New engine: vision models and auto-fallback (#9113)

* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine
This commit is contained in:
Daniel Hiltgen 2025-03-04 09:03:46 -08:00 committed by GitHub
parent 7a01ad7614
commit 1fdb351c37
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 249 additions and 170 deletions

View file

@ -205,7 +205,7 @@ func (s *Server) GenerateHandler(c *gin.Context) {
images := make([]llm.ImageData, len(req.Images))
for i := range req.Images {
if isMllama && !envconfig.NewEngine() {
if isMllama && len(model.ProjectorPaths) > 0 {
data, opts, err := mllama.Preprocess(bytes.NewReader(req.Images[i]))
if err != nil {
c.AbortWithStatusJSON(http.StatusInternalServerError, gin.H{"error": "error processing image"})