The correct constant to remove all entries to the end of the sequence
for the Ollama engine is math.MaxInt32. -1 is used by the old engine.
The impact of this is currently minimal because it would only occur
in situations that are not supported by the implemented models or
rarely used options.
If a model is loading, and the request context is canceled during the load
by a client closing the connection, and another request is inbound for the
same model with a different configuration (context size, etc.) thus requiring
a reload, two unload events can be in flight. The first shuts down the
original model load, but the second one caused the loss of the new
reloading runner reference, thus triggering the leak.
The primary fix is detecting the duplicate unload and ignoring the second
instance. The load routine is also hardened to ensure we detect
clobbering an already present runner and unload it with a warning.
This reduces the size of our Windows installer payloads by ~256M by dropping
support for nvidia drivers older than Feb 2023. Hardware support is unchanged.
Linux default bundle sizes are reduced by ~600M to 1G.
* Move quantization logic to GGML via new backend
This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.
* Remove "add model quantizations"
This is no longer needed now that quantization is implemented in Go+GGML code directly.
Some options listed in api/types.go are not supported in
newer models, or have been deprecated in the past. This is
the first of a series of PRs to clean up the API options
This hides the LlamaServer blank window when chatting outside of the terminal (say like with an app like Msty). This has no other side effects when invoking it the regular way.
For all search path env vars make sure our dirs are first
to avoid potentially finding other incompatible libraries
on the users system.
Also fixes a minor build script glitch for windows rocm
This enhances our logging in the scheduler. The initial "waiting for server" log
no longer claims an initial error state (now "not responding" which better reflects
the actual state). Runners now have slog wiring to report more details about the
runner, including PID.
Successfully completing processing with an errgroup cancels the
associated context. However, we also have a goroutine that is checking
for cancelation of the context. As a result, there is a race where
the goroutine can pick up the cancelation and report an error,
replacing the sucessful error message.
To avoid that, this replaces the goroutine with a cancelation check
when we are reading files. This also has the advantage of stopping
all reads relatively quickly on error and also ensuring that there are
no outstanding I/O operations when we return in this case.
The downside is that if a file read blocks forever (for example, over
the network) then cancelation of the context effectively won't be
honored. However, this is also true for other smaller files we read
and the tensors are read in small chunks (128K), so it's consistent
and better on balance overall.
Worst case graph preallocation was disabled by a27462b
"ollamarunner: Temporarily disable worst case graph preallocation"
since it caused crashes with large batches when not using the GPU.
This backports upstream llama.cpp commit f057808
"ggml: Don't assert fail when tensor data changes (#13222)", which
fixes the underlying bug and allows reverting the previous workaround.
In some cases, we can't find a cache slot when using sliding window
attention. It would be helpful in this (and other cases) to know what
the batch size is.
Bug #10127
The context (and therefore associated input tensors) was not being
properly closed when images were being processed. We were trying to
close them but in reality we were closing over an empty list, preventing
anything from actually being freed.
Fixes#10434
* strip out thinking tags in message history for qwen3 & r1
This is in advance of "proper" support where we'll make reasoning
configurable and we'll parse out thinking/reasoning tags and provide
them to the caller. These models expect there to be no thinking tags in
the message history, so this should improve quality
* parse model names instead of hacky prefix check
* Adjust initial scheduler refCount
Ensure we only set the refCount on success
* sched: fix lock order inversion deadlock
Under certain race conditions, there was a scenario where the scheduler would
get into a deadlock while trying to update free space information while a model
was trying to unload.
When we later have a large batch running purely on a CPU, this
results the error:
GGML_ASSERT(talloc->buffer_id >= 0)
Disabling this means that we will incrementally reallocate memory
as the graph grows.
Fixes#10410
this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k