Commit graph

835 commits

Author SHA1 Message Date
Michael Yang
0d6e35d3c6
fix: stream accumulator exits early (#10593)
the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly
since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.
2025-05-08 13:17:30 -07:00
Michael Yang
6e9a7a2568
lint: enable usetesting, disable tenv (#10594) 2025-05-08 11:42:14 -07:00
Daniel Hiltgen
5e380c3b42
sched: fix race leading to orphaned runners (#10599)
If a model is loading, and the request context is canceled during the load
by a client closing the connection, and another request is inbound for the
same model with a different configuration (context size, etc.) thus requiring
a reload, two unload events can be in flight.  The first shuts down the
original model load, but the second one caused the loss of the new
reloading runner reference, thus triggering the leak.

The primary fix is detecting the duplicate unload and ignoring the second
instance.  The load routine is also hardened to ensure we detect
clobbering an already present runner and unload it with a warning.
2025-05-07 09:38:17 -07:00
Jeffrey Morgan
392de84031
api: remove unused RetrieveModelResponse type (#10603) 2025-05-06 23:08:03 -07:00
Devon Rifkin
4090aca97b
server: send 405 instead of 404 for unallowed methods (#10275)
Fixes: #5483
2025-05-06 14:45:37 -07:00
Michael Yang
92ce438de0
server: remove internal cmd (#10595) 2025-05-06 13:05:01 -07:00
Daniel Hiltgen
424810450f
Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-05-06 11:20:48 -07:00
Jeffrey Morgan
1703d1472e
server: fix panic when runner.Options is nil (#10566) 2025-05-05 09:01:33 -07:00
Daniel Hiltgen
76ea735aaf
sched: logging improvements (#10550)
This enhances our logging in the scheduler.  The initial "waiting for server" log
no longer claims an initial error state (now "not responding" which better reflects
the actual state).  Runners now have slog wiring to report more details about the
runner, including PID.
2025-05-03 12:01:56 -07:00
frob
e6d2d04121
image: add vision capability for projector-based models (#10509)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-05-01 16:50:20 -07:00
Devon Rifkin
ad3c7c9bda
strip out thinking tags in message history for qwen3 & r1 (#10490)
* strip out thinking tags in message history for qwen3 & r1

This is in advance of "proper" support where we'll make reasoning
configurable and we'll parse out thinking/reasoning tags and provide
them to the caller. These models expect there to be no thinking tags in
the message history, so this should improve quality

* parse model names instead of hacky prefix check
2025-04-30 13:57:45 -07:00
Daniel Hiltgen
415c8fcc3d
Fix "Stopping..." scheduler hang (#10487)
* Adjust initial scheduler refCount

Ensure we only set the refCount on success

* sched: fix lock order inversion deadlock

Under certain race conditions, there was a scenario where the scheduler would
get into a deadlock while trying to update free space information while a model
was trying to unload.
2025-04-30 11:26:52 -07:00
Devon Rifkin
fe5b9bb21b
lower default num parallel to 2
this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k
2025-04-29 02:04:14 -07:00
Devon Rifkin
dd93e1af85
Revert "increase default context length to 4096 (#10364)"
This reverts commit 424f648632.
2025-04-28 16:54:11 -07:00
Michael Yang
340448d2d1 explicitly decode maxarraysize 1024 2025-04-25 16:59:01 -07:00
Michael Yang
214a7678ea fix superfluous call to WriteHeader
the first call to http.ResponseWriter.Write implicitly calls WriteHeader
with http.StatusOK if it hasn't already been called. once WriteHeader
has been called, subsequent calls has no effect. Write is called when
JSON encoding progressUpdateJSON{}. calls to
http.ResponseWriter.WriteHeader after the first encode is useless and
produces a warning:

http: superfluous response.WriteHeader call from github.com/ollama/ollama/server/internal/registry.(*statusCodeRecorder).WriteHeader (server.go:77)
2025-04-25 16:58:49 -07:00
Devon Rifkin
424f648632
increase default context length to 4096 (#10364)
* increase default context length to 4096

We lower the default numParallel from 4 to 2 and use these "savings" to
double the default context length from 2048 to 4096.

We're memory neutral in cases when we previously would've used
numParallel == 4, but we add the following mitigation to handle some
cases where we would have previously fallen back to 1x2048 due to low
VRAM: we decide between 2048 and 4096 using a runtime check, choosing
2048 if we're on a one GPU system with total VRAM of <= 4 GB. We
purposefully don't check the available VRAM because we don't want the
context window size to change unexpectedly based on the available VRAM.

We plan on making the default even larger, but this is a relatively
low-risk change we can make to quickly double it.

* fix tests

add an explicit context length so they don't get truncated. The code
that converts -1 from being a signal for doing a runtime check isn't
running as part of these tests.

* tweak small gpu message

* clarify context length default

also make it actually show up in `ollama serve --help`
2025-04-22 16:33:24 -07:00
Michael Yang
88738b357b create tempdir in models directory
the models directory should have plenty of storage and also ensure
there's no cross-device copy
2025-04-18 18:13:05 -07:00
Blake Mizerany
4e535e6188
server/internal/registry: make pull send errors with Error field (#10326)
Previously, the pull handler would send an error message in the Status
field, this prevented the client from using the message as a signal to
stop. In the case of the "run" command, it would follow the pull with a
"show" which would print a nearly identical "not found" message for
unresolved models.

Fixes #10307
2025-04-18 18:12:28 -07:00
Blake Mizerany
1d99451ad7
server/internal/client/ollama: handle some network errors gracefully (#10317) 2025-04-17 12:43:09 -07:00
Blake Mizerany
369de832cd
server/internal/registry: remove superfluous progress bar flush (#10303)
This removes the extra flushProgress() at the end of handlePull. It is
unnecessary because final progress updates are flushed in all cases of
the main select loop.
2025-04-16 14:43:07 -07:00
Blake Mizerany
3457a315b2
server/internal/client/ollama: cleanup use of multiple counters (#10304)
The completed and received counters must work in tandem and the code
should better reflect that. Previously, the act of updating them was 2-3
lines of code duplicated in multiple places. This consolidates them into
a single update closure for easy reading and maintenance.

This also simplifies error handling in places where we can use a return
parameter and defer to handle the error case for updates.

Also, remove the old Layer field from the trackingReader struct.
2025-04-16 14:33:40 -07:00
Daniel Hiltgen
56dc316a57
Give tests more time to run (#10306)
Fix flake failures on windows
2025-04-16 13:37:00 -07:00
Blake Mizerany
1e7f62cb42
cmd: add retry/backoff (#10069)
This commit adds retry/backoff to the registry client for pull requests.

Also, revert progress indication to match original client's until we can
"get it right."

Also, make WithTrace wrap existing traces instead of clobbering them.
This allows clients to compose traces.
2025-04-15 23:24:44 -07:00
Devon Rifkin
97fe45e36d server: add OpenAI-Beta header to CORS safelist
alphabetized the compat list and then added a single header

fixes: #9801
2025-04-14 15:36:10 -07:00
Tom Sheffler
ef65174df2
types: include the 'items' and '$defs' fields to properly handle "array" types (#10091)
---------

Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
2025-04-09 17:45:49 -07:00
Ire Gaddr
42ecb9f138
fix(scheduler): make model unload order deterministic (#10185) 2025-04-09 16:01:02 -07:00
Parth Sareen
6747099d71
types: add any type and validation for ToolFunction enum (#10166) 2025-04-08 15:05:38 -07:00
Alex Rozgo
2f723ac2d6
types: allow tool function parameters with a single type or an array of types (#9434) 2025-04-07 14:27:01 -07:00
Bruce MacDonald
e53b3cbd0c
llm: set done reason at server level (#9830)
No functional change. Many different done reasons can be set at the runner
level, so rather than obsuring them we should return them to the server
process and let it choose what to do with the done reason. This separates
the API concerns from the runner.
2025-04-03 10:19:24 -07:00
Bruce MacDonald
9876c9faa4
chore(all): replace instances of interface with any (#10067)
Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.
2025-04-02 09:44:27 -07:00
Bruce MacDonald
e172f095ba
api: return model capabilities from the show endpoint (#10066)
With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.
2025-04-01 15:21:46 -07:00
Blake Mizerany
ef27d52e79
server/internal/client/ollama: cache completed chunks (#9933)
This change adds tracking of download chunks during the pull process so
that subsequent pulls can skip downloading already completed chunks.
This works across restarts of ollama.

Currently, download state will be lost if a prune is triggered during a
pull (e.g. restart or remove). This issue should be addressed in a
follow-up PR.
2025-03-30 23:54:54 -07:00
CYJiang
0bd0454ea7
server: organize error types (#9465)
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-03-28 11:50:22 -07:00
Jesse Gross
f66216e399 ggml: Support heterogeneous KV cache layer sizes in memory estimation
Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890
2025-03-26 13:16:03 -07:00
Blake Mizerany
ce929984a3
server/internal/client/ollama: fix file descriptor management in Pull (#9931)
Close chunked writers as soon as downloads complete, rather than
deferring closure until Pull exits. This prevents exhausting file
descriptors when pulling many layers.

Instead of unbounded defers, use a WaitGroup and background goroutine
to close each chunked writer as soon as its downloads finish.

Also rename 'total' to 'received' for clarity.
2025-03-21 16:16:38 -07:00
Blake Mizerany
c794fef2f2
server/internal/client/ollama: persist through chunk download errors (#9923) 2025-03-21 13:03:43 -07:00
Patrick Devine
f8c3dbe5b5
templates: add autotemplate for gemma3 (#9880)
This change allows the gemma3 template to be autodetected during `ollama
create`.
2025-03-20 00:15:30 -07:00
Blake Mizerany
2ddacd7516
server/internal/client/ollama: confirm all chunksums were received (#9893)
If the chunksums response is missing a chunk, the client should fail
the download. This changes the client to check that all bytes are
accounted for in the chunksums response.

It is possible there are overlaps or gaps in the chunksums response and
so the size is not the only thing left to check, but this provides
enough coverage for now. We may want to check that chunks are contiguous
later.
2025-03-19 14:59:57 -07:00
Blake Mizerany
8294676150
server/internal/client/ollama: set User-Agent for registry client (#9775)
This sets the agent header in DefaultRegistry to include the version of
the client, OS, and architecture in the previous format, with a minor
twist.

Note: The version is obtained from the build info, instead of the
version in version.Version, which should not longer be necessary, but we
can remove in a future commit. Using the build info is more accurate and
also provides extra build information if the build is not tagged, and if
it is "dirty". Previously, the version was just "0.0.0" with no other
helpful information. The ollama.com registry and others handle this
swimmingly.
2025-03-14 18:33:07 -07:00
Jesse Gross
7bf793a600 gemma3: Allow multiple image in a single input
Previously processing multiple images in a batch would trigger
segfaults so sending images together was disabled as a way to
mitigate this. The trigger was processing one image on the CPU
and one on the GPU.

This can no longer happen:
 - The vision encoder is now on the GPU so both images would be
   processed on the GPU.
 - We require images to be fully contained in a batch and each
   image including its special tokens is over half the batch size.
   As a result, we will never get two images in the same batch.

Fixes #9731
2025-03-14 15:38:54 -07:00
Blake Mizerany
4e320b8b90
server/internal/chunks: remove chunks package (#9755) 2025-03-14 08:57:59 -07:00
Blake Mizerany
eb2b22b042
server/internal/client: use chunksums for concurrent blob verification (#9746)
Replace large-chunk blob downloads with parallel small-chunk
verification to solve timeout and performance issues. Registry users
experienced progressively slowing download speeds as large-chunk
transfers aged, often timing out completely.

The previous approach downloaded blobs in a few large chunks but
required a separate, single-threaded pass to read the entire blob back
from disk for verification after download completion.

This change uses the new chunksums API to fetch many smaller
chunk+digest pairs, allowing concurrent downloads and immediate
verification as each chunk arrives. Chunks are written directly to their
final positions, eliminating the entire separate verification pass.

The result is more reliable downloads that maintain speed throughout the
transfer process and significantly faster overall completion, especially
over unstable connections or with large blobs.
2025-03-13 22:18:29 -07:00
Patrick Devine
4bed739259
add verbose mode to the show command (#9640)
Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
2025-03-13 14:24:27 -07:00
Michael Yang
ec46f3286c engine: error on embeddings; not currently implemented 2025-03-13 11:40:55 -07:00
jmorganca
65b0f329d1 Revert "Allow models to force a new batch"
This reverts commit c7eae586b899083acebcd9b3847b89ea78c2850c.
2025-03-11 14:49:20 -07:00
Jesse Gross
06007c0a18 Allow models to force a new batch
This is useful for a few things:
 - Work around bugs, such as having 2 images in one batch
 - Keep the image in a single batch for fully connected attention
 - Improve performance by not evaluating embeddings multiple times
2025-03-11 14:49:20 -07:00
Jesse Gross
475005504e Restrict Gemma to a single image per request 2025-03-11 14:49:20 -07:00
Blake Mizerany
e2252d0fc6
server/internal/registry: take over pulls from server package (#9485)
This commit replaces the old pull implementation in the server package
with the new, faster, more robust pull implementation in the registry
package.

The new endpoint, and now the remove endpoint too, are behind the
feature gate "client2" enabled only by setting the OLLAMA_EXPERIMENT
environment variable include "client2".

Currently, the progress indication is wired to perform the same as the
previous implementation to avoid making changes to the CLI, and because
the status reports happen at the start of the download, and the end of
the write to disk, the progress indication is not as smooth as it could
be. This is a known issue and will be addressed in a future change.

This implementation may be ~0.5-1.0% slower in rare cases, depending on
network and disk speed, but is generally MUCH faster and more robust
than the its predecessor in all other cases.
2025-03-05 14:48:18 -08:00
Daniel Hiltgen
1fdb351c37
New engine: vision models and auto-fallback (#9113)
* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine
2025-03-04 09:03:46 -08:00