this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k
the first call to http.ResponseWriter.Write implicitly calls WriteHeader
with http.StatusOK if it hasn't already been called. once WriteHeader
has been called, subsequent calls has no effect. Write is called when
JSON encoding progressUpdateJSON{}. calls to
http.ResponseWriter.WriteHeader after the first encode is useless and
produces a warning:
http: superfluous response.WriteHeader call from github.com/ollama/ollama/server/internal/registry.(*statusCodeRecorder).WriteHeader (server.go:77)
* increase default context length to 4096
We lower the default numParallel from 4 to 2 and use these "savings" to
double the default context length from 2048 to 4096.
We're memory neutral in cases when we previously would've used
numParallel == 4, but we add the following mitigation to handle some
cases where we would have previously fallen back to 1x2048 due to low
VRAM: we decide between 2048 and 4096 using a runtime check, choosing
2048 if we're on a one GPU system with total VRAM of <= 4 GB. We
purposefully don't check the available VRAM because we don't want the
context window size to change unexpectedly based on the available VRAM.
We plan on making the default even larger, but this is a relatively
low-risk change we can make to quickly double it.
* fix tests
add an explicit context length so they don't get truncated. The code
that converts -1 from being a signal for doing a runtime check isn't
running as part of these tests.
* tweak small gpu message
* clarify context length default
also make it actually show up in `ollama serve --help`
Previously, the pull handler would send an error message in the Status
field, this prevented the client from using the message as a signal to
stop. In the case of the "run" command, it would follow the pull with a
"show" which would print a nearly identical "not found" message for
unresolved models.
Fixes#10307
This removes the extra flushProgress() at the end of handlePull. It is
unnecessary because final progress updates are flushed in all cases of
the main select loop.
The completed and received counters must work in tandem and the code
should better reflect that. Previously, the act of updating them was 2-3
lines of code duplicated in multiple places. This consolidates them into
a single update closure for easy reading and maintenance.
This also simplifies error handling in places where we can use a return
parameter and defer to handle the error case for updates.
Also, remove the old Layer field from the trackingReader struct.
This commit adds retry/backoff to the registry client for pull requests.
Also, revert progress indication to match original client's until we can
"get it right."
Also, make WithTrace wrap existing traces instead of clobbering them.
This allows clients to compose traces.
No functional change. Many different done reasons can be set at the runner
level, so rather than obsuring them we should return them to the server
process and let it choose what to do with the done reason. This separates
the API concerns from the runner.
With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.
This change adds tracking of download chunks during the pull process so
that subsequent pulls can skip downloading already completed chunks.
This works across restarts of ollama.
Currently, download state will be lost if a prune is triggered during a
pull (e.g. restart or remove). This issue should be addressed in a
follow-up PR.
Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.
Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.
This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.
Fixes#9730Fixes#9890
Close chunked writers as soon as downloads complete, rather than
deferring closure until Pull exits. This prevents exhausting file
descriptors when pulling many layers.
Instead of unbounded defers, use a WaitGroup and background goroutine
to close each chunked writer as soon as its downloads finish.
Also rename 'total' to 'received' for clarity.
If the chunksums response is missing a chunk, the client should fail
the download. This changes the client to check that all bytes are
accounted for in the chunksums response.
It is possible there are overlaps or gaps in the chunksums response and
so the size is not the only thing left to check, but this provides
enough coverage for now. We may want to check that chunks are contiguous
later.
This sets the agent header in DefaultRegistry to include the version of
the client, OS, and architecture in the previous format, with a minor
twist.
Note: The version is obtained from the build info, instead of the
version in version.Version, which should not longer be necessary, but we
can remove in a future commit. Using the build info is more accurate and
also provides extra build information if the build is not tagged, and if
it is "dirty". Previously, the version was just "0.0.0" with no other
helpful information. The ollama.com registry and others handle this
swimmingly.
Previously processing multiple images in a batch would trigger
segfaults so sending images together was disabled as a way to
mitigate this. The trigger was processing one image on the CPU
and one on the GPU.
This can no longer happen:
- The vision encoder is now on the GPU so both images would be
processed on the GPU.
- We require images to be fully contained in a batch and each
image including its special tokens is over half the batch size.
As a result, we will never get two images in the same batch.
Fixes#9731
Replace large-chunk blob downloads with parallel small-chunk
verification to solve timeout and performance issues. Registry users
experienced progressively slowing download speeds as large-chunk
transfers aged, often timing out completely.
The previous approach downloaded blobs in a few large chunks but
required a separate, single-threaded pass to read the entire blob back
from disk for verification after download completion.
This change uses the new chunksums API to fetch many smaller
chunk+digest pairs, allowing concurrent downloads and immediate
verification as each chunk arrives. Chunks are written directly to their
final positions, eliminating the entire separate verification pass.
The result is more reliable downloads that maintain speed throughout the
transfer process and significantly faster overall completion, especially
over unstable connections or with large blobs.
Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
This is useful for a few things:
- Work around bugs, such as having 2 images in one batch
- Keep the image in a single batch for fully connected attention
- Improve performance by not evaluating embeddings multiple times
This commit replaces the old pull implementation in the server package
with the new, faster, more robust pull implementation in the registry
package.
The new endpoint, and now the remove endpoint too, are behind the
feature gate "client2" enabled only by setting the OLLAMA_EXPERIMENT
environment variable include "client2".
Currently, the progress indication is wired to perform the same as the
previous implementation to avoid making changes to the CLI, and because
the status reports happen at the start of the download, and the end of
the write to disk, the progress indication is not as smooth as it could
be. This is a known issue and will be addressed in a future change.
This implementation may be ~0.5-1.0% slower in rare cases, depending on
network and disk speed, but is generally MUCH faster and more robust
than the its predecessor in all other cases.
* Include unified vision layers in memory prediction
For newer vision models with a single gguf, include
the projection estimates.
* Adjust CLI to handle both styles of vision model metadata
* Wire up new tokenizers for new engine
If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp. This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.
This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.
* Lay foundation for auto selection of new engine
This reintroduces aggressive pruning on model deletion as a temporary
measure until a more controlled garbage collection (GC) mechanism is
implemented.
Issues with the current approach:
1. Users may accidentally delete a model (`ollama rm llama3.3` instead
of `ollama rm llama3.2`), requiring a full re-download unless another
model references the same blobs.
2. Users may assume a deleted model is still referenced elsewhere, but
due to prior updates or deletions, the references no longer exist,
leading to unnecessary re-downloads.
Soon, we should implement a structured GC mechanism to retain
unreferenced blobs for a configurable period before removal, which will
run on "ollama rm" and other commands we deem appropriate.
Users that want to immediately remove unreferenced blobs can use a new
prune command that will allow them to specify the age and class of blobs
to remove.
Example usage:
# Run basic blob GC
$ ollama prune
# Remove unreferenced blobs older than 7 days
$ ollama prune --age 7d
# Remove all blobs, referenced or not, older than 7 days (and their manifests?)
$ ollama prune --age 7d --all
# Remove all unreferenced blobs immediately
$ ollama prune --age 0 --all
# Remove all blobs
$ ollama prune --age 0 --all
This should provide a safer and more predictable cleanup process.
Previously, developers without the synctest experiment enabled would see
build failures when running tests in some server/internal/internal
packages using the synctest package. This change makes the transition to
use of the package less painful but guards the use of the synctest
package with build tags.
synctest is enabled in CI. If a new change will break a synctest
package, it will break in CI, even if it does not break locally.
The developer docs have been updated to help with any confusion about
why package tests pass locally but fail in CI.
Previously, using a Registry required a DiskCache to be passed in for
use in various methods. This was a bit cumbersome, as the DiskCache is
required for most operations, and the DefaultCache is used in most of
those cases. This change makes the DiskCache an optional field on the
Registry struct.
This also changes DefaultCache to initialize on first use. This is to
not burden clients with the cost of creating a new cache per use, or
having to hold onto a cache for the lifetime of the Registry.
Also, slip in some minor docs updates for Trace.
The extended name format is a superset of the name format that only the
client needs to know about, not the server or other dependents of the
name package, so move the split logic into the client package.
Also, take advantage of knowing about the extended name format to allow
the client to use the extended name format when unlinking to verify they
are unlinking the manifest with the content they intend.
This commit is a step towards a goal to make names less ceremonial
outside of the registry client. Clients of the registry package can
treat names as opaque strings, and the registry package will handle
parsing, validating, and normalizing names.
Ideally we end up with the names package tucked away in an internal
package for good. We'll see how things go.
Also, this package name is not permanent. This another step in the
on-going process of refactoring the server code, and at some point it
will most likely be renamed/moved.
More validation during the safetensor creation process.
Properly handle relative paths (like ./model.safetensors) while rejecting absolute paths
Add comprehensive test coverage for various paths
No functionality changes for valid inputs - existing workflows remain unaffected
Leverages Go 1.24's new os.Root functionality for secure containment
Also, require the -as flag to be set when importing a model. This
prevents the confusing error message "invalid name".
Also, allow short names to be used when importing a model and
auto-complete the name with the default mask.
This fixes panics introduced in 2412adf42b
when Gin ungracefully assumes that the http.ResponseWriter implements
http.CloseNotifier and http.Flusher, which our new statusCodeRecorder
does not. This is a temporary fix until we can pour the rest of the Gin
out.
This commit introduces a new API implementation for handling
interactions with the registry and the local model cache. The new API is
located in server/internal/registry. The package name is "registry" and
should be considered temporary; it is hidden and not bleeding outside of
the server package. As the commits roll in, we'll start consuming more
of the API and then let reverse osmosis take effect, at which point it
will surface closer to the root level packages as much as needed.
This commit copies (without history) the bmizerany/ollama-go repository
with the intention of integrating it into the ollama as a replacement
for the pushing, and pulling of models, and management of the cache they
are pushed and pulled from.
New homes for these packages will be determined as they are integrated
and we have a better understanding of proper package boundaries.
The route assembly in Handler lacked clear organization making it
difficult scan for routes and their relationships to each other. This
commit aims to fix that by reordering the assembly of routes to group
them by category and purpose.
Also, be more specific about what "config" refers to (it is about CORS
if you were wondering... I was.)