ollama

mirror of https://github.com/ollama/ollama.git synced 2025-05-10 18:06:33 +02:00

Author	SHA1	Message	Date
Michael Yang	0d6e35d3c6	fix: stream accumulator exits early (#10593 ) the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.	2025-05-08 13:17:30 -07:00
Michael Yang	6e9a7a2568	lint: enable usetesting, disable tenv (#10594 )	2025-05-08 11:42:14 -07:00
Daniel Hiltgen	5e380c3b42	sched: fix race leading to orphaned runners (#10599 ) If a model is loading, and the request context is canceled during the load by a client closing the connection, and another request is inbound for the same model with a different configuration (context size, etc.) thus requiring a reload, two unload events can be in flight. The first shuts down the original model load, but the second one caused the loss of the new reloading runner reference, thus triggering the leak. The primary fix is detecting the duplicate unload and ignoring the second instance. The load routine is also hardened to ensure we detect clobbering an already present runner and unload it with a warning.	2025-05-07 09:38:17 -07:00
Jeffrey Morgan	392de84031	api: remove unused RetrieveModelResponse type (#10603 )	2025-05-06 23:08:03 -07:00
Devon Rifkin	4090aca97b	server: send 405 instead of 404 for unallowed methods (#10275 ) Fixes: #5483	2025-05-06 14:45:37 -07:00
Michael Yang	92ce438de0	server: remove internal cmd (#10595 )	2025-05-06 13:05:01 -07:00
Daniel Hiltgen	424810450f	Move quantization to new backend (#10363 ) * Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.	2025-05-06 11:20:48 -07:00
Jeffrey Morgan	1703d1472e	server: fix panic when runner.Options is nil (#10566 )	2025-05-05 09:01:33 -07:00
Daniel Hiltgen	76ea735aaf	sched: logging improvements (#10550 ) This enhances our logging in the scheduler. The initial "waiting for server" log no longer claims an initial error state (now "not responding" which better reflects the actual state). Runners now have slog wiring to report more details about the runner, including PID.	2025-05-03 12:01:56 -07:00
frob	e6d2d04121	image: add vision capability for projector-based models (#10509 ) Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2025-05-01 16:50:20 -07:00
Devon Rifkin	ad3c7c9bda	strip out thinking tags in message history for qwen3 & r1 (#10490 ) * strip out thinking tags in message history for qwen3 & r1 This is in advance of "proper" support where we'll make reasoning configurable and we'll parse out thinking/reasoning tags and provide them to the caller. These models expect there to be no thinking tags in the message history, so this should improve quality * parse model names instead of hacky prefix check	2025-04-30 13:57:45 -07:00
Daniel Hiltgen	415c8fcc3d	Fix "Stopping..." scheduler hang (#10487 ) * Adjust initial scheduler refCount Ensure we only set the refCount on success * sched: fix lock order inversion deadlock Under certain race conditions, there was a scenario where the scheduler would get into a deadlock while trying to update free space information while a model was trying to unload.	2025-04-30 11:26:52 -07:00
Devon Rifkin	fe5b9bb21b	lower default num parallel to 2 this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k	2025-04-29 02:04:14 -07:00
Devon Rifkin	dd93e1af85	Revert "increase default context length to 4096 (#10364 )" This reverts commit `424f648632`.	2025-04-28 16:54:11 -07:00
Michael Yang	340448d2d1	explicitly decode maxarraysize 1024	2025-04-25 16:59:01 -07:00
Michael Yang	214a7678ea	fix superfluous call to WriteHeader the first call to http.ResponseWriter.Write implicitly calls WriteHeader with http.StatusOK if it hasn't already been called. once WriteHeader has been called, subsequent calls has no effect. Write is called when JSON encoding progressUpdateJSON{}. calls to http.ResponseWriter.WriteHeader after the first encode is useless and produces a warning: http: superfluous response.WriteHeader call from github.com/ollama/ollama/server/internal/registry.(*statusCodeRecorder).WriteHeader (server.go:77)	2025-04-25 16:58:49 -07:00
Devon Rifkin	424f648632	increase default context length to 4096 (#10364 ) * increase default context length to 4096 We lower the default numParallel from 4 to 2 and use these "savings" to double the default context length from 2048 to 4096. We're memory neutral in cases when we previously would've used numParallel == 4, but we add the following mitigation to handle some cases where we would have previously fallen back to 1x2048 due to low VRAM: we decide between 2048 and 4096 using a runtime check, choosing 2048 if we're on a one GPU system with total VRAM of <= 4 GB. We purposefully don't check the available VRAM because we don't want the context window size to change unexpectedly based on the available VRAM. We plan on making the default even larger, but this is a relatively low-risk change we can make to quickly double it. * fix tests add an explicit context length so they don't get truncated. The code that converts -1 from being a signal for doing a runtime check isn't running as part of these tests. * tweak small gpu message * clarify context length default also make it actually show up in `ollama serve --help`	2025-04-22 16:33:24 -07:00
Michael Yang	88738b357b	create tempdir in models directory the models directory should have plenty of storage and also ensure there's no cross-device copy	2025-04-18 18:13:05 -07:00
Blake Mizerany	4e535e6188	server/internal/registry: make pull send errors with Error field (#10326 ) Previously, the pull handler would send an error message in the Status field, this prevented the client from using the message as a signal to stop. In the case of the "run" command, it would follow the pull with a "show" which would print a nearly identical "not found" message for unresolved models. Fixes #10307	2025-04-18 18:12:28 -07:00
Blake Mizerany	1d99451ad7	server/internal/client/ollama: handle some network errors gracefully (#10317 )	2025-04-17 12:43:09 -07:00
Blake Mizerany	369de832cd	server/internal/registry: remove superfluous progress bar flush (#10303 ) This removes the extra flushProgress() at the end of handlePull. It is unnecessary because final progress updates are flushed in all cases of the main select loop.	2025-04-16 14:43:07 -07:00
Blake Mizerany	3457a315b2	server/internal/client/ollama: cleanup use of multiple counters (#10304 ) The completed and received counters must work in tandem and the code should better reflect that. Previously, the act of updating them was 2-3 lines of code duplicated in multiple places. This consolidates them into a single update closure for easy reading and maintenance. This also simplifies error handling in places where we can use a return parameter and defer to handle the error case for updates. Also, remove the old Layer field from the trackingReader struct.	2025-04-16 14:33:40 -07:00
Daniel Hiltgen	56dc316a57	Give tests more time to run (#10306 ) Fix flake failures on windows	2025-04-16 13:37:00 -07:00
Blake Mizerany	1e7f62cb42	cmd: add retry/backoff (#10069 ) This commit adds retry/backoff to the registry client for pull requests. Also, revert progress indication to match original client's until we can "get it right." Also, make WithTrace wrap existing traces instead of clobbering them. This allows clients to compose traces.	2025-04-15 23:24:44 -07:00
Devon Rifkin	97fe45e36d	server: add `OpenAI-Beta` header to CORS safelist alphabetized the compat list and then added a single header fixes: #9801	2025-04-14 15:36:10 -07:00
Tom Sheffler	ef65174df2	types: include the 'items' and '$defs' fields to properly handle "array" types (#10091 ) --------- Co-authored-by: Parth Sareen <parth.sareen@ollama.com>	2025-04-09 17:45:49 -07:00
Ire Gaddr	42ecb9f138	fix(scheduler): make model unload order deterministic (#10185 )	2025-04-09 16:01:02 -07:00
Parth Sareen	6747099d71	types: add any type and validation for ToolFunction enum (#10166 )	2025-04-08 15:05:38 -07:00
Alex Rozgo	2f723ac2d6	types: allow tool function parameters with a single type or an array of types (#9434 )	2025-04-07 14:27:01 -07:00
Bruce MacDonald	e53b3cbd0c	llm: set done reason at server level (#9830 ) No functional change. Many different done reasons can be set at the runner level, so rather than obsuring them we should return them to the server process and let it choose what to do with the done reason. This separates the API concerns from the runner.	2025-04-03 10:19:24 -07:00
Bruce MacDonald	9876c9faa4	chore(all): replace instances of interface with any (#10067 ) Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.	2025-04-02 09:44:27 -07:00
Bruce MacDonald	e172f095ba	api: return model capabilities from the show endpoint (#10066 ) With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.	2025-04-01 15:21:46 -07:00
Blake Mizerany	ef27d52e79	server/internal/client/ollama: cache completed chunks (#9933 ) This change adds tracking of download chunks during the pull process so that subsequent pulls can skip downloading already completed chunks. This works across restarts of ollama. Currently, download state will be lost if a prune is triggered during a pull (e.g. restart or remove). This issue should be addressed in a follow-up PR.	2025-03-30 23:54:54 -07:00
CYJiang	0bd0454ea7	server: organize error types (#9465 ) Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2025-03-28 11:50:22 -07:00
Jesse Gross	f66216e399	ggml: Support heterogeneous KV cache layer sizes in memory estimation Gemma3 uses sliding windows for its context on 5/6 layers, significantly reducing memory usage but leading to uneven usage across layers, which makes allocation to the correct GPU difficult. We currently estimate very conservatively by assuming all layers are consistent at the max size. Llama3.2-vision is also inconsistent between self attention and cross attention layers - at moment, we calculate the correct total size and then average this across layers. In some cases, this may lead to crashes if a large layer is placed on a GPU sized by the average. This allows memory estimation to calculate per-layer KV cache size and take this account when placing layers onto GPUs. We already do this for weights that vary per-tensor, so this is a logical extension. Fixes #9730 Fixes #9890	2025-03-26 13:16:03 -07:00
Blake Mizerany	ce929984a3	server/internal/client/ollama: fix file descriptor management in Pull (#9931 ) Close chunked writers as soon as downloads complete, rather than deferring closure until Pull exits. This prevents exhausting file descriptors when pulling many layers. Instead of unbounded defers, use a WaitGroup and background goroutine to close each chunked writer as soon as its downloads finish. Also rename 'total' to 'received' for clarity.	2025-03-21 16:16:38 -07:00
Blake Mizerany	c794fef2f2	server/internal/client/ollama: persist through chunk download errors (#9923 )	2025-03-21 13:03:43 -07:00
Patrick Devine	f8c3dbe5b5	templates: add autotemplate for gemma3 (#9880 ) This change allows the gemma3 template to be autodetected during `ollama create`.	2025-03-20 00:15:30 -07:00
Blake Mizerany	2ddacd7516	server/internal/client/ollama: confirm all chunksums were received (#9893 ) If the chunksums response is missing a chunk, the client should fail the download. This changes the client to check that all bytes are accounted for in the chunksums response. It is possible there are overlaps or gaps in the chunksums response and so the size is not the only thing left to check, but this provides enough coverage for now. We may want to check that chunks are contiguous later.	2025-03-19 14:59:57 -07:00
Blake Mizerany	8294676150	server/internal/client/ollama: set User-Agent for registry client (#9775 ) This sets the agent header in DefaultRegistry to include the version of the client, OS, and architecture in the previous format, with a minor twist. Note: The version is obtained from the build info, instead of the version in version.Version, which should not longer be necessary, but we can remove in a future commit. Using the build info is more accurate and also provides extra build information if the build is not tagged, and if it is "dirty". Previously, the version was just "0.0.0" with no other helpful information. The ollama.com registry and others handle this swimmingly.	2025-03-14 18:33:07 -07:00
Jesse Gross	7bf793a600	gemma3: Allow multiple image in a single input Previously processing multiple images in a batch would trigger segfaults so sending images together was disabled as a way to mitigate this. The trigger was processing one image on the CPU and one on the GPU. This can no longer happen: - The vision encoder is now on the GPU so both images would be processed on the GPU. - We require images to be fully contained in a batch and each image including its special tokens is over half the batch size. As a result, we will never get two images in the same batch. Fixes #9731	2025-03-14 15:38:54 -07:00
Blake Mizerany	4e320b8b90	server/internal/chunks: remove chunks package (#9755 )	2025-03-14 08:57:59 -07:00
Blake Mizerany	eb2b22b042	server/internal/client: use chunksums for concurrent blob verification (#9746 ) Replace large-chunk blob downloads with parallel small-chunk verification to solve timeout and performance issues. Registry users experienced progressively slowing download speeds as large-chunk transfers aged, often timing out completely. The previous approach downloaded blobs in a few large chunks but required a separate, single-threaded pass to read the entire blob back from disk for verification after download completion. This change uses the new chunksums API to fetch many smaller chunk+digest pairs, allowing concurrent downloads and immediate verification as each chunk arrives. Chunks are written directly to their final positions, eliminating the entire separate verification pass. The result is more reliable downloads that maintain speed throughout the transfer process and significantly faster overall completion, especially over unstable connections or with large blobs.	2025-03-13 22:18:29 -07:00
Patrick Devine	4bed739259	add verbose mode to the show command (#9640 ) Add metadata and tensor information to the show command to be able to see more information about a model. This outputs the same data as shown on the model details page on ollama.com	2025-03-13 14:24:27 -07:00
Michael Yang	ec46f3286c	engine: error on embeddings; not currently implemented	2025-03-13 11:40:55 -07:00
jmorganca	65b0f329d1	Revert "Allow models to force a new batch" This reverts commit c7eae586b899083acebcd9b3847b89ea78c2850c.	2025-03-11 14:49:20 -07:00
Jesse Gross	06007c0a18	Allow models to force a new batch This is useful for a few things: - Work around bugs, such as having 2 images in one batch - Keep the image in a single batch for fully connected attention - Improve performance by not evaluating embeddings multiple times	2025-03-11 14:49:20 -07:00
Jesse Gross	475005504e	Restrict Gemma to a single image per request	2025-03-11 14:49:20 -07:00
Blake Mizerany	e2252d0fc6	server/internal/registry: take over pulls from server package (#9485 ) This commit replaces the old pull implementation in the server package with the new, faster, more robust pull implementation in the registry package. The new endpoint, and now the remove endpoint too, are behind the feature gate "client2" enabled only by setting the OLLAMA_EXPERIMENT environment variable include "client2". Currently, the progress indication is wired to perform the same as the previous implementation to avoid making changes to the CLI, and because the status reports happen at the start of the download, and the end of the write to disk, the progress indication is not as smooth as it could be. This is a known issue and will be addressed in a future change. This implementation may be ~0.5-1.0% slower in rare cases, depending on network and disk speed, but is generally MUCH faster and more robust than the its predecessor in all other cases.	2025-03-05 14:48:18 -08:00
Daniel Hiltgen	1fdb351c37	New engine: vision models and auto-fallback (#9113 ) * Include unified vision layers in memory prediction For newer vision models with a single gguf, include the projection estimates. * Adjust CLI to handle both styles of vision model metadata * Wire up new tokenizers for new engine If we're loading the new engine, utilize the new model text processor instead of calling into cgo wrappers for llama.cpp. This also cleans up some tech debt from the older tokenization flow for the C++ server which was no longer used. This also adjusts the grammar handling logic to pass through to the new engine instead of utilizing the cgo schema to grammar call. * Lay foundation for auto selection of new engine	2025-03-04 09:03:46 -08:00

1 2 3 4 5 ...

835 commits