ollama

mirror of https://github.com/ollama/ollama.git synced 2025-05-11 18:36:41 +02:00

Author	SHA1	Message	Date
Blake Mizerany	369de832cd	server/internal/registry: remove superfluous progress bar flush (#10303 ) This removes the extra flushProgress() at the end of handlePull. It is unnecessary because final progress updates are flushed in all cases of the main select loop.	2025-04-16 14:43:07 -07:00
Blake Mizerany	3457a315b2	server/internal/client/ollama: cleanup use of multiple counters (#10304 ) The completed and received counters must work in tandem and the code should better reflect that. Previously, the act of updating them was 2-3 lines of code duplicated in multiple places. This consolidates them into a single update closure for easy reading and maintenance. This also simplifies error handling in places where we can use a return parameter and defer to handle the error case for updates. Also, remove the old Layer field from the trackingReader struct.	2025-04-16 14:33:40 -07:00
Daniel Hiltgen	ed4e139314	Integration test improvements (#9654 ) Add some new test coverage for various model architectures, and switch from orca-mini to the small llama model.	2025-04-16 14:25:55 -07:00
Daniel Hiltgen	56dc316a57	Give tests more time to run (#10306 ) Fix flake failures on windows	2025-04-16 13:37:00 -07:00
Michael Yang	2fec73eef6	fix write gguf padding	2025-04-16 10:24:35 -07:00
Blake Mizerany	1e7f62cb42	cmd: add retry/backoff (#10069 ) This commit adds retry/backoff to the registry client for pull requests. Also, revert progress indication to match original client's until we can "get it right." Also, make WithTrace wrap existing traces instead of clobbering them. This allows clients to compose traces.	2025-04-15 23:24:44 -07:00
Jesse Gross	ccb7eb8135	ggml: Free ggml_backend_buffer_t when releasing buffer When ggml_backend_buffer_free() is called, the device memory is released but not all backends consistently release the actual ggml_backend_buffer_t in system RAM, causing a memory leak. Bug #10040	2025-04-15 15:29:58 -07:00
Devon Rifkin	637fd21230	docs: change more template blocks to have syntax highlighting In #8215 syntax highlighting was added to most of the blocks, but there were a couple that were still being rendered as plaintext	2025-04-15 12:08:11 -07:00
Devon Rifkin	0fe487e732	Merge pull request #10276 from ollama/drifkin/cors-headers server: add `OpenAI-Beta` header to CORS safelist	2025-04-14 17:42:51 -07:00
Devon Rifkin	6bfaa6e282	Merge pull request #10277 from ollama/drifkin/docs-json-errors docs: update some response code blocks to json5	2025-04-14 17:11:20 -07:00
Devon Rifkin	378d3210dc	docs: update some response code blocks to json5 This is to prevent rendering bright red comments indicating invalid JSON when the comments are just supposed to be explanatory	2025-04-14 17:09:06 -07:00
Devon Rifkin	97fe45e36d	server: add `OpenAI-Beta` header to CORS safelist alphabetized the compat list and then added a single header fixes: #9801	2025-04-14 15:36:10 -07:00
CYJiang	64a9cc8f05	cmd: add missing file close in tests (#10179 )	2025-04-14 07:49:41 -04:00
Jesse Gross	f50d691254	ggml: Fix memory leak on input tensors For every forward pass through the model, we need to allocate input tensors: tokens, images, positions, outputs and masks. These get allocated in system memory. However, when we close the context that the tensors were allocated through, the metadata gets freed but the actual backend memory does not. This results in a significant memory leak. This makes it so that all the memory allocated through a context gets freed when it is closed. Fixes #10040	2025-04-11 11:13:22 -07:00
Jesse Gross	34c3b68fc8	ggml: Don't allocate CPU buffers as CUDA Host buffers Allocating (and in particular, freeing) memory from CUDA host buffers is expensive and can cause a significant performance hit if we do it for every token. Using normal system memory avoids this issue and also gives the OS more flexibility to manage it. There is no performance impact from this patch directly (either positive or negative) but it makes a difference once we start freeing memory correctly.	2025-04-11 11:13:22 -07:00
Jesse Gross	f33ccd5d27	ggml: Use pointer receivers for Context Context is currently mixed between pointer and value receivers. Change this to be all pointer receivers so don't have to reason about whether the things we are updating in the struct will be retained.	2025-04-11 11:13:22 -07:00
Jesse Gross	bc108b9ad6	ggml: Log filesystem errors Sometimes loading the GGUF file fails with: panic: context canceled This is probably a filesystem error but it doesn't provide any information about what happened.	2025-04-11 11:13:06 -07:00
Tom Sheffler	ef65174df2	types: include the 'items' and '$defs' fields to properly handle "array" types (#10091 ) --------- Co-authored-by: Parth Sareen <parth.sareen@ollama.com>	2025-04-09 17:45:49 -07:00
Ire Gaddr	42ecb9f138	fix(scheduler): make model unload order deterministic (#10185 )	2025-04-09 16:01:02 -07:00
湛露先生	5c0331fd83	Fix dockerfile. (#9855 ) Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>	2025-04-09 13:24:56 -07:00
CYJiang	e7019c9455	fix(integration): move waitgroup Add(1) outside goroutine to avoid potential issue (#10070 ) Signed-off-by: googs1025 <googs1025@gmail.com>	2025-04-08 15:17:40 -07:00
Michael Yang	d98bfe7e70	kvcache: stub out test structs	2025-04-08 15:08:29 -07:00
Parth Sareen	6747099d71	types: add any type and validation for ToolFunction enum (#10166 )	2025-04-08 15:05:38 -07:00
frob	ccc8c6777b	cleanup: remove OLLAMA_TMPDIR and references to temporary executables (#10182 ) * cleanup: remove OLLAMA_TMPDIR * cleanup: ollama doesn't use temporary executables anymore --------- Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2025-04-08 15:01:39 -07:00
Jesse Gross	dbb149e6f7	ollamarunner: Preallocate worst case graph at startup Currently, the KV cache and graph are lazily allocated as needed. The cache is fully allocated on first use of the corresponding layer whereas the graph grows with the size of the context. This can be an issue if another application allocates more VRAM after we do our calculations - Ollama will crash in the middle of inference. If we instead allocate the maximum needed memory at startup of the runner, we will either succeed or fail at that point rather than at some surprising time in the future. Currently, this only generates a worst case batch for text, which means that vision models may get a partial allocation and continue to lazily allocate the rest.	2025-04-08 10:01:28 -07:00
Jesse Gross	a807985e59	ggml: Check for OOM and return as Go errors If there is a CUDA OOM, we currently don't check the return value and will evetually segfault. This checks for the problem and generates a Go error. At the moment, this will still result in a panic but having the error is the first step to being able to handle it more gracefully.	2025-04-08 10:01:28 -07:00
qwerty108109	8643c4d5bf	readme: fix url for big-AGI in community integrations (#10173 )	2025-04-07 19:42:26 -07:00
Jonathan Hecl	b0c3aba590	readme: add GGUF-to-ollama to community integrations (#10156 )	2025-04-07 16:31:45 -07:00
qwerty108109	19c0c25de8	readme: rename community integration from Claude Dev to Cline (#10168 )	2025-04-07 16:27:20 -07:00
Alex Rozgo	2f723ac2d6	types: allow tool function parameters with a single type or an array of types (#9434 )	2025-04-07 14:27:01 -07:00
Devon Rifkin	249fbbe52f	Merge pull request #10169 from ollama/drifkin/fix-contributing-formatting CONTRIBUTING: fix code block formatting	2025-04-07 14:02:35 -07:00
Devon Rifkin	c38680b8a1	CONTRIBUTING: fix code block formatting There were only 3 spaces instead of 4, so the example was being considered to include html elements	2025-04-07 13:53:33 -07:00
Michael Yang	16fca86c4a	digest files in parallel	2025-04-07 09:46:31 -07:00
Daniel Hipke	0f3f9e353d	ml/backend/ggml: create a new file descriptor for tensor (#10133 ) improves model loading times on network-based filesystems such as GCS fuse by creating a dedicated file descriptor for each section of the file being read, reducing seeking	2025-04-04 17:04:24 -07:00
Bruce MacDonald	6bd0a983cd	model: support for mistral-small in the ollama runner Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.	2025-04-03 16:57:36 -07:00
Michael Yang	1861fbdeb5	Merge pull request #9873 from ollama/mxyng/fs-config fs: move ml.Config to fs package	2025-04-03 14:05:21 -07:00
Michael Yang	3b96a93672	fs: move ml.Config to fs package	2025-04-03 13:12:24 -07:00
Bruce MacDonald	e53b3cbd0c	llm: set done reason at server level (#9830 ) No functional change. Many different done reasons can be set at the runner level, so rather than obsuring them we should return them to the server process and let it choose what to do with the done reason. This separates the API concerns from the runner.	2025-04-03 10:19:24 -07:00
Jeffrey Morgan	b51e0f397c	model: fix issues with spm tokenizer for Gemma 3 (#10081 )	2025-04-02 13:22:56 -07:00
jmorganca	b42970063d	kvcache: Add check for values that fall out of sliding window cache The sliding window cache trims entries that are outside the window for the latest token. This works when we are extending the cache, such as when the conversation continues. However, if we have a partial overlap in conversation (including the BOS tokens), then we resume from a past point in the conversation and the needed tokens are no longer stored in memory. This verifies that the new window overlaps with the old one before reusing the cache. Co-authored-by: Jesse Gross <jesse@ollama.com>	2025-04-02 11:55:48 -07:00
Jesse Gross	493385eb3e	ollamarunner: Don't truncate a SameBatch When truncating inputs to the the context window at the beginning of a sequence, we remove the minimum amount possible. However, this may cause us to truncate to the middle of a set of inputs that the model specified should not be split up. To avoid this, we need to remove the rest of the partial batch.	2025-04-02 10:40:38 -07:00
Bruce MacDonald	9876c9faa4	chore(all): replace instances of interface with any (#10067 ) Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.	2025-04-02 09:44:27 -07:00
IsAurora6	4e415029b3	readme: add Casibase to community integrations (#10057 )	2025-04-02 01:27:16 -07:00
Bruce MacDonald	e172f095ba	api: return model capabilities from the show endpoint (#10066 ) With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.	2025-04-01 15:21:46 -07:00
Ilian	c001b98087	docs: add TagSpaces to community integrations (#9983 )	2025-03-31 17:28:59 -07:00
Abyss-c0re	23fc8e92eb	docs: add DeepShell to community projects (#9955 ) Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2025-03-31 17:23:04 -07:00
湛露先生	4059a297a6	discover: /proc/cpuinfo file open and close. (#9950 ) Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>	2025-03-31 17:07:42 -07:00
Bruce MacDonald	66b2539238	runner: clear cache when shift is not possible (#9433 ) Clear KV cache when shift operation is not supported by model. Added KvCacheCanShift() check to handle models that can't perform cache shifts, falling back to full cache clear while preserving logical token history to maintain expected behavior when context window fills up.	2025-03-31 12:54:45 -07:00
Blake Mizerany	ef27d52e79	server/internal/client/ollama: cache completed chunks (#9933 ) This change adds tracking of download chunks during the pull process so that subsequent pulls can skip downloading already completed chunks. This works across restarts of ollama. Currently, download state will be lost if a prune is triggered during a pull (e.g. restart or remove). This issue should be addressed in a follow-up PR.	2025-03-30 23:54:54 -07:00
Jesse Gross	b2a465296d	runner: Release semaphore and improve error messages on failures If we have an error after creating a new sequence but before finding a slot for it, we return without releasing the semaphore. This reduces our parallel sequences and eventually leads to deadlock. In practice this should never happen because once we have acquired the semaphore, we should always be able to find a slot. However, the code is clearly not correct.	2025-03-30 19:21:54 -07:00

1 2 3 4 5 ...

4216 commits