ollama

mirror of https://github.com/ollama/ollama.git synced 2025-05-11 18:36:41 +02:00

Author	SHA1	Message	Date
Michael Yang	3b96a93672	fs: move ml.Config to fs package	2025-04-03 13:12:24 -07:00
Jesse Gross	01aa788722	ml: Remove Output from Context interface Model implementations should use Input for all of their tensors supplied to the model. This includes tensors that relate to the outputs, which is confusing since there is also an Output funciton. Since Output is only used internally in GGML and not used by any model implementations, we can remove it from the interface to reduce confusion.	2025-03-27 12:19:43 -07:00
Michael Yang	74bd09652d	ml/backend/ggml: load tensors in 32KiB chunks	2025-03-21 14:43:52 -07:00
Bruce MacDonald	df94175a0f	ggml: return error on failure to read tensor data (#9872 ) When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.	2025-03-18 16:51:33 -07:00
Michael Yang	021dcf089d	Merge pull request #9824 from ollama/mxyng/sched conditionally enable parallel pipelines	2025-03-17 15:41:37 -07:00
Jeffrey Morgan	364629b8d6	ml/backend/ggml: allocate memory with malloc when loading model (#9822 )	2025-03-17 13:32:40 -07:00
Michael Yang	4561fff36e	conditionally enable parallel pipelines	2025-03-17 09:46:07 -07:00
Michael Yang	63a394068c	use 2d pooling	2025-03-11 14:49:20 -07:00
Michael Yang	c5cbe4fc2a	fallback to cpu	2025-03-11 14:49:19 -07:00
Michael Yang	9e4642e9b3	ollama debug tensor	2025-03-11 14:49:19 -07:00
Michael Yang	6b0486c216	duplicate token_embd to output	2025-03-11 14:49:19 -07:00
Michael Yang	8934324b72	use fast attention	2025-03-11 14:49:18 -07:00
Michael Yang	0df1800436	set non-causal attention	2025-03-11 14:49:18 -07:00
Michael Yang	4b037a97dc	add gemma vision encoder	2025-03-11 14:49:17 -07:00
Patrick Devine	5f74d1fd47	gemma2 impl	2025-03-11 14:35:08 -07:00
Jesse Gross	4100ed7bdd	ml: Add support for quantized KV cache Similar to the llama engine, quantizing the KV cache requires flash attention to be enabled through the Ollama server.	2025-03-07 18:43:39 -08:00
Jesse Gross	25f9b152f9	ggml-backend: Ensure allocation meet backend requirements Backends can impose additional alignment requirements on buffer sizes. We should ensure that we meet these or allocations can fail.	2025-03-07 18:43:39 -08:00
Jesse Gross	98272fbd58	additional review comments	2025-03-07 14:08:21 -08:00
Michael Yang	b27e8f3f10	ml/backend/ggml: use backend buffer type this ensures the tensor is created on the right buffer type for backends such as cpu	2025-03-07 14:08:21 -08:00
Michael Yang	45df786f09	comments	2025-03-07 14:08:21 -08:00
Michael Yang	daaf42e4a4	ml/backend/ggml: clean up	2025-03-07 14:08:21 -08:00
Michael Yang	2dc60d4620	ml/backend/ggml: offload vision to cpu temporary until tensor loading can accurately account for vision models	2025-03-07 14:08:21 -08:00
Michael Yang	b5312f30e8	ml/backend/ggml: handle tensor split	2025-03-07 14:08:21 -08:00
Michael Yang	26c2e0bd35	ml/backend/ggml: handle user specified cpu offloading	2025-03-07 14:08:21 -08:00
Michael Yang	bf920883d5	ml/backend/ggml: set cpu n_threads	2025-03-07 14:08:21 -08:00
Michael Yang	7bae7fa5ce	ml/backend/ggml: create tensor on specific backend some tensors should be created on specific backends to reduce number of copies and improve performance	2025-03-07 14:08:21 -08:00
Michael Yang	764e199d67	kvcache: create cache ctx per layer each cache layer creates and maintains its own context instead of using a large context for all layers	2025-03-07 14:08:21 -08:00
Michael Yang	bfce55db3d	model: load non-repeated tensors into multiple backends some tensors are expected to be used in repeating layers but are not themselves repeated. this change copies these tensors into the same backends as their repeating counterparts to minimize copying tensors between backends	2025-03-07 14:08:21 -08:00
Michael Yang	bab6f34dc0	ml/backend/ggml: update model loading for hybrid/multi backends use a similar strategy as llama.cpp for deciding where tensors should be allocated. this will be improved later to be aware of usable memory before assigning the tensor	2025-03-07 14:08:21 -08:00
Michael Yang	05a01fdecb	ml/backend/ggml: consolidate system info logging - output backend system info when initializing the backend. this ensures this information is always present without needing to be called explicitly - convert to structured logging - enumerate devices rather than backends since devices are ordered - track device indices grouped by device name	2025-03-04 15:14:31 -08:00
Jesse Gross	21aa666a1e	ml: Enable support for flash attention The GGML flash attention kernel has specific requirements for padding and permutation. This adds support to the KV cache for conforming to these requirements so that flash attention can be enabled. Flash attention can be used in the same situations as the llama engine and is enabled by the user in the same way.	2025-03-01 20:53:23 -08:00
Jesse Gross	ee141cc821	ml: Empty tensor constructor for tensors In cases where we allocate a tensor and then fully overwrite it with copied data, it is wasteful to first zero out the memory.	2025-03-01 20:53:23 -08:00
Jesse Gross	55e5776c44	ggml-backend: Store parent backend as part of tensor It can be important for a tensor to know what backend it came from - for example, to know if flash attention is enabled.	2025-03-01 20:53:23 -08:00
Jesse Gross	854a9195f3	attention: Remove unnecessary contiguous operations Prior to performing attention, we need to permute query, key and value. Currently we call Contiguous after each of these permutations, which is correct but expensive. Avoiding the 3 calls to Contiguous increases performance by over 20%. The permutations of query and key do not violate the continuity rules for mulmat and the Contiguous call can be simply removed. Value requires a different permutation and does require Contiguous. However, we can use the copy into the cache as a way to perform this without further overhead. To support this and avoid unexpected tensor shapes that are seen by models, we need tighter integration between attention, cache and backend. Future optimization will also likely need this structure - for example, flash attention has special padding requirements in the cache and other backends may have their own needs. This further contains the operations that go into attention so that these and other optimizations can be handled transparently. Models that have special requirements for attention can still implement their own version of it.	2025-03-01 20:53:23 -08:00
Michael Yang	3e8b8a1933	ml: update Context.Forward interface update Context.Forward to accept multiple tensors to match Context.Compute signature update Context.Forward to return Context such that it can be chained with Context.Compute	2025-02-27 22:27:16 +00:00
Jesse Gross	f53f4198c3	ml: Abstract attention out of model definitions There are two benefits to doing this: - Provide a library function that models can use, reducing code for each model implementation - Enables a single place to drop in optimized implementations of attention based on the backend or other factors. One is provided for GGML. On CUDA this improves token generation rate by about 3%. It does not have a significant effect on Metal. Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2025-02-21 13:16:21 -08:00
Michael Yang	2192a28eed	ml/backend/ggml: fix rms norm	2025-02-21 18:34:19 +00:00
Jesse Gross	e5bcc51ae1	ggml-backend: Don't recreate the scheduler for each context We don't need to create and destroy the GGML scheduler for every context. This introduces extra CPU overhead for every forward pass and extra memory for contexts that don't actually get scheduled (for example, KV caches). We can instead just have one scheduler for the backend and reset it each time we call Compute. This improves token generation performance by 1-2% and removes scheduler create/destroy from profile traces.	2025-02-20 14:49:47 -08:00
Jesse Gross	bd6a7d5e64	ollamarunner: Pass runner performance parameters to backends Currently the following parameters are in the runner but not used: - numGPULayers - mainGPU - threads - tensorSplit This passes them through to the backend, which is where they would actually get used. However, the GGML backend does not yet do anything with them.	2025-02-20 13:27:57 -08:00
Daniel Hiltgen	df2680b4b9	Wire up system info log for new engine (#9123 )	2025-02-14 15:55:33 -08:00
Jesse Gross	ed443a0393	Runner for Ollama engine This provides integration with the new Ollama engine (`5824541` next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1	2025-02-13 17:09:26 -08:00
Jesse Gross	d223f3b697	ggml-backend: Close on nil should be a no-op	2025-02-13 17:09:26 -08:00
Jesse Gross	60830695c2	ggml-backend: Ensure data is available after async computation We need to sync before retrieving data after async computation. It is also important to ensure that the Go buffer is not moved by the GC across function calls so we do a synchronous copy.	2025-02-13 17:09:26 -08:00
Jesse Gross	01d9a46854	ggml-backend: Let GGML allocate context memory Passing in a Go buffer is not safe because the garbage collector could free or move the memory while the context is still open. However, if we pass in the size and a nil pointer then GGML will allocate it from the C side.	2025-02-13 17:09:26 -08:00
Jesse Gross	d773b7d671	backend: API to support full precision matmul Most tensor backends try to optimize performance by using a lower precision for matmuls. However, some operations (such as kq) on some models are sensitive to this and require full precision.	2025-02-13 17:09:26 -08:00
Jesse Gross	4d4463b2bd	backend: Support graph computation that does not return an output There are two cases where we may not have an output after computing: - Prompt processing where the length of the input exceeds the batch size - Internal memory management operations such as cache defrag and shift	2025-02-13 17:09:26 -08:00
Jesse Gross	0e38297f87	backend: Consistently use int (vs. int64) for tensor shapes Currently there is a mixture of int and int64 used when dealing with tensor dimensions and shapes, which causes unnecessary conversions - they all should be the same type. In general, most interfaces (such as Pytorch) use int64 for generality but most implementations (such as CUDA) use int32 for performance. There isn't much benefit to us to being more flexible than the implementations we are likely to run on. In addition, as a practical matter, a model with a tensor with a single dimension larger than 32 bits is unlikely to run on a 32-bit machine.	2025-02-13 17:09:26 -08:00
Jesse Gross	7e13f568dc	backend: Don't return an error on Close It is not common to return errors with close/free operations - most people won't check it and even if they did there's probably not much that can do. It's better to not give implementations false expectations.	2025-02-13 17:09:26 -08:00
Michael Yang	58245413f4	next ollama runner (#7913 ) feat: add new Ollama engine using ggml through cgo This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this. - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go` - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go` - `ml.Tensor` defines the interface for a tensor and tensor operations This is the first implementation of the new engine. Follow up PRs will implement more features: - non-greedy sampling (#8410) - integration with Ollama and KV caching (#8301) - more model support (#9080) with more coming soon Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2025-02-13 16:31:21 -08:00

49 commits