Commit graph

59 commits

Author SHA1 Message Date
Michael Yang
f0c66e6dea llama4 2025-04-25 16:59:20 -07:00
Michael Yang
40b8fdbdca arange 2025-04-18 11:45:44 -07:00
Jesse Gross
f50d691254 ggml: Fix memory leak on input tensors
For every forward pass through the model, we need to allocate input
tensors: tokens, images, positions, outputs and masks. These get
allocated in system memory.

However, when we close the context that the tensors were allocated
through, the metadata gets freed but the actual backend memory does
not. This results in a significant memory leak.

This makes it so that all the memory allocated through a context
gets freed when it is closed.

Fixes #10040
2025-04-11 11:13:22 -07:00
Jesse Gross
34c3b68fc8 ggml: Don't allocate CPU buffers as CUDA Host buffers
Allocating (and in particular, freeing) memory from CUDA host buffers
is expensive and can cause a significant performance hit if we do
it for every token. Using normal system memory avoids this issue
and also gives the OS more flexibility to manage it.

There is no performance impact from this patch directly (either
positive or negative) but it makes a difference once we start
freeing memory correctly.
2025-04-11 11:13:22 -07:00
Jesse Gross
f33ccd5d27 ggml: Use pointer receivers for Context
Context is currently mixed between pointer and value receivers. Change
this to be all pointer receivers so don't have to reason about whether
the things we are updating in the struct will be retained.
2025-04-11 11:13:22 -07:00
Jesse Gross
bc108b9ad6 ggml: Log filesystem errors
Sometimes loading the GGUF file fails with:
panic: context canceled

This is probably a filesystem error but it doesn't provide any
information about what happened.
2025-04-11 11:13:06 -07:00
Jesse Gross
dbb149e6f7 ollamarunner: Preallocate worst case graph at startup
Currently, the KV cache and graph are lazily allocated as needed.
The cache is fully allocated on first use of the corresponding
layer whereas the graph grows with the size of the context.

This can be an issue if another application allocates more VRAM
after we do our calculations - Ollama will crash in the middle of
inference. If we instead allocate the maximum needed memory at
startup of the runner, we will either succeed or fail at that point
rather than at some surprising time in the future.

Currently, this only generates a worst case batch for text, which
means that vision models may get a partial allocation and continue
to lazily allocate the rest.
2025-04-08 10:01:28 -07:00
Jesse Gross
a807985e59 ggml: Check for OOM and return as Go errors
If there is a CUDA OOM, we currently don't check the return value
and will evetually segfault. This checks for the problem and generates
a Go error. At the moment, this will still result in a panic but having
the error is the first step to being able to handle it more gracefully.
2025-04-08 10:01:28 -07:00
Daniel Hipke
0f3f9e353d
ml/backend/ggml: create a new file descriptor for tensor (#10133)
improves model loading times on network-based filesystems
such as GCS fuse by creating a dedicated file descriptor for each
section of the file being read, reducing seeking
2025-04-04 17:04:24 -07:00
Bruce MacDonald
6bd0a983cd model: support for mistral-small in the ollama runner
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
2025-04-03 16:57:36 -07:00
Michael Yang
3b96a93672 fs: move ml.Config to fs package 2025-04-03 13:12:24 -07:00
Jesse Gross
01aa788722 ml: Remove Output from Context interface
Model implementations should use Input for all of their tensors
supplied to the model. This includes tensors that relate to the
outputs, which is confusing since there is also an Output funciton.

Since Output is only used internally in GGML and not used by any
model implementations, we can remove it from the interface to
reduce confusion.
2025-03-27 12:19:43 -07:00
Michael Yang
74bd09652d ml/backend/ggml: load tensors in 32KiB chunks 2025-03-21 14:43:52 -07:00
Bruce MacDonald
df94175a0f
ggml: return error on failure to read tensor data (#9872)
When converting a ggml model if there is a failure to read tensor data a nil error value was being returned. It should be assigned to the actual error from reading.
2025-03-18 16:51:33 -07:00
Michael Yang
021dcf089d
Merge pull request #9824 from ollama/mxyng/sched
conditionally enable parallel pipelines
2025-03-17 15:41:37 -07:00
Jeffrey Morgan
364629b8d6
ml/backend/ggml: allocate memory with malloc when loading model (#9822) 2025-03-17 13:32:40 -07:00
Michael Yang
4561fff36e conditionally enable parallel pipelines 2025-03-17 09:46:07 -07:00
Michael Yang
63a394068c use 2d pooling 2025-03-11 14:49:20 -07:00
Michael Yang
c5cbe4fc2a fallback to cpu 2025-03-11 14:49:19 -07:00
Michael Yang
9e4642e9b3 ollama debug tensor 2025-03-11 14:49:19 -07:00
Michael Yang
6b0486c216 duplicate token_embd to output 2025-03-11 14:49:19 -07:00
Michael Yang
8934324b72 use fast attention 2025-03-11 14:49:18 -07:00
Michael Yang
0df1800436 set non-causal attention 2025-03-11 14:49:18 -07:00
Michael Yang
4b037a97dc add gemma vision encoder 2025-03-11 14:49:17 -07:00
Patrick Devine
5f74d1fd47 gemma2 impl 2025-03-11 14:35:08 -07:00
Jesse Gross
4100ed7bdd ml: Add support for quantized KV cache
Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.
2025-03-07 18:43:39 -08:00
Jesse Gross
25f9b152f9 ggml-backend: Ensure allocation meet backend requirements
Backends can impose additional alignment requirements on buffer sizes.
We should ensure that we meet these or allocations can fail.
2025-03-07 18:43:39 -08:00
Jesse Gross
98272fbd58 additional review comments 2025-03-07 14:08:21 -08:00
Michael Yang
b27e8f3f10 ml/backend/ggml: use backend buffer type
this ensures the tensor is created on the right buffer type for backends
such as cpu
2025-03-07 14:08:21 -08:00
Michael Yang
45df786f09 comments 2025-03-07 14:08:21 -08:00
Michael Yang
daaf42e4a4 ml/backend/ggml: clean up 2025-03-07 14:08:21 -08:00
Michael Yang
2dc60d4620 ml/backend/ggml: offload vision to cpu
temporary until tensor loading can accurately account for vision models
2025-03-07 14:08:21 -08:00
Michael Yang
b5312f30e8 ml/backend/ggml: handle tensor split 2025-03-07 14:08:21 -08:00
Michael Yang
26c2e0bd35 ml/backend/ggml: handle user specified cpu offloading 2025-03-07 14:08:21 -08:00
Michael Yang
bf920883d5 ml/backend/ggml: set cpu n_threads 2025-03-07 14:08:21 -08:00
Michael Yang
7bae7fa5ce ml/backend/ggml: create tensor on specific backend
some tensors should be created on specific backends to reduce number of
copies and improve performance
2025-03-07 14:08:21 -08:00
Michael Yang
764e199d67 kvcache: create cache ctx per layer
each cache layer creates and maintains its own context instead of using
a large context for all layers
2025-03-07 14:08:21 -08:00
Michael Yang
bfce55db3d model: load non-repeated tensors into multiple backends
some tensors are expected to be used in repeating layers but are not
themselves repeated. this change copies these tensors into the same
backends as their repeating counterparts to minimize copying tensors
between backends
2025-03-07 14:08:21 -08:00
Michael Yang
bab6f34dc0 ml/backend/ggml: update model loading for hybrid/multi backends
use a similar strategy as llama.cpp for deciding where tensors should be
allocated. this will be improved later to be aware of usable memory
before assigning the tensor
2025-03-07 14:08:21 -08:00
Michael Yang
05a01fdecb ml/backend/ggml: consolidate system info logging
- output backend system info when initializing the backend. this ensures
  this information is always present without needing to be called
  explicitly
- convert to structured logging
- enumerate devices rather than backends since devices are ordered
- track device indices grouped by device name
2025-03-04 15:14:31 -08:00
Jesse Gross
21aa666a1e ml: Enable support for flash attention
The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.

Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.
2025-03-01 20:53:23 -08:00
Jesse Gross
ee141cc821 ml: Empty tensor constructor for tensors
In cases where we allocate a tensor and then fully overwrite it with
copied data, it is wasteful to first zero out the memory.
2025-03-01 20:53:23 -08:00
Jesse Gross
55e5776c44 ggml-backend: Store parent backend as part of tensor
It can be important for a tensor to know what backend it came from -
for example, to know if flash attention is enabled.
2025-03-01 20:53:23 -08:00
Jesse Gross
854a9195f3 attention: Remove unnecessary contiguous operations
Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.

The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.

Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.

To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
 - for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.

This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.
2025-03-01 20:53:23 -08:00
Michael Yang
3e8b8a1933 ml: update Context.Forward interface
update Context.Forward to accept multiple tensors to match
Context.Compute signature

update Context.Forward to return Context such that it can be chained
with Context.Compute
2025-02-27 22:27:16 +00:00
Jesse Gross
f53f4198c3 ml: Abstract attention out of model definitions
There are two benefits to doing this:
 - Provide a library function that models can use, reducing code for
   each model implementation
 - Enables a single place to drop in optimized implementations of
   attention based on the backend or other factors. One is provided for
   GGML.

On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-02-21 13:16:21 -08:00
Michael Yang
2192a28eed ml/backend/ggml: fix rms norm 2025-02-21 18:34:19 +00:00
Jesse Gross
e5bcc51ae1 ggml-backend: Don't recreate the scheduler for each context
We don't need to create and destroy the GGML scheduler for every
context. This introduces extra CPU overhead for every forward
pass and extra memory for contexts that don't actually get scheduled
(for example, KV caches). We can instead just have one scheduler
for the backend and reset it each time we call Compute.

This improves token generation performance by 1-2% and removes
scheduler create/destroy from profile traces.
2025-02-20 14:49:47 -08:00
Jesse Gross
bd6a7d5e64 ollamarunner: Pass runner performance parameters to backends
Currently the following parameters are in the runner but not used:
 - numGPULayers
 - mainGPU
 - threads
 - tensorSplit

This passes them through to the backend, which is where they would
actually get used. However, the GGML backend does not yet do anything
with them.
2025-02-20 13:27:57 -08:00
Daniel Hiltgen
df2680b4b9
Wire up system info log for new engine (#9123) 2025-02-14 15:55:33 -08:00