Commit graph

163 commits

Author SHA1 Message Date
Jeffrey Morgan
fa9973cd7f
api: remove unused sampling parameters (#10581) 2025-05-08 08:31:08 -07:00
Daniel Hiltgen
424810450f
Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-05-06 11:20:48 -07:00
Jeffrey Morgan
3b2d2c8326
api: remove unused or unsupported api options (#10574)
Some options listed in api/types.go are not supported in
newer models, or have been deprecated in the past. This is
the first of a series of PRs to clean up the API options
2025-05-05 14:54:40 -07:00
Jeffrey Morgan
913905028b
all: fix cgo compiler warnings on windows (#10563) 2025-05-05 08:02:39 -07:00
Jesse Gross
c2f5d6662b ollamarunner: Re-enable worst case graph preallocation.
Worst case graph preallocation was disabled by a27462b
"ollamarunner: Temporarily disable worst case graph preallocation"
since it caused crashes with large batches when not using the GPU.

This backports upstream llama.cpp commit f057808
"ggml: Don't assert fail when tensor data changes (#13222)", which
fixes the underlying bug and allows reverting the previous workaround.
2025-05-02 12:22:47 -07:00
Jeffrey Morgan
8dd12c873d
llama: update to commit e1e8e099 (#10513) 2025-05-01 18:24:09 -07:00
Jeffrey Morgan
e9e5f61c45
llama: update to commit 2016f07b (#10352) 2025-04-24 17:26:02 -07:00
Parth Sareen
a53d744b01
llama: remove model loading for grammar (#10096) 2025-04-24 11:51:19 -07:00
Jeffrey Morgan
dc264be6ff
ml: add missing cmake property and remove additional CMakeLists.txt (#10310) 2025-04-16 18:56:29 -07:00
Jeffrey Morgan
943464ccb8
llama: update to commit 71e90e88 (#10192) 2025-04-16 15:14:01 -07:00
Jesse Gross
ccb7eb8135 ggml: Free ggml_backend_buffer_t when releasing buffer
When ggml_backend_buffer_free() is called, the device memory
is released but not all backends consistently release the actual
ggml_backend_buffer_t in system RAM, causing a memory leak.

Bug #10040
2025-04-15 15:29:58 -07:00
Bruce MacDonald
6bd0a983cd model: support for mistral-small in the ollama runner
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
2025-04-03 16:57:36 -07:00
Bruce MacDonald
66b2539238
runner: clear cache when shift is not possible (#9433)
Clear KV cache when shift operation is not supported by model.
Added KvCacheCanShift() check to handle models that can't perform cache shifts,
falling back to full cache clear while preserving logical token history to
maintain expected behavior when context window fills up.
2025-03-31 12:54:45 -07:00
saman-amd
ead27aa9fe
Add gfx1200 & gfx1201 support on linux (#9878) 2025-03-27 07:35:19 -07:00
Patrick Devine
ef378ad673
gemma3 quantization (#9776) 2025-03-14 17:41:07 -07:00
Michael Yang
9e4642e9b3 ollama debug tensor 2025-03-11 14:49:19 -07:00
Jeffrey Morgan
e093db92c4
sample: temporarily use grammars for constrained generation in new engine (#9586) 2025-03-10 16:17:39 +01:00
Jeffrey Morgan
4289c74359
llama: fix kv loading on snowflake-arctic-embed models (#9536) 2025-03-07 09:25:34 -08:00
Michael Yang
05a01fdecb ml/backend/ggml: consolidate system info logging
- output backend system info when initializing the backend. this ensures
  this information is always present without needing to be called
  explicitly
- convert to structured logging
- enumerate devices rather than backends since devices are ordered
- track device indices grouped by device name
2025-03-04 15:14:31 -08:00
Michael Yang
ba7d31240e fix: own lib/ollama directory
expand backend loading error handling to catch more problems and log
them instead of panicing
2025-03-03 13:01:18 -08:00
Michael Yang
657685e85d fix: replace deprecated functions 2025-02-28 21:29:34 +00:00
Jeffrey Morgan
98d44fa39d
llama: add phi4 mini support (#9403) 2025-02-27 19:30:32 -08:00
Michael Yang
a59f665235 ml/backend/ggml: fix debug logging 2025-02-27 18:30:57 +00:00
Jeffrey Morgan
d7d7e99662
llama: update llama.cpp vendor code to commit d7cfe1ff (#9356) 2025-02-26 20:34:44 -08:00
Jeffrey Morgan
3ad4bc8afe
llama: removed unused 'vendoring' file (#9351) 2025-02-25 14:33:03 -08:00
Jeffrey Morgan
8c13cfa4dd
ml/backend/ggml: fix crash on windows paths with wide characters (#9305) 2025-02-23 19:13:53 -08:00
Michael Yang
bda4ef6c56 reorder patches 2025-02-20 03:49:24 +00:00
Michael Yang
1e438b237c
Merge pull request #9203 from ollama/mxyng/sapphirerapids
build: remove backend build for sapphirerapids
2025-02-19 21:42:00 +00:00
Jeffrey Morgan
d2eb226c91
llama: add patch to fix ggml backend reg on Linux with utf-8 characters in the path (#9159) 2025-02-18 22:46:17 -05:00
Michael Yang
5f8c03189e build: remove backend build for sapphirerapids
sapphire rapids has amx support but it ends up having a negative
performance impact.

emerald rapids also has amx support with a positive performance impact
however there's no reasonable way in ggml to differentiate between the
two. the impact is small (~6%) so disable amx entirely for simplicity
2025-02-18 14:47:58 -08:00
Jeffrey Morgan
6600bd7d91
ml/backend/ggml: stable sort devices by score (#9081) 2025-02-13 18:42:36 -08:00
Jesse Gross
ed443a0393 Runner for Ollama engine
This provides integration with the new Ollama engine
(5824541 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1
2025-02-13 17:09:26 -08:00
Michael Yang
49df03da9a
fix: harden backend loading (#9024)
* wrap ggml_backend_load_best in try/catch
* ignore non-ollama paths
2025-02-11 15:36:53 -08:00
Jeffrey Morgan
f4711da7bd
ml/backend/ggml: fix crash on dlopen for non-AVX systems (#8976) 2025-02-10 09:52:12 -08:00
Azis Alvriyanto
b901a712c6
docs: improve syntax highlighting in code blocks (#8854) 2025-02-07 09:55:07 -08:00
Diego Pereira
928911bc68
runner: avoid buffer overwrite when generating multiple embeddings (#8714)
Shield the code processing the embedding result
from subsequent calls that may overwrite the same
buffer to process a second input when retrieving
model embeddings.
2025-02-05 16:53:33 -08:00
Michael Yang
5b446cc815
chore: update gitattributes (#8860)
* chore: update gitattributes
* chore: add build info source
2025-02-05 16:37:18 -08:00
Jeffrey Morgan
cd3fbf1c49
llama: use dynamic backend loading for mllama and clip (#8835) 2025-02-05 09:46:56 -08:00
Michael Yang
548a9f56a6 Revert "cgo: use O3"
This reverts commit bea1f1fac6.
2025-01-31 10:25:39 -08:00
Michael Yang
bea1f1fac6 cgo: use O3 2025-01-30 12:21:50 -08:00
Michael Yang
dcfb7a105c
next build (#8539)
* add build to .dockerignore

* test: only build one arch

* add build to .gitignore

* fix ccache path

* filter amdgpu targets

* only filter if autodetecting

* Don't clobber gpu list for default runner

This ensures the GPU specific environment variables are set properly

* explicitly set CXX compiler for HIP

* Update build_windows.ps1

This isn't complete, but is close.  Dependencies are missing, and it only builds the "default" preset.

* build: add ollama subdir

* add .git to .dockerignore

* docs: update development.md

* update build_darwin.sh

* remove unused scripts

* llm: add cwd and build/lib/ollama to library paths

* default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS

* add additional cmake output vars for msvc

* interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12

* remove unncessary filepath.Dir, cleanup

* add hardware-specific directory to path

* use absolute server path

* build: linux arm

* cmake install targets

* remove unused files

* ml: visit each library path once

* build: skip cpu variants on arm

* build: install cpu targets

* build: fix workflow

* shorter names

* fix rocblas install

* docs: clean up development.md

* consistent build dir removal in development.md

* silence -Wimplicit-function-declaration build warnings in ggml-cpu

* update readme

* update development readme

* llm: update library lookup logic now that there is one runner (#8587)

* tweak development.md

* update docs

* add windows cuda/rocm tests

---------

Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-01-29 15:03:38 -08:00
Jeffrey Morgan
61676fb506
llama: move grammar tests to llama_test.go (#8411) 2025-01-14 12:55:45 -08:00
Jeffrey Morgan
1deafd8254
llama: update vendored code to commit 46e3556 (#8308) 2025-01-08 11:22:01 -08:00
Ubaldo Porcheddu
3919f4ba3d
llama: fix runner api example url in README.md (#8307) 2025-01-04 15:45:16 -08:00
Parth Sareen
290cf2040a
llama: test key order preservation in schema_to_grammar (#8078)
This change adds a test to catch a regression in schema_to_grammar where
the order of keys in the JSON schema is not preserved in the generated
grammar, which is critical for step-by-step reasoning.
2024-12-18 19:44:50 -08:00
Jesse Gross
08a832b482 llama: Ensure KV cache is fully defragmented.
Sometimes the KV cache requires defragmentation even without
triggering the threshold heuristic. In this case, decoding
will not being able to find a KV cache slot. This is particularly
difficult for the caller to handle if it happens in between
ubatches. To avoid this, we should immediately trigger a defrag.

In addition, a heavily fragmented cache can require more than
max_moves to defragment. Currently, we stop when we hit the limit
but this can leave a cache that still does not have adequate space
even after defragmentation is triggered. Instead, we should do
multiple batches of processing until everything is complete.

Fixes #7949
2024-12-17 14:01:19 -08:00
Jeffrey Morgan
7a81daf026
llama: update vendor code to commit ba1cb19c (#8101) 2024-12-14 14:55:51 -08:00
Daniel Hiltgen
60f75560a2
runner: switch logging back to stderr (#8091)
This puts the low-level runner logging back on stderr for consistency with prior releases
2024-12-13 14:36:50 -08:00
Pascal Patry
c216850523
llama: parse JSON schema using nlohmann::ordered_json to maintain ordering (#8071) 2024-12-12 09:57:28 -08:00
Parth Sareen
18f6a98bd6
llama: enable JSON schema key ordering for generating grammars (#8055) 2024-12-11 17:17:36 -08:00