mirror of
https://github.com/ollama/ollama.git
synced 2025-05-16 22:44:25 +02:00
Sometimes the KV cache requires defragmentation even without triggering the threshold heuristic. In this case, decoding will not being able to find a KV cache slot. This is particularly difficult for the caller to handle if it happens in between ubatches. To avoid this, we should immediately trigger a defrag. In addition, a heavily fragmented cache can require more than max_moves to defragment. Currently, we stop when we hit the limit but this can leave a cache that still does not have adequate space even after defragmentation is triggered. Instead, we should do multiple batches of processing until everything is complete. Fixes #7949 |
||
---|---|---|
.. | ||
0001-cuda.patch | ||
0002-pretokenizer.patch | ||
0003-embeddings.patch | ||
0004-clip-unicode.patch | ||
0005-solar-pro.patch | ||
0006-conditional-fattn.patch | ||
0007-blas.patch | ||
0008-add-mllama-support.patch | ||
0009-add-unpad-operator.patch | ||
0010-fix-deepseek-deseret-regex.patch | ||
0011-relative-include-paths.patch | ||
0012-Maintain-ordering-for-rules-for-grammar.patch | ||
0013-fix-missing-arg-in-static-assert-on-windows.patch | ||
0014-llama-Ensure-KV-cache-is-fully-defragmented.patch |