ollama/kvcache
Jesse Gross 21aa666a1e ml: Enable support for flash attention
The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.

Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.
2025-03-01 20:53:23 -08:00
..
cache.go attention: Remove unnecessary contiguous operations 2025-03-01 20:53:23 -08:00
causal.go ml: Enable support for flash attention 2025-03-01 20:53:23 -08:00
causal_test.go ml: Empty tensor constructor for tensors 2025-03-01 20:53:23 -08:00
encoder.go ml: Empty tensor constructor for tensors 2025-03-01 20:53:23 -08:00
wrapper.go attention: Remove unnecessary contiguous operations 2025-03-01 20:53:23 -08:00