ml: Enable support for flash attention

The GGML flash attention kernel has specific requirements for padding and permutation. This adds support to the KV cache for conforming to these requirements so that flash attention can be enabled. Flash attention can be used in the same situations as the llama engine and is enabled by the user in the same way.
2025-05-11 18:36:41 +02:00 · 2025-02-25 17:24:36 -08:00 · 2025-02-25 17:24:36 -08:00 · 21aa666a1e
commit 21aa666a1e
parent ee141cc821
4 changed files with 73 additions and 21 deletions
--- a/ml/backend.go
+++ b/ml/backend.go
@ -46,6 +46,14 @@ type CacheConfig struct {
 	// and return the permuted version via Get. This uses the cache copy operation
 	// to avoid a Contiguous call on the permuted tensor.
 	PermutedV bool
+
+	// MaskDType specifies the data type for generating the mask. If unset it will
+	// default to DTypeF32.
+	MaskDType DType
+
+	// MaskBatchPadding specifies the multiple for the batch size dimension in the mask.
+	// Any position that does not correspond to an actual token will be filled with -Inf.
+	MaskBatchPadding int
 }

 // BackendParams controls how the backend loads and executes models
@ -61,6 +69,9 @@ type BackendParams struct {

 	// TensorSplit is the fraction of the model to offload to each GPU
 	TensorSplit []float32
+
+	// FlashAttention indicates that we should use a fused flash attention kernel
+	FlashAttention bool
 }

 var backends = make(map[string]func(*os.File, BackendParams) (Backend, error))