ollama/model
Jesse Gross 5c5535c064 models: Prune unused outputs earlier in the forward pass
Currently Rows is called as the last step in a model computation
to get the values for the output tokens. However, if we move it
earlier in the process then we can trim out computations that
never get used. This is similar to how models are defined in
llama.cpp.

Changing the model definition in this way improves token generation
performance by approximately 8%.
2025-02-20 14:49:47 -08:00
..
imageproc imageproc mllama refactor (#7537) 2024-12-14 19:50:15 -08:00
models models: Prune unused outputs earlier in the forward pass 2025-02-20 14:49:47 -08:00
testdata next ollama runner (#7913) 2025-02-13 16:31:21 -08:00
model.go ollamarunner: Pass runner performance parameters to backends 2025-02-20 13:27:57 -08:00
model_test.go Runner for Ollama engine 2025-02-13 17:09:26 -08:00
process_text.go vocab: Use int32 for special tokens 2025-02-13 17:09:26 -08:00
process_text_test.go next ollama runner (#7913) 2025-02-13 16:31:21 -08:00