mirror of https://github.com/ollama/ollama.git synced 2025-05-10 18:06:33 +02:00

History

Jesse Gross 086d683f9c ollamarunner: Multi-modal worst case graph We currently preallocate compute graph memory for the worst case batch of text tokens. This adds support for doing the same for images. Note that image models are more complicated than text models in how they process their inputs so there may be cases where this approach isn't completely generic for all models. It covers all currently supported models though.		2025-05-09 17:23:18 -07:00
..
common	Runner for Ollama engine	2025-02-13 17:09:26 -08:00
llamarunner	ollamarunner: Base cached tokens on current prompt	2025-05-09 17:22:08 -07:00
ollamarunner	ollamarunner: Multi-modal worst case graph	2025-05-09 17:23:18 -07:00
README.md	Runner for Ollama engine	2025-02-13 17:09:26 -08:00
runner.go	Runner for Ollama engine	2025-02-13 17:09:26 -08:00

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding