mirror of
https://github.com/ollama/ollama.git
synced 2025-05-11 10:26:53 +02:00
59 lines
1.7 KiB
Markdown
59 lines
1.7 KiB
Markdown
# Benchmark
|
|
|
|
Go benchmark tests that measure end-to-end performance of a running Ollama server. Run these tests to evaluate model inference performance on your hardware and measure the impact of code changes.
|
|
|
|
## When to use
|
|
|
|
Run these benchmarks when:
|
|
- Making changes to the model inference engine
|
|
- Modifying model loading/unloading logic
|
|
- Changing prompt processing or token generation code
|
|
- Implementing a new model architecture
|
|
- Testing performance across different hardware setups
|
|
|
|
## Prerequisites
|
|
- Ollama server running locally with `ollama serve` on `127.0.0.1:11434`
|
|
## Usage and Examples
|
|
|
|
>[!NOTE]
|
|
>All commands must be run from the root directory of the Ollama project.
|
|
|
|
Basic syntax:
|
|
```bash
|
|
go test -bench=. ./benchmark/... -m $MODEL_NAME
|
|
```
|
|
|
|
Required flags:
|
|
- `-bench=.`: Run all benchmarks
|
|
- `-m`: Model name to benchmark
|
|
|
|
Optional flags:
|
|
- `-count N`: Number of times to run the benchmark (useful for statistical analysis)
|
|
- `-timeout T`: Maximum time for the benchmark to run (e.g. "10m" for 10 minutes)
|
|
|
|
Common usage patterns:
|
|
|
|
Single benchmark run with a model specified:
|
|
```bash
|
|
go test -bench=. ./benchmark/... -m llama3.3
|
|
```
|
|
|
|
## Output metrics
|
|
|
|
The benchmark reports several key metrics:
|
|
|
|
- `gen_tok/s`: Generated tokens per second
|
|
- `prompt_tok/s`: Prompt processing tokens per second
|
|
- `ttft_ms`: Time to first token in milliseconds
|
|
- `load_ms`: Model load time in milliseconds
|
|
- `gen_tokens`: Total tokens generated
|
|
- `prompt_tokens`: Total prompt tokens processed
|
|
|
|
Each benchmark runs two scenarios:
|
|
- Cold start: Model is loaded from disk for each test
|
|
- Warm start: Model is pre-loaded in memory
|
|
|
|
Three prompt lengths are tested for each scenario:
|
|
- Short prompt (100 tokens)
|
|
- Medium prompt (500 tokens)
|
|
- Long prompt (1000 tokens)
|