@@ -36,12 +36,13 @@ All official benchmarks run on a single machine:
3636
3737### Multi-Model Comparison: Zerfoo vs Ollama (2026-03-25)
3838
39- Head-to-head decode throughput on DGX Spark GB10. 128 tokens, 3 runs (median),
40- greedy sampling (temp=0), commit ` 294aa43 ` (v1.19.0), Ollama v0.17.7.
39+ Head-to-head decode throughput on DGX Spark GB10. 128 tokens (except where
40+ noted), 3 runs (median), greedy sampling (temp=0), commit ` 294aa43 ` (v1.19.0),
41+ Ollama v0.17.7.
4142
4243| Model | Architecture | Size | Zerfoo (tok/s) | Ollama (tok/s) | Ratio | Winner |
4344| -------| -------------| ------| ----------------| ----------------| -------| --------|
44- | Gemma 3 1B Q4_K_M | gemma3 | 1B | ** 236.38 ** | 204.37 | ** 1.16x ** | Zerfoo |
45+ | Gemma 3 1B Q4_K_M | gemma3 | 1B | ** 241 ** (256 tok) | 201 (256 tok) | ** 1.20x ** | Zerfoo |
4546| DeepSeek R1 1.5B Q4_K_M | deepseek2 | 1.5B | ** 192.83** | 184.75 | ** 1.04x** | Zerfoo |
4647| Llama 3.2 3B Q4_K_M | llama | 3B | 96.06 | 97.66 | 0.98x | ~ Even |
4748| Mistral 7B Q4_K_M | mistral | 7B | 11.61 | 46.77 | 0.25x | Ollama |
@@ -105,16 +106,16 @@ dp4a benefits will appear at larger batch sizes where compute becomes the bottle
105106
106107| Framework | Version | Tokens | Tok/s (decode) | CUDA Graphs | Notes |
107108| -----------| ---------| --------| ----------------| -------------| -------|
108- | ** Zerfoo** | v1.19.0 | 128 | ** 236.38 ** | Yes | Multi-model benchmark (2026-03-25) |
109+ | ** Zerfoo** | v1.19.0 | 256 | ** 241 ** | Yes | Multi-model benchmark (2026-03-25) |
109110| ** Zerfoo** | v0.x | 256 | ** 244.45** | Yes | Single-model baseline (2026-03-20) |
110111| ** Zerfoo** | v0.x | 256 | 174.44 | No | Without CUDA graph capture |
111112| ** Ollama** | 0.17.7 | 128 | 204.37 | N/A | Multi-model benchmark (2026-03-25) |
112113| ** llama.cpp** | b5220+ | 256 | ~ 210-230 | No | Estimated from community reports on GB10-class hardware |
113114
114115** Summary:**
115116
116- - Zerfoo with CUDA graphs: ** 236 tok/s** (+16 % vs Ollama, ~ 5-10 % vs llama.cpp)
117- - Zerfoo without CUDA graphs: ** 174 tok/s** (CUDA graph capture adds +36 %)
117+ - Zerfoo with CUDA graphs: ** 241 tok/s** (+20 % vs Ollama, ~ 5-15 % vs llama.cpp)
118+ - Zerfoo without CUDA graphs: ** 174 tok/s** (CUDA graph capture adds +38 %)
118119- Ollama: ** 204 tok/s** (uses llama.cpp under the hood with its own overhead)
119120
120121> ** Note on llama.cpp numbers:** Direct llama.cpp measurements on this exact
@@ -149,7 +150,7 @@ dp4a benefits will appear at larger batch sizes where compute becomes the bottle
149150
150151| GPU | Zerfoo (est.) | Notes |
151152| -----| ---------------| -------|
152- | DGX Spark GB10 | 236 tok/s | Measured (Gemma 3 1B, 2026-03-25) |
153+ | DGX Spark GB10 | 241 tok/s | Measured (Gemma 3 1B, 2026-03-25) |
153154| RTX 4090 | TBD | Community contributions welcome |
154155| RTX 3090 | TBD | Community contributions welcome |
155156| A100 80GB | TBD | Community contributions welcome |
0 commit comments