The fastest, most efficient library for running GGUF models with maximum throughput and zero-config hardware optimization.
model-accelerator isn't just another wrapper. It’s a high-performance inference engine designed to squeeze every drop of power out of your hardware. By talking directly to GGUF metadata, it configures itself perfectly for every model you throw at it.
Get up and running in seconds. Optimized for uv and high-performance inference.
# 1. Setup Environment
uv sync && source .venv/bin/activate
# Optional: Enable GPU/CUDA support
# Using Release build type is the standard way to get -O3 optimizations automatically
CMAKE_ARGS="-DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=on" uv pip install llama-cpp-python --reinstall --no-binary llama-cpp-python
# 2. Load & Register a Model
# Pulls directly from HF, auto-selecting the best quantization (Q4_K_M/Q5_K_M).
ma load unsloth/Qwen3.5-9B-GGUF --alias qwen3_5-9b
# 3. Configure (Optional: Set your hardware's "sweet spot")
ma config qwen3_5-9b --n-ctx 8192 --max-output auto --temp 0.7
# 4. Interact!
ma chat qwen3_5-9b --gpu-layers maxFor a deep dive into every possible flag and advanced usage examples, check out our Full Command Guide (RUN.md) ↗.
| Command | Usage Example | Description |
|---|---|---|
| load | load <src> --alias <name> |
Download from HF/URL or register a local path. |
| chat | chat <alias> |
Interactive REPL. Uses native GGUF Jinja2 templates. |
| run | run <alias> --prompt "..." |
Single-prompt generation with real-time streaming. |
| batch | batch <alias> --file <f> |
High-throughput C-level parallel processing. |
| config | config <alias> [flags] |
View/Set persistent sampling & resource limits. |
| bench | bench <alias> |
Benchmark throughput (tokens/sec) for peak speed. |
| list | list |
Show all registered models and disk sizes. |
We built this engine for developers who want precision and feedback. No more "guessing" if your settings are working:
- Bulletproof Self-Healing: Using
--gpu-layers max, the engine reads the model's exact internal layer count. If you run out of VRAM, it automatically steps down layer-by-layer and applies a Compute Margin to ensure you have enough memory left for the actual math—preventing those annoying CUDA crashes. - Native GGUF Intelligence: It extracts Chat Templates, EOG (End-of-Generation) tokens, and Context Limits directly from the model file. It knows exactly how to talk to your model without you needing to tell it.
- Rich Terminal UI: Beautiful, clean output using
rich. You get real-time feedback on parameters, token speed, and wall-clock time for every single generation. - C-Level Parallelism: The
batchcommand bypasses Python's bottlenecks to run multiple sequences at once using nativellama.cppbindings for massive throughput.
You can add these flags to any command or save them permanently via config.
--gpu-layers max: (Recommended) Automatically finds the absolute maximum layers your GPU can handle with a safety buffer.--n-ctx <N>: Set your context window (e.g.,8192or32768).--n-threads <N>: CPU threads. We default to your physical core count (max 8) for the best balance.
--chat: Always use this for Instruct/Chat models! It applies the model's specific "brain" format to your prompt.--temp <F>: Controls randomness (0.0 = deterministic, 0.7 = standard, 1.2 = creative).--think: Enables thinking mode for models that support it. The model will output its reasoning in<think>tags before the final answer. Silently ignored for models without thinking capability.
model-accelerator/
├── engine.py # The Brain: Dataclass-driven, handles GGUF metadata & self-healing
├── cli.py # The Interface: Rich-powered CLI for interaction
├── pyproject.toml # Dependencies (rich, llama-cpp-python, jinja2)
├── models.json # The Registry: Maps aliases to your local storage
└── models/ # Storage: Houses your GGUF files and per-model params.json
- Stop Typing Flags: Use
ma config <alias> --gpu-layers max --n-ctx 4096. These settings are saved toparams.jsonand inherited automatically byrun,chat, andbatch. - Environment Variables: You can set
export GGUF_MODELS_DIR=/path/to/large/driveto keep your models on secondary storage. - Bulk Processing: Need to categorize 1,000 strings?
ma batch my-model --file prompts.txt --parallel 20will process them in parallel chunks.