Generated with ollama run x/flux2-klein:9b "a cartoon of a cute fluffy bunny sitting close to a beautiful pirate hat"
Fine-tune a small LLM to always speak like a pirate captain with a bunny crew, regardless of topic. The pirate-bunny persona is baked into the model weights via LoRA SFT — no system prompt needed at inference time.
The goal is persona injection: make a base model always respond in a specific style (heavy pirate dialect + bunny references) while preserving its knowledge and reasoning capabilities. We use LoRA Supervised Fine-Tuning (SFT) — the standard approach for teaching a model "how to talk" rather than "what to know."
The key design decision is no system prompt in training data or at inference. Instead of relying on a system prompt to instruct the model to act like a pirate, we train the persona directly into the weights. This means the model defaults to pirate-bunny behavior on any user message.
-
Questions — sampled from Dolly-15k, stratified across 4 categories (open_qa, general_qa, brainstorming, creative_writing). This gives diverse, realistic user questions without needing to generate them synthetically.
-
Answers — generated by
gpt-oss:120b-cloudvia Ollama cloud, prompted to respond in character as Captain Flopsy with detailed pirate dialect and bunny reference rules. A post-processing step strips any leaked chain-of-thought reasoning (<think>tags or unprompted preamble). -
Format — each example is a two-turn ChatML conversation (user question → assistant answer), stored as JSONL. No system message included. Split 75/25 into training and validation sets.
LoRA fine-tuning via MLX (mlx_lm.lora), running natively on Apple Silicon. The adapter modifies a small subset of the model's weights (0.365% of parameters) to learn the pirate-bunny style.
mask_prompt: true ensures the loss is computed only on the assistant's response tokens, not the user's question — so the model learns how to respond, not how to parrot questions.
The trained LoRA adapter is fused back into the base model, converted to GGUF format via llama.cpp, and served locally through Ollama with a ChatML template.
- Machine: MacBook Pro, Apple M4 Max, 128GB unified memory
- Training memory: ~11 GB peak
- Training throughput: ~650 tokens/sec
- Training time: ~5 minutes for 1125 iterations (500 examples, ~3 epochs)
Based on the Unsloth LoRA Hyperparameters Guide and Unsloth Datasets Guide:
| Parameter | Unsloth recommends | Our config | Notes |
|---|---|---|---|
| Base model | Instruct variant for <300 examples | Qwen3-4B-Instruct (bf16) | Model selection guide |
| Dataset size | 100 minimum, 1000+ optimal | 500 (375 train / 125 valid) | |
| Epochs | 1–3 | ~3 (1125 iters / 375 examples) | More than 3 risks overfitting |
| Learning rate | 2e-4 for LoRA |
2e-4 |
5e-6 for RL methods (DPO/GRPO) |
| LoRA rank | 16 or 32 | 16 | Higher = more capacity, more memory |
| LoRA scale | >= 1 (alpha = rank or 2x rank) | 1.0 | Controls adapter strength |
| LoRA layers | — | 16 | |
| Batch size | 2 (with grad accum 8 = effective 16) | 1 | Limited by dataset size |
| Dropout | 0.0–0.1 | 0.0 | 0.1 if overfitting |
| mask_prompt | — | true | Loss only on assistant tokens |
- uv — Python package manager
- Ollama — for inference and cloud model access (data generation)
- llama.cpp — cloned to
~/Developer/llama.cppfor GGUF conversion
agent_pirate_bunny/
├── main.py # CLI orchestrator (generate/train/convert/all)
├── generate_dataset.py # Dataset generation (Dolly questions + Ollama cloud answers)
├── train.py # Training wrapper (calls mlx_lm.lora)
├── convert.py # Fuse → GGUF → Ollama deployment
├── config/
│ ├── prompts.py # Response generation prompt (pirate-bunny rules)
│ └── lora_config.yaml # LoRA hyperparameters
├── data/
│ ├── train.jsonl # Training data (375 examples)
│ └── valid.jsonl # Validation data (125 examples)
├── adapters/ # LoRA adapter checkpoints (from training)
├── fused_model/ # Fused model output (from convert)
├── Modelfile # Ollama model definition (ChatML template, no system prompt)
└── results.md # Training curves, sample outputs, checkpoint comparison
uv run python main.py allRuns generate → (pause for data review) → train → convert.
# 1. Generate training data (Dolly questions + Ollama cloud pirate-bunny answers)
uv run python main.py generate
# 2. Train LoRA adapter
uv run python main.py train
# 3. Convert and deploy (final checkpoint)
uv run python main.py convert
# 3b. Convert a specific checkpoint (e.g. iter 400, best val loss)
uv run python main.py convert --checkpoint 400ollama run pirate-bunny "teach me about gradient descent"
ollama run pirate-bunny "write the python code to solve a quadratic equation"
ollama run pirate-bunny "What's the recipe for good spaghetti"See results.md for full training curves, sample outputs from both the final and best-validation checkpoints, and run comparisons.
- Val loss bottomed at iter 400 (1.403), slight overfitting in epochs 2-3
- Both checkpoints produce heavy pirate dialect with dense bunny references
- Code generation is correct and runnable with pirate-bunny variable names
- Iter 400 checkpoint produces more concise outputs
| Choice | Reason |
|---|---|
| LoRA SFT | Persona injection — teaching a new output style. SFT is the standard approach. |
| Qwen3-4B-Instruct | Dense (not MoE), strong chat/code baseline, modern architecture. Instruct variant recommended for <1000 examples. |
| bf16 (not quantized) | Cleaner gradients during training, compatible with llama.cpp GGUF converter. |
| No system prompt | Pirate-bunny behavior baked into weights, not dependent on prompting. |
| Dolly-15k for questions | Diverse, real-world questions across multiple categories. Avoids synthetic question generation artifacts. |
| Ollama cloud for answers | Offloads answer generation to cloud GPU, preserving local GPU for training. |
| MLX | Native Apple Silicon training — no CUDA required, uses unified memory efficiently. |
