Skip to content

inkybubble/pirate-bunny-lora

Repository files navigation

Agent Pirate Bunny

A cartoon of a cute fluffy bunny dressed like a pirate

Generated with ollama run x/flux2-klein:9b "a cartoon of a cute fluffy bunny sitting close to a beautiful pirate hat"

Fine-tune a small LLM to always speak like a pirate captain with a bunny crew, regardless of topic. The pirate-bunny persona is baked into the model weights via LoRA SFT — no system prompt needed at inference time.

Methodology

Approach

The goal is persona injection: make a base model always respond in a specific style (heavy pirate dialect + bunny references) while preserving its knowledge and reasoning capabilities. We use LoRA Supervised Fine-Tuning (SFT) — the standard approach for teaching a model "how to talk" rather than "what to know."

The key design decision is no system prompt in training data or at inference. Instead of relying on a system prompt to instruct the model to act like a pirate, we train the persona directly into the weights. This means the model defaults to pirate-bunny behavior on any user message.

Data Generation Pipeline

  1. Questions — sampled from Dolly-15k, stratified across 4 categories (open_qa, general_qa, brainstorming, creative_writing). This gives diverse, realistic user questions without needing to generate them synthetically.

  2. Answers — generated by gpt-oss:120b-cloud via Ollama cloud, prompted to respond in character as Captain Flopsy with detailed pirate dialect and bunny reference rules. A post-processing step strips any leaked chain-of-thought reasoning (<think> tags or unprompted preamble).

  3. Format — each example is a two-turn ChatML conversation (user question → assistant answer), stored as JSONL. No system message included. Split 75/25 into training and validation sets.

Training

LoRA fine-tuning via MLX (mlx_lm.lora), running natively on Apple Silicon. The adapter modifies a small subset of the model's weights (0.365% of parameters) to learn the pirate-bunny style.

mask_prompt: true ensures the loss is computed only on the assistant's response tokens, not the user's question — so the model learns how to respond, not how to parrot questions.

Conversion and Deployment

The trained LoRA adapter is fused back into the base model, converted to GGUF format via llama.cpp, and served locally through Ollama with a ChatML template.

Hardware

  • Machine: MacBook Pro, Apple M4 Max, 128GB unified memory
  • Training memory: ~11 GB peak
  • Training throughput: ~650 tokens/sec
  • Training time: ~5 minutes for 1125 iterations (500 examples, ~3 epochs)

Hyperparameters

Based on the Unsloth LoRA Hyperparameters Guide and Unsloth Datasets Guide:

Parameter Unsloth recommends Our config Notes
Base model Instruct variant for <300 examples Qwen3-4B-Instruct (bf16) Model selection guide
Dataset size 100 minimum, 1000+ optimal 500 (375 train / 125 valid)
Epochs 1–3 ~3 (1125 iters / 375 examples) More than 3 risks overfitting
Learning rate 2e-4 for LoRA 2e-4 5e-6 for RL methods (DPO/GRPO)
LoRA rank 16 or 32 16 Higher = more capacity, more memory
LoRA scale >= 1 (alpha = rank or 2x rank) 1.0 Controls adapter strength
LoRA layers 16
Batch size 2 (with grad accum 8 = effective 16) 1 Limited by dataset size
Dropout 0.0–0.1 0.0 0.1 if overfitting
mask_prompt true Loss only on assistant tokens

Prerequisites

  • uv — Python package manager
  • Ollama — for inference and cloud model access (data generation)
  • llama.cpp — cloned to ~/Developer/llama.cpp for GGUF conversion

Project Structure

agent_pirate_bunny/
├── main.py                 # CLI orchestrator (generate/train/convert/all)
├── generate_dataset.py     # Dataset generation (Dolly questions + Ollama cloud answers)
├── train.py                # Training wrapper (calls mlx_lm.lora)
├── convert.py              # Fuse → GGUF → Ollama deployment
├── config/
│   ├── prompts.py          # Response generation prompt (pirate-bunny rules)
│   └── lora_config.yaml    # LoRA hyperparameters
├── data/
│   ├── train.jsonl         # Training data (375 examples)
│   └── valid.jsonl         # Validation data (125 examples)
├── adapters/               # LoRA adapter checkpoints (from training)
├── fused_model/            # Fused model output (from convert)
├── Modelfile               # Ollama model definition (ChatML template, no system prompt)
└── results.md              # Training curves, sample outputs, checkpoint comparison

Usage

Full pipeline

uv run python main.py all

Runs generate → (pause for data review) → train → convert.

Individual steps

# 1. Generate training data (Dolly questions + Ollama cloud pirate-bunny answers)
uv run python main.py generate

# 2. Train LoRA adapter
uv run python main.py train

# 3. Convert and deploy (final checkpoint)
uv run python main.py convert

# 3b. Convert a specific checkpoint (e.g. iter 400, best val loss)
uv run python main.py convert --checkpoint 400

Test

ollama run pirate-bunny "teach me about gradient descent"
ollama run pirate-bunny "write the python code to solve a quadratic equation"
ollama run pirate-bunny "What's the recipe for good spaghetti"

Results

See results.md for full training curves, sample outputs from both the final and best-validation checkpoints, and run comparisons.

Summary

  • Val loss bottomed at iter 400 (1.403), slight overfitting in epochs 2-3
  • Both checkpoints produce heavy pirate dialect with dense bunny references
  • Code generation is correct and runnable with pirate-bunny variable names
  • Iter 400 checkpoint produces more concise outputs

Design Decisions

Choice Reason
LoRA SFT Persona injection — teaching a new output style. SFT is the standard approach.
Qwen3-4B-Instruct Dense (not MoE), strong chat/code baseline, modern architecture. Instruct variant recommended for <1000 examples.
bf16 (not quantized) Cleaner gradients during training, compatible with llama.cpp GGUF converter.
No system prompt Pirate-bunny behavior baked into weights, not dependent on prompting.
Dolly-15k for questions Diverse, real-world questions across multiple categories. Avoids synthetic question generation artifacts.
Ollama cloud for answers Offloads answer generation to cloud GPU, preserving local GPU for training.
MLX Native Apple Silicon training — no CUDA required, uses unified memory efficiently.

About

LoRA fine-tune a small LLM to always speak like a pirate captain with a bunny crew. Qwen3-4B on Apple Silicon with MLX.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages