Every fine-tuning tutorial starts the same way: clone the repo, load a pretrained model, run the training script on Alpaca or ShareGPT data, celebrate the lower loss. Then you try it on your actual codebase and the model still doesn’t know what your ServiceLocator pattern is, still ignores your team’s error-handling conventions, still generates code that reads like it came from a different company.
The problem isn’t fine-tuning. It’s the tutorials.
This guide shows you how to fine-tune an LLM for code tasks using your own git history as training data, LoRA/QLoRA with Unsloth, and a structured LLM-as-judge evaluation pipeline that tells you whether the model is better — not just whether the loss went down. Everything here runs on hardware you already have access to.
Fine-Tuning vs. Prompt Engineering vs. RAG: A Decision Framework for Code Tasks
Most developers jump to fine-tuning too early — or too late. You’ve probably heard it thrown around like a magic fix, but without a clear trigger, you’ll waste a weekend training a model that performs worse than a well-crafted system prompt.
Here are the three signals that mean fine-tuning is actually worth it:
Signal 1: You’re writing the same corrections repeatedly. If you find yourself editing AI-generated code to match your team’s patterns dozens of times a week — same naming conventions, same error-handling structure, same module organization — that’s behavioral consistency you can bake into weights. Prompt engineering can’t enforce this reliably at scale. Research backs this up: fine-tuning GPT-3.5 for code review automation achieves 73–74% higher Exact Match scores compared to zero-shot prompting.
Signal 2: Your conventions don’t exist on the internet. Generic models were trained on public code. If your internal library, your team’s specific idioms, or your proprietary framework isn’t in any public repo, no amount of prompting will teach the model about it.
Signal 3: Latency matters. RAG adds retrieval round-trips at inference time. A fine-tuned model’s knowledge lives in the weights — response times drop, and nothing needs to be fetched.
If you hit two of these three, fine-tuning is worth the weekend.
Understanding how multi-model AI stacks behave in production clarifies why fine-tuning and prompt engineering aren’t substitutes — they operate at fundamentally different abstraction levels.
When to use RAG instead: Fast-changing information — API docs updated monthly, internal wikis edited constantly — makes fine-tuning go stale within weeks. RAG stays current without retraining. We’ll combine both approaches at the end.
When to stick with prompting: Fewer than ~500 representative examples of the behavior you want means you don’t have enough signal to fine-tune well. Build the dataset first.
Building Your Training Dataset from Git History
This is where most tutorials fail you completely. They hand you a synthetic dataset and call it a day. Your git history is better — it’s actual evidence of how your team writes and reviews code.
What to extract
You’re looking for instruction-response pairs. The cleanest sources in a typical repo:
- Commit diffs + commit messages: The diff is the “code to write,” the message is the “intent.” These map cleanly to
instruction → responseformat. - PR descriptions + associated diff: Richer context. A well-written PR description explains why something was done, which teaches the model intent alongside implementation.
- Code review comments + updated code: A reviewer says “this function should handle the null case” and the author updates the code. That’s correction data, and it’s invaluable.
- Docstrings + function bodies: If your team writes docstrings before implementing (or vice versa), these become
description → implementationpairs.
If you’re running parallel development workflows, git worktrees let you extract dataset artifacts from your history without touching your active working branches — worth setting up before you start the extraction script.
Filtering and cleaning
Raw git history is messy. Apply these filters first:
- Remove merge commits and auto-generated commits — version bumps, lockfile updates, dependency pins. Pure noise.
- Set a diff size floor and ceiling — under 10 lines is probably a typo fix; over 300 lines is a refactor too complex to learn from atomically. Sweet spot: 20–150 lines changed.
- Deduplicate boilerplate — if your team uses a project scaffold generator, you’ll have dozens of nearly-identical file initializations. Keep one, drop the rest.
- Check author diversity — if 80% of commits are from one person, your model learns one developer’s patterns, not the team’s.
After filtering, aim for 800–2,000 instruction-response pairs for a domain adaptation task. You don’t need 50,000 samples. A fine-tuned Gemma 4B can match an open-source 27B model on domain-specific tasks when the training data is well-curated — quality consistently beats quantity here.
Structuring the pairs
Format each example using the chat template of your target base model. For Llama 3.x and Qwen2.5-style models:
<|user|>
{instruction}
<|assistant|>
{response}
The exact format matters — and we’ll come back to why this is the single most common silent failure mode in fine-tuning.
LoRA or QLoRA? Choosing Based on VRAM, Time, and Quality Targets
Both LoRA and QLoRA are parameter-efficient fine-tuning (PEFT) approaches — they train a small set of adapter weights on top of a frozen base model rather than updating everything. The difference is how the base model’s weights are stored during training.
LoRA keeps the base model in full float16 or bfloat16 precision. QLoRA quantizes it to 4-bit (NF4), reducing memory at the cost of training speed.
The practical numbers: QLoRA cuts GPU memory usage by roughly 33% but adds about 39% more training time. Fine-tuning a 7B model takes ~2 hours with LoRA versus ~3 hours with QLoRA on a 14GB GPU. That trade-off is real and worth planning around if your weekend deadline is firm.
Full fine-tuning isn’t a realistic option at most scales. A 7B model requires 100–120 GB of VRAM for full precision training — roughly $50,000 of H100 hardware. QLoRA enables the same task on a $1,500 RTX 4090 with as little as 6.5 GB of VRAM for a 9B model.
Use QLoRA if:
– You have 16GB or less VRAM
– You’re training a 13B+ model on a 24GB card
– You’re comfortable letting it run overnight
Use LoRA if:
– You have a 24GB+ card and are training a 7B model
– You’re running multiple experiments on RunPod (at ~$1.50/hr, saving an hour per run compounds fast)
– You care more about iteration speed than memory headroom
Setting Up Unsloth for a Single-GPU Fine-Tuning Run
Unsloth is the right tool here. It achieves 2–5x faster training and 60–80% less VRAM compared to Flash Attention 2 through custom Triton GPU kernels — without approximations or accuracy loss. On a Qwen3 30B MoE variant, Unsloth completes training in 0.8 hours versus 9.4 hours with standard PyTorch, using only 17.5 GB VRAM.
Install
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
If you’re on RunPod with an A100 or RTX 4090, start from the Unsloth Docker image (unsloth/unsloth-bfloat16-bnb-4bit) — it saves 20 minutes of environment debugging.
Load the base model
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-Coder-7B-Instruct",
max_seq_length=4096,
dtype=None, # Auto-detect
load_in_4bit=True, # QLoRA
)
Qwen2.5-Coder and DeepSeek-Coder-V2-Lite are the strongest base models for code tasks as of 2026. Both have Unsloth-optimized checkpoints available on Hugging Face.
Configure LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules="all-linear",
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
Launch training
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=your_dataset,
dataset_text_field="text",
max_seq_length=4096,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_ratio=0.05,
num_train_epochs=2,
learning_rate=2e-4,
bf16=True,
logging_steps=1,
optim="adamw_8bit",
lr_scheduler_type="cosine",
output_dir="outputs",
),
)
trainer.train()
With this setup on an RTX 4090 and a 1,200-sample dataset, expect 2.5–3 hours for a 7B model.
The Hyperparameters That Actually Matter in 2026
Most fine-tuning guides either throw every knob at you or skip the reasoning entirely. Here’s what moves the needle for code tasks.
Rank (r): Start at r=16. Research shows intermediate ranks (32–64) can improve capacity, but for instruction-following on a curated 800–2,000 example dataset, r=16 is stable and rarely underfits. Only increase if you have 5,000+ examples and the loss plateaus unusually early.
target_modules: Use "all-linear" with Unsloth. Targeting all linear layers consistently outperforms manual selection (attention layers only) for code generation tasks.
Learning rate: 2e-4 with cosine warmup is the 2026 consensus starting point. Above 3e-4 causes instability on small datasets. Below 5e-5, the model barely moves from the base checkpoint.
Epochs: For 1,000–2,000 examples, 2–3 epochs. Beyond that, you’re memorizing training samples, not generalizing.
The #1 silent failure mode: chat template mismatch
This trips up experienced developers. Every instruct-tuned model was trained with a specific chat template — special tokens delimiting user turns, assistant turns, and system prompts. If your training data doesn’t use the exact same template, the model learns conflicting formatting signals.
The result looks deceptively normal: training loss goes down, the run completes without errors. Then at inference time, the model generates rambling outputs, cuts off mid-response, or ignores the instruction entirely — with nothing in the logs to tell you why.
Fix: always use tokenizer.apply_chat_template() when formatting your dataset. Let the tokenizer define the format. Don’t guess it from a blog post (including this one).
Evaluating Your Fine-Tuned Model with LLM-as-Judge
“Run inference and eyeball the output” is not evaluation. It’s vibes.
LLM-as-judge is the right approach for 2026. Take a held-out test set of 50–100 real code review tasks from your codebase, run both your fine-tuned model and your baseline (the base model or GPT-4o with a system prompt), then have a capable judge model score each output on a structured rubric. Recent research shows LLM-as-judge reaches a Pearson correlation of 0.81 with human scores for code tasks — versus only 0.34 for the best traditional metric, ChrF++. Sophisticated judge models now align with human judgment up to 85% of the time, exceeding human-to-human agreement at 81%.
Building the eval pipeline
Your judge prompt needs explicit criteria:
Rate the following code response on a 1–5 scale for each:
1. Correctness: Does it solve the stated task?
2. Convention adherence: Does it match the style in the provided context?
3. Completeness: Does it handle edge cases mentioned in the task?
Provide a score and one-sentence rationale for each criterion.
Use GPT-4o or a local reasoning model (Qwen3-32B works well) as your judge. Compare mean scores across your test set.
Set your threshold before running the eval. If your fine-tuned model doesn’t beat the baseline by at least 15% on convention adherence — the dimension you specifically trained for — the training didn’t work. Go back and audit your dataset for quality issues before changing any hyperparameters.
Catching Catastrophic Forgetting Before It Ships
Fine-tuning on a narrow dataset can degrade general coding ability. The model gets better at your conventions and simultaneously worse at problems it’s never seen — classic catastrophic forgetting.
The mitigation is straightforward, but almost nobody does it: run HumanEval and MBPP on your fine-tuned model before deploying it anywhere.
- HumanEval: 164 Python programming problems measuring general code generation
- MBPP: 500 crowd-sourced Python problems with broader coverage
Use bigcode-evaluation-harness:
accelerate launch main.py \
--model your_finetuned_model \
--tasks humaneval,mbpp \
--allow_code_execution
Compare pass@1 scores against the base model checkpoint. A 2–3 point drop is acceptable when you’re seeing significant gains on domain-specific tasks. More than 5 points, and you have a quality problem — most commonly too many training epochs, incorrect code examples in your dataset, or that chat template mismatch we already covered.
Fine-tuned models can outperform GPT-4 with prompt engineering by 28 percentage points on MBPP, but GPT-4 with task-specific prompting still leads on HumanEval. Knowing where your model stands on both benchmarks tells you where to direct users and where to warn them.
The Hybrid Playbook: Combining Your Fine-Tuned Model with RAG
Fine-tuning teaches behavioral patterns. RAG retrieves facts. The mistake is using one where the other is better suited.
After a weekend fine-tuning run, your model understands your team’s naming conventions, error-handling idioms, and architectural patterns — that knowledge lives in the weights. But it doesn’t know that your internal auth-client library shipped a breaking change to its token refresh API last Tuesday.
That’s where RAG fills the gap. Before wiring up your retrieval layer, setting up RAG over internal documentation covers the chunking and indexing decisions in depth — the details that determine whether retrieval actually surfaces the right context at inference time.
The split is clean: your fine-tuned model handles how to write it — conventions, structure, style. RAG handles what it is — current API surfaces, accurate function signatures, recent architectural decisions documented in your internal wiki.
For implementation: use a vector database (Chroma for local, Pinecone or Weaviate at team scale) indexed over your internal docs in markdown or parsed from docstrings. Chunk at the function or class level rather than arbitrary character counts — code has natural boundaries, and respecting them dramatically improves retrieval precision.
This combination is where the meaningful productivity gains are in 2026. Fine-tuning alone gives you a model that codes like your team but may hallucinate outdated APIs. RAG alone gives you factual accuracy with generic style. Together, they cover both dimensions without the weaknesses of either.
What You Walk Away With
Fine-tuning an LLM for your codebase is a weekend project, not a research initiative. Extract your dataset from git history, choose QLoRA if you’re VRAM-constrained or LoRA if iteration speed matters more, run your Unsloth training job, evaluate with LLM-as-judge against a real held-out test set, check for catastrophic forgetting with HumanEval/MBPP, then layer RAG on top for live documentation and fast-changing APIs.
The dataset is the highest-leverage piece of this entire pipeline. A mediocre hyperparameter choice on a high-quality dataset will beat perfect hyperparameters on noisy data every time. Start there.
If you run this and hit a snag — chat template weirdness, unexpected loss curves, evaluation scores that don’t match your intuition — drop the specifics in the comments. The combination of base model, dataset size, and hardware matters more than any single configuration choice.