RAG gives the model access to new information. Fine-tuning changes how the model thinks. Most SMBs need the first, not the second — and confusing them is an expensive mistake.
Fine-tuning modifies a model’s weights by training it on a curated dataset of examples. The model learns new behaviors, styles, response formats, or domain-specific patterns that weren’t in its original training. Unlike RAG, which retrieves knowledge at inference time, fine-tuning bakes knowledge into the model permanently.
Fine-Tuning vs RAG: The Decision
This is the most common strategic question in SMB AI deployment.
| Scenario | Better approach | Why |
|---|---|---|
| Model needs to know your product’s features | RAG | Knowledge changes; easier to update |
| Model needs to respond in your company’s specific tone | Fine-tuning | Style doesn’t change; baking it in is more reliable |
| Model needs current pricing or policy information | RAG | Information changes; fine-tuning becomes stale |
| Model needs to output a very specific JSON schema | Prompt engineering first, fine-tuning if prompting fails | System prompts often sufficient |
| Model keeps misunderstanding your domain terminology | Fine-tuning | Vocabulary changes are behavioral, not factual |
| Model needs to classify documents into 50 proprietary categories | Fine-tuning | Classification accuracy improves significantly with examples |
| Model needs to follow a specific regulatory format | Fine-tuning | Format compliance is more reliable when trained |
The core distinction: RAG solves knowledge problems. Fine-tuning solves behavior problems.
If your AI keeps giving wrong answers because it doesn’t know your product’s features, that’s a RAG problem. If your AI gives correct-but-unusable answers because the response format is wrong, or the tone is wrong, or it can’t classify your proprietary categories — that’s a fine-tuning problem.
LoRA: The Efficient Path
Full fine-tuning (updating all model weights) is expensive and compute-intensive. For most SMB use cases, LoRA (Low-Rank Adaptation) is the practical approach.
LoRA works by training a small set of adapter weights rather than modifying the full model. The base model stays frozen. The LoRA adapter (typically 10–100MB) layers on top and modifies outputs. The result: training costs drop by 60–90% and the adapter can be swapped or removed without touching the base model.
A LoRA adapter for a 7B model requires roughly:
- 1 A100 40GB or 2 A100 24GB for training
- 1–5 hours of training depending on dataset size
- ~2,000–10,000 example pairs for meaningful behavior change
Training the adapter yourself costs $50–$200 in cloud GPU time for a well-scoped task. Managed fine-tuning services (OpenAI, Together AI) run $10–$50 per million tokens of training data, depending on provider and model size.
When the Numbers Don’t Work
Most SMB fine-tuning projects are premature. The common scenario:
- Team spends 3 weeks collecting and labeling 500 training examples
- Fine-tunes on OpenAI or trains a LoRA adapter
- The fine-tuned model is better on the labeled task
- They discover that 3 weeks of prompt engineering would have gotten them 80% of the quality improvement at zero training cost
Prompt engineering should always precede fine-tuning. Few-shot examples in the system prompt, output constraints, chain-of-thought instructions — exhaust these first. Fine-tuning is for the gap that remains after serious prompt optimization.
The break-even point: if fine-tuning saves you 15 tokens per inference and you run 10 million inferences per month, you save $150/month on a model priced at $1/million tokens. A fine-tuning project that costs $500 in data labeling + $200 in training pays back in ~5 months at those numbers. Lower volume or smaller per-inference savings and the economics don’t work.
The Maintenance Trap
Fine-tuned models create a new maintenance burden. When a better base model releases (roughly every 3–4 months for popular open-source models), your options are:
- Stay on the old base model (cheaper but you miss quality improvements)
- Re-train the LoRA adapter on the new base model (1–5 hours + compute cost, but behavior may shift)
- Fully re-evaluate the fine-tuned model against your task (required either way)
Contrast with RAG: when a better model releases, you update the inference endpoint and keep your knowledge base. No retraining.
Fine-tuning locks you into a specific model version in a way RAG doesn’t. For fast-moving domains, this is a real cost.
When Fine-Tuning Genuinely Wins
- Domain vocabulary. Models trained on medical or legal text at a small scale often perform better on specialty terminology than general models + RAG.
- Proprietary classification schemes. Classifying documents into categories that don’t exist anywhere on the internet — your internal ticket taxonomy, your product’s defect codes, your organization’s regulatory categories.
- Consistent response formatting. When the output must always follow a very specific structure (insurance claim forms, regulatory filings, API responses for legacy systems) and prompt engineering has failed to maintain consistency.
- Privacy. When the use case can’t send data to external RAG infrastructure — fine-tuning bakes knowledge in so inference is fully local.
Related
- RAG — The alternative for knowledge problems; usually try this first
- Model Quantization — Fine-tuned LoRA adapters also require quantization for deployment
- TCO — Training cost + maintenance cost belongs in total cost of ownership
- Ollama — Can serve models with LoRA adapters locally
- Cost Overrun — Fine-tuning projects frequently overrun if scoped without evaluation gates
- Data & Knowledge — Where fine-tuning fits in the knowledge infrastructure layer