RAG gives the model access to new information. Fine-tuning changes how the model thinks. Most SMBs need the first, not the second — and confusing them is an expensive mistake.

Fine-tuning modifies a model’s weights by training it on a curated dataset of examples. The model learns new behaviors, styles, response formats, or domain-specific patterns that weren’t in its original training. Unlike RAG, which retrieves knowledge at inference time, fine-tuning bakes knowledge into the model permanently.

Fine-Tuning vs RAG: The Decision

This is the most common strategic question in SMB AI deployment.

ScenarioBetter approachWhy
Model needs to know your product’s featuresRAGKnowledge changes; easier to update
Model needs to respond in your company’s specific toneFine-tuningStyle doesn’t change; baking it in is more reliable
Model needs current pricing or policy informationRAGInformation changes; fine-tuning becomes stale
Model needs to output a very specific JSON schemaPrompt engineering first, fine-tuning if prompting failsSystem prompts often sufficient
Model keeps misunderstanding your domain terminologyFine-tuningVocabulary changes are behavioral, not factual
Model needs to classify documents into 50 proprietary categoriesFine-tuningClassification accuracy improves significantly with examples
Model needs to follow a specific regulatory formatFine-tuningFormat compliance is more reliable when trained

The core distinction: RAG solves knowledge problems. Fine-tuning solves behavior problems.

If your AI keeps giving wrong answers because it doesn’t know your product’s features, that’s a RAG problem. If your AI gives correct-but-unusable answers because the response format is wrong, or the tone is wrong, or it can’t classify your proprietary categories — that’s a fine-tuning problem.

LoRA: The Efficient Path

Full fine-tuning (updating all model weights) is expensive and compute-intensive. For most SMB use cases, LoRA (Low-Rank Adaptation) is the practical approach.

LoRA works by training a small set of adapter weights rather than modifying the full model. The base model stays frozen. The LoRA adapter (typically 10–100MB) layers on top and modifies outputs. The result: training costs drop by 60–90% and the adapter can be swapped or removed without touching the base model.

A LoRA adapter for a 7B model requires roughly:

  • 1 A100 40GB or 2 A100 24GB for training
  • 1–5 hours of training depending on dataset size
  • ~2,000–10,000 example pairs for meaningful behavior change

Training the adapter yourself costs $50–$200 in cloud GPU time for a well-scoped task. Managed fine-tuning services (OpenAI, Together AI) run $10–$50 per million tokens of training data, depending on provider and model size.

When the Numbers Don’t Work

Most SMB fine-tuning projects are premature. The common scenario:

  1. Team spends 3 weeks collecting and labeling 500 training examples
  2. Fine-tunes on OpenAI or trains a LoRA adapter
  3. The fine-tuned model is better on the labeled task
  4. They discover that 3 weeks of prompt engineering would have gotten them 80% of the quality improvement at zero training cost

Prompt engineering should always precede fine-tuning. Few-shot examples in the system prompt, output constraints, chain-of-thought instructions — exhaust these first. Fine-tuning is for the gap that remains after serious prompt optimization.

The break-even point: if fine-tuning saves you 15 tokens per inference and you run 10 million inferences per month, you save $150/month on a model priced at $1/million tokens. A fine-tuning project that costs $500 in data labeling + $200 in training pays back in ~5 months at those numbers. Lower volume or smaller per-inference savings and the economics don’t work.

The Maintenance Trap

Fine-tuned models create a new maintenance burden. When a better base model releases (roughly every 3–4 months for popular open-source models), your options are:

  • Stay on the old base model (cheaper but you miss quality improvements)
  • Re-train the LoRA adapter on the new base model (1–5 hours + compute cost, but behavior may shift)
  • Fully re-evaluate the fine-tuned model against your task (required either way)

Contrast with RAG: when a better model releases, you update the inference endpoint and keep your knowledge base. No retraining.

Fine-tuning locks you into a specific model version in a way RAG doesn’t. For fast-moving domains, this is a real cost.

When Fine-Tuning Genuinely Wins

  • Domain vocabulary. Models trained on medical or legal text at a small scale often perform better on specialty terminology than general models + RAG.
  • Proprietary classification schemes. Classifying documents into categories that don’t exist anywhere on the internet — your internal ticket taxonomy, your product’s defect codes, your organization’s regulatory categories.
  • Consistent response formatting. When the output must always follow a very specific structure (insurance claim forms, regulatory filings, API responses for legacy systems) and prompt engineering has failed to maintain consistency.
  • Privacy. When the use case can’t send data to external RAG infrastructure — fine-tuning bakes knowledge in so inference is fully local.
  • RAG — The alternative for knowledge problems; usually try this first
  • Model Quantization — Fine-tuned LoRA adapters also require quantization for deployment
  • TCO — Training cost + maintenance cost belongs in total cost of ownership
  • Ollama — Can serve models with LoRA adapters locally
  • Cost Overrun — Fine-tuning projects frequently overrun if scoped without evaluation gates
  • Data & Knowledge — Where fine-tuning fits in the knowledge infrastructure layer