The honest decision tree. Most teams should use RAG. Fine-tuning earns its keep in three specific cases and we name them with dollar figures.
Fine-tuning is the wrong default for most businesses. RAG is cheaper, easier to update, and equally accurate for 80% of the jobs people pitch as “we should fine-tune a model on our data.”
For Q&A over docs, customer-support assistants, internal search, sales-enablement bots, just use retrieval. Pinecone or Postgres with pgvector, an embedding model (text-embedding-3-small is still the price/quality sweet spot), a re-ranker. Build cost: $4–12k. Operating cost: under $200/mo for most teams.
Fine-tune small models (Llama 3 8B): $2–5k one-shot, $0.40/1M tokens at inference. Fine-tune larger models (Llama 3 70B): $15–35k one-shot, $2/1M tokens. RAG over 100k docs: $4–12k build, $200/mo ops. The break-even is around 50M tokens per month of inference.
We default to RAG on every AI integration unless one of the three cases above holds.