Part 1 of 4
🤖 Ghostwritten by Claude · Curated by Tom Hundley
This article was written by Claude and curated for publication by Tom Hundley.
Fine-tuning LLMs for enterprise sounds compelling. A model that truly understands your domain, your terminology, your specific patterns. The marketing pitch writes itself: Train GPT on your data and watch productivity soar.
Heres what the pitch doesnt mention: fine-tuning is expensive, time-consuming, and in most cases, completely unnecessary.
Ive helped organizations navigate the fine-tuning decision dozens of times. The pattern is remarkably consistent: teams come in convinced they need to fine-tune, and about 80% of the time, we find a simpler solution that works better. This guide will help you make that decision intelligently—and if fine-tuning is actually the right choice, show you how to do it properly.
Before we dive into the how, lets be clear about the when. Fine-tuning is the right choice in a narrow set of circumstances:
Good candidates for fine-tuning:
Poor candidates for fine-tuning (most use cases):
The decision framework is simple: prompt engineering first, RAG second, fine-tuning only when nothing else works. Lets explore why this hierarchy exists and how to implement each approach.
If youve decided fine-tuning is right for your use case, cloud providers offer the lowest-friction path to getting started.
OpenAIs fine-tuning API is the most mature option for cloud-based model customization. Heres what the process actually looks like:
Step 1: Prepare Your Data
OpenAI expects JSONL files with conversation-format training examples:
{messages: [{role: system, content: You are a helpful assistant.}, {role: user, content: Whats our return policy?}, {role: assistant, content: Our return policy allows returns within 30 days of purchase with original receipt.}]}The critical insight here: quality trumps quantity. Fifty well-crafted examples often outperform five hundred mediocre ones. Each example should demonstrate exactly the behavior you want—tone, format, reasoning style, everything.
Step 2: Upload and Train
from openai import OpenAI
client = OpenAI()
# Upload training file
file = client.files.create(
file=open(training_data.jsonl, rb),
purpose=fine-tune
)
# Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file.id,
model=gpt-4o-mini-2024-07-18
)Training typically takes 30 minutes to a few hours depending on dataset size. Youll receive a custom model ID when complete.
Step 3: Cost Reality Check
This is where many teams get surprised. OpenAIs fine-tuning costs include:
A typical fine-tuning run with 10,000 examples might cost $50-200 for training alone. But the real cost is inference: if your fine-tuned model costs 2x the base model per token, and youre making millions of calls, those costs compound quickly.
Anthropic doesnt offer public fine-tuning for Claude models. Instead, they focus on:
This isnt a limitation—its a design choice. Claudes architecture makes it exceptionally responsive to prompt engineering, often achieving results that would require fine-tuning on other models.
Each platform has trade-offs around cost, model quality, and integration complexity. The right choice often depends more on your existing cloud infrastructure than on the fine-tuning capabilities themselves.
Cloud fine-tuning is convenient, but local fine-tuning offers advantages that matter for many enterprises: data privacy, cost control, and unlimited customization.
Data privacy: Your training data never leaves your infrastructure. For healthcare, finance, legal, or any industry with sensitive data, this isnt optional—its mandatory.
Cost control: After the initial hardware investment, you can run unlimited training experiments. No per-token fees, no surprise bills.
Deep customization: Local fine-tuning gives you access to every hyperparameter. You can experiment with learning rates, LoRA ranks, quantization strategies—whatever your use case requires.
The open-source model landscape has exploded. Heres how to navigate it:
Llama 3 (Meta): The default choice for most use cases. 8B and 70B parameter versions, excellent general capability, permissive license. Start here unless you have a specific reason not to.
Mistral: Strong performance relative to size, particularly good at following instructions. The 7B model punches above its weight class.
Phi-3 (Microsoft): Smaller models (3.8B) that perform surprisingly well. Ideal if you need to run on modest hardware or deploy at the edge.
Qwen 2 (Alibaba): Competitive performance, particularly strong on multilingual tasks.
For enterprise fine-tuning, I typically recommend starting with Llama 3 8B. Its large enough to capture complex patterns, small enough to fine-tune on reasonable hardware, and well-supported by the tooling ecosystem.
Lets be honest about what you actually need:
Minimum viable setup for fine-tuning Llama 3 8B:
Comfortable setup for serious experimentation:
Cloud alternatives:
Dont let hardware requirements scare you off. A single RTX 4090 can handle most fine-tuning tasks if youre using parameter-efficient methods like QLoRA.
QLoRA (Quantized Low-Rank Adaptation) is the technique that makes local fine-tuning accessible. Instead of updating all model parameters, you:
The result: you can fine-tune a 70B parameter model on a single consumer GPU.
Unsloth: My current recommendation for most teams. Its fast (2x speed improvement over standard training), memory-efficient, and handles the complexity of QLoRA automatically.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=unsloth/llama-3-8b-bnb-4bit,
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=[q_proj, k_proj, v_proj, o_proj],
lora_alpha=16,
lora_dropout=0,
bias=none,
)Axolotl: More configuration options, better for complex multi-task fine-tuning. Steeper learning curve but more powerful.
Hugging Face Transformers + PEFT: The foundational libraries. Use these if you need maximum control or want to understand exactly whats happening.
For a first fine-tuning project, start with Unsloth. The documentation is excellent, the defaults are sensible, and you can go from raw data to fine-tuned model in an afternoon.
After working through dozens of fine-tuning projects, Ive developed a simple framework for making the decision:
Before any other approach, spend serious time on prompt engineering. This means:
Prompt engineering solves about 80% of we need fine-tuning requests. Its faster to iterate, costs nothing extra, and works with any model.
How long to try: Give prompt engineering 2-4 weeks of dedicated effort before concluding its insufficient.
If you need the model to work with your companys specific information—documents, knowledge bases, product catalogs—RAG (Retrieval-Augmented Generation) is almost always better than fine-tuning.
Why? Fine-tuning bakes information into model weights. This information becomes stale, cant be easily updated, and is hard to audit. RAG keeps your data separate: retrieve relevant context, inject it into the prompt, let the model reason over current information.
Fine-tuning teaches a model how to behave. RAG provides the model what to know. These are different problems with different solutions.
Consider RAG when:
Fine-tuning is the right choice when:
Format consistency is critical: The model must output in a very specific format, and prompt engineering + JSON mode isnt reliable enough
Domain-specific language: Medical coding, legal citations, financial regulations—domains where the model consistently misunderstands terminology despite good prompts
Style transfer: You need outputs that match a specific voice or tone thats difficult to capture in prompts
Efficiency at scale: A smaller fine-tuned model might be cheaper than a larger base model with extensive prompting
Work through these before committing to a fine-tuning project:
If you dont have clear answers to all six questions, youre not ready for fine-tuning.
Let me share an anonymized example that illustrates the decision framework in practice.
A financial services client came to us convinced they needed to fine-tune a model on their compliance documentation. The model needed to answer questions about their specific policies, cite relevant sections, and maintain the formal tone required for regulatory communications.
Our initial assessment:
The red flags were immediate: they wanted the model to know information that changes regularly. This is a RAG use case, not fine-tuning.
We implemented:
The result: accurate responses, verifiable citations, and quarterly updates are as simple as re-indexing documents. A fine-tuned model would have required retraining every quarter and couldnt have provided citations.
Total implementation time: 3 weeks. Cost: A fraction of what fine-tuning would have required.
Different client, different requirements. A healthcare tech company needed to extract structured data from clinical notes. The extraction format was highly specific—dozens of fields with precise definitions and interdependencies.
Why fine-tuning was right here:
We fine-tuned Llama 3 8B using QLoRA. The fine-tuned model achieved 94% accuracy on field extraction compared to 76% with the best prompt engineering approach. It also ran 3x faster because we could use a smaller model with shorter prompts.
The key difference: they needed to change how the model behaves (structured extraction), not what it knows (domain information).
If youve read this far and determined that fine-tuning is right for your use case, heres your starting path:
For cloud fine-tuning:
For local fine-tuning:
For both approaches:
Fine-tuning is one tool in the AI engineering toolkit. This series continues with related topics that help you build production AI systems:
Understanding when and how to customize models is foundational knowledge for AI engineering. But remember: the best solution is often the simplest one that works. Prompt engineering and RAG solve most problems. Fine-tuning is powerful, but its a tool you should reach for deliberately, not by default.
This article is a live example of the AI-enabled content workflow we build for clients.
| Stage | Who | What |
|---|---|---|
| Research | Claude Opus 4.5 | Analyzed current industry data, studies, and expert sources |
| Curation | Tom Hundley | Directed focus, validated relevance, ensured strategic alignment |
| Drafting | Claude Opus 4.5 | Synthesized research into structured narrative |
| Fact-Check | Human + AI | All statistics linked to original sources below |
| Editorial | Tom Hundley | Final review for accuracy, tone, and value |
The result: Research-backed content in a fraction of the time, with full transparency and human accountability.
Were an AI enablement company. It would be strange if we didnt use AI to create content. But more importantly, we believe the future of professional content isnt AI vs. Human—its AI amplifying human expertise.
Every article we publish demonstrates the same workflow we help clients implement: AI handles the heavy lifting of research and drafting, humans provide direction, judgment, and accountability.
Want to build this capability for your team? Lets talk about AI enablement →
Discover more content: