Part 2 of 4
🤖 Ghostwritten by Claude · Curated by Tom Hundley
This article was written by Claude and curated for publication by Tom Hundley.
The AI industry has a size problem. Model distillation offers a solution: take what a large model knows and compress it into something smaller, faster, and cheaper to run.
The premise is counterintuitive. We spend billions training massive models, then immediately try to make them smaller? Yes—because training and deployment have different constraints. A 70B parameter model might be perfect for generating training data, but you cant run it on a customers phone. You cant serve it economically at scale. You cant deploy it where data privacy requires local processing.
Distillation bridges that gap. This guide explains how it works, when to use it, and how to implement it in practice.
For years, the AI narrative was simple: bigger is better. More parameters meant more capability. GPT-3 had 175B parameters because thats what it took to achieve emergent behaviors that smaller models couldnt match.
That narrative is shifting. Recent developments show that smaller models, properly trained, can match or exceed larger models on specific tasks:
What changed? Better training data, improved architectures, and—critically—distillation techniques that transfer knowledge from large models to small ones.
Latency: A 7B model responds 10-20x faster than a 70B model on the same hardware. For interactive applications, this is the difference between usable and unusable.
Cost: Inference costs scale roughly linearly with model size. A 7B model costs about 1/10th what a 70B model costs per token. At scale, this compounds into significant savings.
Deployment flexibility: Smaller models can run on:
Energy efficiency: Smaller models use less power. For organizations with sustainability goals or those operating in power-constrained environments, this matters.
The question isnt whether to use smaller models—its how to make them good enough for your use case.
Distillation is the process of training a small model (the student) to mimic a large model (the teacher). The insight is that the teachers outputs contain more information than raw labels.
Imagine youre teaching someone to classify images. Traditional training says: This is a cat. This is a dog. This is a cat. Binary labels, no nuance.
A teacher model does something different. It says: This is 95% cat, 3% dog, 2% tiger. The pose is unusual, which is why theres some dog probability—dogs often sit this way.
This richer signal—called soft labels—gives the student more to learn from. The student learns not just the right answer, but why its right and what alternatives were considered.
Soft labels become more informative when you adjust the temperature of the teachers output. At normal temperature (T=1), a confident model might output:
At higher temperature (T=4), the same prediction softens:
The higher-temperature distribution reveals relationships the model learned during training. The student can learn from these relationships even though theyre hidden in the low-temperature outputs.
Distillation transfers:
Distillation struggles to transfer:
Understanding these limits is crucial. You cant distill GPT-4s full capabilities into a 1B model. But you can distill its behavior on specific, bounded tasks.
There are several ways to implement distillation in practice. The right approach depends on your resources, use case, and the nature of the task.
The simplest and most popular approach: use the large model to generate training data, then train a small model on that data.
Process:
Example workflow:
from openai import OpenAI
import json
client = OpenAI()
def generate_training_pair(input_text):
response = client.chat.completions.create(
model=gpt-4o,
messages=[{
role: system,
content: Extract key entities from the text as JSON.
}, {
role: user,
content: input_text
}]
)
return {
input: input_text,
output: response.choices[0].message.content
}
# Generate 10,000 training examples
training_data = [generate_training_pair(text) for text in input_texts]
# Save for fine-tuning
with open(distillation_data.jsonl, w) as f:
for example in training_data:
f.write(json.dumps(example) + \n)Advantages:
Disadvantages:
If you have access to the teachers logits (probability distributions), you can train the student to match those distributions directly.
import torch
import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, temperature=4.0):
# Soften both distributions
soft_student = F.log_softmax(student_logits / temperature, dim=-1)
soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
# KL divergence between distributions
loss = F.kl_div(
soft_student,
soft_teacher,
reduction=batchmean
) * (temperature ** 2)
return lossAdvantages:
Disadvantages:
For reasoning tasks, distill not just the answer but the reasoning process.
Process:
system_prompt = Solve this problem step by step.
Show your reasoning, then give the final answer.
response = client.chat.completions.create(
model=gpt-4o,
messages=[{
role: system,
content: system_prompt
}, {
role: user,
content: problem_text
}]
)
# Training pair includes full reasoning
training_example = {
input: problem_text,
output: response.choices[0].message.content # Includes reasoning
}This approach is particularly effective for math, coding, and logical reasoning tasks. The student learns not just what to answer, but how to think.
Instead of trying to replicate general capabilities, focus on a single task or narrow domain.
Example: Sentiment classification
Rather than distilling a general-purpose model, create a dedicated sentiment classifier:
Task-specific distillation consistently outperforms general distillation because:
Distillation makes sense in specific circumstances. Heres a decision framework:
High-volume, low-latency requirements: If youre making millions of API calls or need sub-second response times, distillation can drop costs by 90% while maintaining quality.
Edge deployment: When models need to run on devices without internet access or with limited compute, distillation is often the only path forward.
Privacy-sensitive applications: Local models keep data local. Distill a capable model, deploy it on-premises, and eliminate data transmission concerns.
Cost optimization at scale: The math is simple. If inference costs are a significant expense, and a smaller model can handle 80% of your traffic, the savings justify the distillation effort.
Bounded tasks: Classification, extraction, formatting, summarization—tasks with clear inputs and outputs distill well.
Tasks requiring emergent capabilities: Complex reasoning, creative generation, and tasks that large models barely handle wont survive distillation.
Rapidly changing requirements: If your task evolves frequently, maintaining a distilled model becomes expensive.
Low volume: If youre making dozens of calls per day, just use the API. Distillation effort isnt justified.
Quality-critical applications: When you need the absolute best output and cost isnt a constraint, use the largest model available.
Distillation doesnt end when training finishes. Production deployment introduces additional considerations.
Before deploying, rigorously compare:
Accuracy metrics:
Performance metrics:
Cost metrics:
Establish quality thresholds before deployment. A distilled model thats 90% as accurate but 10x cheaper might be perfect—or completely unacceptable—depending on your use case.
Distilled models will fail on some inputs. Plan for this:
Confidence thresholds: If the distilled models confidence is below a threshold, escalate to the full model.
def smart_inference(input_text, confidence_threshold=0.85):
# Try distilled model first
distilled_result = distilled_model.predict(input_text)
if distilled_result.confidence = confidence_threshold:
return distilled_result.output
# Fall back to teacher model
return teacher_model.predict(input_text)Error detection: Monitor for patterns that indicate the distilled model is struggling. Automatic fallback when error rates spike.
Hybrid architectures: Route easy requests to the distilled model, hard requests to the teacher. Over time, the easy/hard boundary often shifts as you improve the distilled model.
Drift detection: The world changes. Monitor whether your distilled models performance degrades over time.
Periodic retraining: Plan to regenerate training data and retrain periodically. The teacher model improves; your distilled model should too.
A/B testing: Before fully switching to a distilled model, run it in parallel with the teacher. Measure real-world performance, not just benchmarks.
Ready to try distillation? Heres a practical starting path:
Step 1: Define your task precisely
What inputs, what outputs, what quality bar? The tighter the task definition, the better distillation works.
Step 2: Collect diverse inputs
Gather 1,000-10,000 inputs that represent your production traffic. Include edge cases.
Step 3: Generate teacher outputs
Run your inputs through the best available model. Save the outputs.
Step 4: Choose a student architecture
Step 5: Fine-tune and evaluate
Train the student on your synthetic dataset. Evaluate against held-out test data.
Step 6: Deploy with fallback
Start with a hybrid architecture. Route confident predictions to the distilled model, uncertain ones to the teacher.
Step 7: Iterate
Analyze failures, generate more training data for weak spots, retrain.
Distillation is one technique in the broader toolkit for making AI systems practical. This series continues with:
The AI engineering discipline is young, but patterns are emerging. Distillation is one of those patterns: a reliable technique for trading capability for efficiency in a controlled way. Master it, and youll have another tool for building AI systems that work in the real world.
This article is a live example of the AI-enabled content workflow we build for clients.
| Stage | Who | What |
|---|---|---|
| Research | Claude Opus 4.5 | Analyzed current industry data, studies, and expert sources |
| Curation | Tom Hundley | Directed focus, validated relevance, ensured strategic alignment |
| Drafting | Claude Opus 4.5 | Synthesized research into structured narrative |
| Fact-Check | Human + AI | All statistics linked to original sources below |
| Editorial | Tom Hundley | Final review for accuracy, tone, and value |
The result: Research-backed content in a fraction of the time, with full transparency and human accountability.
Were an AI enablement company. It would be strange if we didnt use AI to create content. But more importantly, we believe the future of professional content isnt AI vs. Human—its AI amplifying human expertise.
Every article we publish demonstrates the same workflow we help clients implement: AI handles the heavy lifting of research and drafting, humans provide direction, judgment, and accountability.
Want to build this capability for your team? Lets talk about AI enablement →
Discover more content: