🤖 Ghostwritten by Claude · Curated by Tom Hundley

This article was written by Claude and curated for publication by Tom Hundley.

Model Distillation: Build Smaller, Faster, Cheaper AI

The AI industry has a size problem. Model distillation offers a solution: take what a large model knows and compress it into something smaller, faster, and cheaper to run.

The premise is counterintuitive. We spend billions training massive models, then immediately try to make them smaller? Yes—because training and deployment have different constraints. A 70B parameter model might be perfect for generating training data, but you cant run it on a customers phone. You cant serve it economically at scale. You cant deploy it where data privacy requires local processing.

Distillation bridges that gap. This guide explains how it works, when to use it, and how to implement it in practice.

The Small Model Revolution

For years, the AI narrative was simple: bigger is better. More parameters meant more capability. GPT-3 had 175B parameters because thats what it took to achieve emergent behaviors that smaller models couldnt match.

That narrative is shifting. Recent developments show that smaller models, properly trained, can match or exceed larger models on specific tasks:

Phi-3 (Microsoft): A 3.8B parameter model that competes with models 10x its size on reasoning benchmarks
Gemma (Google): Efficient models designed from the ground up for deployment
Llama 3.2: 1B and 3B versions optimized for on-device use
Mistral 7B: Consistently punches above its weight class

What changed? Better training data, improved architectures, and—critically—distillation techniques that transfer knowledge from large models to small ones.

Why Smaller Matters

Latency: A 7B model responds 10-20x faster than a 70B model on the same hardware. For interactive applications, this is the difference between usable and unusable.

Cost: Inference costs scale roughly linearly with model size. A 7B model costs about 1/10th what a 70B model costs per token. At scale, this compounds into significant savings.

Deployment flexibility: Smaller models can run on:

Edge devices (phones, embedded systems)
Customer premises (data stays local)
Lower-tier cloud instances
Multiple instances for redundancy

Energy efficiency: Smaller models use less power. For organizations with sustainability goals or those operating in power-constrained environments, this matters.

The question isnt whether to use smaller models—its how to make them good enough for your use case.

Knowledge Distillation: How It Works

Distillation is the process of training a small model (the student) to mimic a large model (the teacher). The insight is that the teachers outputs contain more information than raw labels.

The Core Concept

Imagine youre teaching someone to classify images. Traditional training says: This is a cat. This is a dog. This is a cat. Binary labels, no nuance.

A teacher model does something different. It says: This is 95% cat, 3% dog, 2% tiger. The pose is unusual, which is why theres some dog probability—dogs often sit this way.

This richer signal—called soft labels—gives the student more to learn from. The student learns not just the right answer, but why its right and what alternatives were considered.

Temperature Scaling

Soft labels become more informative when you adjust the temperature of the teachers output. At normal temperature (T=1), a confident model might output:

Cat: 99%
Dog: 0.5%
Tiger: 0.5%

At higher temperature (T=4), the same prediction softens:

Cat: 70%
Dog: 15%
Tiger: 15%

The higher-temperature distribution reveals relationships the model learned during training. The student can learn from these relationships even though theyre hidden in the low-temperature outputs.

What Gets Transferred (And What Doesnt)

Distillation transfers:

Task-specific patterns and heuristics
Relationships between inputs and outputs
Boundary cases and edge conditions
The spirit of the teachers reasoning

Distillation struggles to transfer:

Emergent behaviors that require scale
Multi-step reasoning beyond the students capacity
Knowledge that depends on the teachers architecture
Capabilities the teacher barely has

Understanding these limits is crucial. You cant distill GPT-4s full capabilities into a 1B model. But you can distill its behavior on specific, bounded tasks.

Practical Distillation Approaches

There are several ways to implement distillation in practice. The right approach depends on your resources, use case, and the nature of the task.

Approach 1: Synthetic Data Generation

The simplest and most popular approach: use the large model to generate training data, then train a small model on that data.

Process:

Define your task (classification, extraction, generation, etc.)
Create or collect input examples
Run inputs through the teacher model
Collect teacher outputs as training targets
Fine-tune a small model on these input-output pairs

Example workflow:

from openai import OpenAI
import json

client = OpenAI()

def generate_training_pair(input_text):
    response = client.chat.completions.create(
        model=gpt-4o,
        messages=[{
            role: system,
            content: Extract key entities from the text as JSON.
        }, {
            role: user,
            content: input_text
        }]
    )
    return {
        input: input_text,
        output: response.choices[0].message.content
    }

# Generate 10,000 training examples
training_data = [generate_training_pair(text) for text in input_texts]

# Save for fine-tuning
with open(distillation_data.jsonl, w) as f:
    for example in training_data:
        f.write(json.dumps(example) + \n)

Advantages:

Simple to implement
Works with any teacher model (including API-only models)
No special training infrastructure needed

Disadvantages:

Requires many examples (typically 10K-100K)
Teacher API costs can add up
Loses soft label information

Approach 2: Direct Distillation with Soft Labels

If you have access to the teachers logits (probability distributions), you can train the student to match those distributions directly.

import torch
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, temperature=4.0):
    # Soften both distributions
    soft_student = F.log_softmax(student_logits / temperature, dim=-1)
    soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
    
    # KL divergence between distributions
    loss = F.kl_div(
        soft_student,
        soft_teacher,
        reduction=batchmean
    ) * (temperature ** 2)
    
    return loss

Advantages:

Richer learning signal
Often requires fewer examples
Student learns relationships, not just answers

Disadvantages:

Requires logit access (not available from most APIs)
More complex training setup
Teacher and student must share vocabulary for language models

Approach 3: Chain-of-Thought Distillation

For reasoning tasks, distill not just the answer but the reasoning process.

Process:

Prompt the teacher to show its reasoning
Collect both reasoning and final answer
Train the student to reproduce the full reasoning chain

system_prompt = Solve this problem step by step.
Show your reasoning, then give the final answer.

response = client.chat.completions.create(
    model=gpt-4o,
    messages=[{
        role: system,
        content: system_prompt
    }, {
        role: user,
        content: problem_text
    }]
)

# Training pair includes full reasoning
training_example = {
    input: problem_text,
    output: response.choices[0].message.content  # Includes reasoning
}

This approach is particularly effective for math, coding, and logical reasoning tasks. The student learns not just what to answer, but how to think.

Approach 4: Task-Specific Distillation

Instead of trying to replicate general capabilities, focus on a single task or narrow domain.

Example: Sentiment classification

Rather than distilling a general-purpose model, create a dedicated sentiment classifier:

Generate 50K labeled sentiment examples using GPT-4
Fine-tune a BERT-class model (110M parameters) on this data
Result: 1000x smaller, 100x faster, nearly as accurate

Task-specific distillation consistently outperforms general distillation because:

The student capacity is matched to task complexity
Training data density is higher for the specific domain
Irrelevant capabilities dont compete for limited capacity

When to Distill

Distillation makes sense in specific circumstances. Heres a decision framework:

Good Candidates for Distillation

High-volume, low-latency requirements: If youre making millions of API calls or need sub-second response times, distillation can drop costs by 90% while maintaining quality.

Edge deployment: When models need to run on devices without internet access or with limited compute, distillation is often the only path forward.

Privacy-sensitive applications: Local models keep data local. Distill a capable model, deploy it on-premises, and eliminate data transmission concerns.

Cost optimization at scale: The math is simple. If inference costs are a significant expense, and a smaller model can handle 80% of your traffic, the savings justify the distillation effort.

Bounded tasks: Classification, extraction, formatting, summarization—tasks with clear inputs and outputs distill well.

Poor Candidates for Distillation

Tasks requiring emergent capabilities: Complex reasoning, creative generation, and tasks that large models barely handle wont survive distillation.

Rapidly changing requirements: If your task evolves frequently, maintaining a distilled model becomes expensive.

Low volume: If youre making dozens of calls per day, just use the API. Distillation effort isnt justified.

Quality-critical applications: When you need the absolute best output and cost isnt a constraint, use the largest model available.

Production Considerations

Distillation doesnt end when training finishes. Production deployment introduces additional considerations.

Benchmarking: Distilled vs. Original

Before deploying, rigorously compare:

Accuracy metrics:

Exact match rate
Semantic similarity to teacher outputs
Error rate on edge cases

Performance metrics:

Latency (p50, p95, p99)
Throughput (requests per second)
Memory footprint

Cost metrics:

Inference cost per request
Total cost at projected volume
Break-even analysis vs. API calls

Establish quality thresholds before deployment. A distilled model thats 90% as accurate but 10x cheaper might be perfect—or completely unacceptable—depending on your use case.

Graceful Degradation Patterns

Distilled models will fail on some inputs. Plan for this:

Confidence thresholds: If the distilled models confidence is below a threshold, escalate to the full model.

def smart_inference(input_text, confidence_threshold=0.85):
    # Try distilled model first
    distilled_result = distilled_model.predict(input_text)
    
    if distilled_result.confidence = confidence_threshold:
        return distilled_result.output
    
    # Fall back to teacher model
    return teacher_model.predict(input_text)

Error detection: Monitor for patterns that indicate the distilled model is struggling. Automatic fallback when error rates spike.

Hybrid architectures: Route easy requests to the distilled model, hard requests to the teacher. Over time, the easy/hard boundary often shifts as you improve the distilled model.

Monitoring and Maintenance

Drift detection: The world changes. Monitor whether your distilled models performance degrades over time.

Periodic retraining: Plan to regenerate training data and retrain periodically. The teacher model improves; your distilled model should too.

A/B testing: Before fully switching to a distilled model, run it in parallel with the teacher. Measure real-world performance, not just benchmarks.

Getting Started

Ready to try distillation? Heres a practical starting path:

Step 1: Define your task precisely
What inputs, what outputs, what quality bar? The tighter the task definition, the better distillation works.

Step 2: Collect diverse inputs
Gather 1,000-10,000 inputs that represent your production traffic. Include edge cases.

Step 3: Generate teacher outputs
Run your inputs through the best available model. Save the outputs.

Step 4: Choose a student architecture

For classification: BERT, DeBERTa, or similar
For generation: Llama 3.2 1B/3B, Phi-3, Gemma 2B
For extraction: Sequence labeling models

Step 5: Fine-tune and evaluate
Train the student on your synthetic dataset. Evaluate against held-out test data.

Step 6: Deploy with fallback
Start with a hybrid architecture. Route confident predictions to the distilled model, uncertain ones to the teacher.

Step 7: Iterate
Analyze failures, generate more training data for weak spots, retrain.

Whats Next

Distillation is one technique in the broader toolkit for making AI systems practical. This series continues with:

Part 3: RLHF and alignment—how reinforcement learning shapes model behavior
Part 4: Drift detection—keeping your models honest over time

The AI engineering discipline is young, but patterns are emerging. Distillation is one of those patterns: a reliable technique for trading capability for efficiency in a controlled way. Master it, and youll have another tool for building AI systems that work in the real world.

Building AI systems that need to run efficiently at scale? We help organizations implement distillation pipelines and optimize model deployment.

How This Article Was Made

This article is a live example of the AI-enabled content workflow we build for clients.

Stage	Who	What
Research	Claude Opus 4.5	Analyzed current industry data, studies, and expert sources
Curation	Tom Hundley	Directed focus, validated relevance, ensured strategic alignment
Drafting	Claude Opus 4.5	Synthesized research into structured narrative
Fact-Check	Human + AI	All statistics linked to original sources below
Editorial	Tom Hundley	Final review for accuracy, tone, and value

The result: Research-backed content in a fraction of the time, with full transparency and human accountability.

Why We Work This Way

Were an AI enablement company. It would be strange if we didnt use AI to create content. But more importantly, we believe the future of professional content isnt AI vs. Human—its AI amplifying human expertise.

Every article we publish demonstrates the same workflow we help clients implement: AI handles the heavy lifting of research and drafting, humans provide direction, judgment, and accountability.

Want to build this capability for your team? Lets talk about AI enablement →