Fine-Tuning and Distillation for Agents

Publish at:

A well-prompted large model can do almost anything — reason through multi-step plans, call tools in the right order, recover from errors, and produce polished outputs. But "almost anything" comes at a cost: latency measured in seconds, bills measured in dollars per thousand calls, and a general-purpose model spending capacity on tasks a specialist could handle in its sleep. Fine-tuning lets you trade generality for speed, cost, and reliability on a narrow task. Distillation takes it further — you watch a powerful model work, capture its behavior, and train a smaller model to replicate it.

This is how you make the agent cheaper and faster without sacrificing the quality that matters for a specific job.

The Prompt-First Baseline #

Before reaching for fine-tuning, you prompt. You write system instructions, few-shot examples, tool schemas, and output format specifications. You iterate on that prompt until the agent handles your target scenarios reliably. This is the right starting point for three reasons:

Speed of iteration. Changing a prompt takes seconds. Changing a fine-tuned model takes hours of data preparation, training, and evaluation. If you can solve the problem with prompting, you should.

Flexibility. A prompted model adapts to new requirements by updating its instructions. A fine-tuned model encodes behavior in its weights — changing that behavior requires retraining.

Observability. With prompting, you can read the instructions and understand why the model behaves a certain way. With fine-tuning, behavior is baked into weights that are not directly inspectable.

But prompting has limits. When you hit them, you know:

Token budget exhaustion. Your few-shot examples, format instructions, and tool schemas consume so much context that you are crowding out the actual task. A fine-tuned model has those patterns in its weights and does not need them spelled out every call.

Latency sensitivity. A frontier model with a 2,000-token system prompt takes 800ms to respond. Your tool-calling loop runs 15 iterations per task. That is 12 seconds of pure model latency. A fine-tuned smaller model responding in 100ms cuts that to under 2 seconds.

Cost at scale. You are making 10 million agent calls per day. Even at a fraction of a cent per call, the bill is significant. A model one-tenth the size, fine-tuned to match quality on your specific task, cuts cost by 90%.

Consistency requirements. Despite detailed prompts, the model occasionally formats outputs wrong, hallucinates tool names, or drifts from your schema. Fine-tuning on correct examples drills the right behavior deeper than instructions alone can achieve.

Behavioral patterns that resist prompting. Some behaviors — a particular reasoning style, a consistent tone, a specific error-recovery pattern — are hard to elicit reliably through instructions alone. If you find yourself adding more and more prompt engineering hacks to stabilize behavior, fine-tuning might encode it more cleanly.

┌───────────────────────────────────────────────────────┐
│              When to Move Beyond Prompting            │
│                                                       │
│  Start here:  Prompt engineering + few-shot examples  │
│       │                                               │
│       ▼                                               │
│  Problem?     ─── No ───► Stay with prompting         │
│       │                                               │
│      Yes                                              │
│       │                                               │
│       ▼                                               │
│  ┌─────────────────────────────────────────────┐      │
│  │  Symptom                   │  Solution      │      │
│  ├─────────────────────────────────────────────┤      │
│  │  Context window too full   │  Fine-tune     │      │
│  │  Latency too high          │  Distill       │      │
│  │  Cost too high at scale    │  Distill       │      │
│  │  Output format drift       │  Fine-tune     │      │
│  │  Inconsistent behavior     │  Fine-tune     │      │
│  │  Need offline/edge deploy  │  Distill       │      │
│  └─────────────────────────────────────────────┘      │
│                                                       │
└───────────────────────────────────────────────────────┘

Trace Distillation #

The most powerful fine-tuning technique for agents is trace distillation: you run a capable teacher model on real tasks, record its complete execution traces (reasoning, tool calls, observations, final outputs), and use those traces as training data for a smaller student model.

The insight is simple — you do not need to hand-author training examples. The teacher model generates them by doing its job. Your role is to filter for quality and format the traces as training data.

Collecting Traces #

A trace captures everything the agent did during a task: the input, each reasoning step, each tool call with arguments and results, and the final output. This is the ground truth of "how a good agent handles this task."

import json
from dataclasses import dataclass, field
from typing import Any


@dataclass
class TraceStep:
    """A single step in an agent execution trace."""
    step_type: str  # "reasoning", "tool_call", "observation", "output"
    content: str
    tool_name: str | None = None
    tool_args: dict | None = None
    tool_result: Any = None
    timestamp: float = 0.0


@dataclass
class AgentTrace:
    """A complete execution trace for one task."""
    task_id: str
    input_text: str
    steps: list[TraceStep] = field(default_factory=list)
    final_output: str = ""
    success: bool = False
    quality_score: float = 0.0
    model_id: str = ""
    total_tokens: int = 0
    wall_time_seconds: float = 0.0


class TraceCollector:
    """Instrument an agent to collect execution traces."""

    def __init__(self, agent, evaluator):
        self.agent = agent
        self.evaluator = evaluator
        self.traces: list[AgentTrace] = []

    async def collect(self, task: str, task_id: str) -> AgentTrace:
        """Run the agent on a task and capture the full trace."""
        trace = AgentTrace(
            task_id=task_id,
            input_text=task,
            model_id=self.agent.model_id,
        )

        # Wrap the agent's execution to capture each step
        original_step = self.agent.step

        async def instrumented_step(state):
            result = await original_step(state)
            trace.steps.append(TraceStep(
                step_type=result.type,
                content=result.content,
                tool_name=result.tool_name,
                tool_args=result.tool_args,
                tool_result=result.tool_result,
            ))
            return result

        self.agent.step = instrumented_step

        try:
            output = await self.agent.run(task)
            trace.final_output = output
            trace.success = True
        except Exception as e:
            trace.final_output = str(e)
            trace.success = False
        finally:
            self.agent.step = original_step

        # Score the trace quality
        trace.quality_score = await self.evaluator.score(trace)
        self.traces.append(trace)
        return trace

    def get_high_quality_traces(
        self, min_score: float = 0.8
    ) -> list[AgentTrace]:
        """Filter traces suitable for distillation training."""
        return [
            t for t in self.traces
            if t.success and t.quality_score >= min_score
        ]

Quality Filtering #

Not every trace from the teacher is worth training on. The teacher model makes mistakes, takes inefficient paths, and occasionally hallucinates. You need a quality gate.

Three approaches to quality filtering:

Outcome-based filtering. Did the agent produce the correct final answer? If you have ground truth labels, this is straightforward. Keep traces that led to correct outcomes, discard the rest.

Efficiency filtering. Among traces that succeeded, prefer shorter ones. A trace that solved a task in 3 tool calls is better training data than one that wandered for 12 calls before arriving at the same answer.

Human review. For high-stakes domains, a human reviews a sample of traces and labels them as good or bad training examples. Expensive, but catches subtle quality issues that automated metrics miss.

class TraceFilter:
    """Filter and rank traces for training data quality."""

    def __init__(self, model):
        self.model = model

    async def filter_batch(
        self, traces: list[AgentTrace], ground_truth: dict[str, str]
    ) -> list[AgentTrace]:
        """Apply multiple quality filters to a batch of traces."""
        filtered = []

        for trace in traces:
            # Gate 1: Must have succeeded
            if not trace.success:
                continue

            # Gate 2: Outcome correctness (if ground truth available)
            if trace.task_id in ground_truth:
                if not self._output_matches(
                    trace.final_output, ground_truth[trace.task_id]
                ):
                    continue

            # Gate 3: Efficiency — reject traces that are too long
            max_steps = self._expected_steps(trace.input_text) * 2
            if len(trace.steps) > max_steps:
                continue

            # Gate 4: No hallucinated tool calls
            if self._has_invalid_tool_calls(trace):
                continue

            filtered.append(trace)

        # Rank by quality score, take the best
        filtered.sort(key=lambda t: t.quality_score, reverse=True)
        return filtered

    def _output_matches(self, output: str, expected: str) -> bool:
        """Fuzzy match between agent output and ground truth."""
        # Normalize and compare — exact match is too strict
        output_norm = output.strip().lower()
        expected_norm = expected.strip().lower()
        # Use semantic similarity or structured comparison
        return expected_norm in output_norm or self._semantic_match(
            output_norm, expected_norm
        )

    def _has_invalid_tool_calls(self, trace: AgentTrace) -> bool:
        """Check for tool calls with hallucinated names or schemas."""
        valid_tools = {"search", "read_file", "execute_code", "query_db"}
        for step in trace.steps:
            if step.step_type == "tool_call":
                if step.tool_name not in valid_tools:
                    return True
        return False

Formatting Traces as Training Data #

Once you have high-quality traces, you format them as supervised fine-tuning examples. The standard approach is to convert each trace into a multi-turn conversation where the model's "responses" include both reasoning and tool calls.

class TraceFormatter:
    """Convert agent traces into fine-tuning training examples."""

    def format_for_sft(self, trace: AgentTrace) -> list[dict]:
        """Format a trace as a supervised fine-tuning example.

        Returns a conversation in the standard messages format.
        """
        messages = []

        # System message establishes the agent role
        messages.append({
            "role": "system",
            "content": self._get_system_prompt(),
        })

        # User message is the task
        messages.append({
            "role": "user",
            "content": trace.input_text,
        })

        # Each reasoning + tool_call pair becomes an assistant turn
        # Each observation becomes a tool response
        i = 0
        while i < len(trace.steps):
            step = trace.steps[i]

            if step.step_type == "reasoning":
                # Look ahead for a tool call
                if (i + 1 < len(trace.steps)
                        and trace.steps[i + 1].step_type == "tool_call"):
                    tool_step = trace.steps[i + 1]
                    messages.append({
                        "role": "assistant",
                        "content": step.content,
                        "tool_calls": [{
                            "function": {
                                "name": tool_step.tool_name,
                                "arguments": json.dumps(tool_step.tool_args),
                            }
                        }],
                    })
                    i += 1  # Skip the tool_call step

                    # Add observation as tool response
                    if (i + 1 < len(trace.steps)
                            and trace.steps[i + 1].step_type == "observation"):
                        i += 1
                        messages.append({
                            "role": "tool",
                            "content": trace.steps[i].content,
                        })
                else:
                    messages.append({
                        "role": "assistant",
                        "content": step.content,
                    })

            elif step.step_type == "output":
                messages.append({
                    "role": "assistant",
                    "content": step.content,
                })

            i += 1

        return messages

    def format_batch(
        self, traces: list[AgentTrace]
    ) -> list[list[dict]]:
        """Format a batch of traces for training."""
        return [self.format_for_sft(t) for t in traces]

The result is a dataset of conversations where the student model learns: given this task and these tool results, produce this reasoning and these tool calls. It learns the teacher's decision-making pattern in addition to its final outputs.

Reward Modeling for Agent Behavior #

Supervised fine-tuning teaches the student to imitate the teacher. But imitation has a ceiling — if the teacher occasionally makes suboptimal decisions, the student faithfully replicates those too. Reward modeling goes a step further: instead of training the model on "what the teacher did," you train it on "what is good."

A reward model scores agent trajectories — complete sequences of reasoning and actions — assigning higher scores to better behavior. You then use reinforcement learning (typically PPO or DPO) to optimize the agent policy against this reward signal.

Building an Agent Reward Model #

The reward model needs to evaluate agent-specific qualities: Did the agent pick the right tool? Did it reason correctly before acting? Did it recover gracefully from errors? Did it ask for clarification when appropriate? Did it avoid unnecessary steps?

@dataclass
class TrajectoryScore:
    """Multi-dimensional scoring of an agent trajectory."""
    correctness: float = 0.0      # Did it produce the right answer?
    efficiency: float = 0.0       # Did it minimize unnecessary steps?
    tool_selection: float = 0.0   # Did it pick appropriate tools?
    safety: float = 0.0           # Did it avoid unsafe actions?
    reasoning_quality: float = 0.0  # Was its reasoning coherent?
    overall: float = 0.0

    def compute_overall(self, weights: dict[str, float] | None = None):
        if weights is None:
            weights = {
                "correctness": 0.4,
                "efficiency": 0.2,
                "tool_selection": 0.2,
                "safety": 0.1,
                "reasoning_quality": 0.1,
            }
        self.overall = (
            weights["correctness"] * self.correctness
            + weights["efficiency"] * self.efficiency
            + weights["tool_selection"] * self.tool_selection
            + weights["safety"] * self.safety
            + weights["reasoning_quality"] * self.reasoning_quality
        )


class AgentRewardModel:
    """Score agent trajectories for reinforcement learning."""

    def __init__(self, scoring_model, reference_traces: list[AgentTrace]):
        self.scoring_model = scoring_model
        self.reference_traces = reference_traces

    async def score_trajectory(
        self, trace: AgentTrace
    ) -> TrajectoryScore:
        """Score a complete agent trajectory on multiple dimensions."""
        score = TrajectoryScore()

        score.correctness = await self._score_correctness(trace)
        score.efficiency = self._score_efficiency(trace)
        score.tool_selection = await self._score_tool_selection(trace)
        score.safety = self._score_safety(trace)
        score.reasoning_quality = await self._score_reasoning(trace)
        score.compute_overall()

        return score

    def _score_efficiency(self, trace: AgentTrace) -> float:
        """Score based on how concisely the agent solved the task."""
        step_count = len(trace.steps)

        # Compare to reference traces for similar tasks
        similar_refs = self._find_similar_references(trace.input_text)
        if not similar_refs:
            # No reference — use absolute heuristic
            if step_count <= 3:
                return 1.0
            elif step_count <= 7:
                return 0.8
            elif step_count <= 15:
                return 0.5
            else:
                return 0.2

        # Score relative to reference efficiency
        avg_ref_steps = sum(
            len(r.steps) for r in similar_refs
        ) / len(similar_refs)
        ratio = avg_ref_steps / max(step_count, 1)
        return min(ratio, 1.0)

    def _score_safety(self, trace: AgentTrace) -> float:
        """Check for unsafe actions in the trajectory."""
        unsafe_patterns = [
            "rm -rf", "DROP TABLE", "DELETE FROM",
            "sudo", "chmod 777",
        ]
        for step in trace.steps:
            if step.step_type == "tool_call" and step.tool_args:
                args_str = json.dumps(step.tool_args)
                for pattern in unsafe_patterns:
                    if pattern in args_str:
                        return 0.0  # Any unsafe action = zero safety score
        return 1.0

    async def generate_preference_pairs(
        self, traces: list[AgentTrace]
    ) -> list[tuple[AgentTrace, AgentTrace]]:
        """Generate (chosen, rejected) pairs for DPO training."""
        pairs = []
        # Group traces by task
        by_task: dict[str, list[AgentTrace]] = {}
        for trace in traces:
            by_task.setdefault(trace.task_id, []).append(trace)

        for task_id, task_traces in by_task.items():
            if len(task_traces) < 2:
                continue
            # Score all traces for this task
            scored = []
            for t in task_traces:
                s = await self.score_trajectory(t)
                scored.append((t, s.overall))

            scored.sort(key=lambda x: x[1], reverse=True)

            # Pair best with worst for maximum signal
            best = scored[0][0]
            worst = scored[-1][0]
            if scored[0][1] - scored[-1][1] > 0.2:  # Meaningful gap
                pairs.append((best, worst))

        return pairs

DPO vs. PPO for Agents #

Two dominant approaches to training with reward signals:

Proximal Policy Optimization (PPO) trains a separate reward model, then uses it online — the agent generates trajectories, the reward model scores them, and the policy is updated to produce higher-scoring trajectories. Powerful but complex to implement and stabilize.

Direct Preference Optimization (DPO) sidesteps the explicit reward model. Instead, you provide pairs of trajectories — one preferred, one rejected — and the model learns to increase the probability of preferred behavior relative to rejected behavior. Simpler to implement, more stable to train, and increasingly the default choice.

For agent fine-tuning, DPO is often the practical choice because:

  • You already have trajectory pairs from trace collection (good runs vs. bad runs on the same task).
  • You do not need to maintain a separate reward model in production.
  • Training is more stable — PPO for multi-step agent trajectories is notoriously tricky to get right.
class DPOTrainingData:
    """Prepare training data for Direct Preference Optimization."""

    def __init__(self, formatter: TraceFormatter):
        self.formatter = formatter

    def prepare_pairs(
        self, preference_pairs: list[tuple[AgentTrace, AgentTrace]]
    ) -> list[dict]:
        """Convert preference pairs into DPO training format."""
        training_examples = []

        for chosen_trace, rejected_trace in preference_pairs:
            chosen_messages = self.formatter.format_for_sft(chosen_trace)
            rejected_messages = self.formatter.format_for_sft(rejected_trace)

            training_examples.append({
                "prompt": chosen_messages[:2],  # system + user
                "chosen": chosen_messages[2:],   # chosen trajectory
                "rejected": rejected_messages[2:],  # rejected trajectory
            })

        return training_examples

The Distillation Pipeline #

Putting it all together, the distillation pipeline is a cycle:

┌───────────────────────────────────────────────────────────┐
│                   Distillation Pipeline                   │
│                                                           │
│  ┌────────────┐    ┌───────────┐    ┌───────────┐         │
│  │  Teacher   │───►│  Collect  │───►│  Filter   │         │
│  │  (large    │    │  traces   │    │  for      │         │
│  │   model)   │    │  on real  │    │  quality  │         │
│  └────────────┘    │  tasks    │    └─────┬─────┘         │
│                    └───────────┘          │               │
│                                          ▼                │
│  ┌────────────┐    ┌───────────┐    ┌───────────┐         │
│  │  Deploy    │◄───│  Train    │◄───│  Format   │         │
│  │  student   │    │  student  │    │  training │         │
│  │  model     │    │  (SFT +   │    │  data     │         │
│  └─────┬──────┘    │   DPO)    │    └───────────┘         │
│        │           └───────────┘                          │
│        │                                                  │
│        ▼                                                  │
│  ┌────────────┐    ┌───────────┐                          │
│  │  Evaluate  │───►│  Meets    │─── Yes ──► Production    │
│  │  student   │    │  quality  │                          │
│  │  on held-  │    │  bar?     │─── No ───► More traces,  │
│  │  out tasks │    └───────────┘            retrain       │
│  └────────────┘                                           │
│                                                           │
└───────────────────────────────────────────────────────────┘

Each step in detail:

1. Run the teacher on production traffic. The large model handles real tasks — the same tasks you want the student to eventually handle. You instrument it to capture complete traces.

2. Filter traces by quality. Remove failures, inefficient paths, and edge cases. Keep the cleanest examples of correct behavior.

3. Format as training data. Convert traces into the supervised fine-tuning format your training framework expects (messages with tool calls, or prompt/completion pairs).

4. Train the student model. Start with supervised fine-tuning on the high-quality traces. Optionally follow with DPO using preference pairs. The student is typically 5-20x smaller than the teacher.

5. Evaluate on held-out tasks. Run the student on tasks it has never seen. Compare output quality, tool-calling accuracy, and efficiency to the teacher. Set a quality bar — typically 90-95% of the teacher's performance is acceptable given the cost savings.

6. Deploy or iterate. If the student meets the bar, deploy it. If not, collect more traces, adjust filtering, and retrain.

Continuous Distillation #

The pipeline is not a one-shot process. As your agent encounters new tasks, edge cases, and failure modes, the teacher handles them first. You continuously feed new traces into the training set, periodically retraining the student. This creates a flywheel:

  • More production traffic reveals new scenarios.
  • The teacher handles new scenarios (at higher cost).
  • New traces expand the training set.
  • The student is retrained and covers more cases.
  • Less traffic needs to be routed to the expensive teacher.
class ContinuousDistillation:
    """Manage ongoing distillation from teacher to student."""

    def __init__(
        self,
        teacher_agent,
        student_agent,
        evaluator,
        trace_store,
        training_config: dict,
    ):
        self.teacher = teacher_agent
        self.student = student_agent
        self.evaluator = evaluator
        self.trace_store = trace_store
        self.config = training_config

    async def route_task(self, task: str) -> dict:
        """Route a task to teacher or student based on confidence."""
        # Try the student first
        student_result = await self.student.run(task, dry_run=True)

        if student_result.confidence >= self.config["student_threshold"]:
            # Student is confident — use it
            result = await self.student.run(task)
            return {"result": result, "model": "student"}

        # Student is uncertain — fall back to teacher and learn
        trace = await self._run_teacher_with_trace(task)
        self.trace_store.add(trace)

        # Check if we have enough new traces to retrain
        if self.trace_store.new_traces_count() >= self.config["retrain_threshold"]:
            await self._trigger_retrain()

        return {"result": trace.final_output, "model": "teacher"}

    async def _trigger_retrain(self):
        """Retrain the student on accumulated new traces."""
        new_traces = self.trace_store.get_new_traces()
        filtered = [
            t for t in new_traces
            if t.success and t.quality_score >= 0.8
        ]

        if len(filtered) < self.config["min_training_examples"]:
            return  # Not enough quality data yet

        # Format and trigger training job
        formatter = TraceFormatter()
        training_data = formatter.format_batch(filtered)

        # In practice, this kicks off an async training job
        await self._submit_training_job(training_data)
        self.trace_store.mark_as_trained(filtered)

What to Fine-Tune For #

Not every aspect of agent behavior benefits equally from fine-tuning. Some capabilities are better left to prompting; others are natural fits for weight-level encoding.

Good candidates for fine-tuning:

  • Tool-calling format compliance. The exact JSON schema for tool calls, argument formatting, and structured outputs. Models fine-tuned on tool-calling data rarely drift from the schema.
  • Domain-specific reasoning patterns. If your agent always follows the same reasoning structure for a class of problems (e.g., "check permissions, then validate input, then execute"), fine-tuning encodes that sequence.
  • Output style and formatting. Consistent response structure, citation formats, and tone are easily drilled through fine-tuning.
  • Error recovery patterns. How to respond when a tool fails, when to retry vs. try an alternative approach.

Poor candidates for fine-tuning:

  • Factual knowledge. Fine-tuning is not good at injecting new facts — the model may memorize some but generalization is poor. Use RAG for factual grounding.
  • Rapidly changing requirements. If your agent's behavior needs to change weekly, fine-tuning is too slow. Stick with prompting.
  • Rare edge cases. If a scenario occurs once in a thousand tasks, you will not have enough traces to train on it. Handle it in the prompt or route those cases to the teacher.
  • Complex multi-step planning. Planning ability scales with model size. A small fine-tuned model gains tool-calling fluency but does not magically acquire planning depth it never had.

Evaluation and the Quality Bar #

Fine-tuning without rigorous evaluation is dangerous. A model can ace your training distribution while failing on slight variations. Evaluation for agent fine-tuning requires measuring several dimensions:

class DistillationEvaluator:
    """Evaluate a distilled student model against the teacher."""

    def __init__(self, teacher_agent, student_agent, test_tasks: list[dict]):
        self.teacher = teacher_agent
        self.student = student_agent
        self.test_tasks = test_tasks

    async def run_comparison(self) -> dict:
        """Run both models on test tasks and compare."""
        results = {
            "teacher_scores": [],
            "student_scores": [],
            "agreement_rate": 0.0,
            "student_failures": [],
            "efficiency_gain": 0.0,
        }

        for task in self.test_tasks:
            teacher_trace = await self._run_and_score(
                self.teacher, task["input"]
            )
            student_trace = await self._run_and_score(
                self.student, task["input"]
            )

            results["teacher_scores"].append(teacher_trace.quality_score)
            results["student_scores"].append(student_trace.quality_score)

            if student_trace.quality_score < 0.5:
                results["student_failures"].append({
                    "task": task["input"],
                    "student_output": student_trace.final_output,
                    "teacher_output": teacher_trace.final_output,
                })

        # Compute aggregate metrics
        teacher_avg = sum(results["teacher_scores"]) / len(results["teacher_scores"])
        student_avg = sum(results["student_scores"]) / len(results["student_scores"])
        results["quality_ratio"] = student_avg / max(teacher_avg, 0.01)
        results["agreement_rate"] = self._compute_agreement(results)

        return results

    def meets_quality_bar(self, results: dict) -> bool:
        """Does the student meet minimum quality requirements?"""
        return (
            results["quality_ratio"] >= 0.90  # Within 10% of teacher
            and len(results["student_failures"]) / len(self.test_tasks) < 0.05
        )

The quality bar is task-specific. For a customer-facing agent, 95% of the teacher's quality might be required. For an internal automation agent where occasional errors are easily caught, 85% might suffice — if it comes with a 10x cost reduction.

Critical evaluation pitfalls:

Distribution shift. Your test set must include tasks the student has not seen in training. If you test on the same distribution you trained on, you are measuring memorization, not generalization.

Failure mode blindness. Average scores hide catastrophic failures. A student that gets 95% of tasks right but completely botches the remaining 5% in dangerous ways (wrong tool calls, unsafe actions) is not production-ready. Evaluate worst-case behavior explicitly.

Metric gaming. If you optimize for a single metric (e.g., final answer correctness), the student may learn shortcuts that pass the metric but do not reflect genuine understanding. Multi-dimensional evaluation (correctness, efficiency, safety, reasoning quality) resists gaming.

Trade-Offs and Practical Considerations #

Dimension Prompting Only Fine-Tuned Distilled
Setup cost Low Medium High
Iteration speed Minutes Hours-days Days-weeks
Per-call cost High (large model) Medium Low (small model)
Latency High Medium Low
Flexibility High Medium Low
Quality ceiling Highest High Bounded by teacher
Data requirement None Hundreds of examples Thousands of traces
Maintenance Update prompts Retrain periodically Continuous pipeline

Practical considerations that rarely appear in research papers but matter in production:

Training data contamination. If your traces contain sensitive user data, you cannot naively use them for training without privacy controls. Anonymize, aggregate, or use synthetic tasks that mirror real patterns without exposing actual data.

Model drift. As the teacher model gets updated by its provider, its behavior may shift. Your student was trained on traces from an older version. Periodic retraining keeps the student aligned with the current teacher's quality level.

Catastrophic forgetting. Fine-tuning on a narrow task can degrade the model's general capabilities. If your agent needs both specialist behavior and general reasoning, you may need to blend specialist training data with general-purpose data to maintain breadth.

Serving infrastructure. A distilled model is smaller, but you still need infrastructure to serve it — GPU allocation, load balancing, model versioning. The cost savings from a smaller model must outweigh the operational complexity of running your own inference endpoint.

Fallback routing. Even after distillation, some tasks will exceed the student's capability. You need a confidence-based router that falls back to the teacher when the student is uncertain. The routing threshold is a tunable parameter: too low and you waste money on the teacher; too high and the student produces low-quality results on edge cases.

Conclusion #

Fine-tuning and distillation are not replacements for good prompting — they are the next step when prompting hits its limits at scale. The decision framework is straightforward: start with prompting, measure where you are spending latency and cost, and fine-tune only when the numbers justify the investment.

Trace distillation is a practical approach for agents: run a capable teacher on real tasks, capture its execution traces, filter for quality, and train a smaller model to replicate that behavior. The student learns the teacher's decision patterns — which tools to call, how to reason before acting, when to stop — without needing hand-authored training data.

Reward modeling through DPO adds a second layer: instead of merely imitating the teacher, the student learns to prefer better trajectories over worse ones. This corrects for imperfections in the teacher's behavior and pushes the student toward more efficient, safer execution paths.

The pipeline is never done. Continuous distillation — routing new scenarios to the teacher, collecting fresh traces, and periodically retraining the student — creates a flywheel where the student steadily absorbs more of the teacher's capability. The teacher handles the frontier; the student handles the volume.