Model Selection and Routing

Published: 24 May 2026

Not every step in an agent needs the same model. A classification call that sorts a user's query into one of five buckets does not need the same capabilities as the step that synthesizes a legal brief from twelve source documents. A formatting pass that converts JSON to markdown does not need the same reasoning depth as the planning step that decided what to fetch in the first place. Yet many agent implementations use a single model for everything — the most capable one available — because it is simpler to configure and easier to reason about. This works during prototyping but becomes a cost and latency problem at scale.

Model selection is the practice of picking the right model for each step in an agent's workflow. Model routing is the mechanism that makes the selection dynamic — classifying inputs at runtime and sending them down different paths based on difficulty, domain, or cost constraints. Together, they let you build agents that are fast where speed matters, capable where quality matters, and cheap where neither matters much.

We touched on this idea briefly when discussing workflow orchestration and multi-agent systems. This time we look at the full set of strategies, the implementation patterns, and the trade-offs you will face when choosing models at the step level.

The Cost-Latency-Quality Triangle #

Every model sits at a point in a three-dimensional trade-off space:

Cost — price per input and output token. Ranges from fractions of a cent per thousand tokens for small models to ten or twenty times that for frontier models.
Latency — time to first token and total generation time. Smaller models are faster. Larger models, especially those with extended reasoning, can take seconds before the first token arrives.
Quality — accuracy, reasoning depth, instruction-following, and output coherence. Larger models are more capable, particularly on ambiguous or multi-step tasks.

                        Quality
                           ▲
                          / \
                         /   \
                        /     \
                       /       \
                      / Frontier\
                     /   models  \
                    /             \
                   /    Mid-tier   \
                  /      models     \
                 /                   \
                /     Small / fast    \
               /         models        \
              /─────────────────────────\
            Cost ◀──────────────────▶ Latency

No model wins on all three axes. Frontier models deliver the highest quality but cost the most and respond the slowest. Small models are cheap and fast but fail on complex reasoning. Mid-tier models land in the middle — adequate quality for many tasks at a fraction of the cost. The engineering challenge is matching each step to the right point in this triangle.

The obvious mistake is optimizing for quality alone during development, then panicking about cost and latency in production. The better approach: decide your quality bar per step early, then pick the cheapest model that meets it. This turns model selection from a last-minute cost-cutting exercise into a deliberate design decision — one you can reason about, benchmark, and tune over time.

Static Model Assignment #

The simplest form of model selection is static: you decide at design time which model handles each step and hardcode it into the workflow configuration.

WORKFLOW_MODELS = {
    "classify_intent":   "small-model",
    "extract_entities":  "small-model",
    "plan_response":     "large-model",
    "generate_answer":   "large-model",
    "format_output":     "small-model",
    "safety_check":      "medium-model",
}

async def run_step(step_name: str, prompt: str, tools: list = None) -> str:
    model = WORKFLOW_MODELS[step_name]
    return await call_model(model=model, prompt=prompt, tools=tools)

This is boring, predictable, and effective. You audit each step, decide which model it needs, and the cost profile of the workflow is deterministic — you can estimate per-query cost before deploying. There is no routing overhead, no misclassification risk, and no extra latency from a classification call.

When Static Assignment Works #

Static assignment works well when:

Your workflow has distinct steps with clearly different complexity levels. A classification step is always simple. A synthesis step is always hard.
The input distribution is homogeneous. Every query follows roughly the same path and requires the same model capabilities.
You value predictability over efficiency. You would rather overpay slightly on a few easy inputs than risk misrouting a hard input to a weak model.

When It Breaks Down #

Static assignment struggles when:

Input complexity varies widely within the same step. Some "generate answer" calls are trivial (one-sentence factual lookups) and some are hard (multi-step reasoning with citations). A static assignment must use the model that handles the worst case, which means overpaying on easy cases.
Your model landscape changes frequently. If you have a static mapping to six specific models and two of them get deprecated, you have configuration changes scattered across the codebase.
You want to A/B test models. Static assignment makes experimentation cumbersome — you need to change the config, redeploy, and measure.

For workflows with heterogeneous inputs, you need dynamic routing.

Dynamic Routing #

Dynamic routing classifies each input at runtime and selects a model based on the classification result. The classifier acts as a gatekeeper: cheap inputs go to cheap models, expensive inputs go to expensive models.

┌──────────────┐       ┌─────────────────┐         ┌───────────────────┐
│   Incoming   │       │    Classifier   │         │   Model A         │
│    Input     │──────▶│  (cheap, fast)  │───┬───▶ │   (small, fast)   │
│              │       │                 │   │     └───────────────────┘
└──────────────┘       └─────────────────┘   │
                                             │    ┌───────────────────┐
                                             ├───▶│   Model B         │
                                             │    │   (mid-tier)      │
                                             │    └───────────────────┘
                                             │
                                             │     ┌───────────────────┐
                                             └───▶ │   Model C         │
                                                   │   (frontier)      │
                                                   └───────────────────┘

The classifier itself must be cheap — both in cost and latency. If the routing overhead is anywhere near the cost savings from using a smaller model, the whole exercise is pointless. In practice, a good classifier adds 50-200ms and costs a fraction of a cent. The savings on the handler side can be 10-50x that.

LLM-Based Classifiers #

The simplest classifier is itself a model call — a tiny model with a short prompt that outputs a difficulty category:

ROUTING_PROMPT = """Rate the complexity of the following query.

LOW: simple factual question, greeting, or single-step lookup.
MEDIUM: requires some reasoning, comparison, or multi-step logic.
HIGH: ambiguous, multi-part, requires nuanced judgment, or synthesis across sources.

Respond with exactly one word: LOW, MEDIUM, or HIGH.

Query: {query}"""

async def classify_complexity(query: str) -> str:
    response = await call_model(
        model="small-model",
        prompt=ROUTING_PROMPT.format(query=query),
        max_tokens=5,
        temperature=0,
    )
    return response.strip().upper()

async def route_and_handle(query: str) -> str:
    complexity = await classify_complexity(query)

    model_map = {
        "LOW": "small-model",
        "MEDIUM": "medium-model",
        "HIGH": "large-model",
    }

    model = model_map.get(complexity, "large-model")
    return await call_model(model=model, prompt=query)

The classifier prompt should be tight — just enough to make the decision, no more. Temperature zero ensures deterministic routing. The fallback to the large model for unrecognized outputs means misclassification fails safe (higher cost, but correct answer).

Learned Classifiers #

For high-throughput systems, even a small model call adds latency you might not want. An alternative is a traditional ML classifier — a fine-tuned embedding model, a logistic regression, or a small neural network trained on labeled examples of (query, complexity) pairs.

import numpy as np
from sklearn.linear_model import LogisticRegression

class LearnedRouter:
    """Route queries based on a trained classifier over embeddings."""

    def __init__(self, model_map: dict[str, str], classifier, embedder):
        self.model_map = model_map  # label -> model_id
        self.classifier = classifier
        self.embedder = embedder

    async def route(self, query: str) -> str:
        embedding = await self.embedder.embed(query)
        label = self.classifier.predict([embedding])[0]
        return self.model_map.get(label, "large-model")

The advantage: classification takes milliseconds instead of hundreds of milliseconds. The disadvantage: you need labeled training data, the classifier can drift as your input distribution changes, and updating it requires retraining and redeployment. The LLM-based classifier can be updated by changing a prompt; the learned classifier requires a pipeline.

Keyword and Heuristic Routing #

Sometimes you do not need a model at all. If your input types are structurally distinct — code snippets versus prose, single questions versus multi-part requests, known domains versus open-ended queries — simple heuristics can route effectively:

def heuristic_route(query: str) -> str:
    """Route based on structural features of the input."""
    word_count = len(query.split())

    # Very short queries are almost always simple
    if word_count < 10:
        return "small-model"

    # Queries with code blocks likely need code understanding
    if "`" * 3 in query or "def " in query or "function " in query:
        return "large-model"

    # Multi-question queries need reasoning
    if query.count("?") > 2:
        return "large-model"

    # Default to mid-tier
    return "medium-model"

Heuristics are zero-cost, zero-latency, and completely transparent. They are also brittle — a short query can still be deeply complex ("prove P != NP"), and a long query can be trivially simple (a copy-pasted error log that just needs a keyword lookup). Heuristics work best as a first pass combined with a confidence check, or for routing within a narrow domain where input structure is predictable.

Cascading (Tiered Execution) #

Routing makes a single decision up front: which model handles this input. Cascading takes a different approach — it tries the cheap model first and escalates only if the result is not good enough. This is speculative execution applied to model selection.

┌───────────────┐     ┌────────────────┐      ┌────────────────┐
│  Small Model  │────▶│  Confidence    │────▶ │  Return result │
│  (attempt 1)  │     │  check / gate  │ OK   │                │
└───────────────┘     └───────┬────────┘      └────────────────┘
                              │ NOT OK
                              ▼
                      ┌───────────────┐      ┌────────────────┐
                      │  Large Model  │─────▶│  Return result │
                      │  (attempt 2)  │      │                │
                      └───────────────┘      └────────────────┘

The key is the confidence gate between the tiers. Something has to decide whether the small model's answer is "good enough." There are several approaches:

Self-Reported Confidence #

Ask the model to rate its own confidence. Some models can do this reasonably well, especially for factual questions:

CONFIDENCE_PROMPT = """Answer the following question, then rate your confidence.

Question: {query}

Format your response as:
ANSWER: <your answer>
CONFIDENCE: <LOW, MEDIUM, or HIGH>"""

async def cascade_with_confidence(query: str) -> str:
    response = await call_model(
        model="small-model",
        prompt=CONFIDENCE_PROMPT.format(query=query),
    )

    answer, confidence = parse_confidence_response(response)

    if confidence == "HIGH":
        return answer

    # Escalate to the large model
    return await call_model(model="large-model", prompt=query)

The problem: models are notoriously poorly calibrated on confidence. They can be confidently wrong (especially small models) and tentatively correct. Self-reported confidence works as a rough filter but should not be your only gate.

Verifier-Based Cascading #

A more reliable gate is a separate verifier that checks the small model's output against objective criteria — does the answer parse correctly, does it match known facts, does it pass a format check:

async def cascade_with_verifier(query: str, verifier) -> str:
    """Try small model first, escalate if verification fails."""

    # Attempt with cheap model
    cheap_answer = await call_model(model="small-model", prompt=query)

    # Verify the answer
    is_valid = await verifier.check(query=query, answer=cheap_answer)

    if is_valid:
        return cheap_answer

    # Escalate
    return await call_model(model="large-model", prompt=query)

The verifier can be a rule-based check (does the JSON parse? does the SQL query compile? is the output the right length?), a small model acting as a judge, or a domain-specific validator. The cost of the verifier must be low relative to the savings from avoiding the large model on easy inputs.

When Cascading Beats Routing #

Cascading is better than routing when:

You cannot reliably predict difficulty from the input alone. Some queries look simple but turn out to be hard. Cascading discovers this at generation time rather than classification time.
The cheap model handles a large majority of inputs successfully. If 80% of inputs can be answered by the small model, cascading saves money on those 80% while only adding latency on the 20% that escalate.
You have a good verifier. Without a reliable gate, cascading degenerates into "always escalate" (wasting the cheap call) or "never escalate" (serving bad answers).

Cascading is worse than routing when:

Most inputs are complex. If 80% of queries need the large model anyway, the small model attempt on each one is pure waste — extra latency and cost for no savings.
Latency is the primary constraint. Cascading adds the small model's generation time to the total latency on every escalated query. Routing adds only the classification overhead.

Task-Specific Model Strengths #

Beyond the cost-latency-quality triangle, models differ in what they are good at. A model that excels at code generation might produce mediocre creative writing. A model optimized for instruction-following might struggle with open-ended reasoning. Knowing these strengths lets you route not just by difficulty but by task type.

Common task categories and their typical model requirements:

Task Type	Key Capability	Model Size Needed
Classification / intent detection	Pattern matching	Small
Entity extraction	Structured output	Small to medium
Summarization	Compression, salience	Medium
Code generation	Syntax, logic, planning	Large
Multi-step reasoning	Chain of thought, working memory	Large
Creative writing	Fluency, style	Medium to large
Translation	Bilingual fluency	Medium
Tool selection	Schema understanding	Medium to large
Safety / content filtering	Policy adherence	Small to medium

This is not a universal truth — specific models break these generalizations. But as a design heuristic, it helps you assign models to steps before you start benchmarking.

Composite Workflows #

In a real agent workflow, steps have different task types. A customer support agent might:

Classify intent (small model)
Extract order number and customer ID (small model)
Look up order status (tool call, no model needed)
Decide if the issue requires escalation (medium model)
Generate a response (large model for complex issues, medium model for simple ones)
Check for policy compliance (small model)

Six steps. Three different models. The total cost is a fraction of what it would be if every step used the frontier model. The total latency is lower because small models respond faster. And the quality on the steps that matter — the response generation — is just as high.

SUPPORT_WORKFLOW = [
    Step("classify", model="small", tools=None),
    Step("extract", model="small", tools=None),
    Step("lookup", model=None, tools=["order_status"]),
    Step("triage", model="medium", tools=None),
    Step("respond", model="dynamic", tools=["knowledge_base"]),
    Step("compliance", model="small", tools=None),
]

async def run_support_workflow(query: str) -> str:
    context = {"query": query}

    for step in SUPPORT_WORKFLOW:
        if step.model == "dynamic":
            # Route based on triage result
            model = select_model_for_response(context["triage_result"])
        else:
            model = step.model

        context[step.name] = await execute_step(step, model, context)

    return context["respond"]

Implementing a Model Router #

A production model router needs more than a classifier. It needs to handle fallbacks (what if the selected model is down?), track costs (are we staying within budget?), and collect metrics (which routes are we taking, and what is the quality?).

class ModelRouter:
    """Route requests to models based on classification, budget, and availability."""

    def __init__(self, config: RoutingConfig, metrics: MetricsCollector):
        self.config = config
        self.metrics = metrics
        self.budget_tracker = BudgetTracker(config.daily_budget)

    async def route(self, request: ModelRequest) -> ModelResponse:
        # 1. Classify the request
        category = await self.classify(request)

        # 2. Select model based on category and budget
        model = self.select_model(category, request)

        # 3. Execute with fallback
        try:
            response = await self.execute(model, request)
            self.metrics.record(
                category=category,
                model=model,
                tokens_in=response.input_tokens,
                tokens_out=response.output_tokens,
                latency=response.latency_ms,
            )
            self.budget_tracker.debit(response.cost)
            return response

        except (RateLimitError, ServiceUnavailable) as e:
            # Fall back to next available model
            fallback = self.config.fallback_for(model)
            if fallback:
                return await self.execute(fallback, request)
            raise

    def select_model(self, category: str, request: ModelRequest) -> str:
        """Pick model based on category, respecting budget constraints."""
        preferred = self.config.model_for_category(category)

        # If budget is tight, downgrade non-critical requests
        if self.budget_tracker.remaining_fraction() < 0.2:
            if category != "critical":
                return self.config.budget_model
        return preferred

    async def classify(self, request: ModelRequest) -> str:
        """Classify request complexity/type."""
        if request.explicit_category:
            return request.explicit_category

        return await self.config.classifier.classify(request.prompt)

Budget-Aware Routing #

Budget awareness changes routing behavior over time. Early in the day (or billing period), the router can be generous — sending borderline queries to the large model because budget is plentiful. Late in the period, when budget is running low, the router becomes conservative — downgrading anything that is not critical.

class BudgetTracker:
    """Track spend against a budget and influence routing decisions."""

    def __init__(self, daily_budget: float):
        self.daily_budget = daily_budget
        self.spent_today = 0.0

    def debit(self, cost: float):
        self.spent_today += cost

    def remaining_fraction(self) -> float:
        return max(0, (self.daily_budget - self.spent_today) / self.daily_budget)

    def recommend_tier(self, base_tier: str) -> str:
        """Suggest a model tier based on remaining budget."""
        remaining = self.remaining_fraction()

        if remaining > 0.5:
            return base_tier  # Plenty of budget, use preferred model
        elif remaining > 0.2:
            # Getting tight — downgrade medium to small
            return "small" if base_tier == "medium" else base_tier
        else:
            # Critical budget — use small for everything non-essential
            return "small"

This is a simple linear degradation, but you can make it more sophisticated — weighting by time of day (if traffic is predictable), by request priority (premium users get the large model regardless of budget), or by task criticality (safety checks never get downgraded).

Evaluating Model Fit #

How do you know which model is "good enough" for a given step? You benchmark. There is no shortcut. Model capabilities change with every release, and the only way to know if a cheaper model handles your task acceptably is to test it on your actual data.

The Evaluation Loop #

The process is straightforward:

Collect a representative sample of inputs for each workflow step (50-200 examples is usually enough for initial signal).
Run each candidate model on the sample.
Score the outputs — either with an automated metric (accuracy, F1, BLEU, pass rate) or with a judge model.
Compare quality versus cost and latency.
Pick the cheapest model that meets your quality threshold.

async def evaluate_model_for_step(
    step_name: str,
    model_id: str,
    test_cases: list[dict],
    judge,
) -> EvalResult:
    """Evaluate a model's fitness for a specific workflow step."""
    results = []

    for case in test_cases:
        response = await call_model(
            model=model_id,
            prompt=case["prompt"],
            tools=case.get("tools"),
        )
        score = await judge.score(
            prompt=case["prompt"],
            response=response,
            reference=case.get("reference"),
        )
        results.append({
            "score": score,
            "tokens_in": response.input_tokens,
            "tokens_out": response.output_tokens,
            "latency_ms": response.latency_ms,
        })

    return EvalResult(
        model=model_id,
        step=step_name,
        mean_score=np.mean([r["score"] for r in results]),
        mean_latency=np.mean([r["latency_ms"] for r in results]),
        mean_cost=sum(estimate_cost(r, model_id) for r in results) / len(results),
        pass_rate=sum(1 for r in results if r["score"] >= THRESHOLD) / len(results),
    )

Quality Thresholds Per Step #

Not every step requires 95% accuracy. A classification step that routes queries might tolerate 90% accuracy — the 10% misrouted queries will still get answered, just by a less-optimal handler. A safety check that blocks harmful outputs might require 99.5% recall — missing even 0.5% of harmful content is unacceptable.

Define your threshold before running the evaluation. Otherwise, you will rationalize whatever the cheap model produces. Write down "this step must achieve X% accuracy on our test set" and commit to it before you see the numbers. Moving the goalpost after the fact is how you end up shipping degraded quality and calling it a cost optimization.

STEP_THRESHOLDS = {
    "classify_intent":  {"min_accuracy": 0.90, "max_latency_ms": 500},
    "extract_entities": {"min_accuracy": 0.92, "max_latency_ms": 800},
    "generate_answer":  {"min_accuracy": 0.95, "max_latency_ms": 3000},
    "safety_check":     {"min_recall": 0.995, "max_latency_ms": 300},
}

Continuous Evaluation #

Model selection is not a one-time decision. Models get updated, deprecated, or repriced. Your input distribution shifts as your product evolves and your user base grows. The model that was "good enough" three months ago might no longer be — or a newer, cheaper model might now exceed your threshold. Run evaluations on a schedule (weekly or monthly) and alert when a model's quality drops below threshold. Treat your model assignments the way you treat dependency versions: pin them, test them, and upgrade deliberately.

Speculative Execution #

A more aggressive optimization: call multiple models in parallel and use the first acceptable result. This trades cost for latency — you pay for all the calls but only use one result. The idea borrows from CPU branch prediction — speculate that the cheap model will succeed, but hedge by running the expensive model concurrently.

import asyncio

async def speculative_execution(
    query: str,
    models: list[str],
    verifier,
) -> str:
    """Call multiple models in parallel, return first verified answer."""
    tasks = [
        asyncio.create_task(call_model(model=model, prompt=query))
        for model in models
    ]

    # Process results as they complete
    for future in asyncio.as_completed(tasks):
        response = await future
        if await verifier.check(query, response.text):
            # Cancel remaining tasks
            for task in tasks:
                if not task.done():
                    task.cancel()
            return response.text

    # All models tried, none verified — return the last one
    return response.text

This pattern makes sense when latency is your primary constraint (real-time applications, interactive agents), you have a fast verifier that can approve results in milliseconds, and the cost of running multiple models in parallel is acceptable. It does not make sense when cost is the binding constraint — you are paying for every call, even the ones you discard.

Routing in Multi-Agent Systems #

In multi-agent architectures, model selection happens at the agent level rather than the step level. Each agent can run a different model, and the coordinator decides which agent handles which sub-task — effectively routing by capability.

AGENT_CONFIGS = {
    "researcher": {
        "model": "large-model",
        "tools": ["web_search", "document_reader"],
        "description": "Handles complex research and synthesis tasks",
    },
    "formatter": {
        "model": "small-model",
        "tools": ["template_engine"],
        "description": "Converts structured data into formatted output",
    },
    "validator": {
        "model": "medium-model",
        "tools": ["schema_validator", "fact_checker"],
        "description": "Verifies outputs for accuracy and format",
    },
}

The coordinator — itself often running on a mid-tier model — decomposes the task and assigns sub-tasks to agents. Each agent uses the model that matches its workload. The research agent needs deep reasoning, so it gets the frontier model. The formatter does mechanical transformation, so it gets the cheapest model. The validator needs some judgment but not frontier-level reasoning, so it gets the mid-tier.

This is the multi-agent equivalent of static assignment, but with cleaner separation — each agent's model is an implementation detail hidden behind its interface. The coordinator does not need to know which model the researcher uses, only that the researcher can handle complex synthesis tasks.

Trade-offs and Pitfalls #

Model routing is not free. Every layer of indirection — classifiers, verifiers, fallback chains, budget trackers — adds code to maintain, failure modes to handle, and latency to measure. The following are the common pitfalls when building routing systems.

Routing Accuracy vs. Savings #

Every routing decision has a misclassification risk. If a hard query gets routed to a small model, the user gets a bad answer — and bad answers erode trust faster than slow answers do. The cost savings from routing must be weighed against the quality risk of misrouting.

Rule of thumb: if misrouting a query to the cheap model produces a noticeably worse answer (one the user would complain about), your routing must be very accurate on that query type. If misrouting just produces a slightly less polished answer, lower accuracy is acceptable. Monitor misroute rates in production and tighten the classifier when quality regressions surface.

Latency from Classification #

The classification step adds latency to every request. For an LLM-based classifier, this is 100-500ms. For a learned classifier, 5-20ms. For heuristics, sub-millisecond. If your overall latency budget is 2 seconds and you spend 500ms on classification, that is 25% of your budget gone before you start generating.

Measure the end-to-end latency, not just the generation time. A "faster" small model behind a slow classifier can be slower overall than just calling the medium model directly. The routing overhead must be amortized across enough savings to justify its existence — if your average query only saves a fraction of a cent from routing, but the classifier costs half that, your net savings are halved before you even account for engineering time.

Model Diversity Risk #

Depending on multiple models from multiple providers introduces operational complexity. Each model has different:

API formats and authentication
Rate limits and quotas
Failure modes and error codes
Update schedules and deprecation timelines
Prompt format preferences and quirks

An abstraction layer that normalizes these differences is essential. Without it, your routing logic becomes entangled with provider-specific handling code:

class ModelClient:
    """Unified interface across model providers."""

    def __init__(self, provider_configs: dict):
        self.providers = {
            name: create_provider(config)
            for name, config in provider_configs.items()
        }

    async def call(self, model_id: str, prompt: str, **kwargs) -> ModelResponse:
        provider = self.providers[model_id]
        raw_response = await provider.generate(prompt, **kwargs)
        return ModelResponse(
            text=raw_response.text,
            input_tokens=raw_response.usage.input,
            output_tokens=raw_response.usage.output,
            latency_ms=raw_response.elapsed_ms,
            cost=self.estimate_cost(model_id, raw_response.usage),
            model=model_id,
        )

Over-Optimization #

It is possible to over-optimize model selection. If you spend more engineering time building and maintaining the routing infrastructure than you save on model costs, you have gone too far. A simple three-tier system (small / medium / large) with static assignment covers 80% of the benefit. Add dynamic routing only when your volume justifies the complexity.

Signs you are over-optimizing:

You have more than five routing tiers with marginal quality differences between them.
Your routing logic has more lines of code than your core agent logic.
You spend more time debugging misroutes than fixing actual agent failures.
The cost of the classifier is more than 10% of the average request cost.

Conclusion #

Model selection and routing is about matching capability to need — not using a cannon to swat a fly, and not using a flyswatter against a bear.

The cost-latency-quality triangle governs every model choice. No model wins on all three axes. Pick the cheapest model that meets your quality bar for each step.
Static assignment is the simplest approach: hardcode a model per workflow step. Predictable, easy to audit, no routing overhead. Start here.
Dynamic routing classifies inputs at runtime and routes to different models. Use an LLM classifier, a learned classifier, or heuristics. The classifier must be substantially cheaper than the savings it enables.
Cascading tries the cheap model first and escalates if the answer is not good enough. Best when a reliable verifier exists and most inputs are easy.
Task-specific strengths mean routing can be by task type, not just difficulty. Classification and formatting go to small models. Reasoning and generation go to large ones.
Budget-aware routing adjusts model selection based on remaining budget, degrading gracefully as spend accumulates.
Evaluate continuously. Benchmark candidate models on your actual data, set quality thresholds per step, and re-evaluate regularly as models change.
Do not over-engineer. A simple three-tier system with static assignment captures most of the savings. Add dynamic routing only when volume and cost justify the complexity.