Structured Output & Constrained Decoding

Published:

Language models generate text one token at a time. Left unconstrained, that text is free-form — it might be valid JSON, or it might be JSON with a trailing comma that breaks your parser at 2 AM. For agents, structured output is the contract between the model and every downstream system that consumes its responses. We explore how structured generation actually works, the spectrum of enforcement techniques from hopeful prompting to grammar-level constraints, and the trade-offs that determine which approach fits your system.

The Problem - Probabilistic Systems, Deterministic Consumers #

An agent's output rarely goes directly to a human. It feeds into tool-call dispatchers, workflow orchestrators, databases, APIs, and other agents. All of these consumers expect data in a precise format — a JSON object with specific keys, an enum value from a fixed set, a number within a range. The model was trained to predict the next likely token. It does not know what syntactically valid output is.

Consider what happens without structured output enforcement. You prompt the model: "Return a JSON object with keys action and parameters." Most of the time it complies. But occasionally it wraps the JSON in markdown fences. Or it adds a conversational preamble before the JSON. Or it produces a key named params instead of parameters. Or it outputs a trailing comma that is valid in JavaScript but not in JSON. Each of these failures requires a different recovery path, and the failure rate compounds across multi-step agent runs.

The fundamental tension is this: you want the flexibility of natural language generation (so the model can reason, plan, and compose) combined with the rigidity of a type system (so downstream code can parse without guessing). Structured output techniques resolve this tension at different points in the generation pipeline.

Three Layers of Enforcement #

There are three distinct layers where you can enforce output structure, each with different guarantees and costs:

┌─────────────────────────────────────────────────────────────────┐
│                    Generation Pipeline                          │
│                                                                 │
│  ┌──────────────┐     ┌──────────────────┐    ┌──────────────┐  │
│  │   Prompt     │───▶│  Constrained     │───▶│   Runtime    │  │
│  │  Instructions│     │  Decoding        │    │  Validation  │  │
│  │              │     │                  │    │              │  │
│  │  "Return     │     │  Token masking,  │    │  JSON Schema │  │
│  │   valid JSON │     │  grammar guides, │    │  validation, │  │
│  │   with these │     │  logit bias      │    │  type checks,│  │
│  │   keys..."   │     │                  │    │  retry logic │  │
│  └──────────────┘     └──────────────────┘    └──────────────┘  │
│                                                                 │
│  Guarantee: None     Guarantee: Syntactic   Guarantee: Semantic │
│  Cost: Zero          Cost: Inference-time   Cost: Extra calls   │
└─────────────────────────────────────────────────────────────────┘

Layer 1 - Prompt-Based Formatting #

The simplest approach — you tell the model what format you want in the system prompt or user message. This costs nothing extra and works surprisingly well with capable models. But it provides zero guarantees. The model might comply 95% of the time, which means 5% of your agent runs will hit a parse error.

Prompt-based formatting is appropriate for prototyping, for cases where a parse failure triggers a cheap retry, or when the model is strong enough that compliance is near-100% empirically. It is not appropriate for production pipelines where a malformed output causes data corruption or a cascading failure.

SYSTEM_PROMPT = """You are a task-planning agent.

When you decide on the next action, respond with ONLY a JSON object:
{
  "action": "tool_name",
  "parameters": {"key": "value"},
  "reasoning": "one sentence explaining why"
}

Do not include any text before or after the JSON.
"""

The weakness is obvious: "do not include any text" is a suggestion, not a constraint. The model can and will violate it.

Layer 2 - Constrained Decoding #

Constrained decoding enforces structure during generation, at the token level. Instead of hoping the model produces valid output, you make it impossible for it to produce invalid output. This is where the real engineering happens.

The core mechanism is token masking (sometimes called logit biasing or vocabulary restriction). At each generation step, before the model samples the next token, you compute which tokens are valid given the current position in the output schema. You set the probability of all invalid tokens to zero (or negative infinity in logit space). The model can only choose among tokens that keep the output on a valid path.

For JSON output, this works like a pushdown automaton. If you have generated {"action": "search, the valid next tokens are any that continue a JSON string or close it with ". A token like \n or } would be invalid mid-string, so it gets masked out.

# Conceptual token masking for JSON generation
def get_valid_token_mask(
    generated_so_far: str,
    schema: dict,
    vocabulary: list[str],
) -> list[bool]:
    """Return a boolean mask over the vocabulary.
    True = token is allowed at this position.
    """
    parser_state = incremental_parse(generated_so_far)
    mask = []
    for token in vocabulary:
        candidate = generated_so_far + token
        # Can this candidate lead to a valid completion?
        mask.append(is_valid_prefix(candidate, schema, parser_state))
    return mask

The power of this approach is that it provides a syntactic guarantee: the output will always be parseable. The model cannot produce broken JSON, missing brackets, or invalid escape sequences. It can still produce semantically wrong content (a tool name that does not exist, a number outside the valid range), but the structure is locked down.

Grammar-Based Constrained Decoding #

The most general form of constrained decoding uses a formal grammar (typically GBNF — GGML BNF, or a context-free grammar) to define the set of valid outputs. The grammar specifies "valid JSON" and also "valid JSON conforming to this specific schema."

# GBNF grammar for a tool-call response
root   ::= "{" ws "\"action\"" ws ":" ws action-value ws ","
            ws "\"parameters\"" ws ":" ws object ws "}"
action-value ::= "\"search\"" | "\"calculate\"" | "\"send_email\""
object ::= "{" ws (pair (ws "," ws pair)*)? ws "}"
pair   ::= string ws ":" ws value
value  ::= string | number | "true" | "false" | "null" | object | array
string ::= "\"" [^"\\]* "\""
number ::= "-"? [0-9]+ ("." [0-9]+)?
array  ::= "[" ws (value (ws "," ws value)*)? ws "]"
ws     ::= [ \t\n]*

This grammar restricts action to exactly three valid values. The model physically cannot hallucinate a tool name — if it tries, the token gets masked and sampling falls to the next-best valid token.

JSON Mode and Schema Mode #

Most model providers now offer simpler interfaces that hide the grammar machinery:

JSON mode guarantees the output is valid JSON (any valid JSON). It handles bracket matching, string escaping, and structural validity, but does not enforce a specific schema. You still need runtime validation to check that the right keys exist.

Schema mode (sometimes called "structured outputs" or "response format with schema") goes further — you pass a JSON Schema and the model's output is guaranteed to conform to it. This is constrained decoding with the grammar auto-generated from your schema.

from dataclasses import dataclass


@dataclass
class ToolCall:
    action: str  # enum: search, calculate, send_email
    parameters: dict
    reasoning: str


# Using schema mode (pseudocode — API details vary)
response = model.generate(
    messages=messages,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "tool_call",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "action": {
                        "type": "string",
                        "enum": ["search", "calculate", "send_email"],
                    },
                    "parameters": {"type": "object"},
                    "reasoning": {"type": "string"},
                },
                "required": ["action", "parameters", "reasoning"],
                "additionalProperties": False,
            },
        },
    },
)
# response.content is guaranteed to parse and validate

Layer 3 - Runtime Validation #

Even with constrained decoding, you still want runtime validation. Constrained decoding guarantees syntax — the output parses. Runtime validation checks semantics — the values make sense. Does the referenced tool exist in the registry? Is the date in the future? Is the amount within budget?

import json
from jsonschema import validate, ValidationError


def validate_and_parse(raw: str, schema: dict) -> dict | None:
    """Parse, validate, and return structured output.
    Returns None on failure (caller decides whether to retry).
    """
    try:
        parsed = json.loads(raw)
    except json.JSONDecodeError:
        return None  # Should not happen with constrained decoding

    try:
        validate(instance=parsed, schema=schema)
    except ValidationError:
        return None

    # Semantic checks beyond schema
    if parsed.get("action") not in REGISTERED_TOOLS:
        return None

    return parsed

The three layers are complementary, not alternatives. A production agent typically uses all three: prompt instructions set the model's intent, constrained decoding enforces structure, and runtime validation catches semantic issues that no grammar can express.

Trade-Offs and When to Use Each Approach #

Approach Guarantee Latency impact Flexibility Best for
Prompt-only None (soft) Zero Maximum Prototypes, strong models, cheap retries
JSON mode Syntactic (any JSON) Minimal High When you need valid JSON but schema varies
Schema mode Syntactic + structural 5-15% slower Moderate Fixed-schema tool calls, workflow handoffs
Grammar (GBNF) Syntactic + structural 10-20% slower Maximum control Custom formats, non-JSON structured output
Runtime validation Semantic Extra call on failure N/A Always (defense in depth)

The latency cost of constrained decoding comes from two sources. First, computing the valid token mask at each step adds overhead — typically 5-20% depending on grammar complexity and vocabulary size. Second, token masking can force the model away from its preferred token distribution, sometimes causing it to take less direct paths to the same output (more tokens to express the same content).

There is also a quality trade-off. Aggressive constraints can hurt reasoning quality. If the model needs to "think" before committing to a structured response, forcing it directly into JSON may cut off useful intermediate reasoning. A common pattern is to let the model reason freely in a scratchpad, then emit structured output only at the end:

scratchpad_messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": task},
]

response = model.generate(
    messages=scratchpad_messages,
    # Let the model think in free text first
    stop=["```json"],
)

# Now generate just the structured part with constraints
structured = model.generate(
    messages=[
        *scratchpad_messages,
        {"role": "assistant", "content": response.content + "```json\n"},
    ],
    response_format={"type": "json_schema", "json_schema": schema},
)

Schema Evolution and Versioning #

Schemas change. You add a field, deprecate another, rename a third. In traditional APIs, schema evolution is managed through versioning and backward compatibility. Agent output schemas need the same discipline, but with an additional twist: the model has been trained (or prompted) on the old schema, and changing it may degrade compliance even with constrained decoding.

Practical strategies for schema evolution:

Additive changes only. Add new optional fields rather than modifying or removing existing ones. Constrained decoding with additionalProperties: false means you must update the schema everywhere simultaneously — model prompt, grammar, and consumer code — or the model cannot produce the new fields.

Schema registry. Maintain a versioned registry of output schemas. Tag each agent trace with the schema version it used. This lets you replay old traces against new validators and catch regressions.

Migration windows. When a breaking change is unavoidable, run both schemas in parallel. The agent emits output under the new schema; a compatibility layer translates to the old schema for consumers that have not migrated yet. Set a deadline and remove the old path.

SCHEMA_V1 = {
    "type": "object",
    "properties": {
        "action": {"type": "string"},
        "args": {"type": "object"},
    },
    "required": ["action", "args"],
}

SCHEMA_V2 = {
    "type": "object",
    "properties": {
        "action": {"type": "string"},
        "parameters": {"type": "object"},  # renamed from "args"
        "reasoning": {"type": "string"},   # new field
    },
    "required": ["action", "parameters", "reasoning"],
}


def migrate_v1_to_v2(output: dict) -> dict:
    """Translate v1 output to v2 shape."""
    return {
        "action": output["action"],
        "parameters": output.get("args", output.get("parameters", {})),
        "reasoning": output.get("reasoning", ""),
    }

Structured Output in Multi-Step Pipelines #

In an agent pipeline, every intermediate step that hands data to the next step needs a contract. The plan produced by a planner must parse into steps. The retrieval query must parse into a search object. The tool result must parse into a format the synthesizer expects.

This creates a chain of schemas:

User Query
    │
    ▼
┌─────────┐  schema: PlanOutput      ┌──────────┐ schema: ToolCallOutput
│ Planner │────────────────────────▶│ Executor │─────────────────────────▶ ...
└─────────┘                          └──────────┘

Each schema is a typed interface between pipeline stages. When you change one schema, you must update both the producer (upstream model) and the consumer (downstream code or model). This is exactly the same contract discipline you would apply to a REST API or a message queue — but it happens inside a single agent's execution loop.

The benefit is significant: when every intermediate output is structured, you can inspect, validate, and log at every stage. Debugging a failed agent run becomes straightforward — you check each stage's output against its schema and find exactly where the contract broke.

Conclusion #

Structured output transforms an agent from a system that usually works into one that reliably works. Prompt-based formatting is just a starting point. Constrained decoding — whether through JSON mode, schema mode, or full grammars — provides syntactic guarantees that eliminate an entire class of runtime failures. Runtime validation adds the semantic layer that no grammar can express. Use all three together. Keep schemas versioned and treat them as first-class interfaces between pipeline stages. The upfront investment in structured output pays compound returns: fewer parse errors, easier debugging, safer tool execution, and the confidence to build complex multi-step pipelines on a foundation that will not crack under production load.