Simulation and Synthetic Environments

Published: 28 Jun 2026

You cannot ship an agent to production if it has only ever been tested against real systems. Real systems are slow, expensive, fragile, and — worst of all — stateful. A coding agent that accidentally deletes a production branch during testing has taught you something, but the lesson costs too much. A customer-service agent that sends a real apology email to a real customer during a QA run is not testing — it is an incident.

Simulation solves this. You build a sandboxed world — a controlled replica of the environment your agent will operate in — and let the agent run thousands of trajectories inside it. The agent thinks it is calling real tools, reading real databases, talking to real users. But everything is synthetic: reproducible, resettable, and safe to break.

This is how you get from "works on the demo" to "works at scale."

The Challenge #

Traditional software testing relies on unit tests, integration tests, and staging environments. These work because traditional software is deterministic — the same input always produces the same output. Agents are different. They are stochastic, multi-step, and reactive. A single change in the model's sampling temperature can produce a completely different trajectory. An agent might take three steps or thirty to solve the same problem.

This creates three challenges that simulation addresses:

Coverage. You cannot enumerate all possible agent behaviors with handwritten test cases. A simulation lets you generate thousands of diverse scenarios and observe emergent behavior you would never have thought to test for.

Safety. Agents act on the world. Testing an agent that calls kubectl delete namespace or stripe.charges.create against real infrastructure is reckless. Simulation provides a blast radius of zero.

Reproducibility. When an agent fails in production, you need to replay the failure. A simulation environment with deterministic seeding lets you reproduce exact trajectories, vary one factor at a time, and isolate the root cause.

Simulation Environment Architecture #

A simulation environment has four components: a world model, simulated tools, a scenario generator, and an evaluation harness.

┌─────────────────────────────────────────────────────┐
│                  Simulation Harness                 │
│                                                     │
│  ┌───────────┐    ┌──────────────┐   ┌───────────┐  │
│  │ Scenario  │──▶ │  Agent Under │──▶│Evaluation │  │
│  │ Generator │    │    Test      │   │  Scoring  │  │
│  └───────────┘    └──────┬───────┘   └───────────┘  │
│                          │                          │
│                          ▼                          │
│                 ┌──────────────────┐                │
│                 │  Simulated Tools │                │
│                 │  (Mock World)    │                │
│                 ├──────────────────┤                │
│                 │ • File system    │                │
│                 │ • Database       │                │
│                 │ • APIs           │                │
│                 │ • User responses │                │
│                 └──────────────────┘                │
│                          │                          │
│                          ▼                          │
│                 ┌──────────────────┐                │
│                 │   World State    │                │
│                 │   (Snapshot)     │                │
│                 └──────────────────┘                │
└─────────────────────────────────────────────────────┘

The World Model #

The world model defines what the agent can observe and how the environment responds to the agent's actions. It tracks state that evolves over time: files on disk, rows in a database, the contents of an inbox, the position of a cursor on screen.

A good world model is faithful enough to expose the same failure modes the agent will encounter in production, but simple enough to run thousands of episodes without the latency and cost of real infrastructure.

from dataclasses import dataclass, field
from typing import Any


@dataclass
class WorldState:
    """Mutable state representing the simulated environment."""

    filesystem: dict[str, str] = field(default_factory=dict)
    database: dict[str, list[dict]] = field(default_factory=dict)
    api_responses: dict[str, Any] = field(default_factory=dict)
    messages_sent: list[dict] = field(default_factory=list)
    clock: float = 0.0

    def snapshot(self) -> "WorldState":
        """Create a deep copy for checkpoint/restore."""
        import copy
        return copy.deepcopy(self)

    def reset(self, snapshot: "WorldState") -> None:
        """Restore to a previous state."""
        self.__dict__.update(snapshot.__dict__)

Simulated Tools #

Simulated tools implement the same interface as real tools but operate against the world model instead of real infrastructure. The agent cannot distinguish a simulated tool from a real one — same schema, same response format, same error conditions.

class SimulatedFileSystem:
    """Drop-in replacement for a file-system tool."""

    def __init__(self, world: WorldState):
        self.world = world

    def read_file(self, path: str) -> dict:
        if path not in self.world.filesystem:
            return {"error": "FileNotFoundError", "message": f"{path} not found"}
        return {"content": self.world.filesystem[path]}

    def write_file(self, path: str, content: str) -> dict:
        self.world.filesystem[path] = content
        return {"status": "ok", "bytes_written": len(content)}

    def delete_file(self, path: str) -> dict:
        if path not in self.world.filesystem:
            return {"error": "FileNotFoundError", "message": f"{path} not found"}
        del self.world.filesystem[path]
        return {"status": "ok"}


class SimulatedDatabase:
    """Drop-in replacement for a SQL execution tool."""

    def __init__(self, world: WorldState):
        self.world = world

    def execute_query(self, query: str) -> dict:
        # Simplified: parse intent rather than full SQL
        # In practice, use an in-memory SQLite or DuckDB
        table_name = self._extract_table(query)
        if table_name not in self.world.database:
            return {"error": "TableNotFound", "message": f"{table_name}"}
        return {"rows": self.world.database[table_name], "row_count": len(self.world.database[table_name])}

    def _extract_table(self, query: str) -> str:
        tokens = query.lower().split()
        if "from" in tokens:
            return tokens[tokens.index("from") + 1]
        return "unknown"

The key principle: simulated tools should fail the same way real tools fail. If the real API returns a 429 when rate-limited, the simulated version should too. If the real database throws a constraint violation, simulate that. Agents that have only ever seen the happy path will collapse on their first real error.

Simulated Users #

Many agents interact with users in a conversational loop — asking clarifying questions, presenting options, receiving feedback. Testing these agents requires a simulated user: another language model (or a scripted persona) that plays the role of the human.

@dataclass
class SimulatedUser:
    """A synthetic user with a persona and a hidden goal."""

    persona: str  # "impatient power user", "confused novice", etc.
    hidden_goal: str  # What the user actually wants
    patience: int = 5  # Max turns before giving up
    ambiguity_level: float = 0.3  # How vague initial requests are

    def generate_initial_request(self, model: str) -> str:
        prompt = f"""You are simulating a user with this persona: {self.persona}
Your actual goal is: {self.hidden_goal}
Generate an initial request. Be somewhat vague (ambiguity level: {self.ambiguity_level}).
Do NOT state your goal explicitly — make the agent work to understand you."""
        return call_model(prompt, model=model, temperature=0.8)

    def respond_to_agent(self, agent_message: str, turn: int, model: str) -> str:
        if turn >= self.patience:
            return "Never mind, forget it."

        prompt = f"""You are simulating a user with persona: {self.persona}
Your hidden goal: {self.hidden_goal}
The agent just said: {agent_message}
Turn {turn}/{self.patience}. Respond naturally. If the agent is on track, cooperate.
If it misunderstood, correct it — but stay in character."""
        return call_model(prompt, model=model, temperature=0.6)

This lets you test how the agent handles ambiguity, frustration, contradictory instructions, and adversarial users — without involving real people.

Scenario Generation #

A simulation is only as good as its scenarios. Hand-crafted scenarios test known risks. Generated scenarios discover unknown ones.

Template-Based Generation #

Start with parameterized templates that cover your core use cases, then randomize the parameters.

import random


@dataclass
class Scenario:
    name: str
    description: str
    initial_world_state: WorldState
    user: SimulatedUser
    success_criteria: list[str]
    max_steps: int = 20


def generate_file_editing_scenarios(n: int = 100) -> list[Scenario]:
    """Generate diverse file-editing scenarios."""
    languages = ["python", "typescript", "rust", "go"]
    bug_types = ["off_by_one", "null_reference", "missing_import",
                 "wrong_variable", "infinite_loop", "race_condition"]
    file_sizes = ["small", "medium", "large"]

    scenarios = []
    for i in range(n):
        lang = random.choice(languages)
        bug = random.choice(bug_types)
        size = random.choice(file_sizes)

        world = WorldState(
            filesystem=generate_buggy_codebase(lang, bug, size)
        )

        scenario = Scenario(
            name=f"fix_{bug}_in_{lang}_{i}",
            description=f"Fix a {bug} bug in a {size} {lang} codebase",
            initial_world_state=world,
            user=SimulatedUser(
                persona="developer",
                hidden_goal=f"Fix the {bug} bug in main.{lang}",
                patience=8,
                ambiguity_level=0.2,
            ),
            success_criteria=[
                "bug_is_fixed",
                "no_new_bugs_introduced",
                "tests_pass",
            ],
        )
        scenarios.append(scenario)

    return scenarios

Adversarial Scenario Generation #

Use a separate model to generate scenarios that are specifically designed to break the agent. This is red-teaming through simulation — the scenario generator acts as an adversary.

def generate_adversarial_scenarios(
    agent_weaknesses: list[str],
    n: int = 50,
    model: str = "gpt-4o",
) -> list[dict]:
    """Generate scenarios targeting known agent weaknesses."""
    prompt = f"""Generate {n} test scenarios for an AI agent.
Each scenario should target one of these known weaknesses:
{agent_weaknesses}

For each scenario, provide:
- A description of the setup
- What makes it tricky for the agent
- What the correct behavior should be
- A plausible wrong behavior the agent might exhibit

Make scenarios realistic — things that could happen in production."""

    result = call_model(prompt, model=model, temperature=0.9)
    return parse_scenarios(result)

Curriculum Design #

Not all scenarios are equally useful for improving an agent. Borrowing from reinforcement learning, you can design a curriculum — starting with easy scenarios and progressively increasing difficulty as the agent improves.

@dataclass
class Curriculum:
    """Progressive difficulty scheduling for simulation."""

    levels: list[dict]  # Each level has scenarios + pass criteria
    current_level: int = 0

    def advance_if_ready(self, pass_rate: float) -> bool:
        """Move to next level if agent passes current level."""
        threshold = self.levels[self.current_level].get("advance_threshold", 0.8)
        if pass_rate >= threshold and self.current_level < len(self.levels) - 1:
            self.current_level += 1
            return True
        return False

    def get_current_scenarios(self) -> list[Scenario]:
        return self.levels[self.current_level]["scenarios"]


# Example curriculum for a customer-service agent
support_curriculum = Curriculum(levels=[
    {
        "name": "basic_queries",
        "scenarios": generate_simple_faq_scenarios(50),
        "advance_threshold": 0.9,
    },
    {
        "name": "multi_step_resolution",
        "scenarios": generate_multi_step_scenarios(50),
        "advance_threshold": 0.8,
    },
    {
        "name": "angry_customers",
        "scenarios": generate_emotional_scenarios(50),
        "advance_threshold": 0.7,
    },
    {
        "name": "adversarial_edge_cases",
        "scenarios": generate_adversarial_scenarios(
            ["prompt injection", "out-of-scope requests", "contradictions"]
        ),
        "advance_threshold": 0.6,
    },
])

Digital Twins #

A digital twin is a high-fidelity simulation of a specific real-world system — a precise replica of your production environment, kept in sync with the real thing. Where a simulated tool is a sketch, a digital twin is a photograph.

Digital twins are expensive to build and maintain, but they provide the highest confidence that simulation results will transfer to production.

Building a Digital Twin #

Build a digital twin when:

The agent operates on complex, stateful infrastructure (Kubernetes clusters, CI/CD pipelines, multi-service architectures)
Failures in production are expensive or dangerous (financial transactions, medical systems, physical hardware)
You need to replay production incidents exactly to diagnose root causes
Regulatory requirements demand evidence of testing against production-equivalent systems

Do not build a digital twin when:

Simple mocks are sufficient (the agent reads files and writes code)
The environment changes faster than you can sync the twin
The cost of building the twin exceeds the cost of occasional production failures

Keeping the Twin in Sync #

A digital twin that drifts from production is worse than no twin at all — it gives you false confidence. Sync strategies include:

Snapshot replication — periodically dump production state into the twin (databases, config, file systems)
Event sourcing — replay the production event stream into the twin to arrive at the same state
Schema-driven generation — use production schemas and data distributions to generate realistic synthetic data without copying real data (useful for privacy compliance)

class DigitalTwin:
    """A synchronized replica of a production environment."""

    def __init__(self, sync_source: str, sync_interval_seconds: int = 3600):
        self.sync_source = sync_source
        self.sync_interval = sync_interval_seconds
        self.last_sync: float = 0.0
        self.world = WorldState()

    def sync(self) -> None:
        """Pull latest state from production (sanitized)."""
        raw_state = fetch_production_snapshot(self.sync_source)
        sanitized = self._strip_pii(raw_state)
        self._apply_state(sanitized)
        self.last_sync = time.time()

    def is_stale(self) -> bool:
        return (time.time() - self.last_sync) > self.sync_interval

    def _strip_pii(self, state: dict) -> dict:
        """Remove personally identifiable information."""
        # Replace real emails, names, etc. with synthetic equivalents
        return anonymize(state)

    def _apply_state(self, state: dict) -> None:
        self.world.filesystem = state.get("files", {})
        self.world.database = state.get("tables", {})
        self.world.api_responses = state.get("api_mocks", {})

Reward Signal Design #

Once the agent runs in simulation, you need to measure how well it did. This is the reward signal — a function that takes a trajectory (the sequence of observations, thoughts, and actions) and produces a score.

Designing good reward signals is surprisingly hard. A naive reward can produce agents that game the metric rather than solve the problem.

Components of a Reward Function #

A robust reward function typically combines multiple signals:

@dataclass
class TrajectoryReward:
    """Multi-dimensional reward for an agent trajectory."""

    task_success: float  # Did the agent achieve the goal? (0 or 1)
    efficiency: float  # How many steps? (fewer is better)
    safety: float  # Did the agent violate any constraints? (1 = safe)
    quality: float  # How good is the output? (0-1, judged by evaluator)
    cost: float  # Token/API cost of the trajectory

    @property
    def composite_score(self) -> float:
        """Weighted combination. Weights are domain-specific."""
        if self.safety < 1.0:
            return 0.0  # Safety violation trumps everything

        return (
            0.5 * self.task_success
            + 0.2 * self.quality
            + 0.2 * self.efficiency
            + 0.1 * (1.0 - self.cost)  # Normalize cost to 0-1
        )


def compute_efficiency_score(steps_taken: int, max_steps: int) -> float:
    """Reward fewer steps. Linear decay."""
    if steps_taken >= max_steps:
        return 0.0
    return 1.0 - (steps_taken / max_steps)

Reward Shaping Pitfalls #

Sparse rewards — giving the agent a signal only at the end (success/failure) — make learning slow. The agent takes twenty steps and receives no feedback until the final evaluation. If it rarely succeeds, it rarely learns.

Dense rewards — giving feedback at every step — accelerate learning but risk reward hacking. If you reward the agent for "making progress" (e.g., editing a file), it might edit files unnecessarily to collect the reward.

The practical compromise: use sparse terminal rewards for correctness, but add shaping bonuses for intermediate milestones that you are confident correlate with success.

def shaped_reward(trajectory: list[dict], goal: str) -> float:
    """Combine terminal reward with intermediate shaping."""
    terminal = evaluate_goal_completion(trajectory[-1]["world_state"], goal)

    # Shaping: reward milestones without rewarding busywork
    milestones_hit = 0
    expected_milestones = get_milestones_for_goal(goal)
    for milestone in expected_milestones:
        if any(milestone_achieved(step, milestone) for step in trajectory):
            milestones_hit += 1

    shaping_bonus = 0.1 * (milestones_hit / max(len(expected_milestones), 1))

    return terminal + shaping_bonus

LLM-as-Judge #

For tasks where success is subjective (writing quality, user satisfaction, code elegance), you can use a separate model as a judge. This is cheaper than human evaluation and scales to thousands of trajectories.

def llm_judge_reward(
    trajectory: list[dict],
    task_description: str,
    model: str = "gpt-4o",
) -> float:
    """Use a language model to score trajectory quality."""
    prompt = f"""You are evaluating an AI agent's performance.

Task: {task_description}

Agent trajectory (actions taken):
{format_trajectory(trajectory)}

Final state:
{trajectory[-1]["world_state"]}

Score the agent on a scale of 0.0 to 1.0:
- 1.0: Perfect execution. Goal achieved efficiently and safely.
- 0.7-0.9: Goal achieved with minor issues (extra steps, suboptimal approach).
- 0.4-0.6: Partial success. Some progress but goal not fully met.
- 0.1-0.3: Mostly failed. Minimal useful progress.
- 0.0: Complete failure or caused harm.

Provide your score as a single number:"""

    result = call_model(prompt, model=model, temperature=0.0)
    try:
        return float(result.strip())
    except ValueError:
        return 0.0

The risk of LLM-as-judge: the judge model may have the same blind spots as the agent model. Mitigate this by using a different model family for judging, or by periodically calibrating the judge against human evaluations.

Trajectory Generation and Collection #

A trajectory is the complete record of one agent episode: the initial state, every observation, thought, action, and the final outcome. Trajectories are the raw material for evaluation, debugging, and — when you move to fine-tuning — training.

Trajectory Schema #

from dataclasses import dataclass
from typing import Literal


@dataclass
class Step:
    step_number: int
    observation: str  # What the agent saw
    thought: str | None  # Chain-of-thought (if exposed)
    action: dict  # Tool call or response
    result: str  # Tool output or environment response
    world_state_hash: str  # For reproducibility


@dataclass
class Trajectory:
    trajectory_id: str
    scenario: Scenario
    steps: list[Step]
    outcome: Literal["success", "failure", "timeout", "safety_violation"]
    reward: TrajectoryReward
    total_tokens: int
    wall_clock_seconds: float
    model: str
    temperature: float
    seed: int | None  # For reproducibility

Batch Trajectory Collection #

Running simulations at scale means running hundreds or thousands of episodes in parallel. The harness manages scenario distribution, resource limits, and result aggregation.

import asyncio
from concurrent.futures import ProcessPoolExecutor


async def collect_trajectories(
    agent,
    scenarios: list[Scenario],
    max_parallel: int = 10,
) -> list[Trajectory]:
    """Run agent through scenarios and collect trajectories."""
    semaphore = asyncio.Semaphore(max_parallel)
    trajectories = []

    async def run_one(scenario: Scenario) -> Trajectory:
        async with semaphore:
            world = scenario.initial_world_state.snapshot()
            tools = build_simulated_tools(world)
            steps = []

            for step_num in range(scenario.max_steps):
                observation = get_observation(world, scenario)
                thought, action = await agent.act(observation, tools)
                result = execute_simulated_action(action, tools, world)

                steps.append(Step(
                    step_number=step_num,
                    observation=observation,
                    thought=thought,
                    action=action,
                    result=result,
                    world_state_hash=hash_state(world),
                ))

                if is_terminal(world, scenario):
                    break

            reward = compute_reward(steps, scenario)
            outcome = classify_outcome(steps, scenario, reward)

            return Trajectory(
                trajectory_id=generate_id(),
                scenario=scenario,
                steps=steps,
                outcome=outcome,
                reward=reward,
                total_tokens=agent.token_count,
                wall_clock_seconds=elapsed(),
                model=agent.model,
                temperature=agent.temperature,
                seed=agent.seed,
            )

    tasks = [run_one(s) for s in scenarios]
    trajectories = await asyncio.gather(*tasks)
    return list(trajectories)

Trajectories for Fine-Tuning #

Collected trajectories with high reward scores become training data for distillation. You filter for successful trajectories, format them as input-output pairs, and fine-tune a smaller model to replicate the behavior of the larger agent.

def trajectories_to_training_data(
    trajectories: list[Trajectory],
    min_reward: float = 0.8,
) -> list[dict]:
    """Convert high-quality trajectories into fine-tuning examples."""
    training_examples = []

    for traj in trajectories:
        if traj.reward.composite_score < min_reward:
            continue
        if traj.outcome != "success":
            continue

        for step in traj.steps:
            if step.thought is None:
                continue

            example = {
                "messages": [
                    {"role": "system", "content": "You are a helpful agent."},
                    {"role": "user", "content": step.observation},
                    {"role": "assistant", "content": format_thought_and_action(
                        step.thought, step.action
                    )},
                ],
            }
            training_examples.append(example)

    return training_examples

This closes the loop: simulation generates trajectories, trajectories are scored, high-scoring trajectories become training data, and the fine-tuned model is evaluated in the same simulation to measure improvement.

Reproducibility and Determinism #

Simulations are only useful for debugging if you can reproduce them exactly. Non-determinism in agent execution comes from three sources: model sampling, tool execution order, and clock/random state.

Controlling Randomness #

@dataclass
class SimulationConfig:
    """Configuration for reproducible simulation runs."""

    seed: int
    model: str
    temperature: float = 0.0  # Deterministic sampling
    max_steps: int = 50
    timeout_seconds: float = 300.0

    def apply(self) -> None:
        """Set all random seeds for reproducibility."""
        import random
        import numpy as np

        random.seed(self.seed)
        np.random.seed(self.seed)
        # Note: model-side seeds depend on the provider.
        # OpenAI supports a `seed` parameter; others may not.

Setting temperature=0.0 and providing a fixed seed gets you close to determinism with most providers, but not all guarantee it. Some providers round floating-point operations differently across hardware, producing slightly different logits. For strict reproducibility, log the full trajectory and diff against the baseline rather than relying on bit-for-bit output matching.

Trajectory Diffing #

When a code change causes a previously-passing scenario to fail, you need to find exactly where the trajectories diverge.

def diff_trajectories(baseline: Trajectory, current: Trajectory) -> dict:
    """Find the first point of divergence between two trajectories."""
    for i, (base_step, curr_step) in enumerate(
        zip(baseline.steps, current.steps)
    ):
        if base_step.action != curr_step.action:
            return {
                "divergence_step": i,
                "baseline_action": base_step.action,
                "current_action": curr_step.action,
                "observation_at_divergence": curr_step.observation,
                "baseline_thought": base_step.thought,
                "current_thought": curr_step.thought,
            }

    if len(baseline.steps) != len(current.steps):
        return {
            "divergence_step": min(len(baseline.steps), len(current.steps)),
            "reason": "trajectory_length_mismatch",
            "baseline_length": len(baseline.steps),
            "current_length": len(current.steps),
        }

    return {"divergence_step": None, "reason": "trajectories_identical"}

Trade-Offs and Practical Considerations #

Simulation Fidelity vs. Speed #

Higher fidelity means the simulation is more likely to predict real-world behavior, but it runs slower and costs more. Lower fidelity runs fast but may miss failure modes that only emerge in the real environment.

        Fidelity Spectrum

  Low                              High
  ├──────────┼──────────┼──────────┤
  │          │          │          │
  Mocks    Emulators  Digital    Production
  & Stubs             Twins      Replay

  Fast ◄─────────────────────────► Slow
  Cheap ◄────────────────────────► Expensive
  Gaps ◄─────────────────────────► Faithful

The practical approach: use low-fidelity simulation for rapid iteration (thousands of runs per hour), then validate on high-fidelity digital twins before shipping (tens of runs per hour). Run against production replay as the final gate (single-digit runs per hour, read-only).

The Sim-to-Real Gap #

The biggest risk of simulation is overfitting to the simulator. An agent that scores perfectly in simulation but fails in production has learned to exploit the simulation's shortcuts rather than solving the real problem. Common sources of sim-to-real gap:

Missing failure modes. The simulator does not model network timeouts, partial writes, or race conditions that happen in production.
Unrealistic data distributions. Synthetic data is too clean, too uniform, or lacks the edge cases that real data contains.
Perfect tool behavior. Simulated tools always respond instantly and format their output consistently. Real tools are flaky.
Simulated users are too cooperative. Real users interrupt, change their mind, give contradictory instructions, and paste huge walls of text.

Mitigate the sim-to-real gap by:

Injecting noise and failures into the simulation (random latency, dropped connections, malformed responses)
Regularly validating simulation results against production traces
Using production replay as a ground-truth calibration for your simulator
Running canary simulations — scenarios derived directly from recent production incidents

Cost Budgeting #

Simulation at scale consumes inference tokens. A single trajectory of 20 steps with a large model might cost $0.50-$2.00. A thousand trajectories for a regression suite costs real money. Budget strategies:

Use smaller, cheaper models for high-volume exploratory simulation; reserve the full model for final validation
Cache deterministic tool outputs across runs (if the world state is identical, the tool output is identical)
Truncate trajectories early when it is clear the agent has failed (no need to run all 50 steps if it is stuck in a loop by step 10)
Run curriculum levels in order — do not waste expensive scenarios on an agent that cannot pass easy ones

Conclusion #

Simulation transforms agent development from guesswork to engineering. Instead of crossing your fingers and deploying, you run thousands of scenarios in a controlled environment, measure outcomes precisely, and iterate with confidence.

The key takeaways:

Build simulated tools that match your real tools' interfaces and failure modes — not just their happy paths.
Use simulated users to test conversational agents without involving real people or risking real-world side effects.
Design reward signals carefully. Combine task success, efficiency, safety, and quality. Beware reward hacking.
Generate scenarios at multiple fidelity levels: templates for coverage, adversarial generation for robustness, and production replay for ground truth.
Collect trajectories as structured data. They serve evaluation, debugging, and fine-tuning simultaneously.
Mind the sim-to-real gap. A perfect simulation score means nothing if the simulator does not capture the failure modes that matter in production.
Invest in reproducibility. Fixed seeds, trajectory logging, and diffing tools let you diagnose regressions quickly.

Simulation is not a substitute for production monitoring — agents will always encounter situations the simulator did not predict. But it dramatically reduces the set of preventable failures, and it gives you the trajectory data you need to keep improving.