Simulation and Synthetic Environments
You cannot ship an agent to production if it has only ever been tested against real systems. Real systems are slow, expensive, fragile, and — worst of all — stateful. A coding agent that accidentally deletes a production branch during testing has taught you something, but the lesson costs too much. A customer-service agent that sends a real apology email to a real customer during a QA run is not testing — it is an incident.
Simulation solves this. You build a sandboxed world — a controlled replica of the environment your agent will operate in — and let the agent run thousands of trajectories inside it. The agent thinks it is calling real tools, reading real databases, talking to real users. But everything is synthetic: reproducible, resettable, and safe to break.
This is how you get from "works on the demo" to "works at scale."
The Challenge #
Traditional software testing relies on unit tests, integration tests, and staging environments. These work because traditional software is deterministic — the same input always produces the same output. Agents are different. They are stochastic, multi-step, and reactive. A single change in the model's sampling temperature can produce a completely different trajectory. An agent might take three steps or thirty to solve the same problem.
This creates three challenges that simulation addresses:
Coverage. You cannot enumerate all possible agent behaviors with handwritten test cases. A simulation lets you generate thousands of diverse scenarios and observe emergent behavior you would never have thought to test for.
Safety. Agents act on the world. Testing an agent that calls kubectl delete namespace or stripe.charges.create against real infrastructure is reckless. Simulation provides a blast radius of zero.
Reproducibility. When an agent fails in production, you need to replay the failure. A simulation environment with deterministic seeding lets you reproduce exact trajectories, vary one factor at a time, and isolate the root cause.
Simulation Environment Architecture #
A simulation environment has four components: a world model, simulated tools, a scenario generator, and an evaluation harness.
┌─────────────────────────────────────────────────────┐
│ Simulation Harness │
│ │
│ ┌───────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Scenario │──▶ │ Agent Under │──▶│Evaluation │ │
│ │ Generator │ │ Test │ │ Scoring │ │
│ └───────────┘ └──────┬───────┘ └───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Simulated Tools │ │
│ │ (Mock World) │ │
│ ├──────────────────┤ │
│ │ • File system │ │
│ │ • Database │ │
│ │ • APIs │ │
│ │ • User responses │ │
│ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ World State │ │
│ │ (Snapshot) │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────┘
The World Model #
The world model defines what the agent can observe and how the environment responds to the agent's actions. It tracks state that evolves over time: files on disk, rows in a database, the contents of an inbox, the position of a cursor on screen.
A good world model is faithful enough to expose the same failure modes the agent will encounter in production, but simple enough to run thousands of episodes without the latency and cost of real infrastructure.
from dataclasses import dataclass, field
from typing import Any
@dataclass
class WorldState:
"""Mutable state representing the simulated environment."""
filesystem: dict[str, str] = field(default_factory=dict)
database: dict[str, list[dict]] = field(default_factory=dict)
api_responses: dict[str, Any] = field(default_factory=dict)
messages_sent: list[dict] = field(default_factory=list)
clock: float = 0.0
def snapshot(self) -> "WorldState":
"""Create a deep copy for checkpoint/restore."""
import copy
return copy.deepcopy(self)
def reset(self, snapshot: "WorldState") -> None:
"""Restore to a previous state."""
self.__dict__.update(snapshot.__dict__)
Simulated Tools #
Simulated tools implement the same interface as real tools but operate against the world model instead of real infrastructure. The agent cannot distinguish a simulated tool from a real one — same schema, same response format, same error conditions.
class SimulatedFileSystem:
"""Drop-in replacement for a file-system tool."""
def __init__(self, world: WorldState):
self.world = world
def read_file(self, path: str) -> dict:
if path not in self.world.filesystem:
return {"error": "FileNotFoundError", "message": f"{path} not found"}
return {"content": self.world.filesystem[path]}
def write_file(self, path: str, content: str) -> dict:
self.world.filesystem[path] = content
return {"status": "ok", "bytes_written": len(content)}
def delete_file(self, path: str) -> dict:
if path not in self.world.filesystem:
return {"error": "FileNotFoundError", "message": f"{path} not found"}
del self.world.filesystem[path]
return {"status": "ok"}
class SimulatedDatabase:
"""Drop-in replacement for a SQL execution tool."""
def __init__(self, world: WorldState):
self.world = world
def execute_query(self, query: str) -> dict:
# Simplified: parse intent rather than full SQL
# In practice, use an in-memory SQLite or DuckDB
table_name = self._extract_table(query)
if table_name not in self.world.database:
return {"error": "TableNotFound", "message": f"{table_name}"}
return {"rows": self.world.database[table_name], "row_count": len(self.world.database[table_name])}
def _extract_table(self, query: str) -> str:
tokens = query.lower().split()
if "from" in tokens:
return tokens[tokens.index("from") + 1]
return "unknown"
The key principle: simulated tools should fail the same way real tools fail. If the real API returns a 429 when rate-limited, the simulated version should too. If the real database throws a constraint violation, simulate that. Agents that have only ever seen the happy path will collapse on their first real error.
Simulated Users #
Many agents interact with users in a conversational loop — asking clarifying questions, presenting options, receiving feedback. Testing these agents requires a simulated user: another language model (or a scripted persona) that plays the role of the human.
@dataclass
class SimulatedUser:
"""A synthetic user with a persona and a hidden goal."""
persona: str # "impatient power user", "confused novice", etc.
hidden_goal: str # What the user actually wants
patience: int = 5 # Max turns before giving up
ambiguity_level: float = 0.3 # How vague initial requests are
def generate_initial_request(self, model: str) -> str:
prompt = f"""You are simulating a user with this persona: {self.persona}
Your actual goal is: {self.hidden_goal}
Generate an initial request. Be somewhat vague (ambiguity level: {self.ambiguity_level}).
Do NOT state your goal explicitly — make the agent work to understand you."""
return call_model(prompt, model=model, temperature=0.8)
def respond_to_agent(self, agent_message: str, turn: int, model: str) -> str:
if turn >= self.patience:
return "Never mind, forget it."
prompt = f"""You are simulating a user with persona: {self.persona}
Your hidden goal: {self.hidden_goal}
The agent just said: {agent_message}
Turn {turn}/{self.patience}. Respond naturally. If the agent is on track, cooperate.
If it misunderstood, correct it — but stay in character."""
return call_model(prompt, model=model, temperature=0.6)
This lets you test how the agent handles ambiguity, frustration, contradictory instructions, and adversarial users — without involving real people.
Scenario Generation #
A simulation is only as good as its scenarios. Hand-crafted scenarios test known risks. Generated scenarios discover unknown ones.
Template-Based Generation #
Start with parameterized templates that cover your core use cases, then randomize the parameters.
import random
@dataclass
class Scenario:
name: str
description: str
initial_world_state: WorldState
user: SimulatedUser
success_criteria: list[str]
max_steps: int = 20
def generate_file_editing_scenarios(n: int = 100) -> list[Scenario]:
"""Generate diverse file-editing scenarios."""
languages = ["python", "typescript", "rust", "go"]
bug_types = ["off_by_one", "null_reference", "missing_import",
"wrong_variable", "infinite_loop", "race_condition"]
file_sizes = ["small", "medium", "large"]
scenarios = []
for i in range(n):
lang = random.choice(languages)
bug = random.choice(bug_types)
size = random.choice(file_sizes)
world = WorldState(
filesystem=generate_buggy_codebase(lang, bug, size)
)
scenario = Scenario(
name=f"fix_{bug}_in_{lang}_{i}",
description=f"Fix a {bug} bug in a {size} {lang} codebase",
initial_world_state=world,
user=SimulatedUser(
persona="developer",
hidden_goal=f"Fix the {bug} bug in main.{lang}",
patience=8,
ambiguity_level=0.2,
),
success_criteria=[
"bug_is_fixed",
"no_new_bugs_introduced",
"tests_pass",
],
)
scenarios.append(scenario)
return scenarios
Adversarial Scenario Generation #
Use a separate model to generate scenarios that are specifically designed to break the agent. This is red-teaming through simulation — the scenario generator acts as an adversary.
def generate_adversarial_scenarios(
agent_weaknesses: list[str],
n: int = 50,
model: str = "gpt-4o",
) -> list[dict]:
"""Generate scenarios targeting known agent weaknesses."""
prompt = f"""Generate {n} test scenarios for an AI agent.
Each scenario should target one of these known weaknesses:
{agent_weaknesses}
For each scenario, provide:
- A description of the setup
- What makes it tricky for the agent
- What the correct behavior should be
- A plausible wrong behavior the agent might exhibit
Make scenarios realistic — things that could happen in production."""
result = call_model(prompt, model=model, temperature=0.9)
return parse_scenarios(result)
Curriculum Design #
Not all scenarios are equally useful for improving an agent. Borrowing from reinforcement learning, you can design a curriculum — starting with easy scenarios and progressively increasing difficulty as the agent improves.
@dataclass
class Curriculum:
"""Progressive difficulty scheduling for simulation."""
levels: list[dict] # Each level has scenarios + pass criteria
current_level: int = 0
def advance_if_ready(self, pass_rate: float) -> bool:
"""Move to next level if agent passes current level."""
threshold = self.levels[self.current_level].get("advance_threshold", 0.8)
if pass_rate >= threshold and self.current_level < len(self.levels) - 1:
self.current_level += 1
return True
return False
def get_current_scenarios(self) -> list[Scenario]:
return self.levels[self.current_level]["scenarios"]
# Example curriculum for a customer-service agent
support_curriculum = Curriculum(levels=[
{
"name": "basic_queries",
"scenarios": generate_simple_faq_scenarios(50),
"advance_threshold": 0.9,
},
{
"name": "multi_step_resolution",
"scenarios": generate_multi_step_scenarios(50),
"advance_threshold": 0.8,
},
{
"name": "angry_customers",
"scenarios": generate_emotional_scenarios(50),
"advance_threshold": 0.7,
},
{
"name": "adversarial_edge_cases",
"scenarios": generate_adversarial_scenarios(
["prompt injection", "out-of-scope requests", "contradictions"]
),
"advance_threshold": 0.6,
},
])
Digital Twins #
A digital twin is a high-fidelity simulation of a specific real-world system — a precise replica of your production environment, kept in sync with the real thing. Where a simulated tool is a sketch, a digital twin is a photograph.
Digital twins are expensive to build and maintain, but they provide the highest confidence that simulation results will transfer to production.
Building a Digital Twin #
Build a digital twin when:
- The agent operates on complex, stateful infrastructure (Kubernetes clusters, CI/CD pipelines, multi-service architectures)
- Failures in production are expensive or dangerous (financial transactions, medical systems, physical hardware)
- You need to replay production incidents exactly to diagnose root causes
- Regulatory requirements demand evidence of testing against production-equivalent systems
Do not build a digital twin when:
- Simple mocks are sufficient (the agent reads files and writes code)
- The environment changes faster than you can sync the twin
- The cost of building the twin exceeds the cost of occasional production failures
Keeping the Twin in Sync #
A digital twin that drifts from production is worse than no twin at all — it gives you false confidence. Sync strategies include:
- Snapshot replication — periodically dump production state into the twin (databases, config, file systems)
- Event sourcing — replay the production event stream into the twin to arrive at the same state
- Schema-driven generation — use production schemas and data distributions to generate realistic synthetic data without copying real data (useful for privacy compliance)
class DigitalTwin:
"""A synchronized replica of a production environment."""
def __init__(self, sync_source: str, sync_interval_seconds: int = 3600):
self.sync_source = sync_source
self.sync_interval = sync_interval_seconds
self.last_sync: float = 0.0
self.world = WorldState()
def sync(self) -> None:
"""Pull latest state from production (sanitized)."""
raw_state = fetch_production_snapshot(self.sync_source)
sanitized = self._strip_pii(raw_state)
self._apply_state(sanitized)
self.last_sync = time.time()
def is_stale(self) -> bool:
return (time.time() - self.last_sync) > self.sync_interval
def _strip_pii(self, state: dict) -> dict:
"""Remove personally identifiable information."""
# Replace real emails, names, etc. with synthetic equivalents
return anonymize(state)
def _apply_state(self, state: dict) -> None:
self.world.filesystem = state.get("files", {})
self.world.database = state.get("tables", {})
self.world.api_responses = state.get("api_mocks", {})
Reward Signal Design #
Once the agent runs in simulation, you need to measure how well it did. This is the reward signal — a function that takes a trajectory (the sequence of observations, thoughts, and actions) and produces a score.
Designing good reward signals is surprisingly hard. A naive reward can produce agents that game the metric rather than solve the problem.
Components of a Reward Function #
A robust reward function typically combines multiple signals:
@dataclass
class TrajectoryReward:
"""Multi-dimensional reward for an agent trajectory."""
task_success: float # Did the agent achieve the goal? (0 or 1)
efficiency: float # How many steps? (fewer is better)
safety: float # Did the agent violate any constraints? (1 = safe)
quality: float # How good is the output? (0-1, judged by evaluator)
cost: float # Token/API cost of the trajectory
@property
def composite_score(self) -> float:
"""Weighted combination. Weights are domain-specific."""
if self.safety < 1.0:
return 0.0 # Safety violation trumps everything
return (
0.5 * self.task_success
+ 0.2 * self.quality
+ 0.2 * self.efficiency
+ 0.1 * (1.0 - self.cost) # Normalize cost to 0-1
)
def compute_efficiency_score(steps_taken: int, max_steps: int) -> float:
"""Reward fewer steps. Linear decay."""
if steps_taken >= max_steps:
return 0.0
return 1.0 - (steps_taken / max_steps)
Reward Shaping Pitfalls #
Sparse rewards — giving the agent a signal only at the end (success/failure) — make learning slow. The agent takes twenty steps and receives no feedback until the final evaluation. If it rarely succeeds, it rarely learns.
Dense rewards — giving feedback at every step — accelerate learning but risk reward hacking. If you reward the agent for "making progress" (e.g., editing a file), it might edit files unnecessarily to collect the reward.
The practical compromise: use sparse terminal rewards for correctness, but add shaping bonuses for intermediate milestones that you are confident correlate with success.
def shaped_reward(trajectory: list[dict], goal: str) -> float:
"""Combine terminal reward with intermediate shaping."""
terminal = evaluate_goal_completion(trajectory[-1]["world_state"], goal)
# Shaping: reward milestones without rewarding busywork
milestones_hit = 0
expected_milestones = get_milestones_for_goal(goal)
for milestone in expected_milestones:
if any(milestone_achieved(step, milestone) for step in trajectory):
milestones_hit += 1
shaping_bonus = 0.1 * (milestones_hit / max(len(expected_milestones), 1))
return terminal + shaping_bonus
LLM-as-Judge #
For tasks where success is subjective (writing quality, user satisfaction, code elegance), you can use a separate model as a judge. This is cheaper than human evaluation and scales to thousands of trajectories.
def llm_judge_reward(
trajectory: list[dict],
task_description: str,
model: str = "gpt-4o",
) -> float:
"""Use a language model to score trajectory quality."""
prompt = f"""You are evaluating an AI agent's performance.
Task: {task_description}
Agent trajectory (actions taken):
{format_trajectory(trajectory)}
Final state:
{trajectory[-1]["world_state"]}
Score the agent on a scale of 0.0 to 1.0:
- 1.0: Perfect execution. Goal achieved efficiently and safely.
- 0.7-0.9: Goal achieved with minor issues (extra steps, suboptimal approach).
- 0.4-0.6: Partial success. Some progress but goal not fully met.
- 0.1-0.3: Mostly failed. Minimal useful progress.
- 0.0: Complete failure or caused harm.
Provide your score as a single number:"""
result = call_model(prompt, model=model, temperature=0.0)
try:
return float(result.strip())
except ValueError:
return 0.0
The risk of LLM-as-judge: the judge model may have the same blind spots as the agent model. Mitigate this by using a different model family for judging, or by periodically calibrating the judge against human evaluations.
Trajectory Generation and Collection #
A trajectory is the complete record of one agent episode: the initial state, every observation, thought, action, and the final outcome. Trajectories are the raw material for evaluation, debugging, and — when you move to fine-tuning — training.
Trajectory Schema #
from dataclasses import dataclass
from typing import Literal
@dataclass
class Step:
step_number: int
observation: str # What the agent saw
thought: str | None # Chain-of-thought (if exposed)
action: dict # Tool call or response
result: str # Tool output or environment response
world_state_hash: str # For reproducibility
@dataclass
class Trajectory:
trajectory_id: str
scenario: Scenario
steps: list[Step]
outcome: Literal["success", "failure", "timeout", "safety_violation"]
reward: TrajectoryReward
total_tokens: int
wall_clock_seconds: float
model: str
temperature: float
seed: int | None # For reproducibility
Batch Trajectory Collection #
Running simulations at scale means running hundreds or thousands of episodes in parallel. The harness manages scenario distribution, resource limits, and result aggregation.
import asyncio
from concurrent.futures import ProcessPoolExecutor
async def collect_trajectories(
agent,
scenarios: list[Scenario],
max_parallel: int = 10,
) -> list[Trajectory]:
"""Run agent through scenarios and collect trajectories."""
semaphore = asyncio.Semaphore(max_parallel)
trajectories = []
async def run_one(scenario: Scenario) -> Trajectory:
async with semaphore:
world = scenario.initial_world_state.snapshot()
tools = build_simulated_tools(world)
steps = []
for step_num in range(scenario.max_steps):
observation = get_observation(world, scenario)
thought, action = await agent.act(observation, tools)
result = execute_simulated_action(action, tools, world)
steps.append(Step(
step_number=step_num,
observation=observation,
thought=thought,
action=action,
result=result,
world_state_hash=hash_state(world),
))
if is_terminal(world, scenario):
break
reward = compute_reward(steps, scenario)
outcome = classify_outcome(steps, scenario, reward)
return Trajectory(
trajectory_id=generate_id(),
scenario=scenario,
steps=steps,
outcome=outcome,
reward=reward,
total_tokens=agent.token_count,
wall_clock_seconds=elapsed(),
model=agent.model,
temperature=agent.temperature,
seed=agent.seed,
)
tasks = [run_one(s) for s in scenarios]
trajectories = await asyncio.gather(*tasks)
return list(trajectories)
Trajectories for Fine-Tuning #
Collected trajectories with high reward scores become training data for distillation. You filter for successful trajectories, format them as input-output pairs, and fine-tune a smaller model to replicate the behavior of the larger agent.
def trajectories_to_training_data(
trajectories: list[Trajectory],
min_reward: float = 0.8,
) -> list[dict]:
"""Convert high-quality trajectories into fine-tuning examples."""
training_examples = []
for traj in trajectories:
if traj.reward.composite_score < min_reward:
continue
if traj.outcome != "success":
continue
for step in traj.steps:
if step.thought is None:
continue
example = {
"messages": [
{"role": "system", "content": "You are a helpful agent."},
{"role": "user", "content": step.observation},
{"role": "assistant", "content": format_thought_and_action(
step.thought, step.action
)},
],
}
training_examples.append(example)
return training_examples
This closes the loop: simulation generates trajectories, trajectories are scored, high-scoring trajectories become training data, and the fine-tuned model is evaluated in the same simulation to measure improvement.
Reproducibility and Determinism #
Simulations are only useful for debugging if you can reproduce them exactly. Non-determinism in agent execution comes from three sources: model sampling, tool execution order, and clock/random state.
Controlling Randomness #
@dataclass
class SimulationConfig:
"""Configuration for reproducible simulation runs."""
seed: int
model: str
temperature: float = 0.0 # Deterministic sampling
max_steps: int = 50
timeout_seconds: float = 300.0
def apply(self) -> None:
"""Set all random seeds for reproducibility."""
import random
import numpy as np
random.seed(self.seed)
np.random.seed(self.seed)
# Note: model-side seeds depend on the provider.
# OpenAI supports a `seed` parameter; others may not.
Setting temperature=0.0 and providing a fixed seed gets you close to determinism with most providers, but not all guarantee it. Some providers round floating-point operations differently across hardware, producing slightly different logits. For strict reproducibility, log the full trajectory and diff against the baseline rather than relying on bit-for-bit output matching.
Trajectory Diffing #
When a code change causes a previously-passing scenario to fail, you need to find exactly where the trajectories diverge.
def diff_trajectories(baseline: Trajectory, current: Trajectory) -> dict:
"""Find the first point of divergence between two trajectories."""
for i, (base_step, curr_step) in enumerate(
zip(baseline.steps, current.steps)
):
if base_step.action != curr_step.action:
return {
"divergence_step": i,
"baseline_action": base_step.action,
"current_action": curr_step.action,
"observation_at_divergence": curr_step.observation,
"baseline_thought": base_step.thought,
"current_thought": curr_step.thought,
}
if len(baseline.steps) != len(current.steps):
return {
"divergence_step": min(len(baseline.steps), len(current.steps)),
"reason": "trajectory_length_mismatch",
"baseline_length": len(baseline.steps),
"current_length": len(current.steps),
}
return {"divergence_step": None, "reason": "trajectories_identical"}
Trade-Offs and Practical Considerations #
Simulation Fidelity vs. Speed #
Higher fidelity means the simulation is more likely to predict real-world behavior, but it runs slower and costs more. Lower fidelity runs fast but may miss failure modes that only emerge in the real environment.
Fidelity Spectrum
Low High
├──────────┼──────────┼──────────┤
│ │ │ │
Mocks Emulators Digital Production
& Stubs Twins Replay
Fast ◄─────────────────────────► Slow
Cheap ◄────────────────────────► Expensive
Gaps ◄─────────────────────────► Faithful
The practical approach: use low-fidelity simulation for rapid iteration (thousands of runs per hour), then validate on high-fidelity digital twins before shipping (tens of runs per hour). Run against production replay as the final gate (single-digit runs per hour, read-only).
The Sim-to-Real Gap #
The biggest risk of simulation is overfitting to the simulator. An agent that scores perfectly in simulation but fails in production has learned to exploit the simulation's shortcuts rather than solving the real problem. Common sources of sim-to-real gap:
- Missing failure modes. The simulator does not model network timeouts, partial writes, or race conditions that happen in production.
- Unrealistic data distributions. Synthetic data is too clean, too uniform, or lacks the edge cases that real data contains.
- Perfect tool behavior. Simulated tools always respond instantly and format their output consistently. Real tools are flaky.
- Simulated users are too cooperative. Real users interrupt, change their mind, give contradictory instructions, and paste huge walls of text.
Mitigate the sim-to-real gap by:
- Injecting noise and failures into the simulation (random latency, dropped connections, malformed responses)
- Regularly validating simulation results against production traces
- Using production replay as a ground-truth calibration for your simulator
- Running canary simulations — scenarios derived directly from recent production incidents
Cost Budgeting #
Simulation at scale consumes inference tokens. A single trajectory of 20 steps with a large model might cost $0.50-$2.00. A thousand trajectories for a regression suite costs real money. Budget strategies:
- Use smaller, cheaper models for high-volume exploratory simulation; reserve the full model for final validation
- Cache deterministic tool outputs across runs (if the world state is identical, the tool output is identical)
- Truncate trajectories early when it is clear the agent has failed (no need to run all 50 steps if it is stuck in a loop by step 10)
- Run curriculum levels in order — do not waste expensive scenarios on an agent that cannot pass easy ones
Conclusion #
Simulation transforms agent development from guesswork to engineering. Instead of crossing your fingers and deploying, you run thousands of scenarios in a controlled environment, measure outcomes precisely, and iterate with confidence.
The key takeaways:
- Build simulated tools that match your real tools' interfaces and failure modes — not just their happy paths.
- Use simulated users to test conversational agents without involving real people or risking real-world side effects.
- Design reward signals carefully. Combine task success, efficiency, safety, and quality. Beware reward hacking.
- Generate scenarios at multiple fidelity levels: templates for coverage, adversarial generation for robustness, and production replay for ground truth.
- Collect trajectories as structured data. They serve evaluation, debugging, and fine-tuning simultaneously.
- Mind the sim-to-real gap. A perfect simulation score means nothing if the simulator does not capture the failure modes that matter in production.
- Invest in reproducibility. Fixed seeds, trajectory logging, and diffing tools let you diagnose regressions quickly.
Simulation is not a substitute for production monitoring — agents will always encounter situations the simulator did not predict. But it dramatically reduces the set of preventable failures, and it gives you the trajectory data you need to keep improving.