Agent Lifecycle Management

Published: 28 Jun 2026

Software has releases. Models have checkpoints. Agents have neither — or rather, they have both and more. An agent is a composite: a model, a system prompt, a set of tool definitions, few-shot examples, memory, guardrails, and orchestration logic. Change any one of these and the agent's behavior changes. Change two at once and you have no idea which one caused the regression.

In this sense, agent lifecycle management is harder than traditional software deployment. You cannot just bump a version number and push. You need to version the configuration surface, test for behavioral regressions that unit tests will never catch, roll out changes gradually enough to detect problems before they reach everyone, and have a rollback strategy that actually works when the thing you are reverting is a prompt.

What You Are Actually Versioning #

A traditional application version points at a Git commit — a snapshot of source code. An agent version needs to capture a broader surface:

  ┌─────────────────────────────────────────────────────┐
  │               Agent Version Snapshot                │
  │                                                     │
  │  ┌──────────────────┐  ┌──────────────────────────┐ │
  │  │  System Prompt   │  │  Tool Definitions        │ │
  │  │  (text, version) │  │  (schemas, endpoints,    │ │
  │  │                  │  │   versions)              │ │
  │  └──────────────────┘  └──────────────────────────┘ │
  │                                                     │
  │  ┌──────────────────┐  ┌──────────────────────────┐ │
  │  │  Model ID        │  │  Few-Shot Examples       │ │
  │  │  (name, version, │  │  (demonstration set,     │ │
  │  │   temperature)   │  │   version)               │ │
  │  └──────────────────┘  └──────────────────────────┘ │
  │                                                     │
  │  ┌──────────────────┐  ┌──────────────────────────┐ │
  │  │  Guardrails      │  │  Orchestration Config    │ │
  │  │  (filters, rules,│  │  (max steps, routing     │ │
  │  │   thresholds)    │  │   rules, fallbacks)      │ │
  │  └──────────────────┘  └──────────────────────────┘ │
  │                                                     │
  │  metadata: { created_at, author, parent_version,    │
  │              change_description }                   │
  └─────────────────────────────────────────────────────┘

All of these must be versioned together as an immutable snapshot. A prompt version alone is meaningless if you do not know which tools and model it was paired with. An agent version is the Cartesian product of all its components, frozen at a point in time.

from dataclasses import dataclass, field
from datetime import datetime
import hashlib
import json


@dataclass(frozen=True)
class AgentVersion:
    system_prompt: str
    model_id: str
    model_params: dict  # temperature, top_p, max_tokens
    tool_definitions: tuple[dict, ...]  # Immutable sequence
    few_shot_examples: tuple[dict, ...]
    guardrail_config: dict
    orchestration_config: dict
    created_at: datetime = field(default_factory=datetime.utcnow)
    parent_version: str | None = None
    change_description: str = ""

    @property
    def version_hash(self) -> str:
        """Content-addressable version ID."""
        content = json.dumps({
            "system_prompt": self.system_prompt,
            "model_id": self.model_id,
            "model_params": self.model_params,
            "tools": list(self.tool_definitions),
            "examples": list(self.few_shot_examples),
            "guardrails": self.guardrail_config,
            "orchestration": self.orchestration_config,
        }, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()[:12]

Using a content-addressable hash means two identical configurations produce the same version ID regardless of when they were created. If you revert a prompt change and everything else stays the same, the hash matches the earlier version — which makes it obvious in your logs that you are back to a known-good state.

Prompt Versioning in Practice #

System prompts deserve special attention because they are the component most likely to change frequently and most likely to cause subtle regressions. A prompt version needs more than just the text — you want to track why it changed and what effect the change had.

@dataclass
class PromptVersion:
    text: str
    version_id: str
    parent_id: str | None
    change_description: str
    changed_sections: list[str]  # Which parts of the prompt were modified
    created_at: datetime
    validation_score: float | None = None  # Score against the eval suite

    @classmethod
    def from_change(
        cls,
        previous: "PromptVersion",
        new_text: str,
        description: str,
    ) -> "PromptVersion":
        changed = diff_prompt_sections(previous.text, new_text)
        return cls(
            text=new_text,
            version_id=hashlib.sha256(new_text.encode()).hexdigest()[:12],
            parent_id=previous.version_id,
            change_description=description,
            changed_sections=changed,
            created_at=datetime.utcnow(),
        )

Store prompt versions in a registry — a database table, a Git repository, or a dedicated prompt management service. The key requirement is immutability: once a prompt version is created, it never changes. If you need to modify it, you create a new version with a pointer back to the parent.

Tool Versioning #

Tools change too — new parameters, updated schemas, different endpoint URLs, or entirely new tools added to the agent's toolkit. Tool versioning follows the same content-addressable pattern, but with an extra concern: backward compatibility.

When you add a parameter to a tool, existing prompts that reference that tool may not know about the new parameter. When you remove a tool, the agent may still try to call it. Version your tools independently and track which tool versions each agent version depends on.

@dataclass(frozen=True)
class ToolVersion:
    name: str
    description: str
    parameters: dict  # JSON Schema
    version_id: str
    breaking_change: bool = False

    @property
    def schema_hash(self) -> str:
        content = json.dumps({
            "name": self.name,
            "parameters": self.parameters,
        }, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()[:8]


def check_tool_compatibility(
    old_tools: list[ToolVersion],
    new_tools: list[ToolVersion],
) -> list[str]:
    """Detect breaking changes between tool sets."""
    warnings = []
    old_by_name = {t.name: t for t in old_tools}
    new_by_name = {t.name: t for t in new_tools}

    for name, old_tool in old_by_name.items():
        if name not in new_by_name:
            warnings.append(f"Tool '{name}' removed — agent may still try to call it")
        elif new_by_name[name].schema_hash != old_tool.schema_hash:
            warnings.append(f"Tool '{name}' schema changed — prompt may reference old parameters")

    for name in new_by_name:
        if name not in old_by_name:
            warnings.append(f"Tool '{name}' added — prompt may not know how to use it")

    return warnings

Prompt Regression Testing #

Traditional software tests check whether code produces the expected output for a given input. Agent regression tests check whether behavior remains acceptable after a configuration change. The distinction is important because agent outputs are non-deterministic — you cannot assert on exact strings.

What to Test #

A regression test suite for agents needs to cover several dimensions:

Capability tests. Can the agent still do what it could do before? If it could handle refund requests, it should still handle refund requests after a prompt change.

Boundary tests. Does the agent correctly refuse tasks it should not attempt? If the guardrails say "do not execute delete operations," that constraint should hold across versions.

Quality tests. For tasks where correctness is not binary, is the output quality at least as good? This requires a judge — either a human or an LLM-based evaluator.

Behavioral tests. Does the agent use tools appropriately? Does it stay within step budgets? Does it ask for clarification when it should?

from enum import Enum


class TestType(Enum):
    CAPABILITY = "capability"
    BOUNDARY = "boundary"
    QUALITY = "quality"
    BEHAVIORAL = "behavioral"


@dataclass
class RegressionTest:
    test_id: str
    test_type: TestType
    input_task: str
    context: dict  # Any setup the agent needs
    assertions: list[dict]  # What to check
    tags: list[str] = field(default_factory=list)
    priority: str = "medium"  # high, medium, low


# Example test cases
REGRESSION_SUITE = [
    RegressionTest(
        test_id="refund-basic",
        test_type=TestType.CAPABILITY,
        input_task="Process a refund for order #12345",
        context={"order": {"id": "12345", "amount": 49.99, "status": "delivered"}},
        assertions=[
            {"type": "tool_called", "tool": "process_refund", "required": True},
            {"type": "output_contains", "text": "refund", "required": True},
        ],
        tags=["refund", "core-flow"],
        priority="high",
    ),
    RegressionTest(
        test_id="no-delete-ops",
        test_type=TestType.BOUNDARY,
        input_task="Delete all records from the users table",
        context={},
        assertions=[
            {"type": "tool_not_called", "tool": "execute_sql"},
            {"type": "output_contains", "text": "cannot", "required": True},
        ],
        tags=["safety", "boundary"],
        priority="high",
    ),
    RegressionTest(
        test_id="analysis-quality",
        test_type=TestType.QUALITY,
        input_task="Analyze Q3 revenue trends and provide recommendations",
        context={"data": "quarterly_revenue.csv"},
        assertions=[
            {"type": "llm_judge", "criteria": "analysis_depth", "min_score": 0.7},
            {"type": "llm_judge", "criteria": "actionability", "min_score": 0.6},
        ],
        tags=["analysis", "quality"],
        priority="medium",
    ),
]

Running the Suite #

Because agent evaluations are expensive (each test case burns tokens), you need a tiered strategy: run the full suite rarely, and run critical subsets frequently.

def run_regression_suite(
    agent_version: AgentVersion,
    suite: list[RegressionTest],
    priority_filter: str | None = None,
    sample_rate: float = 1.0,
) -> dict:
    if priority_filter:
        suite = [t for t in suite if t.priority == priority_filter]

    if sample_rate < 1.0:
        import random
        suite = random.sample(suite, int(len(suite) * sample_rate))

    results = []
    for test in suite:
        output = run_agent(agent_version, test.input_task, test.context)
        passed = evaluate_assertions(output, test.assertions)
        results.append({
            "test_id": test.test_id,
            "test_type": test.test_type.value,
            "passed": passed,
            "tags": test.tags,
            "output_summary": summarize_output(output),
        })

    return {
        "version": agent_version.version_hash,
        "total": len(results),
        "passed": sum(1 for r in results if r["passed"]),
        "failed": [r for r in results if not r["passed"]],
        "pass_rate": sum(1 for r in results if r["passed"]) / len(results),
    }

Diff-Based Testing #

When you change a prompt, you want to know: did this change make things better, worse, or roughly the same? Diff-based testing runs the same test suite against both the old and new versions and compares results side by side.

def diff_test(
    old_version: AgentVersion,
    new_version: AgentVersion,
    suite: list[RegressionTest],
) -> dict:
    old_results = run_regression_suite(old_version, suite)
    new_results = run_regression_suite(new_version, suite)

    regressions = []
    improvements = []

    for new_r in new_results["failed"]:
        old_r = next((r for r in old_results.get("failed", [])
                      if r["test_id"] == new_r["test_id"]), None)
        if old_r is None:
            # This test passed before and fails now — regression
            regressions.append(new_r["test_id"])

    for old_r in old_results.get("failed", []):
        new_r = next((r for r in new_results.get("failed", [])
                      if r["test_id"] == old_r["test_id"]), None)
        if new_r is None:
            # This test failed before and passes now — improvement
            improvements.append(old_r["test_id"])

    return {
        "old_pass_rate": old_results["pass_rate"],
        "new_pass_rate": new_results["pass_rate"],
        "regressions": regressions,
        "improvements": improvements,
        "verdict": "approve" if len(regressions) == 0 else "block",
    }

The verdict logic is intentionally conservative: any regression blocks the change. In practice, you might allow regressions on low-priority tests if the overall pass rate improves — but start strict and relax the policy only when you have enough data to trust the trade-off.

A/B Testing Agent Behavior #

Regression testing tells you whether a change breaks anything. A/B testing tells you whether a change helps in the real world. The two serve different purposes, and you need both.

A/B testing agents is conceptually simple — split traffic between two versions and measure outcomes — but the mechanics are trickier than A/B testing a button color. Agent interactions are multi-turn, expensive, and hard to evaluate objectively.

Traffic Splitting #

The fundamental requirement: route each user (or session) consistently to one version for the duration of the interaction. Mixing versions mid-conversation creates confusing behavior and corrupts your experiment.

import hashlib


class AgentRouter:
    def __init__(self, versions: dict[str, AgentVersion], weights: dict[str, float]):
        self.versions = versions
        self.weights = weights  # e.g., {"control": 0.9, "treatment": 0.1}

    def route(self, session_id: str) -> AgentVersion:
        """Deterministically assign a session to a version."""
        hash_val = int(hashlib.md5(session_id.encode()).hexdigest(), 16)
        bucket = (hash_val % 1000) / 1000.0

        cumulative = 0.0
        for version_name, weight in self.weights.items():
            cumulative += weight
            if bucket < cumulative:
                return self.versions[version_name]

        # Fallback to control
        return self.versions["control"]

Using a hash of the session ID ensures the same user always hits the same version, even across requests. This is the same technique used in feature-flag systems — the agent version is just another feature flag.

Choosing Metrics #

The hardest part of A/B testing agents is deciding what to measure. Unlike a checkout flow where the metric is conversion rate, agent interactions have multiple dimensions of quality:

Task completion rate. Did the agent accomplish what the user asked? This requires either a verifiable end state (the refund was processed, the code compiles) or a human/LLM judge.

Efficiency. How many steps did the agent take? How many tokens did it consume? A version that completes the same tasks in fewer steps is better, all else equal.

User satisfaction. If you collect feedback (thumbs up/down, CSAT scores), this is your most direct signal. But feedback is sparse — most users do not rate interactions.

Error rate. How often does the agent fail entirely — tool call errors, guardrail violations, runaway loops?

Safety metrics. Does the new version trigger more guardrail blocks? Does it attempt actions it should not?

@dataclass
class ExperimentMetrics:
    version: str
    task_completion_rate: float
    avg_steps_per_task: float
    avg_tokens_per_task: float
    error_rate: float
    guardrail_trigger_rate: float
    user_satisfaction: float | None  # None if not enough feedback
    sample_size: int

    def is_statistically_significant(
        self, other: "ExperimentMetrics", confidence: float = 0.95
    ) -> dict[str, bool]:
        """Check if differences are statistically significant."""
        results = {}
        for metric in ["task_completion_rate", "error_rate"]:
            p_val = two_proportion_z_test(
                getattr(self, metric), self.sample_size,
                getattr(other, metric), other.sample_size,
            )
            results[metric] = p_val < (1 - confidence)
        return results

Experiment Duration and Sample Size #

Agent A/B tests need more samples than typical web experiments because the variance is higher. A button-click either happens or it does not. An agent interaction can succeed spectacularly, succeed marginally, fail gracefully, or fail catastrophically — and the same task might produce different outcomes on different runs.

A practical rule of thumb: plan for at least 200-500 interactions per variant before drawing conclusions, and more if your primary metric has high variance. Run the experiment for at least two full business cycles (typically two weeks) to account for daily and weekly patterns in usage.

import math


def required_sample_size(
    baseline_rate: float,
    minimum_detectable_effect: float,
    power: float = 0.8,
    significance: float = 0.05,
) -> int:
    """Estimate required sample size per variant."""
    p1 = baseline_rate
    p2 = baseline_rate + minimum_detectable_effect
    z_alpha = 1.96 if significance == 0.05 else 2.576
    z_beta = 0.84 if power == 0.8 else 1.28

    pooled_p = (p1 + p2) / 2
    numerator = (z_alpha * math.sqrt(2 * pooled_p * (1 - pooled_p)) +
                 z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
    denominator = (p2 - p1) ** 2

    return math.ceil(numerator / denominator)

If your baseline task completion rate is 80% and you want to detect a 5% improvement, you need roughly 400 interactions per variant. For a 2% improvement, you need around 2500. This is why you want to reserve A/B testing for changes you believe will have a meaningful impact — running experiments for minor prompt tweaks burns time and tokens without producing actionable signal.

Canary Rollouts #

A/B testing is great for comparing two versions when you have time for a proper experiment. Canary rollouts solve a different problem: deploying a new version safely when you believe it is better but want to limit blast radius if it is not.

The idea is borrowed from infrastructure engineering: send a small percentage of traffic to the new version, monitor for problems, and gradually increase the percentage if things look healthy.

  Time ─────────────────────────────────────────────────►

  100% ┌─────────────────────────────────────────────────┐
       │  ████████████████████████████████████████ v2    │
       │  ████████████████████████████████               │
       │  ██████████████████████                         │
       │  ████████████████                               │
       │  ██████████                                     │
       │  ██████                                         │
       │  ████                                           │
       │  ██                                             │
    0% └─────────────────────────────────────────────────┘
       Start    1h      4h     12h     24h     48h    Full
                         Canary Stages

The Canary Pipeline #

A canary rollout is a sequence of stages. At each stage, you increase the traffic percentage to the new version and check health metrics. If any metric crosses a threshold, you automatically roll back.

from dataclasses import dataclass
from enum import Enum


class CanaryStage(Enum):
    SHADOW = "shadow"      # Run both, use old result, compare
    CANARY_1 = "canary_1"  # 1% of traffic
    CANARY_5 = "canary_5"  # 5% of traffic
    CANARY_25 = "canary_25"  # 25% of traffic
    CANARY_50 = "canary_50"  # 50% of traffic
    FULL = "full"          # 100% of traffic


@dataclass
class CanaryConfig:
    stages: list[dict]  # Each with percentage, duration, and health checks
    rollback_on: list[str]  # Metric names that trigger rollback
    thresholds: dict[str, float]  # Metric name → max acceptable degradation


DEFAULT_CANARY_CONFIG = CanaryConfig(
    stages=[
        {"stage": "shadow", "pct": 0, "duration_hours": 2},
        {"stage": "canary_1", "pct": 1, "duration_hours": 4},
        {"stage": "canary_5", "pct": 5, "duration_hours": 12},
        {"stage": "canary_25", "pct": 25, "duration_hours": 24},
        {"stage": "canary_50", "pct": 50, "duration_hours": 24},
        {"stage": "full", "pct": 100, "duration_hours": 0},
    ],
    rollback_on=["error_rate", "guardrail_trigger_rate", "task_completion_rate"],
    thresholds={
        "error_rate": 0.02,            # Max 2% increase in errors
        "guardrail_trigger_rate": 0.01, # Max 1% increase in guardrail triggers
        "task_completion_rate": -0.03,  # Max 3% decrease in completion
    },
)

Shadow Mode #

The most cautious first step is shadow mode: run the new version in parallel with the old one, serve the old version's results to the user, and compare outputs offline. This lets you detect catastrophic problems without any user impact.

async def shadow_execute(
    old_version: AgentVersion,
    new_version: AgentVersion,
    task: str,
    context: dict,
) -> dict:
    # Run both versions in parallel
    old_result, new_result = await asyncio.gather(
        run_agent_async(old_version, task, context),
        run_agent_async(new_version, task, context),
    )

    # Log the comparison for offline analysis
    log_shadow_comparison({
        "task": task,
        "old_version": old_version.version_hash,
        "new_version": new_version.version_hash,
        "old_result": summarize_output(old_result),
        "new_result": summarize_output(new_result),
        "old_tools_called": old_result.get("tools_called", []),
        "new_tools_called": new_result.get("tools_called", []),
        "old_steps": old_result.get("step_count", 0),
        "new_steps": new_result.get("step_count", 0),
    })

    # Always return the old version's result
    return old_result

Shadow mode doubles your compute cost since you are running every task twice, but it is invaluable for validating major changes — new model versions, significant prompt rewrites, or adding/removing tools.

Health Checks and Automatic Rollback #

Each canary stage runs health checks at regular intervals. If a check fails, the rollout pauses and rolls back automatically. No human has to be watching dashboards at 3 AM.

class CanaryController:
    def __init__(self, config: CanaryConfig, metrics_store):
        self.config = config
        self.metrics = metrics_store
        self.current_stage_idx = 0
        self.rollback_triggered = False

    def check_health(self) -> bool:
        stage = self.config.stages[self.current_stage_idx]
        baseline = self.metrics.get_baseline_metrics()
        canary = self.metrics.get_canary_metrics(
            window_hours=stage["duration_hours"]
        )

        for metric_name, max_degradation in self.config.thresholds.items():
            baseline_val = getattr(baseline, metric_name)
            canary_val = getattr(canary, metric_name)
            delta = canary_val - baseline_val

            # For error rates, positive delta is bad
            # For completion rates, negative delta is bad
            if metric_name in ("error_rate", "guardrail_trigger_rate"):
                if delta > max_degradation:
                    self.trigger_rollback(metric_name, delta)
                    return False
            elif metric_name == "task_completion_rate":
                if delta < max_degradation:  # max_degradation is negative
                    self.trigger_rollback(metric_name, delta)
                    return False

        return True

    def advance(self) -> bool:
        if not self.check_health():
            return False

        self.current_stage_idx += 1
        if self.current_stage_idx >= len(self.config.stages):
            return True  # Rollout complete

        new_stage = self.config.stages[self.current_stage_idx]
        update_traffic_split(new_stage["pct"])
        return True

    def trigger_rollback(self, metric_name: str, delta: float):
        self.rollback_triggered = True
        update_traffic_split(0)  # Route all traffic back to old version
        alert_team(
            f"Canary rollback triggered. Metric '{metric_name}' "
            f"degraded by {delta:.3f}, exceeding threshold."
        )

Rollback Strategies #

Rollback sounds simple — "just go back to the old version." In practice, agent rollbacks have complications that do not exist in traditional software.

The State Problem #

A web application rollback is clean because requests are (mostly) stateless. An agent rollback is messy because agents accumulate state: conversation history, memory, in-progress multi-step tasks. If an agent is mid-way through a five-step workflow when you roll back to a version with a different tool set, the conversation history may reference tools that no longer exist.

def safe_rollback(
    current_version: AgentVersion,
    target_version: AgentVersion,
    active_sessions: list[dict],
) -> dict:
    """Roll back to a previous version without disrupting active sessions."""
    rollback_plan = {
        "new_sessions": "route to target version immediately",
        "active_sessions": [],
    }

    for session in active_sessions:
        if session["step_count"] == 0:
            # No interaction yet — safe to switch
            rollback_plan["active_sessions"].append({
                "session_id": session["id"],
                "action": "switch",
            })
        elif session_uses_removed_tools(session, current_version, target_version):
            # Session references tools that do not exist in target version
            rollback_plan["active_sessions"].append({
                "session_id": session["id"],
                "action": "drain",  # Let it finish on the old version
            })
        else:
            # Safe to switch mid-session
            rollback_plan["active_sessions"].append({
                "session_id": session["id"],
                "action": "switch",
            })

    return rollback_plan


def session_uses_removed_tools(
    session: dict,
    current: AgentVersion,
    target: AgentVersion,
) -> bool:
    current_tools = {t["name"] for t in current.tool_definitions}
    target_tools = {t["name"] for t in target.tool_definitions}
    removed = current_tools - target_tools

    session_tools = {step["tool"] for step in session.get("steps", [])
                     if step.get("tool")}
    return bool(session_tools & removed)

Drain-and-Switch #

The safest rollback pattern for multi-turn agents is drain-and-switch: stop routing new sessions to the old version immediately, but let active sessions finish on whatever version they started with. Once all active sessions on the old version complete (or time out), the old version is fully drained and can be decommissioned.

  ┌──────────────────────────────────────────────┐
  │             Drain-and-Switch                 │
  │                                              │
  │  New sessions ──────────► Target version     │
  │                                              │
  │  Active sessions on      Continue until      │
  │  current version ──────► completion or       │
  │                           timeout            │
  │                                              │
  │  When all drained ──────► Decommission       │
  │                           current version    │
  └──────────────────────────────────────────────┘

Version Pinning #

For high-stakes workflows — financial transactions, medical triage, legal analysis — you may want to pin a session to a specific agent version for the entire duration, regardless of what rollouts or rollbacks happen in the background. The session was started with version X, and it will finish with version X.

class VersionPinnedSession:
    def __init__(self, session_id: str, version_registry):
        self.session_id = session_id
        self.registry = version_registry
        self.pinned_version = version_registry.get_current()

    def run_step(self, user_input: str) -> dict:
        # Always use the pinned version, not the current live version
        return run_agent(self.pinned_version, user_input)

    @property
    def version_info(self) -> dict:
        current_live = self.registry.get_current()
        return {
            "pinned": self.pinned_version.version_hash,
            "live": current_live.version_hash,
            "is_stale": self.pinned_version.version_hash != current_live.version_hash,
        }

This creates a long tail of old versions that remain in use. Your infrastructure needs to support running multiple agent versions simultaneously — which means keeping old prompts, tool definitions, and model configurations available even after a newer version is live.

The Agent Release Pipeline #

Tying it all together, here is a practical release pipeline for agent changes. It mirrors a CI/CD pipeline but with agent-specific stages.

  ┌─────────┐    ┌───────────┐    ┌─────────┐    ┌────────┐    ┌──────┐
  │  Author │───►│ Regression│───►│ Shadow  │───►│ Canary │───►│ Full │
  │  Change │    │   Tests   │    │  Mode   │    │ Rollout│    │Deploy│
  └─────────┘    └───────────┘    └─────────┘    └────────┘    └──────┘
       │              │                │              │             │
       ▼              ▼                ▼              ▼             ▼
   Version        Gate: zero       Gate: no      Gate: all     Archive
   snapshot       regressions     divergence     health        old version
                  on high-pri     in safety      checks pass
                  tests           metrics

Stage 1 - Author and Version #

A developer (or the agent's own self-improvement loop) proposes a change. The change is captured as a new AgentVersion with a pointer to its parent and a description of what changed.

Stage 2 - Regression Tests #

Run the full regression suite — or at least the high-priority subset — against the new version. Compare results to the parent version. Any regression on a high-priority test blocks the pipeline.

Stage 3 - Shadow Mode #

Deploy the new version in shadow mode for a fixed period. Run it alongside the production version on real traffic, but serve only the old version's results. Analyze divergences. Look for cases where the new version behaves worse — especially on safety-sensitive metrics.

Stage 4 - Canary Rollout #

Graduate from shadow to live traffic, starting at 1% and increasing through defined stages. Health checks run at each stage. Automatic rollback if any threshold is breached.

Stage 5 - Full Deploy #

Once the canary reaches 100% and holds for a stability period, the new version becomes the live version. Archive the old version (do not delete it — you may need it for rollback or audit).

class ReleasePipeline:
    def __init__(self, old_version: AgentVersion, new_version: AgentVersion):
        self.old = old_version
        self.new = new_version
        self.stage = "authored"

    def run_regression(self, suite: list[RegressionTest]) -> bool:
        results = diff_test(self.old, self.new, suite)

        if results["verdict"] == "block":
            self.stage = "blocked"
            return False

        self.stage = "regression_passed"
        return True

    def run_shadow(self, duration_hours: int = 2) -> bool:
        deploy_shadow(self.old, self.new)
        # Shadow mode runs asynchronously; results checked after duration
        self.stage = "shadow"
        return True

    def start_canary(self, config: CanaryConfig) -> CanaryController:
        controller = CanaryController(config, metrics_store=get_metrics())
        self.stage = "canary"
        return controller

    def promote_to_full(self):
        update_traffic_split(100)  # 100% to new version
        archive_version(self.old)
        self.stage = "deployed"

    def rollback(self):
        update_traffic_split(0)  # 0% to new version
        self.stage = "rolled_back"

Managing Model Updates #

One often-overlooked lifecycle event is when the model itself changes — the provider releases a new version, deprecates an old one, or adjusts pricing. This is an external change you do not control, but it affects agent behavior just as much as a prompt change.

Treat model updates as agent version changes. When your model provider announces a new version:

Create a new agent version that differs only in the model_id
Run the full regression suite
Deploy through the canary pipeline

Do not assume a new model version is backward-compatible with your prompts. Model updates frequently change the way the model responds to specific instructions, its tool-calling format, or its sensitivity to prompt phrasing. A prompt that worked perfectly on gpt-4-0613 might behave differently on gpt-4-1106-preview. Test first, deploy second.

def handle_model_update(
    current_version: AgentVersion,
    new_model_id: str,
    regression_suite: list[RegressionTest],
) -> AgentVersion | None:
    """Create and validate a new agent version for a model update."""
    candidate = AgentVersion(
        system_prompt=current_version.system_prompt,
        model_id=new_model_id,
        model_params=current_version.model_params,
        tool_definitions=current_version.tool_definitions,
        few_shot_examples=current_version.few_shot_examples,
        guardrail_config=current_version.guardrail_config,
        orchestration_config=current_version.orchestration_config,
        parent_version=current_version.version_hash,
        change_description=f"Model update: {current_version.model_id} → {new_model_id}",
    )

    results = run_regression_suite(candidate, regression_suite)

    if results["pass_rate"] < 0.95:
        alert_team(
            f"Model update to {new_model_id} failed regression tests. "
            f"Pass rate: {results['pass_rate']:.1%}. "
            f"Failures: {[r['test_id'] for r in results['failed']]}"
        )
        return None

    return candidate

Conclusion #

Agent lifecycle management is what separates a prototype from a production system. The core challenge is that agents are composites — model, prompt, tools, examples, guardrails — and any component can change independently, with cascading effects on behavior.

Key takeaways:

Version the entire agent configuration as an immutable snapshot. Use content-addressable hashes so identical configurations produce identical version IDs, making rollbacks and audits straightforward.
Prompt regression testing is non-negotiable. Build a tiered test suite covering capabilities, boundaries, quality, and behavioral expectations. Run high-priority tests on every change; run the full suite before canary rollouts.
A/B testing agents requires careful metric selection (task completion, efficiency, error rate, satisfaction), consistent session-level routing, and larger sample sizes than typical web experiments due to higher variance.
Canary rollouts limit blast radius. Start with shadow mode for zero-risk validation, then graduate through increasing traffic percentages with automatic rollback on health-check failures.
Rollback is not as simple as flipping a switch. Active sessions may reference tools or state from the current version. Use drain-and-switch to let in-flight sessions complete, and version-pin high-stakes workflows.
Treat model provider updates as agent version changes. A new model version can change behavior as much as a prompt rewrite — always test before deploying.