Self-Improving Agents & Tool Creation

Published:

Most agents are static. You build them, deploy them, and they execute the same loop with the same tools and the same system prompt forever. When they fail, a human notices, diagnoses the problem, and pushes a fix. The agent itself learns nothing from the experience.

This is a waste. Agents generate enormous signal about what works and what does not — every successful run is a demonstration of a good strategy, and every failure is a lesson about what to avoid. An agent that can act on that signal — refine its own prompts, synthesize new tools, learn from mistakes — compounds its effectiveness over time without requiring a human in the loop for every improvement.

This is the territory of self-improving agents: systems that modify their own behavior based on experience. Fine-tuning the model weights requires a training pipeline, but changing the things the agent controls directly requires changing prompts, tool library, planning strategies, memory of what works.

The Improvement Surface #

An agent has several levers it can adjust without retraining the underlying model:

  ┌─────────────────────────────────────────────────┐
  │              Agent Configuration                │
  │                                                 │
  │  ┌─────────────────┐  ┌──────────────────────┐  │
  │  │  System Prompt  │  │  Tool Definitions    │  │
  │  │  (instructions, │  │  (schemas, code,     │  │
  │  │   constraints)  │  │   descriptions)      │  │
  │  └─────────────────┘  └──────────────────────┘  │
  │                                                 │
  │  ┌─────────────────┐  ┌──────────────────────┐  │
  │  │     Few-Shot    │  │  Planning Heuristics │  │
  │  │     Examples    │  │  (strategies,        │  │
  │  │ (demonstrations)│  │   decompositions)    │  │
  │  └─────────────────┘  └──────────────────────┘  │
  │                                                 │
  │  ┌──────────────────┐  ┌──────────────────────┐ │
  │  │  Memory          │  │  Evaluation Criteria │ │
  │  │  (learned facts, │  │  (what "good" looks  │ │
  │  │   preferences)   │  │   like)              │ │
  │  └──────────────────┘  └──────────────────────┘ │
  └─────────────────────────────────────────────────┘

Each of these is a target for self-improvement. The agent can rewrite its system prompt to be more precise. It can add new tools when it discovers a recurring need. It can accumulate few-shot examples from its own successful runs. It can update its planning heuristics when a strategy consistently fails.

The key constraint: all of these changes happen at the prompt and tool layer. The model stays frozen. The agent gets better by changing what surrounds the model — the context it operates in.

Learning from Failures #

The simplest form of self-improvement is failure memory: when something goes wrong, record what happened and why, so the agent avoids the same mistake next time.

def run_with_failure_learning(task: str, memory: list[dict]) -> str:
    # Include past failures as negative examples
    failure_context = format_failures(memory, task)

    prompt = f"""{SYSTEM_PROMPT}

Previous mistakes to avoid:
{failure_context}

Task: {task}
"""
    result = execute_agent_loop(prompt)

    if result["success"]:
        return result["output"]

    # Record the failure for future runs
    memory.append({
        "task": task,
        "error": result["error"],
        "trace": result["trace"][-3:],  # Last 3 steps before failure
        "lesson": extract_lesson(task, result),
    })

    return result["output"]


def extract_lesson(task: str, result: dict) -> str:
    prompt = f"""An agent attempted the following task and failed.
Analyze the failure and write a one-sentence lesson the agent should remember
to avoid this mistake in the future.

Task: {task}
Error: {result['error']}
Steps taken: {result['trace'][-3:]}

Lesson:
"""
    return call_model(prompt, temperature=0.0).text

The extracted lesson becomes a rule in the agent's context. Over time, the agent accumulates a library of "do not do X because Y" directives that steer it away from known failure modes. This is cheap, simple, and surprisingly effective — especially for agents that repeatedly encounter the same types of tasks.

The risk: unbounded failure memory bloats the context window. You need a curation mechanism — either a fixed window of recent failures, relevance-based retrieval (embed the current task and retrieve only failures that are semantically similar), or periodic summarization that distills many failures into a few general rules.

def curate_failure_memory(memory: list[dict], max_rules: int = 20) -> list[str]:
    if len(memory) <= max_rules:
        return [m["lesson"] for m in memory]

    # Cluster similar failures and distill into general rules
    lessons = [m["lesson"] for m in memory]
    prompt = f"""The following are {len(lessons)} lessons learned from past failures.
Consolidate them into at most {max_rules} general rules.
Remove duplicates, merge related lessons, and keep only the most important ones.

Lessons:
{chr(10).join(f'- {l}' for l in lessons)}

Consolidated rules:
"""
    result = call_model(prompt, temperature=0.0).text
    return [line.strip("- ") for line in result.strip().split("\n") if line.strip()]

Prompt Self-Refinement #

Beyond learning from failures, an agent can actively rewrite its own instructions. The idea: after a batch of runs, evaluate which instructions led to good outcomes and which were vague, misleading, or missing. Then revise the prompt.

def refine_system_prompt(
    current_prompt: str,
    recent_runs: list[dict],
    success_threshold: float = 0.7,
) -> str:
    # Separate successes and failures
    successes = [r for r in recent_runs if r["score"] >= success_threshold]
    failures = [r for r in recent_runs if r["score"] < success_threshold]

    if not failures:
        return current_prompt  # Nothing to fix

    prompt = f"""You are optimizing an agent's system prompt based on performance data.

Current system prompt:
---
{current_prompt}
---

Recent successful runs ({len(successes)}):
{format_run_summaries(successes[:5])}

Recent failed runs ({len(failures)}):
{format_run_summaries(failures[:5])}

Analyze the failures. Identify which instructions in the system prompt are:
1. Missing — the agent needed guidance that was not there
2. Ambiguous — the agent misinterpreted the instruction
3. Counterproductive — the instruction led to worse outcomes

Then output a revised system prompt that addresses these issues.
Do not change instructions that are working well.

Revised system prompt:
"""
    return call_model(prompt, temperature=0.0).text

This is a form of meta-prompting — using the model to improve the prompt that the model will later follow. It works because evaluation is easier than generation: the model can often spot vagueness or gaps in instructions even if it could not have written perfect instructions from scratch.

Guardrails:

  • Always keep the original prompt as a rollback target. If the refined prompt performs worse, revert.
  • Run the refined prompt through a validation set before deploying it. A prompt that fixes one failure mode might break a previously-working case.
  • Limit the magnitude of changes. A single refinement pass should adjust a few clauses, not rewrite the entire prompt. Radical rewrites are unpredictable.
def safe_prompt_refinement(
    current_prompt: str,
    recent_runs: list[dict],
    validation_tasks: list[dict],
) -> str:
    candidate = refine_system_prompt(current_prompt, recent_runs)

    # Validate: run the candidate prompt against known-good tasks
    current_score = evaluate_prompt(current_prompt, validation_tasks)
    candidate_score = evaluate_prompt(candidate, validation_tasks)

    if candidate_score >= current_score:
        return candidate

    # Regression detected — keep the old prompt
    return current_prompt

Tool Synthesis - Creating New Capabilities #

The most powerful form of self-improvement is when an agent creates entirely new tools. Instead of being limited to the tools a developer provided, the agent notices a recurring pattern — "I keep writing the same five lines of code to parse this API response" — and packages that pattern into a reusable tool.

def synthesize_tool(
    pattern_description: str,
    example_usages: list[dict],
) -> dict:
    prompt = f"""Create a new tool based on the following recurring pattern.

Pattern: {pattern_description}

Example situations where this tool would have been useful:
{format_examples(example_usages)}

Generate:
1. A tool name (snake_case)
2. A clear description of what the tool does
3. Input parameters with types and descriptions
4. The implementation code (Python function)

Return as JSON with fields: name, description, parameters, code
"""
    result = call_model(prompt, temperature=0.0).text
    tool_spec = parse_json(result)

    # Validate the generated code is safe and functional
    validated = validate_tool(tool_spec)
    return validated

The synthesized tool goes through validation before it enters the agent's tool library:

def validate_tool(tool_spec: dict) -> dict:
    code = tool_spec["code"]

    # Static analysis: no network calls, no file deletion, no eval
    forbidden_patterns = [
        "os.remove", "shutil.rmtree", "eval(", "exec(",
        "subprocess.call", "__import__",
    ]
    for pattern in forbidden_patterns:
        if pattern in code:
            raise ToolValidationError(f"Forbidden pattern: {pattern}")

    # Sandbox execution with test inputs
    test_inputs = generate_test_inputs(tool_spec["parameters"])
    for inputs in test_inputs:
        try:
            result = sandbox_execute(code, inputs, timeout=5.0)
            if result is None:
                raise ToolValidationError("Tool returned None for valid input")
        except TimeoutError:
            raise ToolValidationError("Tool execution timed out")

    return tool_spec

When to trigger tool synthesis:

  • The agent uses the same code snippet more than N times across different tasks
  • The agent repeatedly calls a sequence of existing tools in the same pattern (a composite tool)
  • The agent fails at a task because no existing tool fits, but the task is within reach with a small custom function
def detect_tool_opportunities(traces: list[dict], threshold: int = 3) -> list[dict]:
    """Analyze execution traces to find recurring patterns worth packaging as tools."""
    # Extract code blocks the agent wrote inline
    code_blocks = []
    for trace in traces:
        for step in trace["steps"]:
            if step["type"] == "code_execution":
                code_blocks.append({
                    "code": step["code"],
                    "task": trace["task"],
                    "context": step.get("reasoning", ""),
                })

    # Cluster similar code blocks
    prompt = f"""Analyze these {len(code_blocks)} code snippets from an agent's execution history.
Identify recurring patterns — code that does essentially the same thing across different tasks.
Group similar snippets and describe each pattern.

Only report patterns that appear {threshold} or more times.

Code snippets:
{format_code_blocks(code_blocks)}

Patterns found (as JSON array with "description" and "example_indices" fields):
"""
    patterns = parse_json_array(call_model(prompt, temperature=0.0).text)
    return patterns

Tool Composition - Building Higher-Level Capabilities #

Beyond creating tools from scratch, agents can compose existing tools into higher-level operations. This is analogous to how a programmer writes functions that call other functions — layering abstractions.

def compose_tool(
    name: str,
    description: str,
    steps: list[dict],
    existing_tools: list[dict],
) -> dict:
    """Create a composite tool that chains existing tools."""
    # Generate the implementation that orchestrates sub-tools
    step_descriptions = "\n".join(
        f"{i+1}. Call {s['tool']} with {s['description']}"
        for i, s in enumerate(steps)
    )

    code = f'''
def {name}(**kwargs):
    """
    {description}

    Steps:
    {step_descriptions}
    """
    results = {{}}
'''

    for i, step in enumerate(steps):
        input_mapping = step.get("input_mapping", {})
        input_pairs = ", ".join(
            f'"{k}": {v}' for k, v in input_mapping.items()
        )
        code += f'''
    results["step_{i}"] = call_tool("{step['tool']}", {{{input_pairs}}})
'''

    code += '''
    return results
'''

    return {
        "name": name,
        "description": description,
        "parameters": infer_parameters(steps, existing_tools),
        "code": code,
        "is_composite": True,
        "sub_tools": [s["tool"] for s in steps],
    }

A concrete example: an agent that handles customer support tickets might discover it always runs the same three-step sequence — look up the customer, check their subscription tier, then search the knowledge base with tier-specific filters. It can package this as a single get_customer_context tool that does all three in one call.

  Before (3 tool calls per ticket):
    1. lookup_customer(email) → customer_id
    2. get_subscription(customer_id) → tier
    3. search_kb(query, filter=tier) → articles

  After (1 composite tool):
    1. get_customer_context(email, query) → {customer, tier, articles}

The benefit is not just convenience — it reduces the number of reasoning steps the agent needs, which reduces the chance of errors mid-sequence and saves tokens.

The Improvement Loop Architecture #

Putting self-improvement into production requires a loop that runs alongside the agent's normal operation. The agent does its work, outcomes are evaluated, and improvements are proposed and validated on a schedule.

  ┌───────────────────────────────────────────────────┐
  │                    Production Loop                │
  │                                                   │
  │  Task → Agent → Result → Evaluate → Store Trace   │
  └─────────────────────┬─────────────────────────────┘
                        │
                        ▼
  ┌───────────────────────────────────────────────────┐
  │                  Improvement Loop                 │
  │                  (runs periodically)              │
  │                                                   │
  │  1. Analyze recent traces                         │
  │  2. Identify failure patterns                     │
  │  3. Propose changes:                              │
  │     • Prompt refinements                          │
  │     • New tool candidates                         │
  │     • Updated few-shot examples                   │
  │  4. Validate against held-out tasks               │
  │  5. Deploy if improved, rollback if regressed     │
  └───────────────────────────────────────────────────┘

The two loops run at different cadences. The production loop handles every request in real time. The improvement loop runs hourly, daily, or after N failures — whatever cadence gives enough signal without being too noisy.

class SelfImprovingAgent:
    def __init__(self, config: dict):
        self.system_prompt = config["system_prompt"]
        self.tools = config["tools"]
        self.few_shot_examples = config["few_shot_examples"]
        self.failure_memory = []
        self.trace_buffer = []
        self.validation_set = config["validation_tasks"]

    def run(self, task: str) -> dict:
        """Execute a task and record the trace."""
        result = execute_with_config(
            task=task,
            system_prompt=self.system_prompt,
            tools=self.tools,
            examples=self.few_shot_examples,
            failure_memory=self.failure_memory,
        )
        self.trace_buffer.append({"task": task, "result": result})
        return result

    def improve(self):
        """Run the improvement loop over accumulated traces."""
        # Step 1: Learn from failures
        failures = [t for t in self.trace_buffer if not t["result"]["success"]]
        for f in failures:
            lesson = extract_lesson(f["task"], f["result"])
            self.failure_memory.append({"task": f["task"], "lesson": lesson})

        self.failure_memory = curate_failure_memory(
            self.failure_memory, max_rules=20
        )

        # Step 2: Refine the system prompt
        self.system_prompt = safe_prompt_refinement(
            self.system_prompt, self.trace_buffer, self.validation_set
        )

        # Step 3: Detect and synthesize new tools
        opportunities = detect_tool_opportunities(self.trace_buffer)
        for opp in opportunities:
            tool = synthesize_tool(opp["description"], opp["examples"])
            if passes_validation(tool, self.validation_set):
                self.tools.append(tool)

        # Step 4: Update few-shot examples from best runs
        best_runs = sorted(
            self.trace_buffer,
            key=lambda t: t["result"].get("score", 0),
            reverse=True,
        )[:5]
        self.few_shot_examples = extract_demonstrations(best_runs)

        # Reset the buffer
        self.trace_buffer = []

Few-Shot Example Selection #

One of the most underrated self-improvement mechanisms is dynamic few-shot selection. Instead of hardcoding examples in the prompt, the agent maintains a library of its own successful runs and selects the most relevant ones for each new task.

class ExampleLibrary:
    def __init__(self):
        self.examples = []
        self.embeddings = []

    def add(self, task: str, trace: list[dict], score: float):
        if score < 0.8:
            return  # Only store high-quality demonstrations

        example = {
            "task": task,
            "steps": format_trace_as_demonstration(trace),
            "score": score,
        }
        self.examples.append(example)
        self.embeddings.append(embedding_model.encode(task))

    def select(self, current_task: str, k: int = 3) -> list[dict]:
        """Retrieve the most relevant examples for the current task."""
        query_embedding = embedding_model.encode(current_task)

        similarities = [
            cosine_similarity(query_embedding, emb)
            for emb in self.embeddings
        ]

        # Get top-k most similar examples
        top_indices = sorted(
            range(len(similarities)),
            key=lambda i: similarities[i],
            reverse=True,
        )[:k]

        return [self.examples[i] for i in top_indices]

This means the agent's behavior adapts to the task at hand. For a data analysis task, it sees demonstrations of how it previously handled data analysis successfully. For a code review task, it sees its best code reviews. The prompt is dynamic — it shows the agent its own best work as a reference.

Bootstrapping - Cold Start and Warm-Up #

A self-improving agent faces a cold-start problem: it has no traces, no failures, and no learned examples when first deployed. The improvement loop needs signal, and signal comes from running tasks. How do you get the first batch of good runs?

Seed with synthetic demonstrations. Generate a set of task-solution pairs using a stronger model or human examples. These bootstrap the few-shot library and give the agent a reasonable starting point.

Run a warm-up phase. Execute the agent on a batch of representative tasks with human evaluation. Accept the cost of some failures in exchange for the learning signal they provide.

Transfer from a similar agent. If you have an existing agent that handles related tasks, copy its failure memory, few-shot library, and tool set as a starting point. The new agent inherits the lessons without having to relearn them.

def bootstrap_agent(
    agent: SelfImprovingAgent,
    seed_tasks: list[str],
    evaluator,
) -> None:
    """Run a warm-up phase to seed the improvement loop."""
    for task in seed_tasks:
        result = agent.run(task)
        score = evaluator.evaluate(task, result)
        result["score"] = score

    # Run one improvement cycle on the warm-up data
    agent.improve()

Safety and Stability Constraints #

Self-improvement is powerful but dangerous. An agent that can rewrite its own prompts and create new tools can also drift into unexpected behavior, amplify biases, or silently break. You need guardrails around the improvement loop itself.

Bounded change rate. Limit how much can change in a single improvement cycle. A prompt can have at most N clauses modified. The tool library can grow by at most K tools per cycle. This prevents catastrophic rewrites.

Validation gates. Every proposed change must pass a held-out validation set before deployment. If performance drops on any critical task category, the change is rejected.

Audit trail. Log every change the improvement loop makes — the before and after of every prompt revision, every new tool, every updated example. A human should be able to review the history and understand why the agent behaves the way it does today.

Semantic drift detection. Compare the current system prompt against the original. If the cosine similarity drops below a threshold, flag it for human review. The agent's identity should not drift beyond recognition.

def check_drift(original_prompt: str, current_prompt: str, threshold: float = 0.85) -> bool:
    original_embedding = embedding_model.encode(original_prompt)
    current_embedding = embedding_model.encode(current_prompt)
    similarity = cosine_similarity(original_embedding, current_embedding)

    if similarity < threshold:
        alert_human(
            f"System prompt drift detected. Similarity: {similarity:.2f}. "
            f"Review required before next deployment."
        )
        return False

    return True

Tool sandbox escalation. Synthesized tools start with minimal permissions. A new tool can read data but not write it. Only after N successful uses without errors does it get promoted to full permissions. This prevents a buggy synthesized tool from causing damage.

PERMISSION_LEVELS = {
    "sandbox": {"read": True, "write": False, "network": False},
    "restricted": {"read": True, "write": True, "network": False},
    "full": {"read": True, "write": True, "network": True},
}

def promote_tool(tool: dict, traces: list[dict]) -> dict:
    """Promote a tool's permission level based on track record."""
    uses = [t for t in traces if tool["name"] in t.get("tools_used", [])]
    successes = [t for t in uses if t["result"]["success"]]

    if len(uses) < 10:
        return tool  # Not enough data

    success_rate = len(successes) / len(uses)

    if success_rate >= 0.95 and tool["permission"] == "sandbox":
        tool["permission"] = "restricted"
    elif success_rate >= 0.99 and tool["permission"] == "restricted":
        tool["permission"] = "full"

    return tool

Real-World Patterns #

Self-improvement shows up in production systems in several recognizable forms:

The coding agent that learns project conventions. After a few interactions with a codebase, the agent notices it keeps getting review feedback about import ordering, naming conventions, or test structure. It adds these as rules to its context, and subsequent code generations comply without being told.

The research agent that builds its own search tools. A research agent discovers that a particular database has an undocumented API that returns better results than the general search tool. It wraps that API into a new tool and preferentially uses it for related queries.

The customer support agent that learns resolution patterns. After handling hundreds of tickets, the agent identifies that certain complaint types have standard resolutions. It packages these as lookup tables — effectively creating a decision tree that short-circuits the reasoning loop for known cases.

The data pipeline agent that learns schema quirks. A SQL-generating agent keeps failing on a specific table because the column names are misleading. It records this in its failure memory: "the date column in orders is actually a timestamp with timezone, not a date." Future queries get it right on the first try.

Trade-offs #

Self-improvement is not free. It adds complexity, introduces new failure modes, and requires careful engineering to avoid instability.

Compute cost. The improvement loop itself costs tokens — analyzing traces, generating prompt revisions, synthesizing tools, running validations. For some agents, this overhead is justified by the quality gains. For simple, well-defined tasks, it may not be worth it.

Stability risk. A self-improving agent's behavior changes over time. This makes debugging harder — the agent that failed today is not the same agent that worked yesterday. Strong versioning and rollback mechanisms are essential.

Evaluation quality. The improvement loop is only as good as its evaluation signal. If you cannot reliably distinguish good runs from bad runs, the agent will optimize for the wrong thing. Invest in evaluation before investing in self-improvement.

Overfitting to recent tasks. An agent that optimizes heavily for its last 100 tasks might lose generality. The validation set acts as a regularizer — it ensures improvements do not come at the cost of breadth.

Conclusion #

Self-improving agents close the loop between execution and learning. Instead of treating every failure as a bug for a human to fix, they treat it as signal for automatic improvement.

Key takeaways:

  • The improvement surface for agents includes system prompts, tool libraries, few-shot examples, planning heuristics, and failure memory — all modifiable without retraining the model
  • Failure memory is the simplest self-improvement mechanism: record what went wrong, distill a lesson, and inject it into future runs as a negative example
  • Prompt self-refinement uses the model to analyze its own performance data and revise its instructions — but requires validation gates and bounded change rates to prevent drift
  • Tool synthesis lets agents package recurring patterns into reusable capabilities, reducing reasoning steps and error opportunities in future runs
  • The improvement loop runs alongside the production loop at a lower cadence, proposing changes that must pass a held-out validation set before deployment
  • Safety constraints — drift detection, permission escalation, audit trails, and bounded change rates — prevent self-improvement from becoming self-destruction
  • The cold-start problem is real: seed with synthetic demonstrations, run a warm-up phase, or transfer knowledge from a similar agent to bootstrap the improvement cycle