The System Prompt & Context Engineering

Publish at:

Every time an agent calls the model, it starts from zero. No memory, no personality, no rules — unless you put them there. The system prompt is what turns a blank-slate language model into a focused, opinionated collaborator.

It defines who the agent is, what it should do, what it must avoid, and how it should respond. Get it right and the agent behaves consistently across hundreds of turns. Get it wrong and you spend your time fighting drift, off-topic rambles, and hallucinated actions.

But the system prompt is only one piece of the context window — and the context window is the entire interface between your code and the model. We touched on context management when we discussed memory. Here we focus on the engineering of what goes into the window: how to structure the system prompt, how to assemble the full context for each model call, how to format it for maximum clarity, how to cache it for cost and latency, and how to budget token space across competing sections.

Anatomy of a System Prompt #

A system prompt is a block of natural language that rides at the top of every model call. The model treats it as persistent instructions — it reads them, internalizes the constraints, and (ideally) follows them throughout the conversation. Unlike the user message, which changes every turn, the system prompt is the stable anchor.

A well-structured system prompt typically has four layers:

┌───────────────────────────────────────────┐
│           1. Identity / Persona           │
│  Who the agent is. What role it plays.    │
├───────────────────────────────────────────┤
│         2. Task Instructions              │
│  What the agent should do. The workflow   │
│  it follows, the tools it prefers, the    │
│  steps it should take.                    │
├───────────────────────────────────────────┤
│         3. Constraints / Rules            │
│  What the agent must NOT do. Boundaries,  │
│  safety limits, formatting requirements.  │
├───────────────────────────────────────────┤
│        4. Output Formatting               │
│  How responses should be structured.      │
│  Tone, length, format conventions.        │
└───────────────────────────────────────────┘

These layers are a design pattern that keeps prompts readable and maintainable as they grow. Let's walk through each one.

Identity #

The identity section tells the model what it is. This sounds trivial, but it anchors every decision the model makes downstream. A model told it is "a cautious security auditor" behaves differently from one told it is "a creative marketing assistant" — even given the same tools and the same user query.

You are a senior infrastructure engineer specializing in Kubernetes
deployments. You are methodical: you check the current state before
making changes, you explain your reasoning, and you never run
destructive commands without confirmation.

Keep the identity short — two or three sentences. The point is to set a decision-making frame. If the identity section is longer than the task instructions, the balance is off.

Task Instructions #

This is the core of the prompt: what the agent should actually do. Effective task instructions are specific, ordered, and action-oriented. They read like the runbook a senior engineer would hand to a competent colleague, not like a wish list.

When a user asks you to investigate an incident:
1. Query the monitoring system for the affected service's metrics
   over the last 2 hours.
2. Check the deployment log for recent changes to that service.
3. Correlate the timeline — did a deployment precede the anomaly?
4. If you find a likely cause, summarize it with evidence.
5. If you cannot determine a cause, say so and suggest what to
   check manually.

Numbered steps work better than prose for multi-step workflows because the model tracks its position through the sequence. Each step should specify a concrete action exactly. "Analyze the situation" is weak. "Query the metrics API for error rates, grouped by endpoint, over the last 2 hours" is something the model can actually execute.

There is a Goldilocks zone for task instructions. At one extreme, engineers hardcode brittle if-else logic into the prompt — "if the user mentions billing, call the billing API; if they mention shipping, call the shipping API" — turning the prompt into a decision tree the model follows mechanically. This is fragile, hard to maintain, and defeats the purpose of using a model. At the other extreme, instructions are so vague ("help the user with their request") that the model has no concrete signal for the desired behavior. The sweet spot is specific enough to guide action, flexible enough to let the model apply judgment. Think heuristics, not rules.

One pitfall: overloading the task section with every possible scenario. A system prompt that tries to cover twenty edge cases becomes a wall of text that the model starts to ignore — attention is not uniform across the context window, and instructions buried deep in a long prompt carry less weight than those near the top. Handle the common path in the prompt. Handle rare edge cases in code, with guardrails and validation logic.

Constraints and Rules #

Constraints define the boundaries. They are usually the most overlooked part of a system prompt and the most impactful when things go wrong. A constraint is a hard rule: do not do X, always do Y, never assume Z.

Rules:
- Never execute a database migration without explicit user approval.
- Do not fabricate data. If the tool returns no results, say so.
- Do not access services outside the production-monitoring namespace.
- If a tool call fails twice, stop retrying and ask the user for help.
- Keep all responses under 300 words unless the user asks for detail.

Constraints work best as a flat list. The model processes them as independent rules, and a bulleted list makes each one visually distinct and hard to miss. Embedding a critical constraint inside a paragraph is a good way to have the model overlook it.

There is a tension here. Adding more constraints makes the agent safer but also more brittle — it may refuse legitimate requests because they brush against an overly broad rule. Every constraint has a cost, and the cost is flexibility. Add constraints that protect against actual failure modes you have observed or can reasonably predict. Do not add speculative rules "just in case."

Output Formatting #

The output section specifies how the model should structure its responses — tone, length, format, and any conventions specific to your use case. This is where you control whether the agent responds in Markdown, JSON, plain text, or some custom format.

Response format:
- Use Markdown for structured answers. Use headings for distinct sections.
- When reporting metrics, use a table.
- Always cite the tool or data source that supports your claim.
- Be concise. Prefer short paragraphs over long explanations.

For agents that produce structured output consumed by downstream code, the formatting section becomes even more critical. If the next step in your pipeline parses JSON, the model needs to know it must return valid JSON — and you should still validate it in code, because no prompt is a guarantee.

Context Assembly #

The system prompt is the first piece of the context window, but it is far from the only one. Every model call requires assembling a full context: the system prompt, the tool schemas, any retrieved documents, the conversation history, and the instructions for the current step. The order, formatting, and relative sizing of these sections all affect how the model weighs and uses the information.

We described the basic structure of the context window in the memory article. Here is the full picture from the prompt-engineering perspective:

┌─────────────────────────────────────────────┐
│              Context Window                 │
│                                             │
│  ┌─────────────────────────────────────┐    │
│  │  [1] System Prompt                  │    │
│  │  (identity, tasks, rules, format)   │    │
│  └─────────────────────────────────────┘    │
│  ┌─────────────────────────────────────┐    │
│  │  [2] Tool Schemas                   │    │
│  │  (names, descriptions, parameters)  │    │
│  └─────────────────────────────────────┘    │
│  ┌─────────────────────────────────────┐    │
│  │  [3] Dynamic Instructions           │    │
│  │  (step-specific rules, retrieved    │    │
│  │   policies, conditional guidance)   │    │
│  └─────────────────────────────────────┘    │
│  ┌─────────────────────────────────────┐    │
│  │  [4] Retrieved Context              │    │
│  │  (RAG results, long-term memory)    │    │
│  └─────────────────────────────────────┘    │
│  ┌─────────────────────────────────────┐    │
│  │  [5] Conversation History           │    │
│  │  (messages, tool calls, results)    │    │
│  └─────────────────────────────────────┘    │
│  ┌─────────────────────────────────────┐    │
│  │  [6] Current Step / User Query      │    │
│  └─────────────────────────────────────┘    │
│                                             │
└─────────────────────────────────────────────┘

Each section competes for space. The assembly logic runs before every model call and decides what to include, how much of it, and in what order. It is a function that adapts on each turn.

def assemble_context(
    system_prompt: str,
    tools: list[dict],
    dynamic_instructions: str,
    retrieved: list[str],
    history: list[dict],
    current_query: str,
    max_tokens: int = 8192,
) -> list[dict]:
    messages = []

    # 1. System prompt — always first, always present
    system_block = system_prompt
    if dynamic_instructions:
        system_block += "\n\n" + dynamic_instructions
    messages.append({"role": "system", "content": system_block})

    # 2. Tool schemas — injected via API parameter, not in messages
    # (handled by the model API, included here for token accounting)
    tool_tokens = count_tokens(format_tools(tools))

    # 3. Retrieved context — injected as a system or user message
    if retrieved:
        context_block = "Relevant context:\n" + "\n---\n".join(retrieved)
        messages.append({"role": "system", "content": context_block})

    # 4. Conversation history — trimmed to fit
    remaining = max_tokens - count_tokens(messages) - tool_tokens - 500
    trimmed_history = trim_to_budget(history, remaining)
    messages.extend(trimmed_history)

    # 5. Current query — always last
    messages.append({"role": "user", "content": current_query})

    return messages

A few principles for assembly:

System prompt first, user query last. Models pay disproportionate attention to the beginning and end of the context. Your persistent rules belong at the top. The current task belongs at the bottom.

Static sections before dynamic ones. The system prompt and tool schemas are stable across turns — they set the frame. Conversation history and retrieved context change every turn — they provide the detail. Putting stable framing before variable detail gives the model a consistent reference point.

Each section should stand alone. Delimit sections clearly with headers, XML tags, or horizontal rules. When sections blur together, the model may treat retrieved context as conversation history or vice versa. Clear boundaries reduce misinterpretation.

<instructions>
You are a deployment assistant...
</instructions>

<retrieved_context>
Document: Rollback Procedure v2.3
...
</retrieved_context>

<conversation>
User: The payment service is returning 503 errors.
...
</conversation>

Dynamic Instructions #

Not everything belongs in the static system prompt. Some instructions only apply in certain situations — when the user is in a particular role, when the agent has reached a specific step, or when a policy document is retrieved that changes how the agent should behave. These are dynamic instructions: rules that are injected into the system prompt (or alongside it) at assembly time based on the current state.

def get_dynamic_instructions(state: dict) -> str:
    instructions = []

    if state.get("user_role") == "admin":
        instructions.append(
            "The user is an admin. You may execute destructive operations "
            "if they confirm."
        )
    else:
        instructions.append(
            "The user is a standard user. Do not offer or execute "
            "destructive operations."
        )

    if state.get("step") == "deployment":
        instructions.append(
            "You are in the deployment phase. Always verify the target "
            "environment before executing any deploy command."
        )

    if state.get("incident_active"):
        instructions.append(
            "An active incident is in progress. Prioritize diagnostic "
            "actions over changes. Do not deploy."
        )

    return "\n".join(instructions)

Dynamic instructions let you keep the base system prompt clean and focused on the common path while adapting behavior per context. The alternative — cramming every conditional rule into the base prompt — creates a bloated prompt that the model has to parse every time, even when most of the conditional branches are irrelevant.

The trade-off is that dynamic instructions are harder to test. A static system prompt is deterministic — you can read it and know what the model will be told. Dynamic instructions depend on runtime state, which means the effective prompt is different on every call. You need to log the assembled prompt, not just the template, to debug behavior.

Formatting for Clarity #

How you format the system prompt matters as much as what you put in it. Models are sensitive to structure — clear formatting improves instruction-following, while ambiguous formatting leads to drift.

Use Delimiters #

XML-style tags, Markdown headings, and horizontal rules all serve as section delimiters that help the model parse the prompt structure. XML tags have become a common convention because they are unambiguous and nest cleanly:

<persona>
You are a data analyst specializing in financial reporting.
</persona>

<task>
When the user asks for a report:
1. Query the data warehouse for the relevant metrics.
2. Compute period-over-period changes.
3. Format the result as a Markdown table.
</task>

<constraints>
- Only query tables in the "reporting" schema.
- Never expose raw SQL to the user.
- If a query returns more than 1000 rows, summarize instead of listing.
</constraints>

Order and Priority #

Instructions near the top and bottom of the prompt carry more weight than those in the middle — this is a consistent pattern across models, driven by how attention mechanisms prioritize position. Put your most critical constraints at the top. Repeat the most important rule at the end if you need extra insurance.

A practical pattern is the sandwich: critical rules at the top, detailed instructions in the middle, a brief restatement of the key constraint at the bottom.

<critical>
Never modify production data without explicit user confirmation.
</critical>

<task>
... (detailed task instructions) ...
</task>

<reminder>
Remember: never modify production data without explicit user confirmation.
</reminder>

This fix adds tokens and can feel redundant — but for high-stakes constraints it measurably reduces violations.

Negative vs. Positive Framing #

Tell the model what to do, not just what to avoid. "Do not use informal language" is weaker than "Use formal, professional language." Negative instructions require the model to infer the desired behavior, which leaves room for creative misinterpretation. Positive instructions state the target directly.

That said, some constraints are inherently negative — "never execute DROP TABLE" — and those should stay as explicit prohibitions. The guideline is: use positive framing for behavior you want, use negative framing for hard boundaries you must enforce.

Prompt Caching #

The system prompt, tool schemas, and any static instructions are identical across turns within a conversation — and often across conversations for the same agent. Recomputing the model's internal representation of this stable prefix on every call is wasteful. Prompt caching addresses this by letting the model provider store and reuse the processed prefix, reducing both latency and cost for subsequent calls.

Turn 1:                          Turn 2:
┌──────────────────────┐        ┌───────────────────────┐
│ System prompt  (NEW) │        │ System prompt (CACHE) │
│ Tool schemas   (NEW) │        │ Tool schemas  (CACHE) │
│ History: msg 1 (NEW) │        │ History: msgs  (NEW)  │
│ User query     (NEW) │        │ User query     (NEW)  │
└──────────────────────┘        └───────────────────────┘
   Full processing                 Prefix is cached,
                                   only new content is
                                   processed

The mechanics vary by provider, but the general pattern is the same: you designate a prefix of the context — everything that does not change between calls — and the provider caches the key-value attention states for that prefix. On subsequent calls, the cached prefix is loaded instead of recomputed, and only the new tokens (conversation history, current query) are processed from scratch.

# Pseudocode — API specifics vary by provider
response = client.chat(
    messages=[
        {
            "role": "system",
            "content": SYSTEM_PROMPT,  # large, stable block
            "cache_control": {"type": "ephemeral"},
        },
        *conversation_history,
        {"role": "user", "content": current_query},
    ],
    tools=TOOL_SCHEMAS,
)

The savings are substantial. For an agent with a 2,000-token system prompt and 500 tokens of tool schemas, caching avoids reprocessing 2,500 tokens on every turn after the first. Over a 20-turn conversation, that is 47,500 fewer input tokens processed — which translates directly into lower cost and faster time-to-first-token.

There are constraints. The cached prefix must be an exact match — if you change a single token, the cache is invalidated and the full prompt is reprocessed. This means dynamic instructions that change every turn should go after the cached prefix, not inside it. Structure your context so the stable parts (system prompt, tool schemas, static instructions) come first and the variable parts (dynamic instructions, history, query) come after.

  ┌────────────────────────────────────┐
  │     Cacheable Prefix               │
  │  ┌──────────────────────────────┐  │
  │  │ System prompt                │  │
  │  │ Tool schemas                 │  │
  │  │ Static reference docs        │  │
  │  └──────────────────────────────┘  │
  ├────────────────────────────────────┤
  │     Variable Suffix                │
  │  ┌──────────────────────────────┐  │
  │  │ Dynamic instructions         │  │
  │  │ Retrieved context            │  │
  │  │ Conversation history         │  │
  │  │ Current query                │  │
  │  └──────────────────────────────┘  │
  └────────────────────────────────────┘

Prompt caching also works across conversations. If multiple users interact with the same agent and the system prompt is identical, the cached prefix can be shared. This is particularly valuable for agents deployed at scale — a customer support agent serving thousands of concurrent conversations with the same base prompt benefits enormously from a shared cache.

Context Budgeting #

The context window has a hard ceiling — 8K, 32K, 128K, or 200K tokens depending on the model. Everything must fit. But the ceiling is not the only constraint. Research on context rot shows that as the number of tokens in the window increases, the model's ability to accurately recall information from that context degrades. The transformer architecture requires every token to attend to every other token — that is pairwise relationships for n tokens. As context grows, this attention gets stretched thin. The result is a performance gradient: the model still works at long contexts, but precision for retrieval and reasoning drops compared to shorter ones. Even if you can fit 100K tokens in the window, that does not mean you should.

This makes context budgeting essential. It is the practice of assigning a token budget to each section of the context window and enforcing those budgets during assembly. The goal is to guarantee that every section gets enough space to be useful, while preventing any single section from crowding out the others — and to keep the total context as tight as possible.

BUDGETS = {
    "system_prompt": 1500,
    "tool_schemas": 800,
    "dynamic_instructions": 400,
    "retrieved_context": 2000,
    "history": 3000,
    "current_query": 500,
}

def assemble_with_budget(sections: dict[str, str]) -> list[dict]:
    result = []
    for key, content in sections.items():
        budget = BUDGETS[key]
        if count_tokens(content) > budget:
            content = compress(content, budget, strategy=STRATEGIES[key])
        result.append({"section": key, "content": content})
    return result

The budgets are not arbitrary — they encode priorities. The system prompt and tool schemas get fixed allocations because they are essential for correct behavior. The current query gets a reserved minimum. Everything else — history, retrieved context, dynamic instructions — competes for the remainder.

When a section exceeds its budget, the compression strategy depends on the section type:

Section Compression Strategy
System prompt Should not be compressed — if it exceeds budget, the prompt needs editing
Tool schemas Remove optional descriptions, drop rarely-used tools for this step
Retrieved context Drop lowest-relevance chunks, truncate long passages
Conversation history Summarize older turns, drop failed intermediate steps
Dynamic instructions Remove lower-priority rules

Adaptive Budgeting #

Static budgets are a good starting point, but they waste space. If the system prompt only uses 800 of its 1,500-token allocation, those 700 tokens should flow to another section instead of going unused. Adaptive budgeting allocates guaranteed minimums to each section, then distributes the remaining space based on demand.

MINIMUMS = {
    "system_prompt": 800,
    "tool_schemas": 400,
    "retrieved_context": 500,
    "history": 1000,
    "current_query": 300,
}

def adaptive_budget(sections: dict[str, str], max_tokens: int) -> dict[str, int]:
    # Start with actual sizes, capped at minimums
    allocated = {}
    for key, content in sections.items():
        actual = count_tokens(content)
        allocated[key] = min(actual, MINIMUMS[key])

    # Distribute remaining space proportionally to demand
    remaining = max_tokens - sum(allocated.values())
    overflow = {
        key: max(0, count_tokens(content) - allocated[key])
        for key, content in sections.items()
    }
    total_overflow = sum(overflow.values())

    if total_overflow > 0 and remaining > 0:
        for key in overflow:
            share = int(remaining * overflow[key] / total_overflow)
            allocated[key] += share

    return allocated

This avoids the common failure mode where a long conversation history eats into the system prompt's space, or a large set of retrieved documents leaves no room for history. Each section is guaranteed its minimum, and surplus space goes where it is most needed.

Budget Is Not Enough #

Sometimes, even with compression and adaptive allocation, the context window is genuinely full. This is the point where you have to make hard choices:

  • Drop tools. If the agent has 30 tools but the current step only plausibly needs 5, filter the tool schemas down to the relevant subset. This frees hundreds of tokens.
  • Clear old tool results. Once a tool call deep in the conversation history has been incorporated into the model's answer or a subsequent summary, the raw result is dead weight. Stripping it saves tokens without losing information the model still needs. This is one of the safest, lightest-touch forms of compression.
  • Compact and reinitialize. For long-horizon tasks — codebase migrations, multi-hour research — the conversation can outgrow any context window. Compaction summarizes the current context into a condensed block and reinitializes a fresh window with the summary plus a small set of recent files or state. The agent continues with minimal degradation. The art is in what to keep: architectural decisions, unresolved issues, and active constraints are high-value; verbose intermediate outputs are not. Start by maximizing recall in your compaction prompt, then iterate to trim the noise.
  • Use structured note-taking. Have the agent write notes to an external file (a scratchpad, a to-do list, a NOTES.md) that persists outside the context window and gets pulled back in selectively. This gives the agent long-term memory across compaction boundaries without paying the token cost of keeping everything in context. The notes survive even when the window is reset.
  • Summarize aggressively. Replace the last 15 turns of conversation with a two-paragraph summary. You lose detail, but you keep the thread.
  • Split the task. If a single context window cannot hold the information the agent needs, break the step into sub-steps, each with its own focused context. This is how orchestration patterns like coordinator-worker address context limits — the coordinator holds the high-level plan, and each worker gets a focused slice of the context.
  • Upgrade the model. A model with a larger context window is sometimes cheaper than the engineering effort required to compress everything into a smaller one. This is a real trade-off that teams underweight.

Prompt Versioning and Testing #

System prompts are code. They determine agent behavior as directly as any function or configuration file, and they deserve the same discipline: version control, review, and testing.

Treat Prompts as Source Code #

Store your system prompts in version-controlled files, not embedded in application code as string literals. This makes diffs visible, reviews meaningful, and rollbacks possible.

prompts/
  ├── deployment-agent/
  │   ├── system.md
  │   ├── tools.json
  │   └── dynamic/
  │       ├── admin-rules.md
  │       └── incident-mode.md
  └── support-agent/
      ├── system.md
      └── tools.json

Each prompt file is loaded at startup or per-request, assembled into the context window by your assembly logic. When you need to change agent behavior, you change the prompt file, review the diff, and deploy — the same workflow as any other configuration change.

Evaluation #

A prompt change can subtly shift agent behavior in ways that are hard to predict by reading the diff. "Respond concisely" and "Be brief" mean different things to different models. The only way to know is to test.

A basic evaluation pipeline runs a set of representative queries against the agent with the new prompt and compares the results to expected outputs — either exact matches, pattern matches, or LLM-as-judge evaluations where a separate model scores the quality of the response.

EVAL_CASES = [
    {
        "query": "Deploy version 3.2.1 to staging",
        "expected_tool": "deploy_service",
        "expected_args": {"version": "3.2.1", "environment": "staging"},
    },
    {
        "query": "What's the current error rate?",
        "expected_tool": "query_metrics",
        "must_not_contain": ["I don't have access", "I cannot"],
    },
    {
        "query": "Drop the users table",
        "expected_response_contains": "I cannot",
        "must_not_call": ["execute_sql"],
    },
]

def run_eval(prompt: str, cases: list[dict]) -> dict:
    results = {"passed": 0, "failed": 0, "details": []}
    for case in cases:
        response = run_agent(prompt, case["query"])
        passed = check_expectations(response, case)
        results["passed" if passed else "failed"] += 1
        results["details"].append({"case": case["query"], "passed": passed})
    return results

This is a smoke test that catches regressions. The most valuable test cases come from real failures: every time the agent does something wrong in production, add a test case that would have caught it. Over time, this builds a regression suite that gives you confidence when changing the prompt.

Putting It All Together #

Here is a complete example showing how the pieces fit: a system prompt with clear layers, context assembly with budgeting, and prompt caching for efficiency.

# --- System prompt: stored in prompts/incident-agent/system.md ---
SYSTEM_PROMPT = """
<persona>
You are an incident response agent for a microservices platform.
You are calm, methodical, and evidence-driven. You never guess
at root causes — you investigate until you have data.
</persona>

<task>
When a user reports an issue:
1. Identify the affected service from the user's description.
2. Query the monitoring API for error rates, latency, and recent
   deployments for that service.
3. Check the dependency graph for upstream or downstream failures.
4. Correlate the timeline: did a deployment, config change, or
   upstream failure precede the issue?
5. Summarize your findings with evidence and a recommended action.
</task>

<constraints>
- Do not restart services or roll back deployments without user approval.
- Do not access services outside the production namespace.
- If a tool call fails twice, stop and ask the user for guidance.
- Cite the specific metric, log line, or deployment that supports
  each conclusion.
</constraints>

<format>
- Use Markdown. Use headings for each investigation step.
- Present metrics in tables.
- Keep the summary under 200 words.
</format>
"""

# --- Assembly ---
def build_context(state: dict, user_query: str) -> list[dict]:
    dynamic = get_dynamic_instructions(state)
    retrieved = retrieve_relevant_runbooks(user_query)
    history = state["messages"]

    budget = adaptive_budget(
        {
            "system_prompt": SYSTEM_PROMPT + "\n" + dynamic,
            "retrieved_context": "\n---\n".join(retrieved),
            "history": format_messages(history),
            "current_query": user_query,
        },
        max_tokens=8192,
    )

    trimmed_history = trim_to_budget(history, budget["history"])
    trimmed_context = trim_to_budget(retrieved, budget["retrieved_context"])

    return [
        {
            "role": "system",
            "content": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        },
        {"role": "system", "content": dynamic},
        {"role": "system", "content": "Relevant runbooks:\n" + "\n---\n".join(trimmed_context)},
        *trimmed_history,
        {"role": "user", "content": user_query},
    ]

The static system prompt is cached — it stays identical across turns. Dynamic instructions, retrieved runbooks, and conversation history are assembled fresh each time. The budget ensures that a verbose conversation history does not push out the runbooks, and vice versa.

Conclusion #

The system prompt and the context assembly logic are where you program the agent's behavior. The model does the reasoning, but you control what it reasons about and what rules it follows.

Key takeaways:

  • Structure system prompts in layers — identity, task instructions, constraints, output format — and keep each layer focused
  • Constraints should protect against real failure modes, not hypothetical ones; every rule costs flexibility
  • Context assembly is a function that runs on every model call, not a one-time configuration; it decides what the model sees
  • Format the prompt with clear delimiters (XML tags, headings) and place critical instructions at the top and bottom where attention is strongest
  • Dynamic instructions let you adapt behavior per step or per user without bloating the base prompt
  • Prompt caching eliminates redundant processing of the stable prefix, reducing cost and latency — structure your context with stable sections first and variable sections after
  • Budget token space explicitly across sections, with guaranteed minimums and adaptive overflow, so no single section can starve the others
  • Treat prompts as source code: version them, review diffs, and build an evaluation suite from real failures