Trust Calibration & Selective Abstention

Published: 28 Jun 2026

The most dangerous agent is the one that is confidently wrong. A model that hallucinates a plausible-sounding answer — and proceeds to act on it — can do more damage than one that fails outright. A failed tool call produces an error. A confidently wrong decision produces a refund, a broken deployment, or a legal liability.

Good agents need to know what they do not know. They need calibrated uncertainty - estimating how likely that answer is correct. And when confidence is low, they need a policy for what to do: abstain, ask a human, hedge, or delegate to a specialist. This is the difference between an agent that is capable and one that is trustworthy.

The Calibration Problem #

Calibration means that when a model says it is 80% confident, it should be correct about 80% of the time. Language models could be poorly calibrated out of the box — they tend to be overconfident, particularly on questions they have seen similar training data for, and especially when generating fluent-sounding text that happens to be wrong.

This matters less for a chatbot (the user can evaluate the answer themselves) and much more for an agent (the agent may act on the answer before anyone reviews it). An agent that calls delete_deployment("prod") with the same confidence it uses to write a haiku is a liability.

  Ideal Calibration         Typical LLM (Overconfident)
  ┌──────────────────┐      ┌──────────────────┐
  │                 /│      │            ████  │
  │               /  │      │          ████    │
  │ Actual      /    │      │        ████      │
  │ Accuracy  /      │      │     ████         │
  │         /        │      │   ████           │
  │       /          │      │  ██              │
  │     /            │      │ ██               │
  │   /              │      │██                │
  │ /                │      │█                 │
  └──────────────────┘      └──────────────────┘
    Stated Confidence         Stated Confidence

In the ideal case, the calibration curve is a diagonal: every confidence bucket matches its accuracy. In practice, models cluster near the top-right corner — they say "I'm very confident" for almost everything, regardless of whether they are right.

Confidence Estimation Strategies #

Since raw model confidence (token logprobs) is a weak signal for factual accuracy, you need additional strategies to estimate how much to trust an agent's output before acting on it.

Token-Level Logprobs #

A logprob (log-probability) is the natural logarithm of the probability the model assigned to a token when it generated that token. Most inference APIs expose logprobs as an optional parameter on the completion request. You typically set logprobs=True (or specify a top_logprobs count) in your API call, and the response includes the log-probability for each generated token alongside the text. Some APIs also return the top-k alternative tokens and their logprobs at each position, which lets you see how "close" the runner-up choices were — a useful signal for uncertainty.

The simplest confidence signal is the log-probability of the generated tokens. Low logprobs on key tokens (entity names, numbers, tool arguments) suggest the model is uncertain. High logprobs suggest fluency but not necessarily correctness — a confident hallucination still has high logprobs.

def extract_confidence_from_logprobs(
    response,
    critical_token_indices: list[int] | None = None,
) -> float:
    """Estimate confidence from token logprobs."""
    if not response.logprobs:
        return 0.5  # No signal; assume neutral

    logprobs = response.logprobs

    if critical_token_indices:
        # Focus on specific tokens (e.g., the tool name, a number)
        relevant = [logprobs[i] for i in critical_token_indices
                    if i < len(logprobs)]
    else:
        relevant = logprobs

    # Convert logprobs to probabilities and take geometric mean
    import math
    probs = [math.exp(lp) for lp in relevant]
    geo_mean = math.exp(sum(math.log(p) for p in probs) / len(probs))
    return geo_mean

Logprobs are cheap and fast but noisy. Use them as a first filter — if logprobs are extremely low, something is probably wrong — but do not rely on them alone for high-stakes decisions.

Self-Consistency (Multiple Samples) #

A more reliable approach: ask the model the same question multiple times (with temperature > 0) and check whether the answers agree. If five samples all produce the same tool call with the same arguments, confidence is high. If they diverge, the model is uncertain about the correct action.

import collections


def self_consistency_confidence(
    prompt: str,
    model: str,
    n_samples: int = 5,
    temperature: float = 0.7,
) -> dict:
    """Run multiple samples and measure agreement."""
    responses = []
    for _ in range(n_samples):
        response = call_model(prompt, model=model, temperature=temperature)
        responses.append(extract_action(response))

    # Count unique actions
    action_counts = collections.Counter(
        json.dumps(r, sort_keys=True) for r in responses
    )
    most_common_action, count = action_counts.most_common(1)[0]

    return {
        "action": json.loads(most_common_action),
        "agreement_rate": count / n_samples,
        "n_unique_actions": len(action_counts),
        "confidence": count / n_samples,
    }

Self-consistency multiplies your inference cost by n_samples, so you cannot afford to run it on every step. Use it selectively — when the agent is about to take a high-impact action, or when the first response triggers a low-confidence heuristic.

Verbalized Confidence #

Ask the model directly: "How confident are you in this answer, on a scale of 1-10?" This sounds naive, but research shows that models can be somewhat calibrated when asked to self-assess — particularly if you provide calibration instructions in the prompt.

CONFIDENCE_PROMPT = """
Before executing this action, assess your confidence level.

Consider:
- Do you have enough information to be certain?
- Are there ambiguities in the user's request?
- Could this action cause harm if wrong?
- Have you seen similar situations where you were incorrect?

Rate your confidence from 0.0 to 1.0:
- 0.9-1.0: Virtually certain. You would bet on this.
- 0.7-0.9: Confident but acknowledge small chance of error.
- 0.5-0.7: Unsure. Multiple reasonable interpretations exist.
- 0.3-0.5: Low confidence. You are guessing.
- 0.0-0.3: Very uncertain. You should ask for clarification.

Your confidence (number only):
"""


def get_verbalized_confidence(
    agent_response: str,
    original_prompt: str,
    model: str,
) -> float:
    """Ask the model to self-assess its confidence."""
    assessment_prompt = (
        f"You just generated this response:\n{agent_response}\n\n"
        f"For this task:\n{original_prompt}\n\n"
        f"{CONFIDENCE_PROMPT}"
    )
    result = call_model(assessment_prompt, model=model, temperature=0.0)

    try:
        confidence = float(result.strip())
        return max(0.0, min(1.0, confidence))
    except ValueError:
        return 0.5  # Parse failure; assume neutral

Verbalized confidence is cheap (one extra inference call) and correlates better with actual accuracy than raw logprobs on many tasks. The main weakness: models tend to be overconfident when they are wrong about facts they "remember" from training, because they cannot distinguish genuine knowledge from plausible confabulation.

Retrieval Verification #

For agents that use RAG or tool outputs, you can estimate confidence by checking whether the agent's answer is grounded in the retrieved evidence. If the answer cites specific passages from retrieved documents, confidence is higher. If it synthesizes an answer that does not appear in any source, it may be hallucinating.

def retrieval_grounding_score(
    agent_answer: str,
    retrieved_documents: list[str],
    model: str,
) -> float:
    """Score how well the answer is grounded in retrieved evidence."""
    grounding_prompt = f"""
Given the following retrieved documents:
{format_documents(retrieved_documents)}

And the following answer:
{agent_answer}

For each claim in the answer, determine if it is:
- SUPPORTED: directly stated or clearly implied by the documents
- UNSUPPORTED: not found in the documents
- CONTRADICTED: conflicts with the documents

Return a JSON object:
{{"supported": <count>, "unsupported": <count>, "contradicted": <count>}}
"""
    result = call_model(grounding_prompt, model=model, temperature=0.0)
    counts = json.loads(result)

    total = sum(counts.values())
    if total == 0:
        return 0.5

    # Contradicted claims are worse than unsupported
    score = (counts["supported"] - 2 * counts["contradicted"]) / total
    return max(0.0, min(1.0, score))

Building a Confidence Pipeline #

In practice, you combine multiple confidence signals into a single score. No single method is reliable enough on its own, but their combination produces a usable estimate.

  ┌──────────────┐   ┌──────────────────┐   ┌──────────────────┐
  │  Token       │   │  Self-           │   │  Verbalized      │
  │  Logprobs    │   │  Consistency     │   │  Confidence      │
  │  (fast/cheap)│   │  (expensive)     │   │  (cheap)         │
  └──────┬───────┘   └────────┬─────────┘   └────────┬─────────┘
         │                    │                      │
         ▼                    ▼                      ▼
  ┌──────────────────────────────────────────────────────────────┐
  │                  Confidence Aggregator                       │
  │                                                              │
  │  Weighted combination based on task type and action risk     │
  └──────────────────────────────┬───────────────────────────────┘
                                 │
                                 ▼
                    ┌────────────────────────┐
                    │   Confidence Score     │
                    │   + Decision Policy    │
                    └────────────────────────┘

from dataclasses import dataclass


@dataclass
class ConfidenceEstimate:
    logprob_score: float | None
    consistency_score: float | None
    verbalized_score: float | None
    grounding_score: float | None
    combined_score: float
    method_used: list[str]


class ConfidencePipeline:
    def __init__(self, config: dict):
        self.config = config
        self.weights = config.get("weights", {
            "logprob": 0.15,
            "consistency": 0.35,
            "verbalized": 0.25,
            "grounding": 0.25,
        })

    def estimate(
        self,
        response,
        prompt: str,
        action_risk: str,  # "low", "medium", "high"
        retrieved_docs: list[str] | None = None,
    ) -> ConfidenceEstimate:
        scores = {}
        methods = []

        # Always compute logprobs (cheap)
        if hasattr(response, "logprobs") and response.logprobs:
            scores["logprob"] = extract_confidence_from_logprobs(response)
            methods.append("logprob")

        # Always compute verbalized confidence (cheap)
        scores["verbalized"] = get_verbalized_confidence(
            response.text, prompt, model=self.config["model"]
        )
        methods.append("verbalized")

        # Grounding score if RAG was used
        if retrieved_docs:
            scores["grounding"] = retrieval_grounding_score(
                response.text, retrieved_docs, model=self.config["model"]
            )
            methods.append("grounding")

        # Self-consistency only for high-risk actions (expensive)
        if action_risk == "high":
            result = self_consistency_confidence(
                prompt, model=self.config["model"], n_samples=5
            )
            scores["consistency"] = result["confidence"]
            methods.append("consistency")

        # Weighted combination
        total_weight = sum(
            self.weights[m] for m in methods if m in self.weights
        )
        combined = sum(
            scores[m] * self.weights[m] / total_weight
            for m in methods if m in self.weights
        )

        return ConfidenceEstimate(
            logprob_score=scores.get("logprob"),
            consistency_score=scores.get("consistency"),
            verbalized_score=scores.get("verbalized"),
            grounding_score=scores.get("grounding"),
            combined_score=combined,
            method_used=methods,
        )

The key design choice: self-consistency is expensive, so only invoke it for high-risk actions. For low-risk actions, logprobs and verbalized confidence are good enough. This keeps the average cost low while concentrating effort where it matters.

Decision Policies - What to Do with Confidence #

A confidence score is useless without a policy that says what happens at each level. The policy maps confidence ranges to actions — and critically, it is sensitive to the risk of the action.

The Confidence-Risk Matrix #

The right threshold for action depends on what you are about to do. An agent should be much more willing to answer a question (low risk) than to execute a financial transaction (high risk) at the same confidence level.

                        Action Risk
                   Low      Medium     High
                ┌────────┬──────────┬──────────┐
  High (>0.8)   │  ACT   │   ACT    │   ACT    │
  Confidence    │        │          │          │
                ├────────┼──────────┼──────────┤
  Medium        │  ACT   │  HEDGE   │ ESCALATE │
  (0.5–0.8)     │        │          │          │
                ├────────┼──────────┼──────────┤
  Low (<0.5)    │ HEDGE  │ ESCALATE │  ABSTAIN │
                │        │          │          │
                └────────┴──────────┴──────────┘

Four possible actions:

ACT — proceed normally with the planned action
HEDGE — take the action but communicate uncertainty to the user ("Based on the available information, I believe X, but I'm not certain about Y...")
ESCALATE — hand off to a human or a more capable agent before acting
ABSTAIN — refuse to act, explain why, and ask for clarification or additional context

from enum import Enum


class Decision(Enum):
    ACT = "act"
    HEDGE = "hedge"
    ESCALATE = "escalate"
    ABSTAIN = "abstain"


@dataclass
class ActionPolicy:
    risk_level: str  # "low", "medium", "high"
    act_threshold: float
    hedge_threshold: float
    escalate_threshold: float
    # Below escalate_threshold → abstain


DEFAULT_POLICIES = {
    "low": ActionPolicy(
        risk_level="low",
        act_threshold=0.4,
        hedge_threshold=0.2,
        escalate_threshold=0.1,
    ),
    "medium": ActionPolicy(
        risk_level="medium",
        act_threshold=0.7,
        hedge_threshold=0.5,
        escalate_threshold=0.3,
    ),
    "high": ActionPolicy(
        risk_level="high",
        act_threshold=0.85,
        hedge_threshold=0.7,
        escalate_threshold=0.5,
    ),
}


def decide(
    confidence: float,
    risk_level: str,
    policies: dict[str, ActionPolicy] = DEFAULT_POLICIES,
) -> Decision:
    policy = policies[risk_level]

    if confidence >= policy.act_threshold:
        return Decision.ACT
    elif confidence >= policy.hedge_threshold:
        return Decision.HEDGE
    elif confidence >= policy.escalate_threshold:
        return Decision.ESCALATE
    else:
        return Decision.ABSTAIN

Classifying Action Risk #

The risk level of an action depends on several factors: reversibility, blast radius, and domain sensitivity. A read_file call is low risk — worst case, you waste a few tokens reading the wrong file. A deploy_to_production call is high risk — the blast radius is every user of the system.

def classify_action_risk(action: dict) -> str:
    """Classify the risk level of a proposed agent action."""
    tool_name = action.get("tool", "")

    # Explicit high-risk tools
    HIGH_RISK_TOOLS = {
        "execute_sql_write", "deploy", "delete", "send_email",
        "transfer_funds", "modify_permissions", "publish",
    }
    # Explicit low-risk tools
    LOW_RISK_TOOLS = {
        "search", "read_file", "list_directory", "get_status",
        "calculate", "format_text",
    }

    if tool_name in HIGH_RISK_TOOLS:
        return "high"
    if tool_name in LOW_RISK_TOOLS:
        return "low"

    # Heuristic: check for destructive verbs in the tool name
    destructive_verbs = ["delete", "drop", "remove", "overwrite", "force"]
    if any(verb in tool_name.lower() for verb in destructive_verbs):
        return "high"

    # Check arguments for risk signals
    args = action.get("arguments", {})
    if "production" in str(args).lower():
        return "high"
    if args.get("force", False) or args.get("irreversible", False):
        return "high"

    return "medium"

Selective Abstention #

Abstention — saying "I don't know" or "I can't do this safely" — is an underappreciated agent capability. Most agent designs optimize for task completion. But a high completion rate is worthless if it includes confidently wrong actions. A 90% completion rate with 2% error is better than a 98% completion rate with 10% error, at least for high-stakes domains.

When to Abstain #

An agent should abstain when:

The task is ambiguous and the wrong interpretation could cause harm. "Delete the old records" — which records? How old is old?
The required information is not available. The agent cannot find the relevant documents, the API returns no results, or the user's question references something outside the agent's knowledge.
Multiple conflicting signals exist. The retrieved documents contradict each other, or the agent's tools return inconsistent data.
The task exceeds the agent's defined scope. A customer-service agent being asked for medical advice should abstain, not attempt an answer.
The confidence is below threshold for the action's risk level. This is the mechanical application of the confidence-risk matrix.

@dataclass
class AbstentionReason:
    category: str  # "ambiguous", "insufficient_info", "conflicting", "out_of_scope", "low_confidence"
    explanation: str
    suggested_action: str  # What the user or system should do next


class AbstentionDetector:
    def __init__(self, scope_definition: list[str], model: str):
        self.scope = scope_definition
        self.model = model

    def should_abstain(
        self,
        task: str,
        confidence: ConfidenceEstimate,
        risk_level: str,
        context: dict,
    ) -> AbstentionReason | None:
        # Check 1: Low confidence for action risk
        decision = decide(confidence.combined_score, risk_level)
        if decision == Decision.ABSTAIN:
            return AbstentionReason(
                category="low_confidence",
                explanation=(
                    f"Confidence ({confidence.combined_score:.2f}) is below the "
                    f"threshold for {risk_level}-risk actions."
                ),
                suggested_action="Please provide more context or confirm the intended action.",
            )

        # Check 2: Out of scope
        if self._is_out_of_scope(task):
            return AbstentionReason(
                category="out_of_scope",
                explanation="This task falls outside my defined capabilities.",
                suggested_action="This request should be routed to a specialist.",
            )

        # Check 3: Conflicting information
        if context.get("retrieved_docs"):
            conflict = self._detect_conflicts(task, context["retrieved_docs"])
            if conflict:
                return AbstentionReason(
                    category="conflicting",
                    explanation=f"Retrieved information is contradictory: {conflict}",
                    suggested_action="Please clarify which source should take precedence.",
                )

        # Check 4: Ambiguity in high-risk context
        if risk_level == "high":
            ambiguity = self._detect_ambiguity(task)
            if ambiguity:
                return AbstentionReason(
                    category="ambiguous",
                    explanation=f"The request is ambiguous: {ambiguity}",
                    suggested_action="Please clarify before I proceed with this action.",
                )

        return None  # No reason to abstain

    def _is_out_of_scope(self, task: str) -> bool:
        prompt = (
            f"Given these scope boundaries:\n{self.scope}\n\n"
            f"Is this task within scope?\nTask: {task}\n"
            f"Answer YES or NO."
        )
        result = call_model(prompt, model=self.model, temperature=0.0)
        return "NO" in result.upper()

    def _detect_conflicts(self, task: str, docs: list[str]) -> str | None:
        prompt = (
            f"Do these documents contain contradictory information "
            f"relevant to this task?\nTask: {task}\n"
            f"Documents:\n{format_documents(docs)}\n"
            f"If contradictory, explain the conflict in one sentence. "
            f"If not, respond NONE."
        )
        result = call_model(prompt, model=self.model, temperature=0.0)
        return None if "NONE" in result.upper() else result.strip()

    def _detect_ambiguity(self, task: str) -> str | None:
        prompt = (
            f"Could this task be interpreted in multiple ways that would "
            f"lead to different actions?\nTask: {task}\n"
            f"If ambiguous, state the ambiguity in one sentence. "
            f"If clear, respond NONE."
        )
        result = call_model(prompt, model=self.model, temperature=0.0)
        return None if "NONE" in result.upper() else result.strip()

The Cost of Abstention #

Abstention is not free. Every time an agent says "I don't know," the user has to do the work themselves — or wait for a human to handle it. Over-abstaining is just as bad as over-acting: an agent that refuses to do anything useful is not trustworthy, it is useless.

The goal is to calibrate the abstention rate so that:

When the agent does act, it is almost always correct
When it abstains, it was right to do so (the answer was genuinely uncertain or risky)

This is a precision-recall trade-off. Lowering the confidence threshold increases recall (the agent acts more often) but decreases precision (more errors slip through). Raising it increases precision but reduces recall (the agent abstains too often).

def evaluate_abstention_policy(
    results: list[dict],  # Each has "abstained", "acted", "correct"
) -> dict:
    acted = [r for r in results if r["acted"]]
    abstained = [r for r in results if r["abstained"]]

    precision = (
        sum(1 for r in acted if r["correct"]) / len(acted)
        if acted else 0.0
    )
    coverage = len(acted) / len(results) if results else 0.0

    # For abstentions, check if they were "right" — i.e., would the
    # agent have been wrong if it had acted?
    abstention_accuracy = (
        sum(1 for r in abstained if not r["would_have_been_correct"])
        / len(abstained)
        if abstained else 0.0
    )

    return {
        "precision_when_acting": precision,
        "coverage": coverage,
        "abstention_rate": 1 - coverage,
        "abstention_accuracy": abstention_accuracy,
    }

A well-tuned system might target 95% precision when acting, 85% coverage, and 70%+ abstention accuracy (meaning most abstentions were justified). The exact targets depend on the domain — medical and financial agents should favor higher precision even at the cost of coverage.

Delegation Thresholds #

Between "act" and "abstain" there is a middle ground: delegate. When an agent is not confident enough to handle a task itself, it can route to something better — a human, a more capable model, or a specialist agent.

Escalation to Humans #

The simplest form of delegation is escalation: the agent pauses and requests human input. The confidence threshold for escalation should be lower than the threshold for abstention — you escalate when you might be able to do it but are not sure enough, and you abstain when you genuinely cannot.

@dataclass
class EscalationRequest:
    task: str
    reason: str
    confidence: float
    context: dict
    suggested_action: str | None  # What the agent thinks the answer is
    urgency: str  # "low", "medium", "high"


class EscalationPolicy:
    def __init__(self, config: dict):
        self.max_wait_seconds = config.get("max_wait_seconds", 300)
        self.fallback_on_timeout = config.get("fallback_on_timeout", "abstain")

    def escalate(
        self,
        task: str,
        confidence: ConfidenceEstimate,
        risk_level: str,
    ) -> EscalationRequest:
        return EscalationRequest(
            task=task,
            reason=self._format_reason(confidence, risk_level),
            confidence=confidence.combined_score,
            context={"method_used": confidence.method_used},
            suggested_action=None,  # Let human decide from scratch
            urgency=risk_level,
        )

    def _format_reason(self, confidence: ConfidenceEstimate, risk: str) -> str:
        parts = [
            f"Confidence: {confidence.combined_score:.2f}",
            f"Risk level: {risk}",
        ]
        if confidence.consistency_score is not None:
            parts.append(
                f"Self-consistency: {confidence.consistency_score:.2f}"
            )
        return " | ".join(parts)

Delegation to Specialist Agents #

In multi-agent systems, delegation is not always to a human — it can be to another agent that is better suited for the task. A generalist coordinator might recognize that a legal question requires the legal-specialist agent, even if the generalist could attempt an answer.

class DelegationRouter:
    def __init__(self, specialists: dict[str, dict]):
        """
        specialists: mapping of domain → {"agent": ..., "scope": ..., "threshold": ...}
        """
        self.specialists = specialists

    def should_delegate(
        self,
        task: str,
        confidence: float,
        current_agent_scope: list[str],
    ) -> dict | None:
        """Check if task should be delegated to a specialist."""
        for domain, spec in self.specialists.items():
            if self._task_matches_domain(task, domain, spec["scope"]):
                # Delegate if the task is clearly in specialist territory
                # OR if our confidence is below the specialist threshold
                if confidence < spec["threshold"]:
                    return {
                        "delegate_to": spec["agent"],
                        "domain": domain,
                        "reason": (
                            f"Task matches {domain} domain and confidence "
                            f"({confidence:.2f}) is below delegation threshold "
                            f"({spec['threshold']:.2f})"
                        ),
                    }
        return None

    def _task_matches_domain(
        self, task: str, domain: str, scope: list[str]
    ) -> bool:
        prompt = (
            f"Does this task fall within the domain of '{domain}'?\n"
            f"Domain scope: {scope}\n"
            f"Task: {task}\n"
            f"Answer YES or NO."
        )
        result = call_model(prompt, model="fast-classifier", temperature=0.0)
        return "YES" in result.upper()

Model Routing by Confidence #

A lightweight form of delegation: when a fast, cheap model is not confident in its answer, escalate to a more capable (and expensive) model. This is model routing driven by confidence rather than by task classification.

class ConfidenceBasedRouter:
    def __init__(self, models: list[dict]):
        """
        models: ordered list from cheapest to most capable.
        Each: {"model_id": ..., "cost_per_token": ..., "confidence_threshold": ...}
        """
        self.models = models

    def route(self, task: str, context: dict) -> dict:
        for model_config in self.models:
            response = call_model(
                task, model=model_config["model_id"], temperature=0.3
            )
            response_text = (
                response.text if hasattr(response, "text") else str(response)
            )
            confidence = get_verbalized_confidence(
                response_text, task, model=model_config["model_id"]
            )

            if confidence >= model_config["confidence_threshold"]:
                return {
                    "response": response,
                    "model_used": model_config["model_id"],
                    "confidence": confidence,
                    "escalated": model_config != self.models[0],
                }

        # Fell through all models — use the most capable one's answer
        return {
            "response": response,
            "model_used": self.models[-1]["model_id"],
            "confidence": confidence,
            "escalated": True,
            "note": "Even the most capable model had low confidence",
        }

This cascading pattern saves money in the common case (most tasks can be handled by the cheap model) while preserving quality for hard tasks. The key parameter is the confidence threshold at each level — too low and you over-escalate (wasting money), too high and you miss errors on the cheap model.

Calibrating Over Time #

Confidence estimation is not a set-it-and-forget-it problem. The calibration of your confidence pipeline drifts as the model changes, the task distribution shifts, and new tool capabilities are added. You need a feedback loop.

Collecting Ground Truth #

To know whether your confidence estimates are calibrated, you need ground truth: was the agent actually correct when it said it was confident? Collect this from:

Automated verification — for tasks with verifiable outcomes (code compiles, SQL returns expected rows, API call succeeds)
Human review — sample a fraction of interactions for expert evaluation
User feedback — thumbs up/down, corrections, complaints
Downstream signals — did the user undo the action? Did they repeat the request differently?

@dataclass
class CalibrationDataPoint:
    confidence_estimate: float
    was_correct: bool
    task_type: str
    risk_level: str
    timestamp: datetime


class CalibrationTracker:
    def __init__(self):
        self.data_points: list[CalibrationDataPoint] = []

    def record(self, confidence: float, correct: bool, task_type: str, risk: str):
        self.data_points.append(CalibrationDataPoint(
            confidence_estimate=confidence,
            was_correct=correct,
            task_type=task_type,
            risk_level=risk,
            timestamp=datetime.utcnow(),
        ))

    def calibration_curve(self, n_bins: int = 10) -> list[dict]:
        """Compute the calibration curve: predicted confidence vs actual accuracy."""
        bins = [[] for _ in range(n_bins)]
        for dp in self.data_points:
            bin_idx = min(int(dp.confidence_estimate * n_bins), n_bins - 1)
            bins[bin_idx].append(dp.was_correct)

        curve = []
        for i, bin_data in enumerate(bins):
            if bin_data:
                predicted = (i + 0.5) / n_bins
                actual = sum(bin_data) / len(bin_data)
                curve.append({
                    "predicted_confidence": predicted,
                    "actual_accuracy": actual,
                    "sample_count": len(bin_data),
                    "gap": abs(predicted - actual),
                })
        return curve

    def expected_calibration_error(self) -> float:
        """ECE: weighted average of |predicted - actual| across bins."""
        curve = self.calibration_curve()
        total_samples = sum(b["sample_count"] for b in curve)
        if total_samples == 0:
            return 0.0
        return sum(
            b["gap"] * b["sample_count"] / total_samples for b in curve
        )

Adjusting Thresholds #

When the calibration curve shows systematic bias — the model is consistently overconfident or underconfident — adjust the thresholds in your decision policy. This is post-hoc calibration: applying a correction to the raw confidence scores to make them match observed accuracy.

def recalibrate_thresholds(
    tracker: CalibrationTracker,
    target_precision: float = 0.95,
    policies: dict[str, ActionPolicy] = DEFAULT_POLICIES,
) -> dict[str, ActionPolicy]:
    """Adjust action thresholds based on observed calibration."""
    curve = tracker.calibration_curve(n_bins=20)

    # Find the confidence level that achieves target precision
    for point in reversed(curve):  # Start from highest confidence
        if point["actual_accuracy"] >= target_precision:
            calibrated_act_threshold = point["predicted_confidence"]
            break
    else:
        calibrated_act_threshold = 0.95  # Very conservative if no bin hits target

    # Adjust all policies proportionally
    adjusted = {}
    for risk, policy in policies.items():
        ratio = calibrated_act_threshold / DEFAULT_POLICIES["medium"].act_threshold
        adjusted[risk] = ActionPolicy(
            risk_level=risk,
            act_threshold=min(0.99, policy.act_threshold * ratio),
            hedge_threshold=min(0.95, policy.hedge_threshold * ratio),
            escalate_threshold=min(0.9, policy.escalate_threshold * ratio),
        )
    return adjusted

Integrating Confidence into the Agent Loop #

All the pieces above — estimation, policies, abstention, delegation — need to plug into the agent's main execution loop. Here is how confidence checks fit into a standard ReAct-style agent:

class CalibratedAgent:
    def __init__(
        self,
        model: str,
        tools: list[dict],
        confidence_pipeline: ConfidencePipeline,
        abstention_detector: AbstentionDetector,
        delegation_router: DelegationRouter | None,
        policies: dict[str, ActionPolicy],
    ):
        self.model = model
        self.tools = tools
        self.confidence = confidence_pipeline
        self.abstention = abstention_detector
        self.delegation = delegation_router
        self.policies = policies

    def run(self, task: str, context: dict) -> dict:
        # Step 1: Check for immediate abstention (out of scope, etc.)
        abstain = self.abstention.should_abstain(
            task, confidence=ConfidenceEstimate(
                logprob_score=None, consistency_score=None,
                verbalized_score=None, grounding_score=None,
                combined_score=0.5, method_used=[],
            ),
            risk_level="medium", context=context,
        )
        if abstain and abstain.category == "out_of_scope":
            return {"status": "abstained", "reason": abstain}

        # Step 2: Generate the agent's planned action
        response = self._generate_response(task, context)
        action = extract_action(response)
        risk_level = classify_action_risk(action)

        # Step 3: Estimate confidence
        estimate = self.confidence.estimate(
            response, task, action_risk=risk_level,
            retrieved_docs=context.get("retrieved_docs"),
        )

        # Step 4: Check abstention with full confidence info
        abstain = self.abstention.should_abstain(
            task, estimate, risk_level, context
        )
        if abstain:
            return {"status": "abstained", "reason": abstain}

        # Step 5: Check delegation
        if self.delegation:
            delegate = self.delegation.should_delegate(
                task, estimate.combined_score, current_agent_scope=[]
            )
            if delegate:
                return {"status": "delegated", "details": delegate}

        # Step 6: Apply decision policy
        decision = decide(
            estimate.combined_score, risk_level, policies=self.policies
        )

        if decision == Decision.ACT:
            result = execute_action(action)
            return {"status": "completed", "result": result}
        elif decision == Decision.HEDGE:
            result = execute_action(action)
            return {
                "status": "completed_with_caveat",
                "result": result,
                "caveat": self._generate_hedge(task, estimate),
            }
        elif decision == Decision.ESCALATE:
            return {
                "status": "escalated",
                "request": EscalationRequest(
                    task=task,
                    reason=f"Confidence {estimate.combined_score:.2f} below threshold",
                    confidence=estimate.combined_score,
                    context=context,
                    suggested_action=str(action),
                    urgency=risk_level,
                ),
            }
        else:
            return {"status": "abstained", "reason": "Below all thresholds"}

    def _generate_hedge(self, task: str, estimate: ConfidenceEstimate) -> str:
        """Generate an appropriate uncertainty disclaimer."""
        if estimate.combined_score > 0.7:
            return "I'm fairly confident in this, but you may want to verify."
        elif estimate.combined_score > 0.5:
            return "I'm not fully certain — please review before relying on this."
        else:
            return "This is my best guess, but I have significant uncertainty."

Conclusion #

An agent that knows what it does not know is more valuable than one that always produces an answer. Trust calibration and selective abstention turn a capable-but-dangerous system into a reliable one.

Key takeaways:

Language models are overconfident by default. Raw token logprobs are a weak signal for factual accuracy. Combine multiple confidence signals — logprobs, self-consistency, verbalized confidence, retrieval grounding — for a usable estimate.
The right confidence threshold depends on the risk of the action. A 70% confidence answer is fine for a search query; it is unacceptable for a production deployment. Use a confidence-risk matrix to map scores to decisions (act, hedge, escalate, abstain).
Abstention is a feature, not a failure. An agent that says "I don't know" when appropriate earns more trust than one that always tries and occasionally gets it catastrophically wrong. Calibrate abstention rates to balance precision (correctness when acting) against coverage (willingness to act).
Delegation is the middle ground between acting and abstaining. Escalate to humans for ambiguous high-stakes tasks, to specialist agents for domain-specific questions, and to more capable models when the cheap model is uncertain.
Calibration drifts over time. Collect ground truth from automated verification, human review, and user feedback. Track the calibration curve and adjust thresholds when predicted confidence diverges from observed accuracy.