Data Agents & SQL Generation

Published:

Most business questions live in databases. "How many users signed up last week?" "What's our revenue by region?" "Which customers are at risk of churning?" These are questions a human analyst answers by writing SQL. A data agent does the same thing — it takes a natural-language question, translates it to a query, executes it safely, and interprets the result back into language. The idea sounds deceptively simple. In practice, it requires solving schema grounding, ambiguity resolution, execution sandboxing, and result interpretation — all within the constraints of a single agent loop.

Text-to-SQL - The Core Translation Problem #

At its heart, a data agent performs text-to-SQL: converting a natural-language question into a syntactically correct, semantically faithful SQL query. The model needs to understand the user's intent, map it to the correct tables and columns, choose the right aggregations, and handle joins — all without hallucinating column names that do not exist.

A naive approach puts the full database schema in the prompt and asks the model to write SQL. This works surprisingly well for small schemas (under 20 tables). But real enterprise databases have hundreds of tables, thousands of columns, and naming conventions that make no sense without institutional context. tbl_cust_xref_02 tells the model nothing about what it contains.

The translation pipeline typically looks like this:

User question
    │
    ▼
┌───────────────────┐
│ Schema retrieval  │ ← select relevant tables/columns
└───────────────────┘
    │
    ▼
┌───────────────────┐
│  SQL generation   │ ← model writes the query
└───────────────────┘
    │
    ▼
┌──────────────────┐
│   Validation     │ ← syntax check, column existence
└──────────────────┘
    │
    ▼
┌──────────────────┐
│   Execution      │ ← sandboxed, read-only
└──────────────────┘
    │
    ▼
┌──────────────────┐
│  Interpretation  │ ← model explains the result
└──────────────────┘

Each step has its own failure modes. Schema retrieval can miss the relevant table. Generation can hallucinate a column. Validation catches some errors but not semantic ones. Execution can time out on a poorly optimized query. Interpretation can misread what the numbers mean.

Schema Grounding #

Schema grounding is the problem of giving the model enough database context to write correct SQL — without overwhelming the context window with irrelevant tables.

The simplest approach is full-schema injection: dump every CREATE TABLE statement into the prompt. This works for small databases but collapses at scale. A schema with 500 tables and 4,000 columns easily exceeds context limits, and even if it fits, the model struggles to find the needle in the haystack.

Better strategies use schema retrieval:

  • Keyword matching: extract entities from the user's question and match them against table/column names and descriptions. Fast, but brittle when naming conventions are opaque.
  • Embedding-based retrieval: embed table and column descriptions (or sample rows) into a vector store. At query time, retrieve the top-k most relevant tables. This handles synonym matching ("revenue" → total_sales) well.
  • Few-shot example lookup: maintain a library of (question, SQL) pairs. Retrieve similar questions and include their SQL as examples. The model pattern-matches against them.
  • Schema linking: a dedicated model pass that explicitly identifies which tables and columns are referenced by the question before generation begins.

In practice, a combination works best. Retrieve candidate tables via embeddings, then include their full DDL plus a few relevant (question, SQL) examples in the generation prompt.

Metadata enrichment makes a huge difference. Adding column descriptions, sample values, foreign-key relationships, and business glossary mappings transforms an opaque schema into something the model can reason about. If the column is named cst_ltv_30d, a description saying "Customer lifetime value over the last 30 days, in USD" lets the model use it correctly.

Query Sandboxing and Safety #

A data agent that can execute arbitrary SQL against a production database is a security incident waiting to happen. Sandboxing is non-negotiable.

The minimum safety boundary is a read-only connection. The agent's database credentials should only have SELECT permission — no INSERT, UPDATE, DELETE, DROP, or DDL statements. This is the first line of defense, enforced at the database layer itself, not just in application code.

Beyond permissions, additional safeguards include:

  • Query parsing and allow-listing: before execution, parse the generated SQL and reject anything that is not a SELECT statement. Watch for injection via CTEs that contain write operations or database-specific syntax that bypasses simple regex checks.
  • Row and time limits: set a LIMIT ceiling (e.g., 10,000 rows) and a query timeout (e.g., 30 seconds). A poorly constructed query that scans a billion-row table should not bring the database to its knees.
  • Cost estimation: on databases that support EXPLAIN, run the query plan first. If the estimated cost exceeds a threshold, refuse execution or ask the user for confirmation.
  • Data masking: if the agent can access tables with PII (emails, SSNs, phone numbers), apply column-level masking or use a view layer that redacts sensitive fields.

A defense-in-depth approach combines all of these. The database enforces read-only. The application layer parses and validates. Timeouts and row limits prevent resource exhaustion. Masking protects sensitive data even if the agent constructs a query that selects it.

import sqlparse


def execute_safely(sql: str, connection) -> dict:
    # Parse and reject non-SELECT statements
    parsed = sqlparse.parse(sql)
    for statement in parsed:
        if statement.get_type() != "SELECT":
            raise PermissionError("Only SELECT queries are allowed")

    # Enforce row limit
    sql = sql.strip().rstrip(";")
    if "LIMIT" not in sql.upper():
        sql += " LIMIT 10000"

    # Execute with timeout
    cursor = connection.cursor()
    cursor.execute("SET statement_timeout = '30s';")
    cursor.execute(sql)
    columns = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
    return {"columns": columns, "rows": rows}

Error Recovery and Self-Correction #

SQL generation is inherently error-prone. Column names get hallucinated, joins are wrong, aggregation logic is off. A good data agent expects failure and builds a correction loop.

The standard pattern is generate → execute → if error, retry with error message. When a query fails — whether with a syntax error, a missing column, or a timeout — the agent feeds the error message back into the model alongside the original question and the failed query. The model uses the error to diagnose and fix the problem. Most agents allow 2–3 retries before giving up.

This is a specialized form of the ReAct loop we discussed earlier. The observation is the database error, and the next action is a corrected query.

Beyond execution errors, semantic validation catches subtler problems. If the user asks "How many active users do we have?" and the query returns 3, something is probably wrong. Heuristics like "count queries on large tables should return more than single digits" or "revenue should be positive" can trigger a re-examination. Some systems ask the model to self-critique: "Does this result look reasonable given the question?"

Result Interpretation and Presentation #

Returning raw query results is rarely enough. A data agent's job is not done when the SQL executes — it needs to translate the results back into a human-readable answer.

For simple questions ("How many users signed up last week?"), the interpretation is trivial: "1,247 users signed up last week." But for complex queries with multiple columns and rows, the agent needs to summarize patterns, highlight outliers, and provide context.

Effective interpretation strategies include:

  • Direct answer extraction: for single-value results, state the answer in a sentence.
  • Tabular formatting: for multi-row results, present a clean table with headers and reasonable formatting (rounding numbers, formatting dates).
  • Trend narration: for time-series data, describe the trend — "Revenue grew 12% month-over-month, with a dip in March."
  • Caveats and qualifications: if the query made assumptions (e.g., "active" means "logged in within 30 days"), state them explicitly.

The model should also know when to say "I don't know." If the schema does not contain the data needed to answer the question, or if the results are ambiguous, the agent should say so rather than fabricating an answer.

A production data agent often combines text interpretation with visualization. It generates a chart (bar, line, scatter) alongside the narrative. This requires either a charting tool in the agent's toolkit or structured output that a frontend can render.

Architecture Trade-Offs #

Building a data agent involves several design decisions, each with trade-offs:

Single-shot vs. iterative: A single-shot agent generates one query and returns the result. An iterative agent can ask clarifying questions, explore the schema, and refine its approach over multiple turns. Single-shot is faster and cheaper; iterative handles ambiguity better.

Direct SQL vs. semantic layer: Some architectures place a semantic layer (like a metrics store or business logic layer) between the agent and the raw database. The agent queries the semantic layer rather than writing raw SQL. This prevents inconsistencies ("revenue" always means the same thing) but adds complexity and limits flexibility.

Pre-computed views vs. raw tables: Exposing curated views and materialized tables simplifies the agent's job — fewer joins, clearer naming, pre-aggregated metrics. The cost is maintenance of the view layer and potential staleness.

Schema in prompt vs. tool-based discovery: You can include the schema statically in the system prompt, or give the agent tools to explore the schema dynamically (list_tables, describe_table, sample_rows). Tool-based discovery scales better but costs more tokens and latency.

For most production systems, the sweet spot is: embed table descriptions for retrieval, include relevant DDL in the generation prompt, provide a few-shot example bank, execute against read-only replicas with timeouts, and allow 2–3 retry loops.

Conclusion #

Data agents turn natural-language questions into database queries, making structured data accessible without SQL expertise. The core challenges are schema grounding (helping the model find the right tables), safe execution (read-only connections, timeouts, row limits), error recovery (retry loops with error feedback), and result interpretation (translating rows back into human answers). The most robust architectures combine embedding-based schema retrieval, parsed query validation, sandboxed execution, and iterative self-correction — treating every generated query as untrusted input until it has been validated, executed safely, and interpreted faithfully.