Text-to-SQL Agents: Multi-step Database Reasoning
A text-to-sql agent is a system that decomposes a natural-language question into a sequence of database operations rather than translating the whole question into a single SQL string. The agent paradigm matters because the questions that users actually ask — "why did signup rate drop last week?" or "what moved MRR this month?" — cannot be answered by one query no matter how cleverly written. They require planning (outline approach), introspection (read schema, sample rows), drafting (write candidate SQL), validation (parse, EXPLAIN, bounded run), and refinement (adjust based on results). This page is the technical depth on how that loop is structured, where it succeeds, and where it still fails in 2026.
What text-to-sql agents do reliably
Decomposing diagnostic questions into queries
Questions that don't translate to a single query — "diagnose the signup-rate drop last week" — decompose into a tree: query daily signups, identify the drop date, segment by traffic source, segment by device, surface the anomaly. An agent runs each query, reads the result, and decides what to query next based on what it sees. The output is a structured analysis with the supporting queries inline, not one wall of SQL that fails to answer the actual question.
Schema introspection with sample-row preview
Before writing any SQL, the agent reads the live information_schema for the tables likely involved, fetches 5-10 sample rows from each, and notes the value distributions (status enums, NULL frequency, date ranges). This grounds generation in the real shape of the data — generated SQL filters on actual enum values rather than the model's prior on what an enum probably looks like. This single step removes a large class of off-by-one and case-mismatch bugs.
Validation before execution
Generated SQL passes through three checks before any human sees a result. Parse against the dialect — catches syntax errors immediately and lets the agent retry without burning a query slot. EXPLAIN — surfaces full-table scans against billion-row event tables before they become incidents, and gives the agent a chance to add a WHERE clause to the time dimension. Bounded execution — every exploratory query runs with LIMIT and a statement timeout so an unintended cross-product doesn't lock the database.
Conversational refinement that preserves prior work
When the user follows up with "now group by month and exclude churned accounts," the agent edits the prior SQL — preserving the WHERE clause and JOINs that were already correct, adding the new conditions. This is architecturally non-trivial: the agent has to remember its own SQL across turns and patch it cleanly rather than regenerating from scratch. Tools that regenerate per-turn re-introduce bugs the user just fixed; agent-shaped tools don't.
Where text-to-sql agents still struggle
- 01
Latency budgets on multi-step questions
An agent that introspects, samples, drafts, validates, and refines makes 4-8 model calls per turn instead of one. Total wall-clock latency for a complex diagnostic question is 30-90 seconds, not 3-5. For exploratory analyst work this is acceptable; for any interactive use case where a user is waiting on a result, the multi-step paradigm needs progress indicators and partial results to feel acceptable.
- 02
Cost on bundled-model SaaS plans
More steps means more tokens. An agent on a metered-token plan can cost 5-20x what a single-shot generator costs for the same question. BYOK (bring-your-own-key) is the right pricing shape for agent-heavy workflows because you're paying provider rates with no markup; bundled-model SaaS pricing math breaks down on agent loops at provider-token scale.
- 03
Planning failures on ambiguous questions
When the question itself is underspecified ("why is engagement down?" — by what definition? for which segment? compared to when?), the agent has to either ask a clarifying question or pick assumptions to proceed. Both modes have failure cases — asking too many clarifications drains the user, picking wrong assumptions returns confidently wrong analysis. The right behavior is one targeted clarification on the most ambiguous dimension, then proceeding.
- 04
Schema drift between introspection and execution
Long-running agent loops introspect the schema once at the start, then run queries over many minutes. If the schema changes mid-loop (a deploy adds a column, a migration drops a table) the agent's prior assumptions become stale and generated SQL fails. Production deployments running text-to-sql against a live OLTP system need either snapshot-isolated read replicas or schema-version checks between steps.
- 05
Evaluating agent output without a ground truth
When an agent returns "churn was driven by a 12% drop in retention among Q1 cohorts using the legacy mobile app," how do you know that's right? Single-query generation has a clear correctness check (does the SQL match what a human would write?). Multi-step agent analysis has no comparable check; you have to spot-check the underlying queries one by one. Agent tools that show their work (every query inline) are auditable; ones that hide the steps are not.
- 06
Recovery from validation failures
When EXPLAIN flags a query as too expensive, the agent has to choose: rewrite (slow, may fail again), simplify (may not answer the question), or escalate to the user. Naive agents loop indefinitely on rewrites. Mature agents have a budget — N rewrite attempts, then escalate — and surface the failure transparently. Watch for this in any agent you evaluate; the recovery shape is more diagnostic than the generation quality.
4 multi-step agent workflows where single-shot SQL falls short
Task 1: Diagnose the drop in signup rate last week
Plain-English prompt
diagnose why our signup rate dropped last week
Generated SQL
-- Step 1: confirm the drop
SELECT
date_trunc('day', created_at) AS day,
COUNT(*) AS signups
FROM users
WHERE created_at >= NOW() - INTERVAL '14 days'
GROUP BY day
ORDER BY day;
-- Step 2 (after observing drop on day X): segment by source
SELECT
date_trunc('day', created_at) AS day,
signup_source,
COUNT(*) AS signups
FROM users
WHERE created_at >= NOW() - INTERVAL '14 days'
GROUP BY day, signup_source
ORDER BY day, signups DESC;
-- Step 3 (if one source dominates the drop): drill into device / utm
SELECT
device_type,
COUNT(*) AS signups
FROM users
WHERE created_at >= NOW() - INTERVAL '7 days'
AND signup_source = '<the dominant source>'
GROUP BY device_type
ORDER BY signups DESC;Validation note: Multi-step decomposition: the agent runs query 1, reads the result, decides query 2 based on which day the drop occurred, then drills into the segment that dominates. A single-query approach can't do this — the analysis depends on intermediate results. Audit by checking each query in the trail produced sensible numbers before trusting the conclusion.
Task 2: Build a churn predictor — what features?
Plain-English prompt
build a churn predictor from these tables — what features should I use?
Generated SQL
-- Step 1: enumerate signal tables
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
-- Step 2: for each signal table, check feature cardinality
-- (sessions, events, support_tickets, billing, feature_usage)
SELECT
COUNT(DISTINCT user_id) AS users_with_signal,
AVG(events_per_user) AS avg_signals_per_user,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY events_per_user)
AS p50_signals_per_user
FROM (
SELECT user_id, COUNT(*) AS events_per_user
FROM events
WHERE occurred_at >= NOW() - INTERVAL '90 days'
GROUP BY user_id
) t;
-- Step 3: validate label availability — do we have ground-truth churn?
SELECT
status,
COUNT(*) AS users
FROM subscriptions
WHERE created_at <= NOW() - INTERVAL '90 days'
GROUP BY status;Validation note: The agent's job here is feature exploration, not model training. The output is a candidate feature list with cardinality / coverage stats per signal table, plus a sanity-check on label availability. The actual ML training happens outside the agent — but starting from a grounded feature inventory beats a blank-page brainstorm.
Task 3: Audit data quality across all event tables
Plain-English prompt
audit data quality across all event tables — null rates per column and schema drift
Generated SQL
-- Step 1: enumerate tables matching the pattern
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'public'
AND table_name LIKE '%event%';
-- Step 2: for each event table, compute per-column null rate
SELECT
'<table>' AS table_name,
column_name,
null_count,
total_rows,
ROUND(100.0 * null_count / NULLIF(total_rows, 0), 2) AS null_pct
FROM (
SELECT
'user_id' AS column_name,
COUNT(*) FILTER (WHERE user_id IS NULL) AS null_count,
COUNT(*) AS total_rows
FROM <table>
-- (one block per column, generated for each table)
) t;
-- Step 3: detect schema drift — compare current columns to a saved baseline
SELECT
current_columns.table_name,
current_columns.column_name,
CASE WHEN baseline_columns.column_name IS NULL
THEN 'added'
ELSE 'unchanged'
END AS drift_status
FROM information_schema.columns AS current_columns
LEFT JOIN baseline_columns USING (table_name, column_name);Validation note: Iterating per-table per-column is exactly what an agent loop is for — the SQL pattern is identical, only the table and column names change. A single-shot generator either skips this or produces one massive UNION query that's hard to read. The agent runs the pattern N times and returns a structured report.
Task 4: Explain why MRR moved this month
Plain-English prompt
explain why MRR moved this month — break down expansion, contraction, and churn
Generated SQL
-- Step 1: starting and ending MRR
SELECT
date_trunc('month', as_of_date) AS month,
SUM(monthly_revenue_cents) / 100.0 AS mrr_usd
FROM subscription_snapshots
WHERE as_of_date IN (
date_trunc('month', NOW()) -- current
, date_trunc('month', NOW()) - INTERVAL '1 month' -- prior
)
GROUP BY month;
-- Step 2: new MRR (subscriptions created this month)
SELECT
SUM(monthly_revenue_cents) / 100.0 AS new_mrr_usd
FROM subscriptions
WHERE created_at >= date_trunc('month', NOW())
AND status = 'active';
-- Step 3: expansion / contraction (existing customers who changed plans)
SELECT
customer_id,
prior_plan_amount,
current_plan_amount,
current_plan_amount - prior_plan_amount AS delta_cents
FROM plan_changes
WHERE changed_at >= date_trunc('month', NOW())
AND prior_plan_amount IS NOT NULL;
-- Step 4: churn (subscriptions ended)
SELECT
SUM(monthly_revenue_cents) / 100.0 AS churned_mrr_usd
FROM subscriptions
WHERE churned_at >= date_trunc('month', NOW())
AND churned_at < date_trunc('month', NOW()) + INTERVAL '1 month';Validation note: MRR reconciliation is a classic multi-query problem: the four components must sum to the period delta. The agent should perform that reconciliation explicitly — "start + new + expansion - contraction - churn = end" — and surface any gap. A gap means a category is missing (refunds, mid-month pro-rations) and the analysis is incomplete.
When to reach for an agent vs single-shot SQL generation
- Diagnostic questions that depend on intermediate resultsUse an agent — single-shot generation can't read its own results to decide the next query.
- Analyses that span multiple tables and need explorationUse an agent — schema introspection and sample-row preview matter most when the question is open-ended.
- Tight latency budgets (interactive UI)Use a single-shot generator — agent loops add 30-90 seconds; pre-built queries are sub-second.
- Production embedded use casesUse neither directly — pre-build the queries, cache the results, expose results to the user, and reserve agent loops for analysts.
When to stay with what you have
- Stay with Cursor if your text-to-sql is one part of a larger code workflow already in your IDE — when the SQL is going into a migration, a query.go file, or a dbt model in the same repo, Cursor's in-place edits beat switching surfaces.
- Stay with Claude Code if you live in the terminal and want CLI-shaped agent loops — for engineers piping SQL output to other CLI tools or scripting agent runs against fixtures, the terminal surface is the right shape.
- Stay with Hex Magic if your work is a hosted notebook flow with collaborators — Hex's Magic is co-located with the notebook surface your team already uses, and shared session state is the differentiator.
- Stay with Snowflake Cortex if your data is exclusively in Snowflake and IT won't approve external tools — Cortex inherits Snowflake's RBAC, with no new vendor review, no new credentials, no extra approval cycle.
Common questions
What's the difference between a text-to-sql agent and a SQL generator?
A SQL generator is one model call: English in, SQL out. An agent is a loop: plan, introspect, draft, validate, run, refine — multiple model calls and database calls per turn. Agents handle questions that depend on intermediate results ("diagnose this drop"); generators handle questions that map to a single query ("top 10 customers by revenue").
Are agents always better than generators?
No. Agents trade latency and cost for flexibility. For a question that maps to one query, an agent is overkill — slower, more expensive, no accuracy gain. The right tool depends on the question shape: single-query patterns favor generators, exploratory and diagnostic patterns favor agents. Most production tools route between the two automatically.
How do agents handle questions they can't answer?
Mature agents fail loudly. The agent should report which step failed (introspection, generation, validation, execution), what it tried, and what's needed from the user. Naive agents loop indefinitely on retries or return confidently wrong analysis when validation fails silently — both are red flags when evaluating an agent.
Can text-to-sql agents write to the database?
They can, but should not by default. OnlySearch gates UPDATE / DELETE / DROP behind explicit user confirmation and recommends connecting with a read-only role. Production agents writing to production databases is a category that will mature, but in 2026 it should still require a deliberate choice, not a default behavior.
What's the latency cost of agent loops?
A typical diagnostic question runs 4-8 model calls plus 4-8 database queries — total wall-clock 30-90 seconds. Comparable single-shot generation is 3-10 seconds. The agent's value is answering questions a generator can't, not answering generator questions faster. For interactive use cases (a dashboard), pre-build the queries instead.
Do agents work offline?
The database calls do — agents introspect and execute against your local database without an outbound connection. The model calls require the model provider you've configured BYOK against. Local model providers (Ollama, LM Studio) make the full agent loop work offline; cloud-model BYOK requires connectivity to the provider only.
How does an agent know when to stop?
Termination is one of the harder problems. Mature agents have a stopping condition per question type — diagnostic agents stop when the cause is localized to a single segment, exploration agents stop after N steps, reconciliation agents stop when the components sum within tolerance. Watch for stopping behavior when evaluating an agent — endless drilling is a worse failure mode than stopping early.
What's the role of human oversight?
Agents make their reasoning auditable: every step's query, result, and decision is visible. The human's role shifts from writing SQL to checking the trail. For high-stakes analyses (financial reporting, regulatory decisions) the human still owns the final read; the agent owns the legwork. The right framing is agent-as-junior-analyst, not agent-as-autopilot.
How is OnlySearch's agent different from cursor-style code agents?
Code agents (Cursor, Claude Code) reason about source files. Text-to-sql agents reason about database state — schema, sample rows, query plans. The architectural difference is what counts as "the world": for code agents, the file tree; for text-to-sql agents, the live database. OnlySearch's agent does both in the same conversation — analyze data, then ship the fix as code.
Ask the questions a single query can't answer
OnlySearch's text-to-sql agent introspects your schema, samples real rows, drafts candidate queries, validates with EXPLAIN, runs with bounded LIMITs, and refines based on results — all in one conversation. BYOK keeps your data on the model provider you chose. Free plan refills weekly.
