OnlySearch · Text-to-SQL

Text-to-SQL: How AI Turns Plain English into Working Database Queries

Text-to-SQL is the task of translating a natural-language question (“what was revenue by region last quarter?”) into a SQL query that runs against a real database and returns the answer. It is one of the longest-running benchmarks in applied NLP and, in 2026, finally good enough that working analysts use it daily.

It is also harder than the demos suggest. The benchmark numbers most vendors cite come from public datasets with simplified schemas and no business-term ambiguity. On a real warehouse with denormalized event tables, conflicting metric definitions, and write-capable connections, naive text-to-sql produces SQL that looks right and returns wrong numbers. This guide is the honest map of what works, where it breaks, and how to evaluate a tool against your own schema before trusting it.

Download OnlySearch See how it works

Definition

What “text-to-sql” actually means in 2026

In its narrow sense, text-to-sql is a translation task: an English sentence on one side, a SQL string on the other, with the SQL grammar and the target schema as constraints. In its working sense — the one that matters for a team using the tool every day — text-to-sql is the wrap-around workflow that turns the translation step into a trustworthy answer: schema introspection, sample-row preview, query validation, bounded execution, and conversational refinement.

The most useful split inside the category is schema-aware versus schema-blind. A schema-blind tool takes the English question alone and tries to guess plausible table and column names; results vary from impressive to fabricated. A schema-aware tool reads your real database first and grounds generation in the actual identifiers and relationships it found. Schema-aware tools are the only ones that survive contact with a production warehouse.

A note on hyphenation. Search interfaces and academic papers use both “text-to-sql” and “text to sql.” The hyphenated form originates in the NLP literature (it parallels other translation tasks like “text-to-speech”); the unhyphenated form is more common in product copy. For our purposes they refer to the same query class — a natural-language question converted to executable SQL — and the analysis here applies to both.

Honest assessment

Why most text-to-sql tools fail on real schemas

Demos run against tidy schemas with three tables and a single fact column. Production warehouses have hundreds of tables, denormalized event tables, and business terms that don’t map cleanly to any one column. Four failure modes show up the moment a tool meets a real schema.

01
Hallucinated columns on multi-JOIN queries
Single-table generations are mostly solved; queries that span three or more tables are where confabulations enter. The model invents plausible-but-wrong column names (a foreign key that doesn't exist, a denormalized field on the wrong table) because predicting tokens that fit the SQL grammar is much easier than maintaining schema fidelity. Without a runtime check that every referenced identifier exists, you ship broken queries that look right.
02
Business-term to physical-column mapping gap
When a stakeholder asks for MRR, the model has to decide which column or expression in your warehouse represents MRR — and that's a glossary problem, not an SQL problem. Most schemas have several candidate columns (subscription_amount_cents, plan_price, normalized_monthly_revenue) and only your team knows which one is canonical. A text-to-sql tool with no semantic layer guesses, often correctly, sometimes catastrophically.
03
Schema privacy and review burden
Sending your DDL to a third-party LLM exposes column names, comments, and table relationships — effectively a partial reverse engineer of your data model. For regulated environments (healthcare, finance, EU customer data) any tool that uploads schema to an external provider triggers a security review. Tools that run locally or only send schema to a model you control sidestep that review entirely.
04
Read versus write blast radius
Generating SELECT statements is low-risk: a wrong query returns wrong numbers, you re-run. Generating UPDATE, DELETE, or DDL is a different category — one mis-scoped WHERE clause silently mutates production data. Most public text-to-sql benchmarks only score SELECT correctness, so a tool that aces benchmarks still has no opinion about whether to ever write a destructive query for you.

What good looks like

The 4 capabilities a real text-to-sql agent needs

These are the system-level capabilities that turn a model that can sometimes write SQL into a tool you trust on Monday morning.

Live schema introspection

A real text-to-sql agent connects to the database and reads the live information_schema, foreign-key graph, and table comments — not a snapshot the user pasted weeks ago. This is the difference between answering questions about your current schema and answering questions about a schema you used to have. Sample-based introspection also lets the agent understand cardinalities and null distributions, not just types.

Sample-row preview before generation

Types alone are not enough. A status column typed VARCHAR could be 'active' / 'churned' or could be 'A' / 'C' / 'NULL'; the SQL the agent writes for filtering depends on which. Pulling 5–10 sample rows before generation gives the model the actual value distributions, which collapses an entire class of off-by-one filters and case-mismatch bugs that pure-DDL text-to-sql tools produce.

Validation: parse, EXPLAIN, bounded run

Once SQL is generated, three checks happen before a human sees a result. First, parse the query against the dialect — catch syntax errors immediately. Second, EXPLAIN it — surface missing indexes and full-table scans before they become production incidents. Third, run with an enforced LIMIT (100 rows) and a statement timeout — never let a generated query touch a production replica unbounded.

Iterative refinement, not regenerate-from-scratch

When a user follows up with "now group by month and exclude trial accounts," the agent should edit the prior query — preserving the WHERE clause that was already correct and adding the new conditions. Tools that regenerate from scratch on every turn re-introduce the bugs the user just fixed. Iterative refinement requires the agent to remember its own SQL across turns, which is an architecture choice, not a model choice.

Benchmark reality

Text-to-SQL benchmarks: what 90% accuracy actually means

When a vendor cites 90%+ accuracy, the headline number almost always comes from Spider — the canonical academic benchmark introduced in 2018. Spider has 200 databases with simplified schemas (typically under a dozen tables each) and English questions hand-written to be answerable by a single, well-formed SQL query. State-of-the-art systems do score in the high 80s and 90s on Spider today. That number does not transfer cleanly to a real warehouse.

BIRD, introduced in 2023, is the harder follow-up. It uses real-world databases with messy column names, dirty values, and questions that require reasoning across multiple tables. On BIRD, the same systems that score 90% on Spider typically land in the 60s to low 70s — and BIRD is still cleaner than most production warehouses. The gap between Spider and BIRD is the floor of how much public benchmarks can overstate real-business performance.

Two structural reasons benchmark numbers diverge from reality. First, real warehouses contain ambiguous business terms (MRR, churn, active user) that have no single right answer in the schema; benchmarks code around this by writing questions that are unambiguous. Second, real workflows are conversational: a user iterates, restates, and corrects. A single-shot accuracy metric never measures the recovery loop, which is where actual usability lives.

The practical implication: ignore vendor accuracy claims and run your own evaluation. Pick 20 questions a real analyst on your team has asked in the last month. Run each through the tool on your real schema. Score how many SQL outputs (a) parse, (b) run, and (c) return the answer the analyst originally computed by hand. That number — your number — is the only one that matters.

OnlySearch approach

How OnlySearch handles text-to-sql

One implementation of the four capabilities above, mapped to how the OnlySearch desktop app actually works today.

Schema stays on your machine

OnlySearch runs as a Mac/Windows/Linux desktop app. Database connections live locally; the schema is introspected from your machine, not uploaded to a hosted backend. Sensitive workflows (regulated industries, internal databases that never should leave the network) keep the same trust boundary you already have around your client.

Sample-row preview built in

Before generating SQL the agent fetches a small sample from each referenced table. The model sees the actual value distributions — status enums, naming conventions, NULL frequencies — so generated SQL matches your data shape, not the model's prior on what an enum probably looks like.

BYOK with explicit query gating

Bring your own OpenAI, Anthropic, Gemini, or DeepSeek key — your data is not training fodder for someone else's model. Destructive statements (UPDATE, DELETE, DROP, TRUNCATE) require explicit confirmation; SELECT runs are bounded with LIMIT and statement-timeout defaults you can tune per connection.

Conversational refinement

Follow-ups like "now group by month" or "exclude churned accounts" edit the prior SQL rather than restarting. The chat keeps both the question history and the SQL history, so reverting to the version that worked is a click — not a re-prompt that may regress.

Download OnlySearch See pricing

Anti-patterns

When NOT to use text-to-sql

Four cases where generated SQL is the wrong call regardless of how good the tool gets — the cost of a wrong query exceeds the time saved by generating it.

01
High-stakes write operations
UPDATE, DELETE, DROP, and bulk INSERT against production should be written by hand and reviewed in a pull request. Generated DML hides reasoning and bypasses code review — both of which exist precisely because writes are irreversible. Use text-to-sql for the SELECT that diagnoses the problem; write the fix yourself.
02
Financial close and audit queries
Anything an auditor will read needs to be version-controlled, deterministic across runs, and explainable line-by-line. Regenerating SQL on every refresh introduces variance that breaks reproducibility. Build close and audit queries as reviewed views or materialized models; let text-to-sql operate on top of those.
03
Real-time customer-facing dashboards
Generating SQL on the render path adds latency, model cost, and a category of failure that shouldn't exist on a hot path: model unavailability. Pre-build the queries, cache the results, and reserve text-to-sql for ad-hoc exploration where a human is already waiting on the answer.
04
PII queries without RBAC enforcement
If your only access control on PII is "the analyst won't write a SELECT *," text-to-sql breaks that assumption immediately. Enforce RBAC at the database layer (row-level security, column-level grants, masking views) before adding any natural-language generation. The generated SQL inherits the connection's privileges; an under-scoped connection is the actual control point.

Decision matrix

Generator, CSV-to-SQL, or agent — which to reach for

The three text-to-sql shapes solve different jobs. Pick the right one before reading the spoke for that pattern.

Comparison of single-shot SQL generator, CSV-to-SQL, and text-to-sql agent across input source, query shape, latency, validation, cost, and the matching deep-dive page.
Dimension	Single-shot generator	CSV-to-SQL	Agent
Input source	Live database connection	Local CSV files	Live database connection
Best query shape	Filter / aggregate; 2-table joins on declared FKs	Single-file or multi-CSV joins; type and date cleanup	Multi-step diagnostics that depend on intermediate results
Wall-clock latency	3-10 seconds per query	1-5 seconds per query	30-90 seconds per question (4-8 model calls)
Validation	Optional EXPLAIN	Inline preview and type check	Built-in: parse, EXPLAIN, bounded run, refinement
BYOK token cost	Low — one model call per query	Low — one model call per query	5-10x higher — multi-step loop
Read this spoke	/text-to-sql/sql-generator	/text-to-sql/csv-to-sql	/text-to-sql/agent

FAQ

Common questions about text-to-sql

Is text-to-sql production-ready in 2026?

For analyst-driven SELECT exploration on schemas the agent can introspect, yes — accuracy on single-table and 2-table queries is high enough that most prompts produce a runnable answer. For multi-JOIN queries on unfamiliar schemas, model accuracy degrades, so production use should always pair generation with validation: parse, EXPLAIN, bounded run with LIMIT. For destructive statements (UPDATE, DELETE) the answer is still no — write those by hand.

Does OnlySearch send my schema to OpenAI or Anthropic?

Only what's needed for the question. OnlySearch runs locally on Mac/Windows/Linux and uses BYOK, so the request goes from your machine to the model provider you choose with the key you control. Schema and sample rows for the tables involved in the current question are sent; tables outside the question scope are not. Your data is not used for model training under either provider's BYOK terms.

Which databases are supported?

PostgreSQL, MySQL, SQLite, BigQuery, Snowflake, and DuckDB are first-class. CSV files load to embedded DuckDB on drag-in. Read replicas and cloud-hosted variants of those engines work via the same connector. Per-connection settings let you set query timeouts and write-statement gating, which is recommended for any production read replica.

How do I prevent it from running destructive queries?

Two layers. First, generated UPDATE / DELETE / DROP / TRUNCATE statements are gated behind explicit confirmation; you see the SQL and choose to run it. Second — and more important — connect with a database user that has read-only privileges. The ground-truth control is RBAC at the database layer, not the application layer; we recommend a SELECT-only role for any text-to-sql connection unless you have a specific reason otherwise.

What's the difference between text-to-sql and a SQL generator?

Conversationally, often nothing. Technically, a SQL generator is a one-shot translation step (English in, SQL out). Text-to-sql increasingly refers to the broader workflow: introspect schema, fetch sample rows, generate, validate, run with bounds, present results, refine on follow-up. The generator is the middle step; text-to-sql as a category covers the wrap-around behavior that makes the generator's output trustworthy.

Can I use this on my company's read replica?

Yes — that's the recommended deployment for any non-trivial use. Connect with a read-only role, set a statement timeout (most teams use 30–60 seconds), and rely on the agent's enforced LIMIT for exploratory queries. If your replica is in a private VPC, OnlySearch can connect through the same SSH tunnel or VPN your psql / DBeaver workflow already uses; nothing routes through a third-party proxy.

Does it work offline?

Schema introspection and SQL execution work offline against any database you can reach on your network. The generation step (English to SQL) needs the model provider you brought your key for — that requires an outbound connection to OpenAI / Anthropic / Gemini / DeepSeek. Local models via Ollama or LM Studio work fully offline if you've configured a BYOK pointing at your local endpoint.

How does it handle ambiguous business terms?

When a question references a term that maps to multiple candidate columns ("MRR" could be plan_price, subscription_amount, or a derived expression), the agent surfaces the ambiguity in chat and asks you to pick once. The choice is then remembered for the conversation. For terms used across a team, a shared semantic glossary is on the roadmap; today, conventions live in the chat history rather than a global registry.

What's BYOK and why does OnlySearch require it?

BYOK is bring-your-own-key — you supply the API key for the model provider rather than paying OnlySearch a metered model markup. This matters for two reasons. First, your data goes directly to the provider whose terms you've already accepted, with no intermediary. Second, you pay model costs at provider rates with no margin layered on, which for heavy users is meaningfully cheaper than bundled-model SaaS pricing.

When should I write SQL by hand instead?

Five cases. (1) Anything that mutates data. (2) Anything an auditor reads. (3) Anything on a customer-facing render path. (4) Cross-database federation that involves credentials your laptop shouldn't have. (5) Performance-critical queries where the cost of a wrong query plan exceeds the time saved by generation. For every other analyst workflow — diagnostic SELECTs, ad-hoc exploration, draft-then-refine — text-to-sql earns its place.

Ask your database in plain English

Connect a Postgres, MySQL, or warehouse — or drop in a CSV. OnlySearch introspects the schema, samples real rows, generates SQL, and validates before running. BYOK keeps your data on the model provider you chose. Free plan refills weekly.

Download OnlySearch See pricing

Text-to-SQL: How AI Turns Plain English into Working Database Queries

What “text-to-sql” actually means in 2026

Why most text-to-sql tools fail on real schemas

Hallucinated columns on multi-JOIN queries

Business-term to physical-column mapping gap

Schema privacy and review burden

Read versus write blast radius