Kin
OnlySearch
Sign in
Text-to-SQL · SQL Generator

AI SQL Generator: From English to Verified Query

An AI SQL generator is a tool that takes a plain-English question and returns SQL. The 2026 distinction worth making is between toy generators (English in, SQL out, no validation) and verified generators (English in, schema introspected, SQL parsed, EXPLAIN run, bounded execution). Both technically generate SQL. Only the second produces SQL you'd want to run against a database that matters. This page covers what works in modern schema-aware SQL generation, where the toy variant breaks, and what changes when validation moves from optional to default.

What works

What schema-aware SQL generators do reliably

  • Single-table SELECT with filters and aggregates

    If the data fits one table and the question is a filter-and-aggregate ("top 10 customers by revenue last quarter"), generation accuracy is near-perfect on schemas the model has seen the DDL for. The generator reads column names and types, picks the obvious aggregate, and writes idiomatic SQL for the dialect. This is the workhorse case where AI SQL generation has clearly arrived — the gap to a hand-written query is closed.

  • Two-table joins on declared foreign keys

    When the schema has explicit foreign-key constraints between two tables, modern generators reliably produce correct JOIN clauses. The constraint metadata removes the guessing — the generator knows orders.customer_id references customers.id rather than inferring it from naming conventions. Most analyst questions on transactional schemas (orders + customers, events + users) fall in this shape and produce runnable SQL on the first try.

  • Date and time handling for the dialect

    PostgreSQL's date_trunc, BigQuery's TIMESTAMP_TRUNC, MySQL's DATE_FORMAT, Snowflake's DATE_PART — every dialect handles time differently and humans get this wrong constantly. SQL generators have ingested enormous amounts of dialect-specific date handling and produce the right syntax for the target engine, including timezone conversions and week-start conventions, on the first try. This alone saves analysts a meaningful amount of doc-flipping.

  • Common analytical patterns with window functions

    Cohort retention, running totals, rank-within-group, and month-over-month delta — patterns where the SQL is non-trivial but standardized — generate cleanly. The model has seen these patterns thousands of times and produces idiomatic implementations with PARTITION BY and ORDER BY clauses laid out the way a senior analyst would write them. Junior analysts learn faster from generated SQL on these patterns than from documentation.

What breaks

Where toy SQL generators fail

  • 01

    Schemas the generator has never seen

    A generator that doesn't introspect your live database is guessing at column names. Even with a pasted DDL, if the schema is large, the model will reference plausible-but-wrong identifiers in the generated query. Always pick generators that read the live information_schema rather than ones that ask for a DDL paste.

  • 02

    Three-or-more-table joins on undeclared keys

    When join keys are conventional (matched by column name) rather than declared as foreign-key constraints, generators frequently invent the wrong JOIN graph. The query compiles, runs, and returns wrong numbers because two tables that look related are joined on the wrong columns. The fix is sample-row preview and EXPLAIN — never trust a multi-JOIN query you haven't validated.

  • 03

    Business terms with no canonical column

    MRR, churn, qualified lead — every team defines these slightly differently in their warehouse. A generator with no team-level glossary picks one candidate column and generates against it. Sometimes it picks correctly; sometimes you ship a dashboard that's silently wrong for a quarter. This is the most expensive failure mode of toy generators.

  • 04

    Performance — the SQL that runs but kills the warehouse

    Generators optimize for correctness, not query plan. A correct query with no index hint can scan a billion-row events table and time out the database. Tools without an EXPLAIN step before execution put this risk on the user; tools with EXPLAIN built in surface the warning before you click run.

  • 05

    Write statements masquerading as reads

    Generators that don't gate on statement type will happily produce UPDATE or DELETE when the prompt is ambiguous ("clean up duplicate users"). Without explicit confirmation and a dry-run preview, the first time you notice the deletion is when a customer reports missing data. Read-only database connections + statement-type gating, both, are the right answer.

  • 06

    Cost on bundled-model SaaS pricing

    Generators that bundle the model into the subscription mark up token costs significantly. For heavy users (analysts asking 100+ questions a day) the metered-bundled gap is real money — often 3-5x what BYOK would cost on the same provider. If you're a heavy user, evaluate BYOK pricing carefully before committing to a bundled-model tool.

Real prompts

4 SQL generation tasks where validation matters

Task 1: Top 10 customers by revenue last quarter

Plain-English prompt

show me the top 10 customers by total revenue for Q1 2026

Generated SQL

SELECT
  c.id,
  c.name,
  SUM(o.amount_cents) / 100.0 AS revenue_usd
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.created_at >= '2026-01-01'
  AND o.created_at <  '2026-04-01'
  AND o.status = 'paid'
GROUP BY c.id, c.name
ORDER BY revenue_usd DESC
LIMIT 10;

Validation note: Two checks worth doing: (1) the WHERE clause uses 'paid' — sample-row preview confirms whether your status enum is 'paid' or 'PAID' or 'completed'. (2) revenue is computed from amount_cents / 100; verify your team treats refunds the same way (this query does not subtract refunds — flag if your definition of revenue is net).

Task 2: Monthly retention cohort for Q1 signups

Plain-English prompt

monthly retention cohort for users who signed up in Q1 2026

Generated SQL

WITH cohort AS (
  SELECT
    id AS user_id,
    DATE_TRUNC('month', created_at) AS cohort_month
  FROM users
  WHERE created_at >= '2026-01-01'
    AND created_at <  '2026-04-01'
),
activity AS (
  SELECT
    user_id,
    DATE_TRUNC('month', occurred_at) AS active_month
  FROM events
  WHERE occurred_at >= '2026-01-01'
  GROUP BY user_id, active_month
)
SELECT
  c.cohort_month,
  a.active_month,
  COUNT(DISTINCT c.user_id) AS active_users
FROM cohort c
LEFT JOIN activity a USING (user_id)
GROUP BY c.cohort_month, a.active_month
ORDER BY c.cohort_month, a.active_month;

Validation note: Cohort math is famously easy to get subtly wrong. Verify two things: (1) 'active' in your team's vocabulary — does any event count, or only specific events (login, transaction)? (2) the LEFT JOIN preserves cohort users with no activity; without that, retention is overstated.

Task 3: Churned users and their last-used feature

Plain-English prompt

users who churned in the last 30 days and which feature they last used before churning

Generated SQL

WITH churned AS (
  SELECT
    user_id,
    churned_at
  FROM subscriptions
  WHERE status = 'churned'
    AND churned_at >= NOW() - INTERVAL '30 days'
),
last_event AS (
  SELECT DISTINCT ON (e.user_id)
    e.user_id,
    e.event_name,
    e.occurred_at
  FROM events e
  JOIN churned c ON c.user_id = e.user_id
  WHERE e.occurred_at <= c.churned_at
  ORDER BY e.user_id, e.occurred_at DESC
)
SELECT
  c.user_id,
  c.churned_at,
  l.event_name AS last_feature_used,
  l.occurred_at AS last_event_at
FROM churned c
LEFT JOIN last_event l USING (user_id);

Validation note: DISTINCT ON is Postgres-specific. On BigQuery / Snowflake the generator should produce ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY occurred_at DESC). Cross-dialect prompts are a place generators sometimes pick the wrong syntax — confirm the dialect was detected before running.

Task 4: GDPR audit: PII columns by table

Plain-English prompt

give me a table-level audit of all columns that look like PII

Generated SQL

SELECT
  c.table_schema,
  c.table_name,
  c.column_name,
  c.data_type
FROM information_schema.columns c
WHERE c.table_schema NOT IN ('pg_catalog', 'information_schema')
  AND (
    c.column_name ILIKE '%email%'
    OR c.column_name ILIKE '%phone%'
    OR c.column_name ILIKE '%ssn%'
    OR c.column_name ILIKE '%first_name%'
    OR c.column_name ILIKE '%last_name%'
    OR c.column_name ILIKE '%address%'
    OR c.column_name ILIKE '%dob%'
    OR c.column_name ILIKE '%date_of_birth%'
  )
ORDER BY c.table_schema, c.table_name, c.column_name;

Validation note: Naming-based PII detection misses unconventional column names ("contact_handle", "emergency_no"). Use this as a starting point, not the audit itself; pair with a sample-row scan that flags fields whose values match email or phone regexes regardless of column name.

Decision criteria

When an AI SQL generator is the right call

  • Diagnostic SELECTs on a familiar schemaUse it freely — pick the schema-aware option, accept the SQL, run with LIMIT 100.
  • Learning a new dialectGenerators are excellent dialect translators. Generate in your home dialect, then ask for the same query in BigQuery or Snowflake syntax.
  • First draft of a complex window-function queryFaster than docs for analytical patterns. Always run EXPLAIN on the result — generators optimize for correctness, not plan.
  • Anything that mutates dataDon't. Write UPDATE / DELETE / DDL by hand and review them in a pull request — the cost of generating a wrong write outweighs every other benefit.
Honest concession

When to stay with what you have

  • Stay with Cursor if SQL is one tab in a code session — when you're already editing migrations, query.go, or a dbt model in your IDE, asking Cursor to generate the SQL inline beats switching tools entirely.
  • Stay with Snowflake Cortex if your data is exclusively in Snowflake and procurement won't approve another vendor — Cortex inherits your existing roles and grants without a new tool review.
  • Stay with Vanna AI if you've already built and tuned a vector store over your schema — its RAG-on-DDL approach can outperform fresh introspection on schemas you've trained it against repeatedly.
  • Stay with hand-written SQL for performance-critical queries on hot paths — when the cost of a wrong query plan exceeds the time saved by generation, write it and review it in a pull request.
FAQ

Common questions

What's the difference between an AI SQL generator and text-to-sql?

Conversationally, none. Technically, a SQL generator is the translation step (English in, SQL out). Text-to-sql increasingly refers to the broader workflow that wraps the generator: introspect, sample, generate, validate, run with bounds, refine. Both terms are used for both meanings in product copy.

Can I use an AI SQL generator with a SQL file already on my disk?

Yes — you can paste an existing query and ask for modifications ("add a CTE that filters to enterprise customers"), and the generator will edit rather than rewrite. This is the right way to evolve a query you already trust without losing the parts you've already validated.

Does it work on a read replica?

Yes — that's the recommended setup. Connect with a read-only database role, set a statement timeout (30-60 seconds), and let the generator's enforced LIMIT do the rest. Replica-aware generators also avoid suggesting queries that would benefit from indexes only present on the primary.

How accurate are AI SQL generators in 2026?

On single-table and 2-table queries with declared foreign keys: high enough that hand-tweaking is rare. On 3+ table joins with undeclared keys: meaningful error rate, validation is mandatory. On ambiguous business terms (MRR, churn): the generator's accuracy is bounded by your team's glossary, not the model.

Can it write SQL for stored procedures and views?

Generators handle SELECT views well. Stored procedures (PL/pgSQL, T-SQL) work for simple cases but accuracy drops on procedural logic with multiple branches. Treat stored-procedure generation as a starting draft — review more carefully than you would a SELECT.

What about migrations?

Generated migrations are useful for boilerplate (adding a column, creating an index) but should always be reviewed in a pull request before applying. Don't run generated migrations against production from the chat surface — the cost of an unintended schema change is too high.

How does OnlySearch's SQL generator differ from online tools?

Three things. Schema introspection is live (not a DDL paste). Validation is built in — parse, EXPLAIN, bounded run with LIMIT happen automatically before you see results. The connection lives on your machine via BYOK, so your schema and data don't traverse a third-party hosted service.

Generate SQL that's been validated, not just written

Connect a Postgres, MySQL, BigQuery, Snowflake, SQLite, or DuckDB. OnlySearch introspects the schema, samples rows, generates SQL, parses it, runs EXPLAIN, and executes with a bounded LIMIT — before you see the result. Free plan refills weekly.