Kin
OnlySearch
Sign in
Text-to-SQL · CSV-to-SQL

CSV to SQL: Upload a File, Ask in English, Export

CSV-to-SQL is the file-based version of text-to-sql: drop in a CSV, the tool infers a schema, you ask in English, and you get both the answer and the SQL that produced it. The workflow exists because most analyst questions never make it to a real database — the data is a Sheets export, a marketing-tool CSV, or a spreadsheet someone emailed. OnlySearch loads any CSV into an embedded DuckDB on drag-in, which means you get a real SQL engine and exportable queries without provisioning anything. This page covers what makes CSV-to-SQL work, where it breaks (NULL handling, type inference, encoding), and the tasks that benefit most.

What works

What CSV-to-SQL handles cleanly

  • Schema inference from header + first thousand rows

    Loading a CSV into DuckDB with type sniffing produces a usable schema in milliseconds — column types are detected from a sample, dates are parsed when their format is unambiguous, and integers vs decimals separate cleanly. The inferred types are good enough that the generated SQL produces correct casts on the first try for the typical analyst CSV (revenue, dates, IDs, categorical columns).

  • Multi-CSV joins on shared keys

    Drop in orders.csv and customers.csv together, then ask a question that spans both. The generator detects the common key (customer_id) and produces a JOIN against the two as if they were tables in a database. This is the case where CSV-to-SQL beats spreadsheet VLOOKUP by a significant margin — the join semantics are honest, NULLs propagate correctly, and the generated SQL is portable to a real warehouse later.

  • Exportable SQL you can take elsewhere

    The generated query is a string you can copy into psql, BigQuery, or your dbt project. CSV-to-SQL becomes a draft surface for queries that will eventually run in production — analysts iterate on the question against the file export, then promote the working SQL to the warehouse view it informs. The exportable artifact is what makes the file-based workflow productive for serious analysis, not just one-off lookups.

  • Date parsing across regional formats

    MM/DD/YYYY, DD/MM/YYYY, YYYY-MM-DD, ISO 8601 with timezone, Excel serial numbers — every CSV ships in a different date format and breaking on dates is the failure mode every analyst has hit. Modern CSV-to-SQL tools detect the format from the column distribution and produce the right CAST or strptime call automatically, then surface what they detected so you can correct it if wrong.

What breaks

Where CSV-to-SQL hits its limits

  • 01

    NULL handling on inconsistent encodings

    Real CSVs encode missing values as empty strings, the literal text 'NULL', 'N/A', '-', '#N/A!', and Excel-style #VALUE! errors — sometimes all in the same file. Type inference treats each as a different value, so a numeric column with mixed NULL encodings becomes a string column. The fix is upfront cleaning or explicit NULL declarations; tools that try to guess silently produce subtly wrong aggregates.

  • 02

    Type inference on partial samples

    DuckDB samples the first thousand rows for type detection. If row 1,001 has a value that doesn't fit the inferred type (a decimal in what looked like an integer column, a string in what looked like a date), the load fails or coerces to string and breaks downstream queries. For files where you can't trust the head to be representative, a full-file type scan is needed before generation.

  • 03

    Encoding and delimiter surprises

    European CSVs use semicolons and decimal commas. Asian CSVs ship in Shift-JIS or GB2312. Excel exports drop a UTF-8 BOM that breaks naive parsers. Tools that don't detect encoding and delimiter robustly produce a single-column table with everything stuffed into it. Pick a tool that surfaces what it detected and lets you override.

  • 04

    Files larger than memory

    Embedded DuckDB streams from disk, but interactive query latency degrades on multi-GB files past a certain point. For files in the 5-10+ GB range, the right move is loading to a real columnar warehouse first. CSV-to-SQL is sized for the typical analyst export (1-1000 MB) — beyond that, the database connection workflow is the better fit.

  • 05

    Wide tables with hundreds of columns

    Marketing-tool exports with 200+ columns push the model's schema-prompt budget. The generator may reference columns that exist but pick the wrong one for the question because it can't see all of them at once. Renaming or pre-filtering to the columns relevant to your question before generation produces meaningfully better SQL.

  • 06

    Heterogeneous CSVs concatenated as one file

    Files where different rows have different columns (a vertical concat of two exports with different headers) confuse type inference completely. Inspect any file showing 'unknown column' or 'mismatched row length' warnings before generating; the right fix is splitting at the load step, not pushing through.

Real prompts

4 CSV-to-SQL tasks that beat the spreadsheet equivalent

Task 1: Monthly revenue by region from orders.csv

Plain-English prompt

from orders.csv, show monthly revenue by region for 2026

Generated SQL

SELECT
  date_trunc('month', order_date) AS month,
  region,
  SUM(amount) AS revenue
FROM 'orders.csv'
WHERE order_date >= DATE '2026-01-01'
  AND order_date <  DATE '2027-01-01'
GROUP BY 1, 2
ORDER BY 1, 2;

Validation note: DuckDB queries CSV files by filename directly — the engine handles the load and parse for you. Verify the column 'amount' is the gross or net revenue in your team's vocabulary, and check whether refunds appear as separate negative-amount rows.

Task 2: Customers with > $10k LTV from joined CSVs

Plain-English prompt

join orders.csv and customers.csv on customer_id, find customers with > $10k LTV

Generated SQL

SELECT
  c.customer_id,
  c.name,
  c.email,
  SUM(o.amount) AS lifetime_value
FROM 'customers.csv' c
JOIN 'orders.csv'    o USING (customer_id)
GROUP BY c.customer_id, c.name, c.email
HAVING SUM(o.amount) > 10000
ORDER BY lifetime_value DESC;

Validation note: USING (customer_id) requires identical column names across files — verify both CSVs use the same key spelling. Customers with no orders are excluded by the inner JOIN, which is intended for an LTV question; if you want all customers (zero-LTV included) switch to LEFT JOIN and COALESCE the SUM.

Task 3: Validate which rows are missing required fields

Plain-English prompt

validate this CSV: which rows have missing required fields (email, customer_id, signup_date)?

Generated SQL

SELECT
  rowid,
  customer_id,
  email,
  signup_date
FROM 'leads.csv'
WHERE customer_id IS NULL
   OR email        IS NULL
   OR email        = ''
   OR signup_date  IS NULL
ORDER BY rowid
LIMIT 200;

Validation note: The empty-string check on email catches Excel exports where blanks become '' instead of NULL. For phone-shaped or address-shaped fields, a regex predicate (using regexp_matches) is more reliable than a NULL check — many CSVs have placeholder values (like '0' or '-') that pass NULL filters but are still empty.

Task 4: Convert date column from MM/DD/YYYY to ISO 8601

Plain-English prompt

convert this CSV's signup_date column from MM/DD/YYYY to ISO 8601

Generated SQL

SELECT
  *,
  strftime(
    strptime(signup_date, '%m/%d/%Y'),
    '%Y-%m-%d'
  ) AS signup_date_iso
FROM 'leads.csv';

Validation note: DuckDB's strptime is strict — rows where signup_date doesn't match the format will return NULL rather than failing the query. After running, count the NULLs in signup_date_iso to surface malformed rows; the typical CSV has 1-3% format outliers (single-digit month/day or stray characters).

Decision criteria

When CSV-to-SQL is the right tool

  • Marketing CSV exports you query once and discardFaster than loading to a database. Drop the file, ask the question, copy the answer.
  • Sheets data you'd otherwise pivot in spreadsheetsUse it — multi-criteria filters and aggregations are 10x faster in SQL than in nested IF formulas.
  • Joining two exports from different toolsStrong fit — JOIN semantics in SQL are correct in ways VLOOKUP isn't, and NULL handling is honest.
  • Files larger than ~5 GB or that update every minuteDon't — load to a real warehouse instead. CSV-to-SQL is for analyst-sized one-shot exports.
Honest concession

When to stay with what you have

  • Stay with ChatGPT Code Interpreter if you don't need SQL on the way out — Code Interpreter's Python path is more flexible for charts, statistics, or one-off transforms where a portable query isn't the deliverable.
  • Stay with Excel or Power Query if your CSV is small and your team lives in Excel — joining two small files there is faster than firing up another tool, and the deliverable is already an Excel artifact.
  • Stay with Hex if your team needs collaborative hosted notebooks — when three people are looking at the same analysis, Hex's shared-notebook surface beats a per-analyst desktop app.
  • Stay with the raw DuckDB CLI if you want the full DuckDB feature surface — extensions, performance tuning, advanced types — exposed without any wrapper between you and the engine.
FAQ

Common questions

Do I need a database to use CSV-to-SQL?

No. OnlySearch ships an embedded DuckDB engine — drop the CSV in and a real SQL engine queries it locally. The file never leaves your machine and you don't provision any infrastructure. For one-off file analysis this is the lightest-weight workflow available.

What CSV variants are supported?

Standard comma-separated, semicolon-separated (European), tab-separated (TSV), pipe-separated, and quoted variants. UTF-8 with or without BOM is auto-detected. For non-UTF-8 encodings (Shift-JIS, GB2312) you'll need to specify the encoding at load — the tool surfaces this as a prompt rather than silently mangling the file.

Can I join more than two CSVs at once?

Yes — drop multiple files into the same workspace and ask a question that references all of them. The generator produces a multi-table JOIN. Common patterns: orders.csv + customers.csv + products.csv for a full sales analysis.

What happens to the data after I close the chat?

The CSV stays on your machine where you put it; OnlySearch maintains a reference to the file path, not a copy. If you delete the file, the workspace surfaces a not-found error rather than a stale snapshot. This matches how analysts actually work — the source of truth is the file on disk, not a copy living in a tool.

Can I export the generated SQL to run in BigQuery / Snowflake?

Yes — every generated query is a copyable string. DuckDB SQL is similar enough to BigQuery and Snowflake that most queries port with light editing (function names, mostly). When you ask the generator for a different dialect explicitly ("give me this in BigQuery syntax"), it produces the dialect-specific version directly.

What about Excel files?

Excel .xlsx loads via the same workflow — DuckDB's excel extension parses .xlsx sheets directly. For multi-sheet workbooks each sheet is treated as a separate table. The /excel-ai guide on this site covers the spreadsheet-specific workflow in more depth.

Is there a row limit?

Practical limits are about machine RAM. DuckDB streams from disk so it handles files larger than memory, but interactive query latency on 10+ GB files isn't the right product fit. For files in the 1MB-5GB range the experience is sub-second; beyond that, load to a real warehouse and use the database connection workflow instead.

Drop a CSV in. Ask in English. Export the SQL.

OnlySearch loads CSV files into embedded DuckDB on drag-in, infers types, and lets you ask questions across single or multiple files. The generated SQL is a string you can copy elsewhere. Free plan refills weekly.