Back to blog
April 2026 · Maceo Cardinale Kwik

We stopped guessing how good Datost was

Why we needed a benchmark

I built Datost on nights and weekends while I was working at Traba, a staffing marketplace. Traba had a lot of data, a lot of people who needed answers from it, and not enough analysts to go around. I kept watching coworkers struggle with the same problem: they had a question about the business, and the path to an answer ran through SQL they couldn't write, or a dashboard that didn't exist yet, or a Slack message to an analyst who was busy. So I started building something to close that gap.

It worked. Within a few months, about 80% of the company was using Datost to ask questions about Traba's data. Operations people, finance, customer success, leadership. I was sitting right there with them, live, from 9am to 9pm most days. Every time Datost got an answer wrong, I saw it. Every time someone asked an ambiguous question and Datost guessed instead of clarifying, I saw the damage. Every awkward workflow, every place the product sounded more certain than it should have. I saw all of it in real time.

That feedback loop is how the product got good. Reality kept humiliating it, and I kept fixing what broke. When I applied to YC, Traba was my first customer. I applied with real users, real usage, and a clear picture of every weakness.

Then I got into YC and moved to SF, and I lost the loop. I was no longer sitting inside the product with users all day. When that happens to a product whose value lives or dies on accuracy, you start building against your own imagination instead of reality. That is a dangerous place to be.

So I needed a substitute. Something adversarial that would expose ambiguity, punish guessing, and force the system to prove it understood a question before it wrote SQL.

I went looking for the hardest benchmark I could find. That turned out to be BIRD-Interact, a benchmark from the University of Hong Kong and Google Cloud. It was published at ICLR 2026, one of the top machine learning conferences, and selected for an oral presentation, which means the reviewers thought it was in roughly the top 1-2% of submissions. 600 questions, 22 PostgreSQL databases, all deliberately ambiguous. The kind of ambiguous where the question says "underperforming assets" but the schema has no column called that. You have to go figure out what underperforming means before you can write any SQL.

The best public reference point is Claude Opus 4.6. It gets 33% right. Most frontier models are in the 20s. That is exactly the kind of question people actually ask Datost.

These databases are genuinely nasty

Most text-to-SQL benchmarks give you clean tables with readable column names. BIRD doesn't. The BIRD family has always been about realistic, ugly data, and BIRD-Interact takes that further.

The solar_panel database has 10 tables. Three of them store their actual data inside jsonb columns. To find a plant's current power output, you need to reach into elec_perf_snapshot->'power'->>'power_now_w'. Its temperature coefficient lives in a different table entirely. Some values are None, some are nan, one column stores resistance as the string '0.17 Ω' instead of a number.

The crypto_exchange schema has columns named ORD_STAMP, exchSpot, UserRef, MARG_FORM. No consistency in casing. TimeCode is a timestamp, acctScope is nullable with the note "NULL means account scope not specified." The column meaning file helpfully explains that market_note contains things like 'ETH-USDT', which is only useful if you already know what that means.

The hulushows database stores Hulu branding assets as JSON blobs with CDN URLs inside them. The organ_transplant database has 14 tables and 6 jsonb columns. The archeology_scan database has 14 tables and 10 jsonb columns. polar_equipment has a 410-line schema.

On top of the schema mess, every database has a separate knowledge base of metric definitions. The solar_panel knowledge base defines "Specific Yield" as energy output divided by rated power capacity. "System Unavailability" is calculated from MTBF and MTTR using a formula you'd find in a reliability engineering textbook. These definitions are what make "underperforming" actually mean something. If you don't look them up, you're guessing.

22 databases like this. Every question assumes you can navigate the mess. That's why 33% is the best anyone has done.

The first run was 0%

Our initial harness wasn't capturing executed SQL correctly. The judge looked at the results, saw nothing, and marked every task failed. We spent a day fixing the plumbing. Ran five tasks again. 5 for 5. That felt suspicious so we ran 25 more. The real number started showing up around 50 tasks: somewhere in the mid 50s on the lighter SQLite version of the benchmark.

We switched to the full thing. 600 tasks, PostgreSQL, official grading. The run took a few hours and cost $440.

451 out of 600

75.2%. We checked it twice.

75.2%
Datost
451 / 600 correct
33.0%
Claude Opus 4.6
#1 on public leaderboard
2.3x
Same model underneath
The gap is product architecture
BIRD-Interact Full (600 tasks) — a-Interact success rate

The model inside Datost is Claude Opus. Same model that gets 33% by itself. The entire difference is the system around it.

Zero errors across 600 tasks. Two timeouts. 73 cents per question.

What these questions actually look like

Here's one that Datost got right:

"Which of the underperforming assets are also high-cost? How many?"

"Underperforming" means optpot = 'High' AND alrtstate = 'Critical' on the latest snapshot. "High-cost" means (maintcost + cleancost) / 1000 > 10. None of that is in the question. The definitions live in the benchmark's knowledge base, separate from the schema.

Datost searched its knowledge base, found both definitions, wrote a CTE to find 66 underperforming plants, filtered for OPEX above 10, returned 19. Correct. Then it got the follow-up question right too.

Here's another:

"What is the average number of residents per household in the region with the lowest comfort index?"

"Comfort index" is Expend_Coeff buried inside a JSON column called socioeconomic. Datost wrote a subquery to find the region with the lowest average (Plano Piloto, 39.43), then computed the resident average for that region: 1.6. Correct.

On the earlier SQLite run, Datost scored 33% on the households family. On the full PostgreSQL run: 93%.

And here's one it got wrong:

"Show me the liquidation risk for order OR6015391."

Datost found the order. Calculated the distance to liquidation price, $955, 3.53% away. Gave a detailed risk assessment. Reasonable SQL. But it joined on margin_risk_profile instead of risk_registry, and the numbers came out slightly different. Wrong.

That's what most of the failures look like. The system understood the question and did real analytical work. Then it got a specific detail wrong, the kind of mistake a human analyst makes too.

The weird thing about the failures

We had 147 failures. When we broke them down, we expected them to spread across task types. They didn't.

BIRD-Interact has two kinds of tasks. Analytical questions where you write a SELECT query ("Business Intelligence"). And tasks where you create tables, write functions, or mutate data ("Data Management").

91%
Analytical queries (414 / 455)
The thing people use Datost for
25%
Schema mutations (36 / 144)
CREATE TABLE, functions, UPDATEs
BI correct69%
BI wrong7%
DM correct6%
DM wrong18%

81 of the 147 failures are DM tasks. The benchmark asked Datost to write PL/pgSQL functions, build materialized views, create compliance tables. It couldn't do any of that well. Which makes sense if you think about it: Datost's whole system (knowledge base lookup, clarification, peer review) is built for answering analytical questions. That machinery doesn't help you design a table schema.

The BIRD-Interact paper actually says DM tasks are supposed to be the easy ones because they "follow standardized, predictable patterns." Datost inverts that. It handles the hard analytical work and drops the easy structured work, because that's what we built it to do.

On the analytical tasks, the ones that match what people actually use Datost for, it's right 91% of the time.

147 failures by category

How it works, briefly

The BIRD-Interact team did this experiment where they took GPT-5 and gave it clarification histories from models that communicated better. GPT-5's performance jumped. Their conclusion: "a more effective communication schema is required." Models write SQL fine. They're bad at figuring out which SQL to write.

Datost is basically a communication system wrapped around a model. Five things happen:

How Datost processes a question
?
Ambiguous question
"Show me underperforming assets"
1
Knowledge lookup
Search org memory for definitions
2
Clarification
Ask user if definition is ambiguous
5
Submit
Final SQL graded: right or wrong
4
Peer review
Second model checks the answer
3
Iterative SQL
Explore schema, probe, verify, write

In any real company, business terms have specific meanings. "Wealthy customer" has a formula. Those definitions live somewhere, usually in docs or in someone's head. Datost keeps them in a searchable knowledge base. When the model hits an unfamiliar term, it can look it up the same way a new analyst would ask a coworker, instead of guessing from a column name.

When the definition isn't in the knowledge base either, Datost can ask a clarification question. "How should I define wealthy, by net worth or by asset-to-liability ratio?" Most models skip that step. They pick an interpretation and commit, even when it's wrong.

After the analyst produces an answer, a second Claude instance reviews it with read-only access. It can see the SQL and the results but can't change anything. If the definition says one thing and the SQL does another, the reviewer catches it.

Datost doesn't write SQL in one shot. It runs in a sandbox where it can explore the schema, run test queries, check column types, verify joins, then commit. The BIRD-Interact team found that most models actually get worse with more interaction turns. Datost gets better.

When a query returns 50,000 rows, raw models choke on context limits. Datost writes the results to a file and processes them with code. The context window stays focused on reasoning.

Every family

Pass rate by database family

Eight families above 80%. Only two below 60%. Households went from 33% on the earlier run to 93%. fake_account went from 10% to 90%. The ones that do worst have the most DM tasks.

What we actually tested

We ran BIRD-Interact Full using Datost's production tool surface, the same tools real users get. Metric definitions from the benchmark's knowledge base were loaded the same way a customer loads business definitions into Datost. The model had to retrieve them. It was not handed answers, ground-truth SQL, or test cases. Grading used the official BIRD-Interact evaluation code. Claude Opus 4.6 is the reference point because Datost runs on top of the same frontier model. The difference is not the model. The difference is the system around it.

What we took away from this

We went in thinking the hard part would be SQL generation. It wasn't. The hard part is figuring out what the question means. The BIRD-Interact team found the same thing independently, which was reassuring, because that's basically our product thesis.

91% on analytical questions is a good number on this benchmark. We're going to keep running it. The 25% on DM tasks is the clearest signal we have for where to build next.