End-to-end experimentation for AI

Ship on evidence, not vibes.

Change a prompt, swap a model, or modify a tool call. three.dev replays it against your real requests offline, ranks candidates by probability to be best, then confirms the winner on live traffic with statistical confidence against the metric that matters.

You changed the system. Did it actually get better?

For most teams, the honest answer is still: we don't know.

01

Shipping on vibes

You tweak a prompt, swap a model, push to prod, and hope. If something breaks, you find out from user complaints, not data.

02

Collapse your eval suite into one judge

Eval suites accumulate, drift, and fight each other. One judge with a clear rubric, graded across hundreds of real conversations, is statistically sufficient to tell you which variant wins — and why.

03

Every change is a tradeoff

The question isn’t only ‘did quality improve?’ It’s whether the extra quality is worth the cost and latency, or whether a faster, cheaper option is the better choice for you.

THE PROOF LOOP

Offline ranks. Live confirms.

One offline experiment ranks the candidates, then a live experiment confirms the winner on the metrics that matter to your business.

01 · OFFLINE / CREATE

Configure the experiment.

Replay the latest 500 production requests against each variant.

Control
GPT-5.2prompt v3
Variant A
GPT-5.5prompt v3
Variant B
GPT-5.5prompt v4·clarifies error context
02 · OFFLINE / RESULTS

Rank variants by probability to be best.

Two candidates beat control. Variant B leads at 58% probability to be best.

VariantStatusP(best)P(beat ctrl)
Variant B
GPT-5.5prompt v4
Winner
58%96%
Variant A
GPT-5.5prompt v3
Promising
38%82%
Control
GPT-5.2prompt v3
Baseline
4%·
03 · PROMOTE TO LIVE

Promote winners to a live experiment.

OFFLINE
Variant B
GPT-5.5prompt v4
96% P(beat ctrl)
Variant A
GPT-5.5prompt v3
82% P(beat ctrl)
LIVE
PR merge rate
business metric
04 · LIVE / RESULTS

Statistical confidence on production business metrics.

Recommendation
ShipShip Variant B.
GPT-5.5prompt v4·clarifies error context
Quality
+7.2%PR merge rate vs control [+3.2%, +11.0%]+498PRs merged / month
Latency
+147ms p50 (+12% vs control)
Cost
+$424 / month (+8% vs control)

Built for how AI actually works. Not how you wish it did.

OFFLINE REPLAY

Test against historical data before touching production

Replay captured requests through every variant. Use real inputs from production instead of handpicked examples or synthetic benchmarks.

EXPERIMENTS

Run any change as an experiment

Model swaps, prompt rewrites, retrieval changes, tool settings, or full pipeline variants. If it affects what your AI feature returns, you can compare it — offline first, then live.

ATTRIBUTION

Keep each conversation on one configuration

Users get a consistent experience from first message to last. You get cleaner attribution than request-by-request sampling.

TRADEOFFS

Quality, latency, and cost in one view

A variant can improve quality while increasing cost or response time. three.dev shows the full tradeoff so every shipping decision is data-driven.

Your next prompt change deserves better than “LGTM, ship it.”

Replay it against real requests, rank candidates by probability to be best, then confirm the winner on live traffic with statistical confidence on the metric that matters.

Request early access →