Private beta · shipping May 2026

Agents that evolve.
Prompts that compound.

Shipping an AI agent today feels like shipping code without tests. NanoEval is the missing loop: define what good looks like once, and your prompts improve themselves — scored, refined, re-deployed.

Closed beta
Launch · May 2026
What it is
An evaluation and optimization loop for LLM prompts and agents.
What it does
Scores every output. Rewrites the prompt. Ships the winner. Blocks regressions.
Who it’s for
AI engineers who treat prompts as production code.
04 — The fix

Treat prompts like code. Test them like code.

Your eval suite is the spec — fixtures, judges, rubrics, all composable. Runs on every commit. Fails the PR when quality drops.
tests/extract_invoice.pyfixtures.yaml
nanoeval · beta
1from nanoeval import eval, judge, optimize
2from prompts import extract_invoice
3 
4@eval(models=["gpt-4o", "claude-sonnet-4", "gemini-2.5"])
5def test_invoice_totals(fixture: Invoice):
6    result = extract_invoice(fixture.pdf)
7    assert result.total == fixture.expected_total
8    assert judge.matches_schema(result, Invoice)
9    assert judge.tone(result.note, "professional") > 0.8
10 
11# compile the prompt against the suite
12@optimize(target="f1", budget=120, algo="mipro")
13def extract_invoice(pdf: bytes) -> Invoice:
14    return nanoeval.predict(
15        signature="pdf -> invoice: Invoice",
16        trace=True,
17    )

Last run — 0.8s ago

totals match on 248/250
99.2%
schema valid
100%
! tone drifted on 3 runs
0.76
cost within budget
$0.018
After one optimizer pass
F1 went from 0.81 to 0.94 with 38% fewer tokens.
+0%
baseline 0.81 · 4.2k tokens · $0.018
05 — The loop

Four moves. Forever.

Describe → evaluate → optimize → ship. Then the loop runs again — on a cron, a PR, or a drift alarm.
00

Describe

Write a signature — inputs, outputs, types. NanoEval scaffolds a prompt from it.

signature:
  pdf → Invoice
00

Evaluate

Drop in fixtures. Compose judges — ground-truth, schema, rubric, LLM-as-judge.

@eval(models=[*])
def test_totals(...)
00

Optimize

The compiler explores rewrites, few-shots and routes. Returns the Pareto frontier.

@optimize(algo="mipro",
  budget=120)
00

Ship

Pin the winner. CI blocks regressions. Drift alerts when the world moves.

$ nanoeval deploy v17
✓ pinned · gated · live
06 — What’s inside

Every tool an engineer already trusts — rebuilt for prompts.

Four primitives. Bring your own models, your own judges, your own eval set. NanoEval binds them into one compiler.
01Eval Runner

Parallel runs, cached and resumable.

Fan out across models and rows with a worker pool. Checkpoint on every result — a dropped connection or a kill signal doesn’t cost you the run.

0.81v12
0.84v13
0.78v14
0.88v15
0.91v16
0.94v17
02Optimizer

An optimizer that rewrites the prompt for you.

DSPy-grade algorithms — MIPRO-v2, BootstrapFewShot, COPRO — built in. Converges to your eval target and shows every mutation it tried.

signature: "pdf -> invoice"
- You are a helpful assistant. Extract the invoice.
+ Extract invoice fields. Reason step-by-step about totals
+ before returning. If a subtotal and tax are present, verify
+ total = subtotal + tax within ±$0.01.
demos: 4 shots (auto-selected from eval set)
03CI GatePlanned · May 2026

Block the PR when quality regresses.

One YAML, every provider. Posts the diff on the PR. Snapshots trace IDs so any failure is replayable. Shipping with public launch.

git push
fixtures250
eval · 14.2s
merge
04Model Court

Every model, head-to-head, on your data.

One dashboard. Quality vs. latency vs. $/1k. Pick the frontier, pin the winner, ship with a single config change.

Opus 4.7
0.96
$0.075
GPT-5.4
0.94
$0.015
Gemini 3.1 Pro
0.93
$0.012
Llama 4
0.88
$0.003
07 — The control plane

You don’t hope the agent got better. You see it.

Live scoreboards. Run diffs. Judge breakdowns. Cost curves. Every mutation the optimizer tried — and why.
🔒 app.nanoeval.dev/workspaces/acme/pipelines/invoice-extract
pipelines / invoice-extract / production

invoice-extract @v17

healthy
3 models
1 drift
F1
0.00
+0.13 ↑ since v12
p50 latency
0ms
−92ms ↓
Cost / 1k
$0.00
−38%
Judge agreement
0%
−1.2%

Recent runs last 24h · 12 runs

RunTriggerΔ F1ResultLatency
#4f2a91main · optimize v17+0.03OPTIMIZED14.2s
#4f2a8ePR #218 · harden schema+0.01PASS12.0s
#4f2a7ccron · drift check−0.02DRIFT9.8s
#4f2a70main · nightly+0.00PASS13.1s
#4f2a6aPR #217 · add currency+0.02PASS15.4s

Judge breakdown v17 · 250 fixtures

Totals match
99.2+18
Schema valid
100+8
Line items
96.0+24
Tone (prof.)
82.0−6
Currency norm
94.0+30
Date parse
98.0+12

Quality vs. cost (30d) frontier optimization

v17 · 0.9430d agotoday
08 — Why it matters

Agents write our code. Soon they’ll run our companies.
They had better evolve on purpose.

Every agent in production is one prompt away from being wrong. The answer isn’t bigger models it’s a tighter loop. Measure. Refine. Redeploy. Measure again. NanoEval is building the scaffolding for agents that improve themselves, backed by metrics you chose. Evals today. Self-improving systems tomorrow.

— the founders, NanoEval
09 — Ecosystem

Works with the models and tools you already ship with.

OpenAI
Anthropic
Gemini
Mistral
Together
Groq
LangChain
LlamaIndex
DeepSeek
xAI

Stop shipping agents on vibes.

Closed beta now. Public launch May 2026. Waitlist members lock in launch pricing and get a direct line to the team.