Agents that evolve.
Prompts that compound.
Shipping an AI agent today feels like shipping code without tests. NanoEval is the missing loop: define what good looks like once, and your prompts improve themselves — scored, refined, re-deployed.
Treat prompts like code. Test them like code.
| 1 | from nanoeval import eval, judge, optimize |
| 2 | from prompts import extract_invoice |
| 3 | |
| 4 | @eval(models=["gpt-4o", "claude-sonnet-4", "gemini-2.5"]) |
| 5 | def test_invoice_totals(fixture: Invoice): |
| 6 | result = extract_invoice(fixture.pdf) |
| 7 | assert result.total == fixture.expected_total |
| 8 | assert judge.matches_schema(result, Invoice) |
| 9 | assert judge.tone(result.note, "professional") > 0.8 |
| 10 | |
| 11 | # compile the prompt against the suite |
| 12 | @optimize(target="f1", budget=120, algo="mipro") |
| 13 | def extract_invoice(pdf: bytes) -> Invoice: |
| 14 | return nanoeval.predict( |
| 15 | signature="pdf -> invoice: Invoice", |
| 16 | trace=True, |
| 17 | ) |
Last run — 0.8s ago
Four moves. Forever.
Describe
Write a signature — inputs, outputs, types. NanoEval scaffolds a prompt from it.
signature: pdf → Invoice
Evaluate
Drop in fixtures. Compose judges — ground-truth, schema, rubric, LLM-as-judge.
@eval(models=[*]) def test_totals(...)
Optimize
The compiler explores rewrites, few-shots and routes. Returns the Pareto frontier.
@optimize(algo="mipro", budget=120)
Ship
Pin the winner. CI blocks regressions. Drift alerts when the world moves.
$ nanoeval deploy v17 ✓ pinned · gated · live
Every tool an engineer already trusts — rebuilt for prompts.
Parallel runs, cached and resumable.
Fan out across models and rows with a worker pool. Checkpoint on every result — a dropped connection or a kill signal doesn’t cost you the run.
An optimizer that rewrites the prompt for you.
DSPy-grade algorithms — MIPRO-v2, BootstrapFewShot, COPRO — built in. Converges to your eval target and shows every mutation it tried.
Block the PR when quality regresses.
One YAML, every provider. Posts the diff on the PR. Snapshots trace IDs so any failure is replayable. Shipping with public launch.
Every model, head-to-head, on your data.
One dashboard. Quality vs. latency vs. $/1k. Pick the frontier, pin the winner, ship with a single config change.
You don’t hope the agent got better. You see it.
invoice-extract @v17
Recent runs last 24h · 12 runs
| Run | Trigger | Δ F1 | Result | Latency |
|---|---|---|---|---|
| #4f2a91 | main · optimize v17 | +0.03 | OPTIMIZED | 14.2s |
| #4f2a8e | PR #218 · harden schema | +0.01 | PASS | 12.0s |
| #4f2a7c | cron · drift check | −0.02 | DRIFT | 9.8s |
| #4f2a70 | main · nightly | +0.00 | PASS | 13.1s |
| #4f2a6a | PR #217 · add currency | +0.02 | PASS | 15.4s |
Judge breakdown v17 · 250 fixtures
Quality vs. cost (30d) frontier optimization
Agents write our code. Soon they’ll run our companies.
They had better evolve on purpose.
Every agent in production is one prompt away from being wrong. The answer isn’t bigger models — it’s a tighter loop. Measure. Refine. Redeploy. Measure again. NanoEval is building the scaffolding for agents that improve themselves, backed by metrics you chose. Evals today. Self-improving systems tomorrow.
Works with the models and tools you already ship with.
Stop shipping agents on vibes.
Closed beta now. Public launch May 2026. Waitlist members lock in launch pricing and get a direct line to the team.