We watched an agent work, then trained a smarter and cheaper successor.

Our pipeline decomposes the work into eval slices, reads failed trajectories, and promotes the cheaper model only when it beats the baseline.

Frontier
Each chart shows where the cheap model started, where Sonnet is, and how far our tuned model climbed.
reasoning-heavy API tasks
thinking tasks · sales API · 4 tasks · Qwen+GEPA v3 · n=10
0.000.250.500.751.00partial credit0.084±0.073 · n=3Qwen 3.6+open baseline$0.0950.160±0.009 · n=3Sonnet 4.6frontier ceiling$0.7070.313±0.110 · n=10Qwen 3.6+open + GEPA$0.128Sonnet 4.6 ceilingGEPA: +273%, clears the ceiling
write-heavy CRM tasks
action tasks · limited_zapier · 7 tasks · targeted v5 adapter · n=10
0.000.250.500.751.00partial credit0.400±0.067 · n=3Qwen 3.6+open baseline$0.2300.557±0.066 · n=3Sonnet 4.6frontier ceiling$1.120.630±0.087 · n=10Qwen 3.6+open + GEPA$0.275Sonnet 4.6 ceilingGEPA: +57%, clears the ceiling
Claim

On thinking tasks, tuned Qwen scores 0.313 ± 0.110 (n=10) versus Sonnet's 0.160 ± 0.009, at 18% of the cost. On action tasks, tuned Qwen scores 0.630 ± 0.087 (n=10) versus Sonnet's 0.557 ± 0.066, at 25% of the cost.

The win came from reading failed runs, naming the missing behavior, and writing small targeted adapters. Bigger blind GEPA budget did not win; the 50-call autoresearch prompt landed at 0.252 ± 0.144.

Ladder

Same model family, same max tool-call budget, different prompt revisions. This is the optimization ladder: raw Qwen, broad GEPA, completion criteria, then task-local adapters for CRM actions.

optimization ladder
sales · max_steps=10 · prompt changes over fixed tool-call budget
partial creditthinking tasks0.000.250.500.751.000.084rawn=3 · 10 calls$0.0950.204v1 GEPAn=10 · 10 calls$0.1000.182v1 + completionn=10 · 10 calls$0.1400.313v3n=10 · 10 calls$0.128action tasks0.000.250.500.751.000.400rawn=3 · 10 calls$0.2300.277v3 transfern=3 · 10 calls$0.2480.468v4 proof rulesn=10 · 10 calls$0.3110.630v5 pricing rulesn=10 · 10 calls$0.275
Thinking

Four reasoning-heavy API tasks from AutomationBench's sales surface: negative selection, priority selection, implicit rules, and cross-reference validation. This is the GEPA training surface: Qwen starts below Sonnet, then the v3 adapter clears Sonnet on mean while still carrying visible rollout variance.

runavgstrictcost$/point
Sonnet 4.6 frontier ceiling
api · mean of 3
0.160
±0.009
0%
±0%
$0.707
1.0×
Qwen 3.6 Plus open baseline
api · mean of 3One of three replicates scored 0.000 — Qwen's raw tool-use has moderate rollout variance. Mean of 3.
0.084
±0.073
0%
±0%
$0.095
3.9×
Qwen 3.6 Plus + GEPA (default temp)
api · mean of 3Default Fireworks temperature. High rollout variance (σ=0.124): one rep 0.300 with 1-of-4 strict; two reps 0.086.
0.157
±0.124
8%
±14%
$0.102
6.8×
Qwen 3.6 Plus + GEPA v3 @ temp=0
api · mean of 10v2 structure plus an explicit completion criterion. Confirmed at n=10 after earlier n=5 variance claims failed to replicate.
0.313
±0.110
23%
$0.128
10.8×
Reasoning-heavy API GEPA row is n=10; ± shows standard deviation. $/point = quality-per-dollar normalized to Sonnet.
Action

Seven write-heavy CRM tasks where both models actually have the right tools to finish work. Generic adapters did not transfer cleanly. The v5 result came from task-local hill climbing: proof artifacts first, then latest-pricing and account-health rules for opportunity creation.

runavgstrictcost$/point
Sonnet 4.6 frontier ceiling
limited zapier · mean of 3
0.557
±0.066
19%
±8%
$1.12
1.0×
Qwen 3.6 Plus open baseline
limited zapier · mean of 37 write-heavy tasks: update contact phone, add to campaign, create note, create contact, create opportunity, advance opportunity stage, qualify lead. Mean of 3 runs.
0.400
±0.067
14%
±14%
$0.230
3.5×
Qwen 3.6 Plus + hand adapter
limited zapier · mean of 3The sales/API adapter tuned for reasoning-heavy API tasks does not transfer to write-heavy CRM tasks — adapter quality is slice-specific. Mean of 3 runs.
0.369
±0.082
14%
±14%
$0.232
3.2×
Qwen 3.6 Plus + targeted v5 adapter
limited zapier · mean of 10Failure-mode hill climb: v4 fixed proof artifacts; v5 added latest-pricing/account-health rules. Mean of 10 runs.
0.630
±0.087
54%
$0.275
4.6×
Write-heavy CRM v5 row is n=10; ± shows standard deviation. It is 4.6× Sonnet's quality-per-dollar on this slice.
Method

Benchmark: AutomationBench, a sales-domain slice from Zapier's workflow benchmark, introduced with Prime Intellect. Models: Claude Sonnet 4.6 (Anthropic), Qwen 3.6 Plus on Fireworks. GEPA: the standalone gepa package (what DSPy wraps), with Claude Opus 4.7 as the reflection LM. All runs use max_steps=10, max_tokens=4096. Costs are per one full slice. The GEPA optimizer ran with max_metric_calls=20 and reflection_minibatch_size=2; reasoning-heavy API v3 was replicated to n=10. Write-heavy CRM v5 was a manual failure-mode hill climb over the action-task surface and was also replicated to n=10.