We watched an agent work, then trained a smarter and cheaper successor.

Our pipeline decomposes the work into eval slices, reads failed trajectories, and promotes the cheaper model only when it beats the baseline.

Frontier

Each chart shows where the cheap model started, where Sonnet is, and how far our tuned model climbed.

reasoning-heavy API tasks

thinking tasks · sales API · 4 tasks · Qwen+GEPA v3 · n=10

write-heavy CRM tasks

action tasks · limited_zapier · 7 tasks · targeted v5 adapter · n=10

Claim

On thinking tasks, tuned Qwen scores 0.313 ± 0.110 (n=10) versus Sonnet's 0.160 ± 0.009, at 18% of the cost. On action tasks, tuned Qwen scores 0.630 ± 0.087 (n=10) versus Sonnet's 0.557 ± 0.066, at 25% of the cost.

The win came from reading failed runs, naming the missing behavior, and writing small targeted adapters. Bigger blind GEPA budget did not win; the 50-call autoresearch prompt landed at 0.252 ± 0.144.

Ladder

Same model family, same max tool-call budget, different prompt revisions. This is the optimization ladder: raw Qwen, broad GEPA, completion criteria, then task-local adapters for CRM actions.

optimization ladder

sales · max_steps=10 · prompt changes over fixed tool-call budget

Thinking

Four reasoning-heavy API tasks from AutomationBench's sales surface: negative selection, priority selection, implicit rules, and cross-reference validation. This is the GEPA training surface: Qwen starts below Sonnet, then the v3 adapter clears Sonnet on mean while still carrying visible rollout variance.

run	avg	strict	cost	$/point
Sonnet 4.6 frontier ceiling api · mean of 3	0.160 ±0.009	0% ±0%	$0.707	1.0×
Qwen 3.6 Plus open baseline api · mean of 3One of three replicates scored 0.000 — Qwen's raw tool-use has moderate rollout variance. Mean of 3.	0.084 ±0.073	0% ±0%	$0.095	3.9×
Qwen 3.6 Plus + GEPA (default temp) api · mean of 3Default Fireworks temperature. High rollout variance (σ=0.124): one rep 0.300 with 1-of-4 strict; two reps 0.086.	0.157 ±0.124	8% ±14%	$0.102	6.8×
Qwen 3.6 Plus + GEPA v3 @ temp=0★ api · mean of 10v2 structure plus an explicit completion criterion. Confirmed at n=10 after earlier n=5 variance claims failed to replicate.	0.313 ±0.110	23%	$0.128	10.8×

Reasoning-heavy API GEPA row is n=10; ± shows standard deviation. $/point = quality-per-dollar normalized to Sonnet.

Action

Seven write-heavy CRM tasks where both models actually have the right tools to finish work. Generic adapters did not transfer cleanly. The v5 result came from task-local hill climbing: proof artifacts first, then latest-pricing and account-health rules for opportunity creation.

run	avg	strict	cost	$/point
Sonnet 4.6 frontier ceiling limited zapier · mean of 3	0.557 ±0.066	19% ±8%	$1.12	1.0×
Qwen 3.6 Plus open baseline limited zapier · mean of 37 write-heavy tasks: update contact phone, add to campaign, create note, create contact, create opportunity, advance opportunity stage, qualify lead. Mean of 3 runs.	0.400 ±0.067	14% ±14%	$0.230	3.5×
Qwen 3.6 Plus + hand adapter limited zapier · mean of 3The sales/API adapter tuned for reasoning-heavy API tasks does not transfer to write-heavy CRM tasks — adapter quality is slice-specific. Mean of 3 runs.	0.369 ±0.082	14% ±14%	$0.232	3.2×
Qwen 3.6 Plus + targeted v5 adapter★ limited zapier · mean of 10Failure-mode hill climb: v4 fixed proof artifacts; v5 added latest-pricing/account-health rules. Mean of 10 runs.	0.630 ±0.087	54%	$0.275	4.6×

Write-heavy CRM v5 row is n=10; ± shows standard deviation. It is 4.6× Sonnet's quality-per-dollar on this slice.

Method

Benchmark: AutomationBench, a sales-domain slice from Zapier's workflow benchmark, introduced with Prime Intellect. Models: Claude Sonnet 4.6 (Anthropic), Qwen 3.6 Plus on Fireworks. GEPA: the standalone gepa package (what DSPy wraps), with Claude Opus 4.7 as the reflection LM. All runs use max_steps=10, max_tokens=4096. Costs are per one full slice. The GEPA optimizer ran with max_metric_calls=20 and reflection_minibatch_size=2; reasoning-heavy API v3 was replicated to n=10. Write-heavy CRM v5 was a manual failure-mode hill climb over the action-task surface and was also replicated to n=10.