We watched an agent work, then trained a smarter and cheaper successor.
Our pipeline decomposes the work into eval slices, reads failed trajectories, and promotes the cheaper model only when it beats the baseline.
On thinking tasks, tuned Qwen scores 0.313 ± 0.110 (n=10) versus Sonnet's 0.160 ± 0.009, at 18% of the cost. On action tasks, tuned Qwen scores 0.630 ± 0.087 (n=10) versus Sonnet's 0.557 ± 0.066, at 25% of the cost.
The win came from reading failed runs, naming the missing behavior, and writing small targeted adapters. Bigger blind GEPA budget did not win; the 50-call autoresearch prompt landed at 0.252 ± 0.144.
Same model family, same max tool-call budget, different prompt revisions. This is the optimization ladder: raw Qwen, broad GEPA, completion criteria, then task-local adapters for CRM actions.
Four reasoning-heavy API tasks from AutomationBench's sales surface: negative selection, priority selection, implicit rules, and cross-reference validation. This is the GEPA training surface: Qwen starts below Sonnet, then the v3 adapter clears Sonnet on mean while still carrying visible rollout variance.
| run | avg | strict | cost | $/point |
|---|---|---|---|---|
Sonnet 4.6 frontier ceiling api · mean of 3 | 0.160 ±0.009 | 0% ±0% | $0.707 | 1.0× |
Qwen 3.6 Plus open baseline api · mean of 3One of three replicates scored 0.000 — Qwen's raw tool-use has moderate rollout variance. Mean of 3. | 0.084 ±0.073 | 0% ±0% | $0.095 | 3.9× |
Qwen 3.6 Plus + GEPA (default temp) api · mean of 3Default Fireworks temperature. High rollout variance (σ=0.124): one rep 0.300 with 1-of-4 strict; two reps 0.086. | 0.157 ±0.124 | 8% ±14% | $0.102 | 6.8× |
Qwen 3.6 Plus + GEPA v3 @ temp=0★ api · mean of 10v2 structure plus an explicit completion criterion. Confirmed at n=10 after earlier n=5 variance claims failed to replicate. | 0.313 ±0.110 | 23% | $0.128 | 10.8× |
Seven write-heavy CRM tasks where both models actually have the right tools to finish work. Generic adapters did not transfer cleanly. The v5 result came from task-local hill climbing: proof artifacts first, then latest-pricing and account-health rules for opportunity creation.
| run | avg | strict | cost | $/point |
|---|---|---|---|---|
Sonnet 4.6 frontier ceiling limited zapier · mean of 3 | 0.557 ±0.066 | 19% ±8% | $1.12 | 1.0× |
Qwen 3.6 Plus open baseline limited zapier · mean of 37 write-heavy tasks: update contact phone, add to campaign, create note, create contact, create opportunity, advance opportunity stage, qualify lead. Mean of 3 runs. | 0.400 ±0.067 | 14% ±14% | $0.230 | 3.5× |
Qwen 3.6 Plus + hand adapter limited zapier · mean of 3The sales/API adapter tuned for reasoning-heavy API tasks does not transfer to write-heavy CRM tasks — adapter quality is slice-specific. Mean of 3 runs. | 0.369 ±0.082 | 14% ±14% | $0.232 | 3.2× |
Qwen 3.6 Plus + targeted v5 adapter★ limited zapier · mean of 10Failure-mode hill climb: v4 fixed proof artifacts; v5 added latest-pricing/account-health rules. Mean of 10 runs. | 0.630 ±0.087 | 54% | $0.275 | 4.6× |
Benchmark: AutomationBench, a sales-domain slice from Zapier's workflow benchmark, introduced with Prime Intellect. Models: Claude Sonnet 4.6 (Anthropic), Qwen 3.6 Plus on Fireworks. GEPA: the standalone gepa package (what DSPy wraps), with Claude Opus 4.7 as the reflection LM. All runs use max_steps=10, max_tokens=4096. Costs are per one full slice. The GEPA optimizer ran with max_metric_calls=20 and reflection_minibatch_size=2; reasoning-heavy API v3 was replicated to n=10. Write-heavy CRM v5 was a manual failure-mode hill climb over the action-task surface and was also replicated to n=10.