The same optimization loop, across every business domain.

Sales. Operations. Support. Finance. The model swap and prompt optimization are the same — the domain just changes which tasks you score against.

Zapier AutomationBench · sales domain · CRM, lead management, cross-app workflows

Claim

On four hard AutomationBench sales tasks, Qwen 3.6+ with a GEPA-optimized prompt produced a strict pass Sonnet never produced in three runs — at 14% of Sonnet's cost. Its mean over n=3 replicates is 0.157 ± 0.124, statistically tied with Sonnet's 0.160 ± 0.009. On the direct-tool slice (n=3), Qwen reaches 72% of Sonnet's partial credit at 21% of cost 3.5× quality-per-dollar.

Visual

Three bars per slice: open-model baseline (left), frontier reference line, GEPA-optimized open model (right). Cost annotated under each bar.

hard slice
sales · generic API tools · 4 tasks
0.000.250.500.751.00partial credit0.084±0.073 · n=3Qwen 3.6+open baseline$0.0950.160±0.009 · n=3Sonnet 4.6frontier ceiling$0.7070.157±0.124 · n=3Qwen 3.6+open + GEPA$0.102Sonnet 4.6 ceilingGEPA: +87%
direct-tool slice
sales · limited_zapier · 7 tasks · mean of 3
0.000.250.500.751.00partial credit0.400±0.067 · n=3Qwen 3.6+open baseline$0.2300.557±0.066 · n=3Sonnet 4.6frontier ceiling$1.12pendingQwen 3.6+open + GEPASonnet 4.6 ceiling
GEPA on the direct-tool slice is the next run queued — raw Qwen already sits at 3.5× Sonnet's quality-per-dollar.
Hard slice

Four sales tasks from AutomationBench's generic-API surface: negative selection, priority selection, implicit rules, cross-reference validation.

runavgstrictcost$/point
Sonnet 4.6 frontier ceiling
api · mean of 3
0.160
±0.009
0%
±0%
$0.707
1.0×
Qwen 3.6 Plus open baseline
api · mean of 3One of three replicates scored 0.000 — Qwen's raw tool-use has moderate rollout variance. Mean of 3.
0.084
±0.073
0%
±0%
$0.095
3.9×
Qwen 3.6 Plus + GEPA (default temp)
api · mean of 3Default Fireworks temperature. High rollout variance (σ=0.124): one rep 0.300 with 1-of-4 strict; two reps 0.086.
0.157
±0.124
8%
±14%
$0.102
6.8×
Qwen 3.6 Plus + GEPA v3 @ temp=0
api · mean of 10v2 structure plus an explicit completion criterion. Confirmed at n=10 after earlier n=5 variance claims failed to replicate.
0.313
±0.110
23%
$0.128
10.8×
n=3 replicates per row. $/point = quality-per-dollar normalized to the frontier row.
Direct slice

Seven direct CRM-mutation tasks. Qwen doesn't beat Sonnet on score; it matches the Pareto frontier at 21% of the cost.

runavgstrictcost$/point
Sonnet 4.6 frontier ceiling
limited zapier · mean of 3
0.557
±0.066
19%
±8%
$1.12
1.0×
Qwen 3.6 Plus open baseline
limited zapier · mean of 37 write-heavy tasks: update contact phone, add to campaign, create note, create contact, create opportunity, advance opportunity stage, qualify lead. Mean of 3 runs.
0.400
±0.067
14%
±14%
$0.230
3.5×
Qwen 3.6 Plus + hand adapter
limited zapier · mean of 3The sales/API adapter tuned for reasoning-heavy API tasks does not transfer to write-heavy CRM tasks — adapter quality is slice-specific. Mean of 3 runs.
0.369
±0.082
14%
±14%
$0.232
3.2×
Qwen 3.6 Plus + targeted v5 adapter
limited zapier · mean of 10Failure-mode hill climb: v4 fixed proof artifacts; v5 added latest-pricing/account-health rules. Mean of 10 runs.
0.630
±0.087
54%
$0.275
4.6×
Mean of 3 replicates. Raw Qwen at 3.5× quality-per-dollar vs. Sonnet on this slice.