The same optimization loop, across every business domain.
Sales. Operations. Support. Finance. The model swap and prompt optimization are the same — the domain just changes which tasks you score against.
Zapier AutomationBench · sales domain · CRM, lead management, cross-app workflows
On four hard AutomationBench sales tasks, Qwen 3.6+ with a GEPA-optimized prompt produced a strict pass Sonnet never produced in three runs — at 14% of Sonnet's cost. Its mean over n=3 replicates is 0.157 ± 0.124, statistically tied with Sonnet's 0.160 ± 0.009. On the direct-tool slice (n=3), Qwen reaches 72% of Sonnet's partial credit at 21% of cost — 3.5× quality-per-dollar.
Three bars per slice: open-model baseline (left), frontier reference line, GEPA-optimized open model (right). Cost annotated under each bar.
Four sales tasks from AutomationBench's generic-API surface: negative selection, priority selection, implicit rules, cross-reference validation.
| run | avg | strict | cost | $/point |
|---|---|---|---|---|
Sonnet 4.6 frontier ceiling api · mean of 3 | 0.160 ±0.009 | 0% ±0% | $0.707 | 1.0× |
Qwen 3.6 Plus open baseline api · mean of 3One of three replicates scored 0.000 — Qwen's raw tool-use has moderate rollout variance. Mean of 3. | 0.084 ±0.073 | 0% ±0% | $0.095 | 3.9× |
Qwen 3.6 Plus + GEPA (default temp) api · mean of 3Default Fireworks temperature. High rollout variance (σ=0.124): one rep 0.300 with 1-of-4 strict; two reps 0.086. | 0.157 ±0.124 | 8% ±14% | $0.102 | 6.8× |
Qwen 3.6 Plus + GEPA v3 @ temp=0★ api · mean of 10v2 structure plus an explicit completion criterion. Confirmed at n=10 after earlier n=5 variance claims failed to replicate. | 0.313 ±0.110 | 23% | $0.128 | 10.8× |
Seven direct CRM-mutation tasks. Qwen doesn't beat Sonnet on score; it matches the Pareto frontier at 21% of the cost.
| run | avg | strict | cost | $/point |
|---|---|---|---|---|
Sonnet 4.6 frontier ceiling limited zapier · mean of 3 | 0.557 ±0.066 | 19% ±8% | $1.12 | 1.0× |
Qwen 3.6 Plus open baseline limited zapier · mean of 37 write-heavy tasks: update contact phone, add to campaign, create note, create contact, create opportunity, advance opportunity stage, qualify lead. Mean of 3 runs. | 0.400 ±0.067 | 14% ±14% | $0.230 | 3.5× |
Qwen 3.6 Plus + hand adapter limited zapier · mean of 3The sales/API adapter tuned for reasoning-heavy API tasks does not transfer to write-heavy CRM tasks — adapter quality is slice-specific. Mean of 3 runs. | 0.369 ±0.082 | 14% ±14% | $0.232 | 3.2× |
Qwen 3.6 Plus + targeted v5 adapter★ limited zapier · mean of 10Failure-mode hill climb: v4 fixed proof artifacts; v5 added latest-pricing/account-health rules. Mean of 10 runs. | 0.630 ±0.087 | 54% | $0.275 | 4.6× |