Frontier model, router, fine-tune, or specialist open model?

The right answer depends on whether the workload is repeated, measurable, and expensive enough to deserve its own model. Understudy is for the point where a generalist is good enough to teach the task, but too slow, costly, or rented to own the product surface.

Decision
pathbest fortradeoffUnderstudy role
Stay on a frontier modelLow-volume workflows, broad reasoning, fast prototyping, and tasks where the target behavior is still changing.Strong default quality, but every repeated request keeps paying frontier latency and token pricing.Use as the baseline, teacher, adjudicator, or control slice while measuring whether a specialist is ready.
Generic LLM routingTraffic splitting across providers when requests vary widely and the team mainly wants availability or marginal cost control.Routing chooses between existing models; it usually does not create a model that is better at your repeated workflow.Understudy can feed routing decisions, but the core loop is evals, optimization, and replacement for a specific workload.
One-off fine-tuningStable labeled datasets with a known target format and enough examples to justify a training run.A tune without held-out evals, production traces, and ongoing review can improve a demo while missing production quality.Fine-tuning is one step in the optimization ladder, used after prompts, schemas, and evals prove the task earns it.
Understudy specialist model workflowRepeated production tasks where cost, latency, ownership, or scale are constrained by a generalist model.Requires a real eval contract and expert review. It is not the right path for vague or low-volume tasks.Capture traces, build evals, optimize prompts/routes/models, and hand off specialist weights when they beat the gate.
Evidence

The comparison is not generic model-score theater. Understudy measures whether a cheaper route can satisfy a specific product contract: CRM actions, operations JSON, or warehouse-scale labeling.

Read the published slices: sales agent benchmark, operations benchmark, and sentiment labeling benchmark.

The same optimization loop, across every business domain.

Sales. Operations. Support. Finance. The model swap and prompt optimization are the same — the domain just changes which tasks you score against.

Zapier AutomationBench · sales domain · CRM, lead management, cross-app workflows

Claim

On four hard AutomationBench sales tasks, Qwen 3.6+ with a GEPA-optimized prompt produced a strict pass Sonnet never produced in three runs — at 14% of Sonnet's cost. Its mean over n=3 replicates is 0.157 ± 0.124, statistically tied with Sonnet's 0.160 ± 0.009. On the direct-tool slice (n=3), Qwen reaches 72% of Sonnet's partial credit at 21% of cost 3.5× quality-per-dollar.

Visual

Three bars per slice: open-model baseline (left), frontier reference line, GEPA-optimized open model (right). Cost annotated under each bar.

hard slice
sales · generic API tools · 4 tasks
0.000.250.500.751.00partial credit0.084±0.073 · n=3Qwen 3.6+open baseline$0.0950.160±0.009 · n=3Sonnet 4.6frontier ceiling$0.7070.157±0.124 · n=3Qwen 3.6+open + GEPA$0.102Sonnet 4.6 ceilingGEPA: +87%
direct-tool slice
sales · limited_zapier · 7 tasks · mean of 3
0.000.250.500.751.00partial credit0.400±0.067 · n=3Qwen 3.6+open baseline$0.2300.557±0.066 · n=3Sonnet 4.6frontier ceiling$1.12pendingQwen 3.6+open + GEPASonnet 4.6 ceiling
GEPA on the direct-tool slice is the next run queued — raw Qwen already sits at 3.5× Sonnet's quality-per-dollar.
Hard slice

Four sales tasks from AutomationBench's generic-API surface: negative selection, priority selection, implicit rules, cross-reference validation.

runavgstrictcost$/point
Sonnet 4.6 frontier ceiling
api · mean of 3
0.160
±0.009
0%
±0%
$0.707
1.0×
Qwen 3.6 Plus open baseline
api · mean of 3One of three replicates scored 0.000 — Qwen's raw tool-use has moderate rollout variance. Mean of 3.
0.084
±0.073
0%
±0%
$0.095
3.9×
Qwen 3.6 Plus + GEPA (default temp)
api · mean of 3Default Fireworks temperature. High rollout variance (σ=0.124): one rep 0.300 with 1-of-4 strict; two reps 0.086.
0.157
±0.124
8%
±14%
$0.102
6.8×
Qwen 3.6 Plus + GEPA v3 @ temp=0
api · mean of 10v2 structure plus an explicit completion criterion. Confirmed at n=10 after earlier n=5 variance claims failed to replicate.
0.313
±0.110
23%
$0.128
10.8×
n=3 replicates per row. $/point = quality-per-dollar normalized to the frontier row.
Direct slice

Seven direct CRM-mutation tasks. Qwen doesn't beat Sonnet on score; it matches the Pareto frontier at 21% of the cost.

runavgstrictcost$/point
Sonnet 4.6 frontier ceiling
limited zapier · mean of 3
0.557
±0.066
19%
±8%
$1.12
1.0×
Qwen 3.6 Plus open baseline
limited zapier · mean of 37 write-heavy tasks: update contact phone, add to campaign, create note, create contact, create opportunity, advance opportunity stage, qualify lead. Mean of 3 runs.
0.400
±0.067
14%
±14%
$0.230
3.5×
Qwen 3.6 Plus + hand adapter
limited zapier · mean of 3The sales/API adapter tuned for reasoning-heavy API tasks does not transfer to write-heavy CRM tasks — adapter quality is slice-specific. Mean of 3 runs.
0.369
±0.082
14%
±14%
$0.232
3.2×
Qwen 3.6 Plus + targeted v5 adapter
limited zapier · mean of 10Failure-mode hill climb: v4 fixed proof artifacts; v5 added latest-pricing/account-health rules. Mean of 10 runs.
0.630
±0.087
54%
$0.275
4.6×
Mean of 3 replicates. Raw Qwen at 3.5× quality-per-dollar vs. Sonnet on this slice.