Frontier model, router, fine-tune, or specialist open model?

The right answer depends on whether the workload is repeated, measurable, and expensive enough to deserve its own model. Understudy is for the point where a generalist is good enough to teach the task, but too slow, costly, or rented to own the product surface.

Decision

path	best for	tradeoff	Understudy role
Stay on a frontier model	Low-volume workflows, broad reasoning, fast prototyping, and tasks where the target behavior is still changing.	Strong default quality, but every repeated request keeps paying frontier latency and token pricing.	Use as the baseline, teacher, adjudicator, or control slice while measuring whether a specialist is ready.
Generic LLM routing	Traffic splitting across providers when requests vary widely and the team mainly wants availability or marginal cost control.	Routing chooses between existing models; it usually does not create a model that is better at your repeated workflow.	Understudy can feed routing decisions, but the core loop is evals, optimization, and replacement for a specific workload.
One-off fine-tuning	Stable labeled datasets with a known target format and enough examples to justify a training run.	A tune without held-out evals, production traces, and ongoing review can improve a demo while missing production quality.	Fine-tuning is one step in the optimization ladder, used after prompts, schemas, and evals prove the task earns it.
Understudy specialist model workflow	Repeated production tasks where cost, latency, ownership, or scale are constrained by a generalist model.	Requires a real eval contract and expert review. It is not the right path for vague or low-volume tasks.	Capture traces, build evals, optimize prompts/routes/models, and hand off specialist weights when they beat the gate.

Evidence

The comparison is not generic model-score theater. Understudy measures whether a cheaper route can satisfy a specific product contract: CRM actions, operations JSON, or warehouse-scale labeling.

Read the published slices: sales agent benchmark, operations benchmark, and sentiment labeling benchmark.

The same optimization loop, across every business domain.

Sales. Operations. Support. Finance. The model swap and prompt optimization are the same — the domain just changes which tasks you score against.

Zapier AutomationBench · sales domain · CRM, lead management, cross-app workflows

Claim

On four hard AutomationBench sales tasks, Qwen 3.6+ with a GEPA-optimized prompt produced a strict pass Sonnet never produced in three runs — at 14% of Sonnet's cost. Its mean over n=3 replicates is 0.157 ± 0.124, statistically tied with Sonnet's 0.160 ± 0.009. On the direct-tool slice (n=3), Qwen reaches 72% of Sonnet's partial credit at 21% of cost — 3.5× quality-per-dollar.

Visual

Three bars per slice: open-model baseline (left), frontier reference line, GEPA-optimized open model (right). Cost annotated under each bar.

hard slice

sales · generic API tools · 4 tasks

direct-tool slice

sales · limited_zapier · 7 tasks · mean of 3

GEPA on the direct-tool slice is the next run queued — raw Qwen already sits at 3.5× Sonnet's quality-per-dollar.

Hard slice

Four sales tasks from AutomationBench's generic-API surface: negative selection, priority selection, implicit rules, cross-reference validation.

run	avg	strict	cost	$/point
Sonnet 4.6 frontier ceiling api · mean of 3	0.160 ±0.009	0% ±0%	$0.707	1.0×
Qwen 3.6 Plus open baseline api · mean of 3One of three replicates scored 0.000 — Qwen's raw tool-use has moderate rollout variance. Mean of 3.	0.084 ±0.073	0% ±0%	$0.095	3.9×
Qwen 3.6 Plus + GEPA (default temp) api · mean of 3Default Fireworks temperature. High rollout variance (σ=0.124): one rep 0.300 with 1-of-4 strict; two reps 0.086.	0.157 ±0.124	8% ±14%	$0.102	6.8×
Qwen 3.6 Plus + GEPA v3 @ temp=0★ api · mean of 10v2 structure plus an explicit completion criterion. Confirmed at n=10 after earlier n=5 variance claims failed to replicate.	0.313 ±0.110	23%	$0.128	10.8×

n=3 replicates per row. $/point = quality-per-dollar normalized to the frontier row.

Direct slice

Seven direct CRM-mutation tasks. Qwen doesn't beat Sonnet on score; it matches the Pareto frontier at 21% of the cost.

run	avg	strict	cost	$/point
Sonnet 4.6 frontier ceiling limited zapier · mean of 3	0.557 ±0.066	19% ±8%	$1.12	1.0×
Qwen 3.6 Plus open baseline limited zapier · mean of 37 write-heavy tasks: update contact phone, add to campaign, create note, create contact, create opportunity, advance opportunity stage, qualify lead. Mean of 3 runs.	0.400 ±0.067	14% ±14%	$0.230	3.5×
Qwen 3.6 Plus + hand adapter limited zapier · mean of 3The sales/API adapter tuned for reasoning-heavy API tasks does not transfer to write-heavy CRM tasks — adapter quality is slice-specific. Mean of 3 runs.	0.369 ±0.082	14% ±14%	$0.232	3.2×
Qwen 3.6 Plus + targeted v5 adapter★ limited zapier · mean of 10Failure-mode hill climb: v4 fixed proof artifacts; v5 added latest-pricing/account-health rules. Mean of 10 runs.	0.630 ±0.087	54%	$0.275	4.6×

Mean of 3 replicates. Raw Qwen at 3.5× quality-per-dollar vs. Sonnet on this slice.