Frontier model, router, fine-tune, or specialist open model?
The right answer depends on whether the workload is repeated, measurable, and expensive enough to deserve its own model. Understudy is for the point where a generalist is good enough to teach the task, but too slow, costly, or rented to own the product surface.
| path | best for | tradeoff | Understudy role |
|---|---|---|---|
| Stay on a frontier model | Low-volume workflows, broad reasoning, fast prototyping, and tasks where the target behavior is still changing. | Strong default quality, but every repeated request keeps paying frontier latency and token pricing. | Use as the baseline, teacher, adjudicator, or control slice while measuring whether a specialist is ready. |
| Generic LLM routing | Traffic splitting across providers when requests vary widely and the team mainly wants availability or marginal cost control. | Routing chooses between existing models; it usually does not create a model that is better at your repeated workflow. | Understudy can feed routing decisions, but the core loop is evals, optimization, and replacement for a specific workload. |
| One-off fine-tuning | Stable labeled datasets with a known target format and enough examples to justify a training run. | A tune without held-out evals, production traces, and ongoing review can improve a demo while missing production quality. | Fine-tuning is one step in the optimization ladder, used after prompts, schemas, and evals prove the task earns it. |
| Understudy specialist model workflow | Repeated production tasks where cost, latency, ownership, or scale are constrained by a generalist model. | Requires a real eval contract and expert review. It is not the right path for vague or low-volume tasks. | Capture traces, build evals, optimize prompts/routes/models, and hand off specialist weights when they beat the gate. |
The comparison is not generic model-score theater. Understudy measures whether a cheaper route can satisfy a specific product contract: CRM actions, operations JSON, or warehouse-scale labeling.
Read the published slices: sales agent benchmark, operations benchmark, and sentiment labeling benchmark.
The same optimization loop, across every business domain.
Sales. Operations. Support. Finance. The model swap and prompt optimization are the same — the domain just changes which tasks you score against.
Zapier AutomationBench · sales domain · CRM, lead management, cross-app workflows
On four hard AutomationBench sales tasks, Qwen 3.6+ with a GEPA-optimized prompt produced a strict pass Sonnet never produced in three runs — at 14% of Sonnet's cost. Its mean over n=3 replicates is 0.157 ± 0.124, statistically tied with Sonnet's 0.160 ± 0.009. On the direct-tool slice (n=3), Qwen reaches 72% of Sonnet's partial credit at 21% of cost — 3.5× quality-per-dollar.
Three bars per slice: open-model baseline (left), frontier reference line, GEPA-optimized open model (right). Cost annotated under each bar.
Four sales tasks from AutomationBench's generic-API surface: negative selection, priority selection, implicit rules, cross-reference validation.
| run | avg | strict | cost | $/point |
|---|---|---|---|---|
Sonnet 4.6 frontier ceiling api · mean of 3 | 0.160 ±0.009 | 0% ±0% | $0.707 | 1.0× |
Qwen 3.6 Plus open baseline api · mean of 3One of three replicates scored 0.000 — Qwen's raw tool-use has moderate rollout variance. Mean of 3. | 0.084 ±0.073 | 0% ±0% | $0.095 | 3.9× |
Qwen 3.6 Plus + GEPA (default temp) api · mean of 3Default Fireworks temperature. High rollout variance (σ=0.124): one rep 0.300 with 1-of-4 strict; two reps 0.086. | 0.157 ±0.124 | 8% ±14% | $0.102 | 6.8× |
Qwen 3.6 Plus + GEPA v3 @ temp=0★ api · mean of 10v2 structure plus an explicit completion criterion. Confirmed at n=10 after earlier n=5 variance claims failed to replicate. | 0.313 ±0.110 | 23% | $0.128 | 10.8× |
Seven direct CRM-mutation tasks. Qwen doesn't beat Sonnet on score; it matches the Pareto frontier at 21% of the cost.
| run | avg | strict | cost | $/point |
|---|---|---|---|---|
Sonnet 4.6 frontier ceiling limited zapier · mean of 3 | 0.557 ±0.066 | 19% ±8% | $1.12 | 1.0× |
Qwen 3.6 Plus open baseline limited zapier · mean of 37 write-heavy tasks: update contact phone, add to campaign, create note, create contact, create opportunity, advance opportunity stage, qualify lead. Mean of 3 runs. | 0.400 ±0.067 | 14% ±14% | $0.230 | 3.5× |
Qwen 3.6 Plus + hand adapter limited zapier · mean of 3The sales/API adapter tuned for reasoning-heavy API tasks does not transfer to write-heavy CRM tasks — adapter quality is slice-specific. Mean of 3 runs. | 0.369 ±0.082 | 14% ±14% | $0.232 | 3.2× |
Qwen 3.6 Plus + targeted v5 adapter★ limited zapier · mean of 10Failure-mode hill climb: v4 fixed proof artifacts; v5 added latest-pricing/account-health rules. Mean of 10 runs. | 0.630 ±0.087 | 54% | $0.275 | 4.6× |