Making small models reliable for repetitive tasks, without expensive training runs.

Frontier models are very good. They are also expensive. For bounded business workflows with clear success criteria, there is an opportunity to move from expensive generalists to faster, cheaper specialists.

The token-cost gap is real: a small open model can run 90× cheaper than the frontier. The catch is that off the shelf, those specialists often fail in ways that keep them out of production: malformed outputs, structural breakage, and the small unreliability that is fatal at scale.

Most of that gap can be closed without training a thing. Instead of giving the model a blank page, give it a form to fill out. When the structure is already there, the model only has to supply values. The task is bounded, the output has less room to break, and the tokens you pay for are the ones carrying information.

blank page vs pre-filled form

Task: Extract intent, priority, and summary from this customer email.

Blank page

Model writes everything

Here's the extracted information from the email:

```json
{
  "intent": "refund_request",
  "priority": "high",
  "summary": "Customer received damaged item and wants money back"
}
```

This appears to be a high-priority refund request related to damaged goods.

Every character of this is generated by the model.

~62 tokens generated

Pre-filled form

Model fills in the blanks

{
  "intent": "refund_request",
  "priority": "high",
  "summary": "Customer received damaged item and wants money back"
}

Only the three value strings are generated by the model. The braces, keys, quotes, commas, colons, and indentation are prefilled.

~14 tokens generated

Generated content is what the model has to produce. Muted structure is already present.

4.4x less to generate

90×

token price gap

$18/M Sonnet basis vs $0.20/M Qwen3-8B

0.9667

8B + prefill score

trainingless JSON scaffold

18.8×

fewer eval tokens

Qwen3-8B prefill vs raw no-prefill

36.5×

fewer eval tokens

Qwen3-8B `/no_think` vs raw no-prefill

+3.5pt

strict-pass lift

SFT + `/no_think` vs raw `/no_think`

369ms

p50 latency

Fireworks SFT + `/no_think`

$0.0066

serving-basis token cost

Fireworks validation · 90 trajectories

Economics

The form changes what the model has to generate. That alone saves money: output tokens are billed at a premium, often 3-5× the cost of input tokens, so cutting what the model produces is the highest-leverage cost reduction available. Same model, same task, fewer tokens, lower bill.

The more interesting consequence is what becomes possible after the task is simplified. Generating a paragraph of free-form English is a different problem than filling three string fields. Filling out a form is bounded, structurally constrained, and well within the competence of a much smaller model.

Once the form is in place, the question shifts from "how do we prompt the frontier model better" to "do we still need the frontier model at all." Sonnet is priced at $18/M blended tokens; Qwen3-8B serves at $0.20/M. The price gap is roughly 90×.

the reason to care about small models

blended model-token serving basis before any optimization

We already know the small model is cheaper, so the problem shifts to whether we can make it reliable enough for production. Prefill is the first step of the answer and the first run on the optimization ladder.

Fit

Small models do not win everywhere. They win when the work is bounded: known environment, constrained actions, structured output, and an objective scorer. That is why this benchmark uses workflow execution instead of an open-ended chat task.

where small-model replacement starts

repeated, structured work with objective checks is the first target

Start

Raw Qwen3-8B was already close on the action-level scorer: 0.9560. But it was a bad production route. It used 1.39M eval tokens and had 32.6s p50 latency. The model could often do the work, but it spent too much time and too many tokens getting there.

That distinction matters. A model can know enough to complete a bounded task and still be unusable because it rambles, thinks in the wrong mode, emits awkward structure, or burns tokens before taking the action.

Prefill

The first useful intervention was JSON prefill. It does not change weights. It starts the model's answer in the structure the system needs. Raw Qwen3-8B without prefill scored 0.9560 and used 1.39M eval tokens. With JSON prefill, the same 8B model scored 0.9667 and used 74k eval tokens.

That is the key result: 18.8× fewer eval tokens, a small score lift, and no training. The first big win did not come from making the model smarter. It came from putting the model in the output mode the system needed.

Thinking

Qwen3 also supports `/no_think`, which tells the model not to spend tokens on extended thinking before producing the answer. Raw `/no_think` scored 0.9598 and used 38k eval tokens: 36.5× fewer tokens than raw no-prefill with near-equivalent quality.

This is not a claim that reasoning is bad. It is a claim that this specific Operations slice did not need long visible deliberation for every action.

most waste was output-mode waste

log-scale token burn across the 8B optimization ladder

Repair

Fine-tuning still mattered, but not as the whole story. SFT + JSON prefill did not beat raw JSON prefill in aggregate. The more useful role for SFT was sparse repair: raw `/no_think` scored 0.9598; SFT + `/no_think` reached 0.9733 while staying around 39k eval tokens.

The right order is: do not fine-tune first. Find the cheapest control that works, then train only on the residual failures.

question	result	interpretation
Did SFT explain the prefill lift?	Raw + prefill = 0.9667; SFT + prefill = 0.9667	No. Prefill alone reproduced the score.
Was token budget the real bottleneck?	Raw no-prefill used 1.39M tokens and still ran slowly	Not by itself. The issue was thinking/output mode.
Did Qwen-native control help?	Raw `/no_think` = 0.9598 with 38k tokens	Yes. `/no_think` exposed latent competence cheaply.
Did SFT still add value?	SFT `/no_think` = 0.9733 vs raw `/no_think` = 0.9598	Yes, as sparse reliability repair.
Did the route serve?	Fireworks SFT `/no_think` = 0.9630, 369ms p50	Yes. Production-style serving validated.

Runs

Same 30-task Operations holdout, 25 samples per task for the methodology rows. The Fireworks row is the production-style serving validation at 90 trajectories.

route	avg	strict	p50 / p95	tokens	serving cost	cost / score point
Qwen3-8B raw no prefill baseline · The 8B model already had latent task competence, but default generation was slow and token-heavy.	0.9560 ±0.1847	94.0%	32.60s 63.26s	1.39M	$0.278	$0.291
Qwen3-8B + JSON prefill structured output · The key 8B win: prefill exposed the same competence with 18.8x fewer eval tokens.	0.9667 ±0.1248	93.3%	6.76s 23.48s	74k	$0.0148	$0.0153
SFT + JSON prefill distillation · No aggregate lift over the 8B prefill-only route on this action-level scorer.	0.9667 ±0.1248	93.3%	6.63s 26.96s	75k	$0.0150	$0.0155
Qwen3-8B + /no_think thinking control · Qwen-native control cuts eval tokens 36.5x versus raw no-prefill with near-equivalent quality.	0.9598 ±0.1314	91.2%	5.34s 21.39s	38k	$0.0076	$0.0079
SFT + /no_think★ sparse repair · SFT adds a small reliability repair over the trainingless /no_think route.	0.9733 ±0.1124	94.7%	5.52s 23.80s	39k	$0.0078	$0.0080
Fireworks SFT + /no_think serving validation · Production-style Fireworks serving validation for the compatible Qwen3-8B LoRA.	0.9630	92.2%	0.37s 0.54s	33k	$0.0066	$0.0069

Serving cost uses the same serving-basis economics as `/bench`: observed tokens × blended model price. Sonnet/generalist frontier baseline is `$18/M` blended tokens; Qwen3-8B basis is `$0.20/M` blended tokens. Cost / score point = serving cost divided by average partial-credit score.

Tinker latency is eval-route latency. The Fireworks row is the production-style serving validation. Short-lived Fireworks deployment validation had a loaded cost of about $1.56 because cold-start GPU time dominated the tiny token cost; that is reported separately from steady-state serving economics.

Domain

The holdout spans scheduling, compliance, facilities, legal documents, inventory, and cross-system execution across workplace tools.

operations domain mix

SFT score by work type · 90 replicated traces

category	traces	8B SFT	30B teacher
Scheduling and coordination kickoffs, room conflicts, trainings, maintenance windows	12	1.0000	1.0000
Compliance and safety hazmat, safety incidents, policy notices, sensor alerts	18	1.0000	1.0000
Facilities and inventory calibration, mailroom, fleet, perishable inventory	21	1.0000	1.0000
Legal and document operations leases, DocuSign, NDAs, archive workflows	12	0.8750	1.0000
Cross-system workflow execution Jira, Confluence, Pipefy, Notion, Mailchimp, Slack, Sheets	27	0.9444	0.9444

Misses

The remaining misses were specific, not random. Raw `/no_think` repeatedly missed lease archival, Jira/Confluence incident, and Pipefy vendor onboarding. SFT mostly repaired the incident workflow and partially improved lease archival. Pipefy vendor onboarding remains the stable repair target.

That is exactly the signal understudy wants: not just a score, but a repair map. The next training batch should focus on the residual workflow clusters instead of blindly adding more examples.

residual cluster	raw `/no_think` behavior	SFT effect	next repair target
Jira / Confluence incident	Missed required cross-system action	Mostly repaired	Add incident workflow variants
Drive / Notion lease archive	Partial document/archive workflow	Partially improved	Add richer document-operation traces
Pipefy vendor onboarding	Stable repeated miss	Not repaired	Add targeted traces or RL/verifier training

Checks

The easiest way to overfit an eval is to report one score and stop. Here, the strongest signal was not that SFT won everywhere. It did not. JSON prefill reproduced the SFT+prefill aggregate score without changing weights, SFT produced a modest repair rather than a magical rewrite, and the remaining misses stayed clustered enough to become a repair map.

Serve

The production-style comparison is Sonnet API versus the optimized Fireworks 8B route. On the same 90-trajectory scale, Sonnet scored 1.0000 with 1.935s p50 latency and about $0.040 measured token/eval cost. The Fireworks route scored 0.9630 with 369ms p50 latency and $0.0066 token cost.

That is the clean cost-and-latency story: roughly 5.2× lower p50 latency and 6.0× lower measured token cost, while preserving most of the frontier score on this bounded slice.

serving validation

same 90-trajectory scale · frontier API vs optimized Fireworks route

Cost

The Fireworks route used 33,111 measured tokens across 90 trajectories. At the Qwen3-8B serving basis of $0.20/M blended tokens, that is $0.0066 in model-token cost for the validation slice.

The normalized work-per-dollar number is still directionally useful, but the direct cost and latency are the cleaner public claim. The short-lived Fireworks deployment had a loaded validation cost of about $1.56 because cold-start GPU time dominated the tiny token bill. That should stay separate from steady-state serving economics.

Serving-basis model-token cost

Observed tokens multiplied by blended model price.

Frontier route

Sonnet scored 1.0000 at 1.935s p50 and $0.040 measured token/eval cost.

Fireworks route

Optimized Qwen3-8B scored 0.9630 at 369ms p50 and $0.0066 token cost.

Lessons

We do not need the small model to beat the frontier. We need to find the work where the frontier is overkill, then make the small model reliable enough to take that work off the frontier path.

The ladder is the product: find bounded work, make the cheap model reliable, validate it on holdouts, and promote only the routes that survive. Sometimes that is prefill. Sometimes it is `/no_think`. Sometimes it is SFT. The point is not to train by default. The point is to run the ladder.