Making small models reliable for repetitive tasks, without expensive training runs.

Frontier models are very good. They are also expensive. For bounded business workflows with clear success criteria, there is an opportunity to move from expensive generalists to faster, cheaper specialists.

The token-cost gap is real: a small open model can run 90× cheaper than the frontier. The catch is that off the shelf, those specialists often fail in ways that keep them out of production: malformed outputs, structural breakage, and the small unreliability that is fatal at scale.

Most of that gap can be closed without training a thing. Instead of giving the model a blank page, give it a form to fill out. When the structure is already there, the model only has to supply values. The task is bounded, the output has less room to break, and the tokens you pay for are the ones carrying information.

blank page vs pre-filled form
Task: Extract intent, priority, and summary from this customer email.

Blank page

Model writes everything
Here's the extracted information from the email:

```json
{
  "intent": "refund_request",
  "priority": "high",
  "summary": "Customer received damaged item and wants money back"
}
```

This appears to be a high-priority refund request related to damaged goods.

Every character of this is generated by the model.
~62 tokens generated

Pre-filled form

Model fills in the blanks
{
  "intent": "refund_request",
  "priority": "high",
  "summary": "Customer received damaged item and wants money back"
}

Only the three value strings are generated by the model. The braces, keys, quotes, commas, colons, and indentation are prefilled.
~14 tokens generated
Generated content is what the model has to produce. Muted structure is already present.
4.4x less to generate
90×
token price gap
$18/M Sonnet basis vs $0.20/M Qwen3-8B
0.9667
8B + prefill score
trainingless JSON scaffold
18.8×
fewer eval tokens
Qwen3-8B prefill vs raw no-prefill
36.5×
fewer eval tokens
Qwen3-8B `/no_think` vs raw no-prefill
+3.5pt
strict-pass lift
SFT + `/no_think` vs raw `/no_think`
369ms
p50 latency
Fireworks SFT + `/no_think`
$0.0066
serving-basis token cost
Fireworks validation · 90 trajectories
Economics

The form changes what the model has to generate. That alone saves money: output tokens are billed at a premium, often 3-5× the cost of input tokens, so cutting what the model produces is the highest-leverage cost reduction available. Same model, same task, fewer tokens, lower bill.

The more interesting consequence is what becomes possible after the task is simplified. Generating a paragraph of free-form English is a different problem than filling three string fields. Filling out a form is bounded, structurally constrained, and well within the competence of a much smaller model.

Once the form is in place, the question shifts from "how do we prompt the frontier model better" to "do we still need the frontier model at all." Sonnet is priced at $18/M blended tokens; Qwen3-8B serves at $0.20/M. The price gap is roughly 90×.

the reason to care about small models
blended model-token serving basis before any optimization
Sonnet / generalist baseline$18/M tokensQwen3-8B route$0.20/M tokensroughly 90x lower model-token price

We already know the small model is cheaper, so the problem shifts to whether we can make it reliable enough for production. Prefill is the first step of the answer and the first run on the optimization ladder.

Fit

Small models do not win everywhere. They win when the work is bounded: known environment, constrained actions, structured output, and an objective scorer. That is why this benchmark uses workflow execution instead of an open-ended chat task.

where small-model replacement starts
repeated, structured work with objective checks is the first target
rare / open-endedrepeated / boundedhard to verifyeasy to verifytarget firstoperations workflowsconstrained actionsstructured outputobjective scorer
Start

Raw Qwen3-8B was already close on the action-level scorer: 0.9560. But it was a bad production route. It used 1.39M eval tokens and had 32.6s p50 latency. The model could often do the work, but it spent too much time and too many tokens getting there.

That distinction matters. A model can know enough to complete a bounded task and still be unusable because it rambles, thinks in the wrong mode, emits awkward structure, or burns tokens before taking the action.

Prefill

The first useful intervention was JSON prefill. It does not change weights. It starts the model's answer in the structure the system needs. Raw Qwen3-8B without prefill scored 0.9560 and used 1.39M eval tokens. With JSON prefill, the same 8B model scored 0.9667 and used 74k eval tokens.

That is the key result: 18.8× fewer eval tokens, a small score lift, and no training. The first big win did not come from making the model smarter. It came from putting the model in the output mode the system needed.

Thinking

Qwen3 also supports `/no_think`, which tells the model not to spend tokens on extended thinking before producing the answer. Raw `/no_think` scored 0.9598 and used 38k eval tokens: 36.5× fewer tokens than raw no-prefill with near-equivalent quality.

This is not a claim that reasoning is bad. It is a claim that this specific Operations slice did not need long visible deliberation for every action.

most waste was output-mode waste
log-scale token burn across the 8B optimization ladder
30k100k1.00M1.39Mraw no prefillscore 0.9560p50 32.60s74k+ JSON prefillscore 0.9667p50 6.76s38k+ /no_thinkscore 0.9598p50 5.34s39kSFT + /no_thinkscore 0.9733p50 5.52s
Repair

Fine-tuning still mattered, but not as the whole story. SFT + JSON prefill did not beat raw JSON prefill in aggregate. The more useful role for SFT was sparse repair: raw `/no_think` scored 0.9598; SFT + `/no_think` reached 0.9733 while staying around 39k eval tokens.

The right order is: do not fine-tune first. Find the cheapest control that works, then train only on the residual failures.

questionresultinterpretation
Did SFT explain the prefill lift?Raw + prefill = 0.9667; SFT + prefill = 0.9667No. Prefill alone reproduced the score.
Was token budget the real bottleneck?Raw no-prefill used 1.39M tokens and still ran slowlyNot by itself. The issue was thinking/output mode.
Did Qwen-native control help?Raw `/no_think` = 0.9598 with 38k tokensYes. `/no_think` exposed latent competence cheaply.
Did SFT still add value?SFT `/no_think` = 0.9733 vs raw `/no_think` = 0.9598Yes, as sparse reliability repair.
Did the route serve?Fireworks SFT `/no_think` = 0.9630, 369ms p50Yes. Production-style serving validated.
Runs

Same 30-task Operations holdout, 25 samples per task for the methodology rows. The Fireworks row is the production-style serving validation at 90 trajectories.

routeavgstrictp50 / p95tokensserving costcost / score point
Qwen3-8B raw no prefill
baseline · The 8B model already had latent task competence, but default generation was slow and token-heavy.
0.9560
±0.1847
94.0%32.60s
63.26s
1.39M$0.278$0.291
Qwen3-8B + JSON prefill
structured output · The key 8B win: prefill exposed the same competence with 18.8x fewer eval tokens.
0.9667
±0.1248
93.3%6.76s
23.48s
74k$0.0148$0.0153
SFT + JSON prefill
distillation · No aggregate lift over the 8B prefill-only route on this action-level scorer.
0.9667
±0.1248
93.3%6.63s
26.96s
75k$0.0150$0.0155
Qwen3-8B + /no_think
thinking control · Qwen-native control cuts eval tokens 36.5x versus raw no-prefill with near-equivalent quality.
0.9598
±0.1314
91.2%5.34s
21.39s
38k$0.0076$0.0079
SFT + /no_think
sparse repair · SFT adds a small reliability repair over the trainingless /no_think route.
0.9733
±0.1124
94.7%5.52s
23.80s
39k$0.0078$0.0080
Fireworks SFT + /no_think
serving validation · Production-style Fireworks serving validation for the compatible Qwen3-8B LoRA.
0.9630
92.2%0.37s
0.54s
33k$0.0066$0.0069
Serving cost uses the same serving-basis economics as `/bench`: observed tokens × blended model price. Sonnet/generalist frontier baseline is `$18/M` blended tokens; Qwen3-8B basis is `$0.20/M` blended tokens. Cost / score point = serving cost divided by average partial-credit score.
Tinker latency is eval-route latency. The Fireworks row is the production-style serving validation. Short-lived Fireworks deployment validation had a loaded cost of about $1.56 because cold-start GPU time dominated the tiny token cost; that is reported separately from steady-state serving economics.
Domain

The holdout spans scheduling, compliance, facilities, legal documents, inventory, and cross-system execution across workplace tools.

operations domain mix
SFT score by work type · 90 replicated traces
0.000.250.500.751.00partial creditScheduling and coordination12 traces1.0000Compliance and safety18 traces1.0000Facilities and inventory21 traces1.0000Legal and document operations12 traces0.8750Cross-system workflow execution27 traces0.9444
categorytraces8B SFT30B teacher
Scheduling and coordination
kickoffs, room conflicts, trainings, maintenance windows
121.00001.0000
Compliance and safety
hazmat, safety incidents, policy notices, sensor alerts
181.00001.0000
Facilities and inventory
calibration, mailroom, fleet, perishable inventory
211.00001.0000
Legal and document operations
leases, DocuSign, NDAs, archive workflows
120.87501.0000
Cross-system workflow execution
Jira, Confluence, Pipefy, Notion, Mailchimp, Slack, Sheets
270.94440.9444
Misses

The remaining misses were specific, not random. Raw `/no_think` repeatedly missed lease archival, Jira/Confluence incident, and Pipefy vendor onboarding. SFT mostly repaired the incident workflow and partially improved lease archival. Pipefy vendor onboarding remains the stable repair target.

That is exactly the signal understudy wants: not just a score, but a repair map. The next training batch should focus on the residual workflow clusters instead of blindly adding more examples.

residual clusterraw `/no_think` behaviorSFT effectnext repair target
Jira / Confluence incidentMissed required cross-system actionMostly repairedAdd incident workflow variants
Drive / Notion lease archivePartial document/archive workflowPartially improvedAdd richer document-operation traces
Pipefy vendor onboardingStable repeated missNot repairedAdd targeted traces or RL/verifier training
Checks

The easiest way to overfit an eval is to report one score and stop. Here, the strongest signal was not that SFT won everywhere. It did not. JSON prefill reproduced the SFT+prefill aggregate score without changing weights, SFT produced a modest repair rather than a magical rewrite, and the remaining misses stayed clustered enough to become a repair map.

Serve

The production-style comparison is Sonnet API versus the optimized Fireworks 8B route. On the same 90-trajectory scale, Sonnet scored 1.0000 with 1.935s p50 latency and about $0.040 measured token/eval cost. The Fireworks route scored 0.9630 with 369ms p50 latency and $0.0066 token cost.

That is the clean cost-and-latency story: roughly 5.2× lower p50 latency and 6.0× lower measured token cost, while preserving most of the frontier score on this bounded slice.

serving validation
same 90-trajectory scale · frontier API vs optimized Fireworks route
p50 latency and measured token costSonnet frontierAPI frontier baselinescore 1.00001.94s$0.0400Fireworks 8B routeSFT + /no_thinkscore 0.9630369ms$0.0066Fireworks route: 5.2x lower p50 latency, 6.0x lower token cost
Cost

The Fireworks route used 33,111 measured tokens across 90 trajectories. At the Qwen3-8B serving basis of $0.20/M blended tokens, that is $0.0066 in model-token cost for the validation slice.

The normalized work-per-dollar number is still directionally useful, but the direct cost and latency are the cleaner public claim. The short-lived Fireworks deployment had a loaded validation cost of about $1.56 because cold-start GPU time dominated the tiny token bill. That should stay separate from steady-state serving economics.

Serving-basis model-token cost

Observed tokens multiplied by blended model price.

Frontier route

Sonnet scored 1.0000 at 1.935s p50 and $0.040 measured token/eval cost.

Fireworks route

Optimized Qwen3-8B scored 0.9630 at 369ms p50 and $0.0066 token cost.

Lessons

We do not need the small model to beat the frontier. We need to find the work where the frontier is overkill, then make the small model reliable enough to take that work off the frontier path.

The ladder is the product: find bounded work, make the cheap model reliable, validate it on holdouts, and promote only the routes that survive. Sometimes that is prefill. Sometimes it is `/no_think`. Sometimes it is SFT. The point is not to train by default. The point is to run the ladder.