Making small models reliable for repetitive tasks, without expensive training runs.
Frontier models are very good. They are also expensive. For bounded business workflows with clear success criteria, there is an opportunity to move from expensive generalists to faster, cheaper specialists.
The token-cost gap is real: a small open model can run 90× cheaper than the frontier. The catch is that off the shelf, those specialists often fail in ways that keep them out of production: malformed outputs, structural breakage, and the small unreliability that is fatal at scale.
Most of that gap can be closed without training a thing. Instead of giving the model a blank page, give it a form to fill out. When the structure is already there, the model only has to supply values. The task is bounded, the output has less room to break, and the tokens you pay for are the ones carrying information.
Blank page
Here's the extracted information from the email:
```json
{
"intent": "refund_request",
"priority": "high",
"summary": "Customer received damaged item and wants money back"
}
```
This appears to be a high-priority refund request related to damaged goods.
Every character of this is generated by the model.Pre-filled form
{ "intent": "refund_request", "priority": "high", "summary": "Customer received damaged item and wants money back" } Only the three value strings are generated by the model. The braces, keys, quotes, commas, colons, and indentation are prefilled.
The form changes what the model has to generate. That alone saves money: output tokens are billed at a premium, often 3-5× the cost of input tokens, so cutting what the model produces is the highest-leverage cost reduction available. Same model, same task, fewer tokens, lower bill.
The more interesting consequence is what becomes possible after the task is simplified. Generating a paragraph of free-form English is a different problem than filling three string fields. Filling out a form is bounded, structurally constrained, and well within the competence of a much smaller model.
Once the form is in place, the question shifts from "how do we prompt the frontier model better" to "do we still need the frontier model at all." Sonnet is priced at $18/M blended tokens; Qwen3-8B serves at $0.20/M. The price gap is roughly 90×.
We already know the small model is cheaper, so the problem shifts to whether we can make it reliable enough for production. Prefill is the first step of the answer and the first run on the optimization ladder.
Small models do not win everywhere. They win when the work is bounded: known environment, constrained actions, structured output, and an objective scorer. That is why this benchmark uses workflow execution instead of an open-ended chat task.
Raw Qwen3-8B was already close on the action-level scorer: 0.9560. But it was a bad production route. It used 1.39M eval tokens and had 32.6s p50 latency. The model could often do the work, but it spent too much time and too many tokens getting there.
That distinction matters. A model can know enough to complete a bounded task and still be unusable because it rambles, thinks in the wrong mode, emits awkward structure, or burns tokens before taking the action.
The first useful intervention was JSON prefill. It does not change weights. It starts the model's answer in the structure the system needs. Raw Qwen3-8B without prefill scored 0.9560 and used 1.39M eval tokens. With JSON prefill, the same 8B model scored 0.9667 and used 74k eval tokens.
That is the key result: 18.8× fewer eval tokens, a small score lift, and no training. The first big win did not come from making the model smarter. It came from putting the model in the output mode the system needed.
Qwen3 also supports `/no_think`, which tells the model not to spend tokens on extended thinking before producing the answer. Raw `/no_think` scored 0.9598 and used 38k eval tokens: 36.5× fewer tokens than raw no-prefill with near-equivalent quality.
This is not a claim that reasoning is bad. It is a claim that this specific Operations slice did not need long visible deliberation for every action.
Fine-tuning still mattered, but not as the whole story. SFT + JSON prefill did not beat raw JSON prefill in aggregate. The more useful role for SFT was sparse repair: raw `/no_think` scored 0.9598; SFT + `/no_think` reached 0.9733 while staying around 39k eval tokens.
The right order is: do not fine-tune first. Find the cheapest control that works, then train only on the residual failures.
| question | result | interpretation |
|---|---|---|
| Did SFT explain the prefill lift? | Raw + prefill = 0.9667; SFT + prefill = 0.9667 | No. Prefill alone reproduced the score. |
| Was token budget the real bottleneck? | Raw no-prefill used 1.39M tokens and still ran slowly | Not by itself. The issue was thinking/output mode. |
| Did Qwen-native control help? | Raw `/no_think` = 0.9598 with 38k tokens | Yes. `/no_think` exposed latent competence cheaply. |
| Did SFT still add value? | SFT `/no_think` = 0.9733 vs raw `/no_think` = 0.9598 | Yes, as sparse reliability repair. |
| Did the route serve? | Fireworks SFT `/no_think` = 0.9630, 369ms p50 | Yes. Production-style serving validated. |
Same 30-task Operations holdout, 25 samples per task for the methodology rows. The Fireworks row is the production-style serving validation at 90 trajectories.
| route | avg | strict | p50 / p95 | tokens | serving cost | cost / score point |
|---|---|---|---|---|---|---|
Qwen3-8B raw no prefill baseline · The 8B model already had latent task competence, but default generation was slow and token-heavy. | 0.9560 ±0.1847 | 94.0% | 32.60s 63.26s | 1.39M | $0.278 | $0.291 |
Qwen3-8B + JSON prefill structured output · The key 8B win: prefill exposed the same competence with 18.8x fewer eval tokens. | 0.9667 ±0.1248 | 93.3% | 6.76s 23.48s | 74k | $0.0148 | $0.0153 |
SFT + JSON prefill distillation · No aggregate lift over the 8B prefill-only route on this action-level scorer. | 0.9667 ±0.1248 | 93.3% | 6.63s 26.96s | 75k | $0.0150 | $0.0155 |
Qwen3-8B + /no_think thinking control · Qwen-native control cuts eval tokens 36.5x versus raw no-prefill with near-equivalent quality. | 0.9598 ±0.1314 | 91.2% | 5.34s 21.39s | 38k | $0.0076 | $0.0079 |
SFT + /no_think★ sparse repair · SFT adds a small reliability repair over the trainingless /no_think route. | 0.9733 ±0.1124 | 94.7% | 5.52s 23.80s | 39k | $0.0078 | $0.0080 |
Fireworks SFT + /no_think serving validation · Production-style Fireworks serving validation for the compatible Qwen3-8B LoRA. | 0.9630 | 92.2% | 0.37s 0.54s | 33k | $0.0066 | $0.0069 |
The holdout spans scheduling, compliance, facilities, legal documents, inventory, and cross-system execution across workplace tools.
| category | traces | 8B SFT | 30B teacher |
|---|---|---|---|
Scheduling and coordination kickoffs, room conflicts, trainings, maintenance windows | 12 | 1.0000 | 1.0000 |
Compliance and safety hazmat, safety incidents, policy notices, sensor alerts | 18 | 1.0000 | 1.0000 |
Facilities and inventory calibration, mailroom, fleet, perishable inventory | 21 | 1.0000 | 1.0000 |
Legal and document operations leases, DocuSign, NDAs, archive workflows | 12 | 0.8750 | 1.0000 |
Cross-system workflow execution Jira, Confluence, Pipefy, Notion, Mailchimp, Slack, Sheets | 27 | 0.9444 | 0.9444 |
The remaining misses were specific, not random. Raw `/no_think` repeatedly missed lease archival, Jira/Confluence incident, and Pipefy vendor onboarding. SFT mostly repaired the incident workflow and partially improved lease archival. Pipefy vendor onboarding remains the stable repair target.
That is exactly the signal understudy wants: not just a score, but a repair map. The next training batch should focus on the residual workflow clusters instead of blindly adding more examples.
| residual cluster | raw `/no_think` behavior | SFT effect | next repair target |
|---|---|---|---|
| Jira / Confluence incident | Missed required cross-system action | Mostly repaired | Add incident workflow variants |
| Drive / Notion lease archive | Partial document/archive workflow | Partially improved | Add richer document-operation traces |
| Pipefy vendor onboarding | Stable repeated miss | Not repaired | Add targeted traces or RL/verifier training |
The easiest way to overfit an eval is to report one score and stop. Here, the strongest signal was not that SFT won everywhere. It did not. JSON prefill reproduced the SFT+prefill aggregate score without changing weights, SFT produced a modest repair rather than a magical rewrite, and the remaining misses stayed clustered enough to become a repair map.
The production-style comparison is Sonnet API versus the optimized Fireworks 8B route. On the same 90-trajectory scale, Sonnet scored 1.0000 with 1.935s p50 latency and about $0.040 measured token/eval cost. The Fireworks route scored 0.9630 with 369ms p50 latency and $0.0066 token cost.
That is the clean cost-and-latency story: roughly 5.2× lower p50 latency and 6.0× lower measured token cost, while preserving most of the frontier score on this bounded slice.
The Fireworks route used 33,111 measured tokens across 90 trajectories. At the Qwen3-8B serving basis of $0.20/M blended tokens, that is $0.0066 in model-token cost for the validation slice.
The normalized work-per-dollar number is still directionally useful, but the direct cost and latency are the cleaner public claim. The short-lived Fireworks deployment had a loaded validation cost of about $1.56 because cold-start GPU time dominated the tiny token bill. That should stay separate from steady-state serving economics.
Observed tokens multiplied by blended model price.
Sonnet scored 1.0000 at 1.935s p50 and $0.040 measured token/eval cost.
Optimized Qwen3-8B scored 0.9630 at 369ms p50 and $0.0066 token cost.
We do not need the small model to beat the frontier. We need to find the work where the frontier is overkill, then make the small model reliable enough to take that work off the frontier path.
The ladder is the product: find bounded work, make the cheap model reliable, validate it on holdouts, and promote only the routes that survive. Sometimes that is prefill. Sometimes it is `/no_think`. Sometimes it is SFT. The point is not to train by default. The point is to run the ladder.