Evals / 2026-05-19 / 7 min

Small Models Need Output Control Before Training

Understudy's operations benchmark showed that scaffolding and output control can make small models reliable before sparse fine-tuning work begins.

Small models often fail because the task contract is loose. They ramble, wrap JSON in prose, choose the wrong action shape, or spend tokens thinking when the product needs a bounded answer.

Many production workflows need reliable structured behavior, not broad prose ability: update a contact, classify a ticket, choose a next action, extract a field, or return a JSON object that downstream code can trust.

In the Understudy Operations benchmark, the starting economics were obvious. A Sonnet-style generalist sat on an $18 per million blended token basis while the Qwen3-8B route sat near $0.20 per million tokens. The price gap was about 90x before any optimization.

Training did not close the quality gap first. A trainingless JSON scaffold pushed Qwen3-8B with prefill to a 0.9667 score. The same controls reduced eval-token usage sharply: 18.8x fewer eval tokens for Qwen3-8B prefill versus raw no-prefill, and 36.5x fewer eval tokens for the `/no_think` route versus raw no-prefill.

Output control is part of model optimization. Before collecting more labels or launching a fine-tune, teams should check for a stable schema, a constrained action space, a parser, a repair loop, and a scoring function that rewards the actual contract.

Sparse fine-tuning still mattered. The SFT plus `/no_think` route lifted strict-pass performance by 3.5 points versus raw `/no_think`, and the served 8B route validated at 369ms p50 on Fireworks in the benchmark slice. But the training worked against a cleaner contract because the output surface had already been tightened.

Agent workflows expose this quickly. A frontier model can hide sloppy interfaces with extra reasoning. A small specialist needs a smaller, better-defined job.

The sequence is concrete: constrain the output, evaluate the parser, freeze examples, test the cheap route, repair the prompt and schema, then train only where the eval proves training is necessary. That keeps the expensive work focused and makes the final specialist easier to trust.

workflow-eval pilot

Turn one brittle structured-output task into an eval.

If a small model is close but unreliable, start with the contract: schema, parser, examples, review rubric, and held-out eval. Training should come after the output surface is measurable.

apply for private preview read the operations benchmark

bench-operations glossary#evals research/self-distillation-lets-ai-teach-itself contact