Model optimization / 2026-05-23 / 7 min

The Optimization Ladder: Prompts, SFT, RL, and Routing

LLM optimization should climb from cheap control fixes to heavier training only when the eval proves the next step is worth it.

Teams often treat LLM optimization as a menu: prompt engineering, fine-tuning, routing, reinforcement learning, or a smaller model. That framing hides the important question. Which step is the cheapest one that clears the product eval?

The ladder starts with the task contract. Before training anything, define the input shape, output shape, success condition, parser, reviewer rubric, and held-out examples. If the contract is loose, every later step gets harder to interpret.

Prompt and scaffold changes are first because they are cheap. A stricter schema, fewer degrees of freedom, prefilled JSON, shorter instructions, or a `/no_think` route can turn a rambling model into a bounded worker. In the operations benchmark, output scaffolding pushed Qwen3-8B with prefill to a 0.9667 score before sparse fine-tuning added another gain.

Routing comes next when the workflow has natural difficulty bands. Easy rows can go to a specialist. Ambiguous rows can escalate to a frontier model. Disagreement rows can become review work. Routing is not just provider selection; it is a way to spend frontier tokens where they create new information.

Supervised fine-tuning is useful when the model understands the task but misses the house style, schema, label boundary, or action pattern. SFT should train on selected examples from the workflow, not every log line the system has ever seen.

Reinforcement learning belongs later, when the reward is real enough to optimize. The reward can come from tests, parsers, expert preferences, task outcomes, or environment feedback. If the reward is noisy or easy to exploit, RL will amplify the wrong behavior.

The ladder is not strictly linear. A routing rule can reveal a prompt bug. A failed SFT run can expose a weak eval. A control slice can show that the frontier baseline changed. The point is to make each step measurable, cheap to compare, and reversible.

Understudy uses the ladder to keep optimization honest. Candidate prompts, routes, and models have to beat the held-out eval before they replace the baseline. That discipline turns model work from one-off tuning into a repeatable system for cheaper, faster, more specialized intelligence.

optimization ladder

Find the cheapest step that clears your eval.

Bring a repeated workflow and a frontier baseline. Understudy can test prompts, scaffolds, routes, SFT, and RL as a ladder instead of jumping straight to expensive training.