How to Cut LLM Cost Without Making the Product Worse
Cost reduction only matters when quality survives. Understudy's sentiment benchmark shows how a specialist open model can make warehouse-scale labeling viable without giving up frontier-style coverage.
Cutting LLM cost starts with the workload, not the model menu. A cheaper model helps only when the task is repetitive enough for a specialist to replace a generalist without damaging the user outcome.
Our warehouse-scale sentiment benchmark shows the pattern. We labeled 39,962 non-empty YouTube comments from a Snowflake-backed table with an Understudy post-trained 30B open model, Sonnet, and Opus. The job was broad classification with a sparse business label, not open-ended reasoning.
The open model cost $2.82 for the full table. Sonnet cost about $12. Opus cost about $140. That made the open route roughly 4.4x cheaper than Sonnet and 50x cheaper than Opus on the same workload.
Price alone is not the claim. The models had 99.30 percent three-way agreement on the explicit theater-intent label. That does not prove every label is perfect. It shows the cheaper route was close enough to move the remaining work into adjudication, reruns, and targeted improvement instead of paying frontier rates for every row.
When the first open model is close, frontier calls can focus on hard cases, disagreement rows, rubric repair, and eval creation. The specialist handles the broad repeatable layer. The generalist becomes a teacher, critic, or escalation path.
Generic routing asks which provider should answer the next request. Understudy asks which repeated workflow should become its own model, what evidence proves quality, and where the control slice still needs a frontier baseline.
Product teams should optimize by workload. Pick a high-volume task with clear labels, freeze the eval, measure the frontier baseline, train or tune the specialist, and keep a small control slice running so regressions show up before users feel them.
A lower bill is useful. The larger effect is product scope: more rows, more labels, more refreshes, more customers, and more product surface area at the same budget.
Have a high-volume LLM workflow that should be cheaper?
Bring one repeated task, a frontier baseline, and a reviewer who knows what good looks like. Understudy can test whether a specialist route preserves quality before it touches production.