Three anonymized model-optimization case studies.
These public case studies use anonymized benchmark evidence from the bench pages. Each one states the workload, quality contract, optimization path, measured result, and explicit non-claims.
Workflow
State the repeated product task, current model, traffic shape, and why the team cares: cost, latency, ownership, reliability, or product scope.
Quality contract
Lock the eval, labels, expert-review rubric, parser contract, and holdout set before claiming a model is better.
Optimization path
Document what changed: prompt repair, structured output control, routing, supervised fine-tuning, reinforcement learning, or serving.
Result
Report score, latency, cost, sample size, baseline, and non-claims. Keep the frontier control slice visible after the route moves to production.
Warehouse-scale sentiment labeling
anonymized benchmark case study39,962 comments labeled at 4.4x lower cost than Sonnet and 50x lower cost than Opus.
A post-trained 30B open model produced frontier-like aggregate rates on dense sentiment labels and close agreement on sparse theater-intent labels.
Operations JSON workflow repair
anonymized benchmark case studyQwen3-8B route validated at 369ms p50 with strict output controls and sparse SFT repair.
The benchmark separates output-control gains from training gains, so the case study can show what changed before the final served route.
Sales agent workflow optimization
anonymized benchmark case studyOptimized open-model routes matched or beat frontier references on measured AutomationBench sales slices at lower cost.
The public page reports task slices, replicate counts, frontier baselines, and cost-normalized quality instead of broad model leaderboard claims.
Named customer stories can replace these once approval is explicit. Until then, the anonymized versions keep the evidence public without inventing customer details.
For the broader comparison framework, see open models vs frontier models for production AI.