Three anonymized model-optimization case studies.

These public case studies use anonymized benchmark evidence from the bench pages. Each one states the workload, quality contract, optimization path, measured result, and explicit non-claims.

Template

Workflow

State the repeated product task, current model, traffic shape, and why the team cares: cost, latency, ownership, reliability, or product scope.

Quality contract

Lock the eval, labels, expert-review rubric, parser contract, and holdout set before claiming a model is better.

Optimization path

Document what changed: prompt repair, structured output control, routing, supervised fine-tuning, reinforcement learning, or serving.

Result

Report score, latency, cost, sample size, baseline, and non-claims. Keep the frontier control slice visible after the route moves to production.

Public proof

Warehouse-scale sentiment labeling

anonymized benchmark case study

39,962 comments labeled at 4.4x lower cost than Sonnet and 50x lower cost than Opus.

A post-trained 30B open model produced frontier-like aggregate rates on dense sentiment labels and close agreement on sparse theater-intent labels.

read proof →

Operations JSON workflow repair

anonymized benchmark case study

Qwen3-8B route validated at 369ms p50 with strict output controls and sparse SFT repair.

The benchmark separates output-control gains from training gains, so the case study can show what changed before the final served route.

read proof →

Sales agent workflow optimization

anonymized benchmark case study

Optimized open-model routes matched or beat frontier references on measured AutomationBench sales slices at lower cost.

The public page reports task slices, replicate counts, frontier baselines, and cost-normalized quality instead of broad model leaderboard claims.

read proof →

Named customer stories can replace these once approval is explicit. Until then, the anonymized versions keep the evidence public without inventing customer details.

For the broader comparison framework, see open models vs frontier models for production AI.