Understudy made warehouse-scale labeling economically viable: a tuned 30B open model matched frontier coverage at 50x lower cost than Opus

We labeled 39,962 non-empty YouTube comments from a Snowflake table with an Understudy post-trained 30B open model, Sonnet, and Opus. The open model delivered full-table sentiment coverage with high model agreement and landed within spitting distance on the sparse theater-intent label for a fraction of frontier cost.

39,962
non-empty comments
Snowflake-backed YouTube table
99.30%
three-way agreement
theater-intent label across Understudy, Sonnet, and Opus
4.4x
cheaper than Sonnet
$2.82 vs $12
50x
cheaper than Opus
$2.82 vs $140
Offline inference

This is the offline version of Understudy. Instead of proxying live product calls, the optimizer can start from warehouse-shaped work, train a domain classifier on public or synthetic data, climb the same optimization ladder, and publish open weights that a customer can import into Snowflake for native inference.

The customer keeps their regulated data inside Snowflake. Understudy only needs the problem shape: the domain, labels, examples, and eval contract. The resulting model can be customized with prompt optimization, supervised fine-tuning, rejection sampling, and reinforcement learning, then deployed where the warehouse already runs the job.

Snowflake model import docs →
Understudy loop

Understudy turns this from a hand-prompting exercise into a model replacement loop. Start with a warehouse table, run frontier baselines, find the open-model candidate, freeze the hard cases, and keep optimizing until the cheaper model is good enough to serve.

Broad customer-comment analysis stops being a one-off frontier-model expense and becomes a repeatable product feature. More labels, more topics, and more tables become viable when the model is purpose-built for the warehouse workload.

The same loop applies to specialist open models and repeated production workflows.

modelrowsfull-table costcost vs Understudy
Understudy post-trained 30B open model
open candidate
39,962$2.821.0x
Sonnet
frontier baseline
39,962$124.4x higher
Opus
frontier adjudicator
39,962$14050x higher
Coverage

The job has two different objectives. Broad sentiment is dense: almost every comment can be labeled, and Understudy, Sonnet, and Opus showed high agreement across the full table. That is the repeatable coverage layer product teams need before they can ask more business-specific questions.

The sparse intent label is harder because theater-attendance intent appears in about one percent of comments. Understudy still landed workable results against the frontier models, and the lower cost makes it practical to rerun low-confidence rows, adjudicate disagreements, and keep improving with prompt optimization, supervised fine tuning, and reinforcement learning.

sparse business label
explicit theater-intent positives on 39,962 Snowflake comments
0.0%0.5%1.0%0.92%Understudy post-trained 30B open model368 positives0.90%Sonnet360 positives1.01%Opus405 positives