Understudy made warehouse-scale labeling economically viable: a tuned 30B open model matched frontier coverage at 50x lower cost than Opus
We labeled 39,962 non-empty YouTube comments from a Snowflake table with an Understudy post-trained 30B open model, Sonnet, and Opus. The open model delivered full-table sentiment coverage with high model agreement and landed within spitting distance on the sparse theater-intent label for a fraction of frontier cost.
This is the offline version of Understudy. Instead of proxying live product calls, the optimizer can start from warehouse-shaped work, train a domain classifier on public or synthetic data, climb the same optimization ladder, and publish open weights that a customer can import into Snowflake for native inference.
The customer keeps their regulated data inside Snowflake. Understudy only needs the problem shape: the domain, labels, examples, and eval contract. The resulting model can be customized with prompt optimization, supervised fine-tuning, rejection sampling, and reinforcement learning, then deployed where the warehouse already runs the job.
Snowflake model import docs →Understudy turns this from a hand-prompting exercise into a model replacement loop. Start with a warehouse table, run frontier baselines, find the open-model candidate, freeze the hard cases, and keep optimizing until the cheaper model is good enough to serve.
Broad customer-comment analysis stops being a one-off frontier-model expense and becomes a repeatable product feature. More labels, more topics, and more tables become viable when the model is purpose-built for the warehouse workload.
The same loop applies to specialist open models and repeated production workflows.
| model | rows | full-table cost | cost vs Understudy |
|---|---|---|---|
| Understudy post-trained 30B open model★ open candidate | 39,962 | $2.82 | 1.0x |
| Sonnet frontier baseline | 39,962 | $12 | 4.4x higher |
| Opus frontier adjudicator | 39,962 | $140 | 50x higher |
The job has two different objectives. Broad sentiment is dense: almost every comment can be labeled, and Understudy, Sonnet, and Opus showed high agreement across the full table. That is the repeatable coverage layer product teams need before they can ask more business-specific questions.
The sparse intent label is harder because theater-attendance intent appears in about one percent of comments. Understudy still landed workable results against the frontier models, and the lower cost makes it practical to rerun low-confidence rows, adjudicate disagreements, and keep improving with prompt optimization, supervised fine tuning, and reinforcement learning.