Bench-Sentiment found the cheap model was close on the table, but not calibrated on the rare business label.

We labeled every non-empty YouTube comment in a Snowflake table for sentiment, intent, and explicit movie-theater attendance intent. The table was large enough to expose the real problem: the label the business cares about appears in only about one percent of comments.

39,962

non-empty comments

Snowflake-backed YouTube table

0.92%

Qwen positive rate

368 theater-intent positives

99.30%

three-way agreement

theater-intent label across Qwen, Sonnet, and Opus

4.4x

cheaper than Sonnet

$2.82 vs $12

50x

cheaper than Opus

$2.82 vs $140

78.1%

Qwen recall

against strict Sonnet+Opus positives

Question

The model-search question was not whether Qwen could label movie comments cheaply. It could. The question was whether it could replace frontier labeling for explicit theater intent: comments that say or strongly imply a person will see the movie in a theater.

General excitement was deliberately excluded. "Looks awesome" and "cannot wait" stay negative unless the comment includes a concrete attendance cue such as tickets, theaters, opening weekend, or seeing it with someone.

Coverage

The full-table run labeled 39,962 non-empty comments with Qwen VL 30B, Sonnet, and Opus. Qwen cost $2.82, Sonnet cost $12.48, and Opus cost $139.63.

Aggregate positive rates looked close: Qwen found 368 theater-intent positives, Sonnet found 360, and Opus found 405. That is the tempting headline. It is not yet the production answer.

sparse business label

explicit theater-intent positives on 39,962 Snowflake comments

model	rows	parse failures	theater positives	rate	cost
Qwen VL 30B★ open candidate	39,962	13	368	0.9209%	$2.82
Sonnet frontier baseline	39,962	0	360	0.9009%	$12
Opus frontier adjudicator	39,962	1	405	1.0135%	$140

Source: tracked Understudy agent Snowflake model-search notes. Local label artifacts are ignored and stay outside the platform repo.

Agreement

Theater-intent agreement was extremely high in aggregate because almost every row is negative. Three-way theater-intent agreement was 99.3041%. But the positive class is where replacement quality lives.

pair	valid rows	sentiment	intent	theater	precision	recall
Sonnet vs Opus	39,961	85.9%	81.7%	99.68%	78.8%	88.6%
Sonnet vs Qwen	39,949	72.9%	76.7%	99.54%	73.9%	75.6%
Opus vs Qwen	39,948	77.6%	77.8%	99.39%	71.7%	65.2%

Precision and recall are measured against the first model named in the pair. Strict consensus scoring uses only rows where Sonnet and Opus both mark theater intent.

Calibration

Sonnet and Opus agreed on 319 strict positives. Qwen hit 249 of them and produced 119 positives outside that strict consensus: 67.7% precision and 78.1% recall against the strict frontier target.

That is good enough to keep optimizing. It is not good enough to silently swap in as the source of truth for a rare buying-intent label.

aggregate match is not enough

Qwen against strict Sonnet+Opus theater-intent consensus

Optimize

The next intervention was prompt-level, not SFT. We froze a hard-case eval set around positives, false positives, false negatives, frontier disagreements, and lexical hard negatives, then searched for a stricter Qwen labeling contract.

The promoted candidate is theater_intent_strict with JSON prefill and 180 token headroom. The 80-token cap had slightly higher F1, but truncation created parse failures. The usable gain is the clean 180-token run: F1 moved from 0.7673 to 0.7895 with zero parse failures.

variant	parse failures	precision	recall	F1	note
baseline Qwen prompt	0	0.7545	0.7806	0.7673	Full-table prompt used before hard-case optimization.
no rationale	19	0.7764	0.7937	0.7849	Quality lift, but missed the parse-validity gate.
theater_intent_strict	23	0.7740	0.7987	0.7862	Stricter boundary, still too many malformed rows.
strict prefill, 80 cap	20	0.7826	0.8000	0.7912	Best F1, but truncation failures made it unusable.
strict prefill, 180 cap★	0	0.7798	0.7994	0.7895	Promoted prompt candidate: clean parse coverage with most of the F1 gain.

Hard-set scoring excludes frontier-disagreement rows from target metrics. Compact schema variants are omitted because nearly all rows failed to parse.

Do not train from the first SFT export. It produced 37,621 non-eval rows, but all strict consensus positives landed in the frozen hard eval set. That dataset would teach negatives and lexical hard negatives without any stable positive examples.

The next useful run is class-aware: release part of the positive holdout, expand likely-positive lexical searches, or adjudicate Sonnet/Opus disagreements plus Qwen false positives and false negatives. Then rerun the promoted strict-prefill prompt on larger slices before exporting Qwen-family SFT data.

Method

Source data was the Snowflake-backed YOUTUBE_COMMENTS table. The platform site does not include raw warehouse rows or ignored label artifacts. This page uses the tracked Understudy agent notes from the Snowflake model-search run as the durable evidence packet.

Non-claim: this is not yet a production replacement claim. It is a model-search result showing that cheap full-table labeling is viable, frontier consensus reveals the scarce label boundary, and Qwen needs targeted positive-class calibration before it replaces frontier labels.