Bench-Sentiment found the cheap model was close on the table, but not calibrated on the rare business label.

We labeled every non-empty YouTube comment in a Snowflake table for sentiment, intent, and explicit movie-theater attendance intent. The table was large enough to expose the real problem: the label the business cares about appears in only about one percent of comments.

39,962
non-empty comments
Snowflake-backed YouTube table
0.92%
Qwen positive rate
368 theater-intent positives
99.30%
three-way agreement
theater-intent label across Qwen, Sonnet, and Opus
4.4x
cheaper than Sonnet
$2.82 vs $12
50x
cheaper than Opus
$2.82 vs $140
78.1%
Qwen recall
against strict Sonnet+Opus positives
Question

The model-search question was not whether Qwen could label movie comments cheaply. It could. The question was whether it could replace frontier labeling for explicit theater intent: comments that say or strongly imply a person will see the movie in a theater.

General excitement was deliberately excluded. "Looks awesome" and "cannot wait" stay negative unless the comment includes a concrete attendance cue such as tickets, theaters, opening weekend, or seeing it with someone.

Coverage

The full-table run labeled 39,962 non-empty comments with Qwen VL 30B, Sonnet, and Opus. Qwen cost $2.82, Sonnet cost $12.48, and Opus cost $139.63.

Aggregate positive rates looked close: Qwen found 368 theater-intent positives, Sonnet found 360, and Opus found 405. That is the tempting headline. It is not yet the production answer.

sparse business label
explicit theater-intent positives on 39,962 Snowflake comments
0.0%0.5%1.0%0.92%Qwen VL 30B368 positives0.90%Sonnet360 positives1.01%Opus405 positives
modelrowsparse failurestheater positivesratecost
Qwen VL 30B
open candidate
39,962133680.9209%$2.82
Sonnet
frontier baseline
39,96203600.9009%$12
Opus
frontier adjudicator
39,96214051.0135%$140
Source: tracked Understudy agent Snowflake model-search notes. Local label artifacts are ignored and stay outside the platform repo.
Agreement

Theater-intent agreement was extremely high in aggregate because almost every row is negative. Three-way theater-intent agreement was 99.3041%. But the positive class is where replacement quality lives.

pairvalid rowssentimentintenttheaterprecisionrecall
Sonnet vs Opus39,96185.9%81.7%99.68%78.8%88.6%
Sonnet vs Qwen39,94972.9%76.7%99.54%73.9%75.6%
Opus vs Qwen39,94877.6%77.8%99.39%71.7%65.2%
Precision and recall are measured against the first model named in the pair. Strict consensus scoring uses only rows where Sonnet and Opus both mark theater intent.
Calibration

Sonnet and Opus agreed on 319 strict positives. Qwen hit 249 of them and produced 119 positives outside that strict consensus: 67.7% precision and 78.1% recall against the strict frontier target.

That is good enough to keep optimizing. It is not good enough to silently swap in as the source of truth for a rare buying-intent label.

aggregate match is not enough
Qwen against strict Sonnet+Opus theater-intent consensus
Strict Sonnet+Opus positives319Qwen hits on strict positives249Qwen positives outside consensus119precision 67.7% · recall 78.1%
Optimize

The next intervention was prompt-level, not SFT. We froze a hard-case eval set around positives, false positives, false negatives, frontier disagreements, and lexical hard negatives, then searched for a stricter Qwen labeling contract.

The promoted candidate is theater_intent_strict with JSON prefill and 180 token headroom. The 80-token cap had slightly higher F1, but truncation created parse failures. The usable gain is the clean 180-token run: F1 moved from 0.7673 to 0.7895 with zero parse failures.

variantparse failuresprecisionrecallF1note
baseline Qwen prompt00.75450.78060.7673Full-table prompt used before hard-case optimization.
no rationale190.77640.79370.7849Quality lift, but missed the parse-validity gate.
theater_intent_strict230.77400.79870.7862Stricter boundary, still too many malformed rows.
strict prefill, 80 cap200.78260.80000.7912Best F1, but truncation failures made it unusable.
strict prefill, 180 cap00.77980.79940.7895Promoted prompt candidate: clean parse coverage with most of the F1 gain.
Hard-set scoring excludes frontier-disagreement rows from target metrics. Compact schema variants are omitted because nearly all rows failed to parse.
Next

Do not train from the first SFT export. It produced 37,621 non-eval rows, but all strict consensus positives landed in the frozen hard eval set. That dataset would teach negatives and lexical hard negatives without any stable positive examples.

The next useful run is class-aware: release part of the positive holdout, expand likely-positive lexical searches, or adjudicate Sonnet/Opus disagreements plus Qwen false positives and false negatives. Then rerun the promoted strict-prefill prompt on larger slices before exporting Qwen-family SFT data.

Method

Source data was the Snowflake-backed YOUTUBE_COMMENTS table. The platform site does not include raw warehouse rows or ignored label artifacts. This page uses the tracked Understudy agent notes from the Snowflake model-search run as the durable evidence packet.

Non-claim: this is not yet a production replacement claim. It is a model-search result showing that cheap full-table labeling is viable, frontier consensus reveals the scarce label boundary, and Qwen needs targeted positive-class calibration before it replaces frontier labels.