Bench-Sentiment found the cheap model was close on the table, but not calibrated on the rare business label.
We labeled every non-empty YouTube comment in a Snowflake table for sentiment, intent, and explicit movie-theater attendance intent. The table was large enough to expose the real problem: the label the business cares about appears in only about one percent of comments.
The model-search question was not whether Qwen could label movie comments cheaply. It could. The question was whether it could replace frontier labeling for explicit theater intent: comments that say or strongly imply a person will see the movie in a theater.
General excitement was deliberately excluded. "Looks awesome" and "cannot wait" stay negative unless the comment includes a concrete attendance cue such as tickets, theaters, opening weekend, or seeing it with someone.
The full-table run labeled 39,962 non-empty comments with Qwen VL 30B, Sonnet, and Opus. Qwen cost $2.82, Sonnet cost $12.48, and Opus cost $139.63.
Aggregate positive rates looked close: Qwen found 368 theater-intent positives, Sonnet found 360, and Opus found 405. That is the tempting headline. It is not yet the production answer.
| model | rows | parse failures | theater positives | rate | cost |
|---|---|---|---|---|---|
Qwen VL 30B★ open candidate | 39,962 | 13 | 368 | 0.9209% | $2.82 |
Sonnet frontier baseline | 39,962 | 0 | 360 | 0.9009% | $12 |
Opus frontier adjudicator | 39,962 | 1 | 405 | 1.0135% | $140 |
Theater-intent agreement was extremely high in aggregate because almost every row is negative. Three-way theater-intent agreement was 99.3041%. But the positive class is where replacement quality lives.
| pair | valid rows | sentiment | intent | theater | precision | recall |
|---|---|---|---|---|---|---|
| Sonnet vs Opus | 39,961 | 85.9% | 81.7% | 99.68% | 78.8% | 88.6% |
| Sonnet vs Qwen | 39,949 | 72.9% | 76.7% | 99.54% | 73.9% | 75.6% |
| Opus vs Qwen | 39,948 | 77.6% | 77.8% | 99.39% | 71.7% | 65.2% |
Sonnet and Opus agreed on 319 strict positives. Qwen hit 249 of them and produced 119 positives outside that strict consensus: 67.7% precision and 78.1% recall against the strict frontier target.
That is good enough to keep optimizing. It is not good enough to silently swap in as the source of truth for a rare buying-intent label.
The next intervention was prompt-level, not SFT. We froze a hard-case eval set around positives, false positives, false negatives, frontier disagreements, and lexical hard negatives, then searched for a stricter Qwen labeling contract.
The promoted candidate is theater_intent_strict with JSON prefill and 180 token headroom. The 80-token cap had slightly higher F1, but truncation created parse failures. The usable gain is the clean 180-token run: F1 moved from 0.7673 to 0.7895 with zero parse failures.
| variant | parse failures | precision | recall | F1 | note |
|---|---|---|---|---|---|
| baseline Qwen prompt | 0 | 0.7545 | 0.7806 | 0.7673 | Full-table prompt used before hard-case optimization. |
| no rationale | 19 | 0.7764 | 0.7937 | 0.7849 | Quality lift, but missed the parse-validity gate. |
| theater_intent_strict | 23 | 0.7740 | 0.7987 | 0.7862 | Stricter boundary, still too many malformed rows. |
| strict prefill, 80 cap | 20 | 0.7826 | 0.8000 | 0.7912 | Best F1, but truncation failures made it unusable. |
| strict prefill, 180 cap★ | 0 | 0.7798 | 0.7994 | 0.7895 | Promoted prompt candidate: clean parse coverage with most of the F1 gain. |
Do not train from the first SFT export. It produced 37,621 non-eval rows, but all strict consensus positives landed in the frozen hard eval set. That dataset would teach negatives and lexical hard negatives without any stable positive examples.
The next useful run is class-aware: release part of the positive holdout, expand likely-positive lexical searches, or adjudicate Sonnet/Opus disagreements plus Qwen false positives and false negatives. Then rerun the promoted strict-prefill prompt on larger slices before exporting Qwen-family SFT data.
Source data was the Snowflake-backed YOUTUBE_COMMENTS table. The platform site does not include raw warehouse rows or ignored label artifacts. This page uses the tracked Understudy agent notes from the Snowflake model-search run as the durable evidence packet.
Non-claim: this is not yet a production replacement claim. It is a model-search result showing that cheap full-table labeling is viable, frontier consensus reveals the scarce label boundary, and Qwen needs targeted positive-class calibration before it replaces frontier labels.