Expert feedback / 2026-05-25 / 7 min

Why Domain Experts, Not ML Teams, Define the Reward Signal

Reward signals encode product judgment. ML teams can build the harness, but domain experts know which errors matter, which tradeoffs are acceptable, and what good work looks like.

A reward signal is not just an ML artifact. It is a product decision written in a form the system can optimize. If the signal rewards the wrong behavior, the model will improve in the wrong direction.

ML teams are essential for the harness: data splits, scoring code, parsers, training jobs, serving checks, and regression tests. But they usually should not decide which customer promise is risky, which extracted field is material, which sales note is useful, or which support escalation protects the account.

Domain experts know the cost of mistakes. A warehouse operator knows when a label changes the downstream queue. A support lead knows which answer creates liability. A sales engineer knows whether a generated note helps the next call. A marketer knows whether copy is merely fluent or actually on-position.

The reward signal has to capture that judgment. Sometimes it is a rubric. Sometimes it is a pass/fail parser plus expert labels. Sometimes it is a ranked preference between two outputs. Sometimes it is a downstream event: accepted edit, resolved ticket, repaired tool call, or approved structured record.

The dangerous shortcut is asking the model to judge itself without a grounded task boundary. Self-critique can be useful, but only after the team defines the contract. Otherwise the system learns to produce answers that sound plausible to another model instead of answers that solve the real workflow.

Expert review should not mean reading every output forever. The better pattern is to review representative examples, adjudicate disagreement rows, define edge cases, and turn those decisions into evals. Once the eval is stable, the system can compare prompts, routes, and model candidates without re-litigating the whole task each time.

Understudy University's prompt optimizer demo shows the small version: a rubric scores candidate prompts, and the measured winner is kept. In production, the same loop needs traces, held-out examples, reviewer judgment, and promotion gates before a cheaper specialist route replaces the frontier baseline.

The ML team builds the machine that learns. The domain expert defines what the machine should learn. Understudy exists to connect those two jobs so repeated expert work becomes evals, routing rules, training data, and specialist models that improve in the direction the business actually values.

expert feedback loop

Turn expert judgment into an optimization signal.

Understudy helps teams capture domain review, convert it into rubrics and held-out evals, and use it to test prompts, routes, SFT, and RL without handing product judgment to a generic leaderboard.

apply for private preview read how traces become evals

research/how-production-traces-become-evals research/the-optimization-ladder-prompts-sft-rl-and-routing research/open-models-are-not-cheaper-until-they-are-specialized https:/university.understudylabs.com/demos/prompt-optimizer contact