Evals / 2026-05-21 / 7 min

How Production Traces Become Evals

Production traces are not just logs. With the right capture, review, and holdout discipline, they become the evals that make model optimization safe.

A production trace is a record of work: the input, context, prompt, tools, model answer, downstream result, and any human correction. Most teams store some of this, but not enough to learn from it.

An eval starts when the trace is paired with a judgment. The judgment can be a label, a rubric score, a parser result, a unit test, an approval, a rejection, or a reviewer note. Without that judgment, the trace is only history.

The first mistake is treating every trace as training data. Production logs contain duplicates, stale prompts, policy changes, partial failures, and user behavior that should not be copied. The first job is selection: find the repeated workflow, preserve the context, and decide what good means.

A useful eval needs a stable task boundary. For an agent workflow, that might be the final action chosen, the fields extracted, the tool call sequence, or the answer sent back to a user. The boundary should match the product risk. If the user only sees a JSON object, the eval should score the JSON contract before it scores style.

Holdout discipline matters. Some traces can become examples, some can become prompt tests, and some must stay unseen so the team can measure whether a smaller model, route, or prompt actually improved the workflow. Mixing those sets makes progress look easier than it is.

Expert review is the bottleneck and the moat. A domain expert can say why an answer was wrong, which edge case matters, and when a cheaper answer is good enough. Understudy turns that judgment into rubrics, held-out evals, routing rules, and specialist-model training data.

The workflow is concrete: capture traces, normalize the task, add reviewer judgment, split examples from holdouts, score the frontier baseline, test cheaper candidates, and promote only the route that clears the held-out eval.

That is how production traces become an optimization system. The logs stop being a graveyard of old model calls and become the evidence base for replacing expensive frontier work with a narrower, measurable specialist.

trace-to-eval pilot

Have real traces but no reliable eval yet?

Bring one repeated production workflow, a sample of traces, and a reviewer who knows the domain. Understudy can help turn that work into a measurable optimization lane before training or routing decisions get expensive.

apply for private preview see anonymized case studies

glossary#evals bench bench-operations case-studies contact