Model optimization / 2026-05-18 / 8 min

Self-Distillation Lets AI Teach Itself

Self-distillation turns rich feedback from compilers, users, and environments into model improvement instead of collapsing everything into a pass/fail reward.

Most reinforcement learning for language models still has a brutal feedback problem. A model can write a thousand tokens of reasoning, fail the final test, and receive one bit of feedback: wrong.

The environment often knows much more. A compiler can return a stack trace. A unit test can point to the failing assertion. A user can say the answer was too long, used the wrong format, or missed the point. Traditional pass/fail reward collapses that rich signal into a scalar.

Self-distillation makes the model read the feedback. The same model plays two roles. As the student, it attempts the task. As the teacher, it sees the task, the failed answer, and the feedback. With hindsight in the context window, the teacher can infer what the student should have done.

The teacher does not have to be a larger external model. It can be the same neural network, conditioned on better context. The model's transient in-context reasoning becomes supervision for its long-term weights.

Code workflows are full of text feedback: runtime exceptions, compiler errors, test failures, lint errors, and human review comments. Those signals are training data when the system preserves and uses them.

Product workflows have the same property. When someone says make it shorter, use JSON only, do not mention pricing, or write this for a sales engineer, they are giving the model the missing rubric.

For Understudy, production traces and expert judgment are raw material for evals, training data, routing decisions, and specialist models. The job is to capture the work, preserve the feedback, and turn repeated corrections into better behavior.

Models can improve from the work they are already doing. Strong systems will watch the workflow, understand the failures, and compound feedback into cheaper, faster, more specialized intelligence.

feedback loop

Already collecting corrections, traces, or review comments?

Those signals can become evals, routing rules, training examples, and eventually specialist models. The useful first step is preserving the feedback with enough context to learn from it.