A KL View of Post-Training Losses

Published on Apr 11, 2026

This post is a rough note after reading Thinking Machines Lab’s post on On-Policy Distillation.

I mainly want to understand one thing:

How does KL show up in different post-training losses, and what intuition does each KL direction give us?

The useful framing for me is to ask three questions:

Which distribution do we sample from?
Which distribution do we want to move toward?
Is the loss closer to forward KL or reverse KL?

With this view:

SFT is an important off-policy forward-KL loss.
On-policy distillation is reverse KL to the teacher, on student-generated prefixes.
KL-regularized RL can be seen as reverse KL to an implicit reward-weighted target policy.

Notation

Let $x$ be the prefix and $y$ be the response or next token. For token-level formulas, I use $x$ as the current prefix.

I use:

$\pi_\theta$ for the current student model.
$\pi_T$ for the teacher model.
$\pi_{\text{ref}}$ for the reference model.
$p_{\text{data}}$ for the data distribution.

Forward KL

$$ D_{\mathrm{KL}}(\pi_{\text{ref}} \,\|\, \pi_\theta) = \mathbb{E}_{y\sim \pi_{\text{ref}}}\left[\log \pi_{\text{ref}}(y|x) - \log \pi_\theta(y|x)\right] $$

Reverse KL

$$ D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\text{ref}}) = \mathbb{E}_{y\sim \pi_\theta}\left[\log \pi_\theta(y|x) - \log \pi_{\text{ref}}(y|x)\right] $$

Forward KL. $D_{\mathrm{KL}}(\pi_{\text{ref}} \,\|\, \pi_\theta)$ asks: does the student cover the things the reference thinks are important? This is mode-covering. It punishes the student if it misses tokens that the reference may choose.

Reverse KL. $D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\text{ref}})$ asks: for the things the student currently chooses, would the reference agree? This is mode-seeking. It pushes the student distribution toward the part that the reference agrees with.

Forward KL: SFT

SFT is like next-token prediction:

$$ L_{\mathrm{SFT}} = \mathbb{E}_{(x,y)\sim \mathcal D}\left[-\log \pi_\theta(y|x)\right] $$

Optimizing the loss is equivalent to:

$$ L_{\mathrm{SFT}} = D_{\mathrm{KL}}(p_{\text{data}} \,\|\, \pi_\theta) + \text{const} $$

So the intuition is: SFT asks the student to cover the data distribution. If the data contains a behavior, the student should put probability on it.

This also explains the limitation. SFT only teaches the model on prefixes covered by the dataset. If the student makes an early mistake at inference time and enters a prefix that the dataset never covers, the learned model may not know how to proceed.

The off-policy idea: learn from prefixes and actions produced outside the student. This gives dense token-level supervision, but it does not teach the student what to do in prefixes it reaches by itself.

On-Policy Distillation: Reverse KL to Teacher

Instead of sampling trajectories from the teacher, we sample trajectories from the student. Then we ask the teacher to score the student’s tokens.

The objective is:

$$ L_{\mathrm{OPD}} = D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_T) $$

Expanding the KL gives a per-token log-prob loss:

$$ L_{\mathrm{OPD}} = \mathbb{E}_{y\sim \pi_\theta}\left[\log \pi_\theta(y|x) - \log \pi_T(y|x)\right] $$

Key intuition. The off-policy version asks: can the student imitate external data or teacher trajectories? On-policy distillation asks: when the student reaches this prefix, does the teacher agree with the student's next token?

That is why it sits between SFT and RL.

It is on-policy like RL, because the prefixes come from the student. But it is dense like distillation, because every token can get a teacher log-prob signal.

The limitation also comes from reverse KL. If the student never samples some teacher behavior, reverse KL will not necessarily pull the student toward it. So I think of SFT as adding support, and on-policy distillation as doing mode-seeking inside the support the student already has.

KL-Regularized RL: Reverse KL to an Implicit Target

A common post-training RL objective is:

$$ J(\pi_\theta) = \mathbb{E}_{y\sim \pi_\theta}[r(x,y)] - \beta D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\text{ref}}) $$

It says: maximize reward, but do not move too far from the reference model.

If the reward model and reference model are fixed, this objective implies an ideal target policy:

$$ p^*(y|x) \propto \pi_{\text{ref}}(y|x)\exp(r(x,y)/\beta) $$

The important equivalence is:

$$ J(\pi_\theta) = -\beta D_{\mathrm{KL}}(\pi_\theta \,\|\, p^*) + \beta \log Z(x) $$

So maximizing the KL-regularized reward objective is equivalent to minimizing:

$$ D_{\mathrm{KL}}(\pi_\theta \,\|\, p^*) $$

This is why we can view PPO / GRPO-style RL as reverse KL to an implicit reward-weighted target policy.

The target $p^*$ is not a model we can directly sample from. It is an implicit policy defined by the reward and the reference. PPO / GRPO are practical ways to move the current policy toward that target using sampled rollouts and reward estimates.

Summary

SFT / off-policy forward KL. Samples come from data or teacher, and the model minimizes forward KL toward that external distribution, such as $D_{\mathrm{KL}}(p_{\text{data}} \,\|\, \pi_\theta)$ or $D_{\mathrm{KL}}(\pi_T \,\|\, \pi_\theta)$.

On-policy distillation. Samples come from the student, and the model minimizes reverse KL toward the teacher, $D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_T)$.

PPO / GRPO-style RL. Samples come from the student, and the model approximately minimizes reverse KL toward an implicit reward-weighted target policy, $D_{\mathrm{KL}}(\pi_\theta \,\|\, p^*)$.