Direct Policy Optimization DPO

Tags: #nlp #llm #RLHF

Equation

$$\pi_{r} (y|x) = \frac{1}{Z(x)} \pi_{ref} (y|x) \exp(\frac{1}{\beta} r(x,y) ) , r(x,y) = \beta \log \frac{\pi_{r} (y|x)}{\pi_{ref} (y|x)} + \beta \log Z(x) , p^{*}(y_{1} > y_{2} |x) = \frac{1}{1+\exp{(\beta \frac{\pi^{*} (y_{2}|x)}{\pi_{ref} (y_{2}|x)} - \beta \frac{\pi^{*} (y_{1}|x)}{\pi_{ref} (y_{1}|x)} )}} , \mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref}) = -\mathbb{E}_{(x, y_{w},y_{l}) \sim D } [\log \sigma (\beta \frac{\pi_{\theta} (y_{w}|x)}{\pi_{ref} (y_{w}|x)} - \beta \frac{\pi_{\theta} (y_{l}|x)}{\pi_{ref} (y_{l}|x)} )] , \nabla \mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref}) = - \beta \mathbb{E}_{(x, y_{w},y_{l}) \sim D } [ \sigma ( \hat{r}_{\theta} (x, y_{l}) - \hat{r}_{\theta} (x, y_{w})) [\nabla_{\theta} \log \pi (y_{w}|x) - \nabla_{\theta} \log \pi (y_{l}|x) ] ] , \hat{r}_{\theta} (x, y) = \beta \log (\frac{\pi_{\theta} (y|x)}{\pi_{ref} (y|x)})$$

Latex Code

                                 \pi_{r} (y|x) = \frac{1}{Z(x)} \pi_{ref} (y|x) \exp(\frac{1}{\beta} r(x,y) ) ,

r(x,y) = \beta \log \frac{\pi_{r} (y|x)}{\pi_{ref} (y|x)} + \beta \log Z(x) ,

p^{*}(y_{1} > y_{2} |x) = \frac{1}{1+\exp{(\beta \frac{\pi^{*} (y_{2}|x)}{\pi_{ref} (y_{2}|x)} - \beta \frac{\pi^{*} (y_{1}|x)}{\pi_{ref} (y_{1}|x)}  )}} ,

\mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref}) = -\mathbb{E}_{(x, y_{w},y_{l}) \sim D } [\log \sigma (\beta \frac{\pi_{\theta} (y_{w}|x)}{\pi_{ref} (y_{w}|x)} - \beta \frac{\pi_{\theta} (y_{l}|x)}{\pi_{ref} (y_{l}|x)} )] ,

\nabla \mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref}) = - \beta \mathbb{E}_{(x, y_{w},y_{l}) \sim D } [ \sigma ( \hat{r}_{\theta} (x, y_{l}) - \hat{r}_{\theta} (x, y_{w})) [\nabla_{\theta} \log \pi (y_{w}|x) - \nabla_{\theta} \log \pi (y_{l}|x) ] ] ,

\hat{r}_{\theta} (x, y) = \beta \log (\frac{\pi_{\theta} (y|x)}{\pi_{ref} (y|x)})
                            

Have Fun

Let's Vote for the Most Difficult Equation!

Introduction

$$ \mathcal{L}_{DPO} $$: denotes the loss function of Direct Policy Optimization.
$$ \nabla \mathcal{L}_{DPO} $$: Gradient update of DPO Loss to parameter. $$\theta$$
$$ r(x,y) $$ : denotes the true reward function.
$$ \pi_{\theta}(.) $$: Function of updated language model.
$$ \pi_{ref}(.) $$: Function of reference language model.
$$ \hat{r}(x,y) $$ : denotes the reward function defined by updated language model function $$ \pi_{\theta}(.) $$ and the reference language model function $$ \pi_{ref}(.) $$
$$ \pi_{r} (y|x) $$: optimal solution to the KL-constrained reward maximization objective.
$$ Z(x) $$: denotes the partition function.
$$ p^{*}(y_{1} > y_{2} |x) $$: denotes the preference model give y_{1} is preferred than y_{2} give input x, under the Bradley-Terry model.
$$ \pi^{*} $$: Optimal Reinforcement Learning from Human Feedback (RLHF) policy.

Reference
Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Discussion

Comment to Make Wishes Come True

Leave your wishes (e.g. Passing Exams) in the comments and earn as many upvotes as possible to make your wishes come true