## Proximal Policy Optimization

Tags: #machine learning #AI #LLM### Equation

$$\arg\max\limits_{\pi}{ E_{p \sim D,g \sim \pi} [R(g|p)] }, R(g|p) = \tilde{R}_{c}(g|p) - \beta D_{KL}( \pi_{\theta} (g|p) || \pi_{0} (g|p))$$### Latex Code

\arg\max\limits_{\pi}{ E_{p \sim D,g \sim \pi} [R(g|p)] }, R(g|p) = \tilde{R}_{c}(g|p) - \beta D_{KL}( \pi_{\theta} (g|p) || \pi_{0} (g|p))

### Have Fun

Let's Vote for the Most Difficult Equation!

### Introduction

Proximal Policy Optimization (PPO) is an optimization algorithm of reinforcement learning (RLHF) to train the reward model during the stage of finetuning a Larget Language Model (LLM). It helps to better align the LLM's generation to judgement and human's feedback. The model iteratively improve the policy by sampling prompts p from dataset D and generations g from the policy \pi, and use the PPO algorithm to optimize loss function. $$ R(g|p) $$: Final reward function , $$ \tilde{R}_{c}(g|p) $$: The reward function we define, such as the piecewise combination of the safety (Rs) and helpfulness (Rh) in LLaMa 2 model, $$ \pi_{0} (g|p) $$: Original policy to generate Response g, given Prompt p, $$ \pi_{\theta} (g|p) $$: The policy we are optimizing with parameters \theta, $$ \beta $$: KL penalty term to prevent the policy from diverging from original policy too far.

## Discussion

### Comment to Make Wishes Come True

Leave your wishes (e.g. Passing Exams) in the comments and earn as many upvotes as possible to make your wishes come true