Online RL [PPO] [GRPO]

Description

Online Reinforcement Learning is a method for continuously improving language models by gathering real-time feedback and updating the model in a loop. Unlike static datasets, Online RL allows a model to learn from its own deployed outputs, incorporating fresh human (or synthetic) feedback into the training process. This enables adaptive alignment with evolving user preferences and tasks.

Workflow

Deployment & Feedback Loop: Deploy the model and collect interaction data from users, including implicit or explicit preference signals (e.g., thumbs up/down, rankings).
Reward Model Updates: Continuously update or retrain the reward model using new preference data.
Policy Optimization:
- Use algorithms like Proximal Policy Optimization (PPO) to fine-tune the language model against the reward model.
- Optionally use Generalized Reward Policy Optimization (GRPO) to better handle non-stationary rewards and long-horizon feedback.
Evaluation: Periodically assess model behavior using alignment metrics, task performance, and user satisfaction.

PPO vs GRPO

PPO (Proximal Policy Optimization): A stable, widely-used RL algorithm that prevents large policy updates, reducing training instability.
GRPO (Generalized Reward Policy Optimization): Extends PPO by incorporating reward generalization techniques, making it more robust to sparse, delayed, or shifting feedback.

Use Cases

Continual improvement of deployed assistants and chatbots.
Fine-tuning moderation or recommendation systems with live user feedback.
Adaptive task-solving agents in dynamic environments.