GRPO is a new reinforcement learning technique that replaces traditional methods like Proximal Policy Optimization (PPO)
DeepSeek’s Group Relative Policy Optimization (GRPO) represents a paradigm shift in reinforcement learning (RL) for large language models, addressing key limitations of Proximal Policy Optimization (PPO) through innovative simplifications and efficiency gains. Here’s why GRPO stands out:
Core Innovations in GRPO
1. Elimination of the Critic Model
GRPO removes the need for a separate value function (critic model) required in PPO, reducing memory and computational overhead by ~50%. Instead of training two models (policy + critic), GRPO uses:
- Group-based reward baselines: Multiple responses per prompt are generated, with their average reward serving as a dynamic baseline.
- Monte Carlo estimation: Advantages are calculated directly from sampled completions, avoiding complex value function training.
2. Stability Through Relative Ranking
GRPO introduces a comparative framework:
- Responses are ranked within groups, emphasizing relative performance over absolute rewards.
- A simplified reward system combining accuracy (correct answers) and format compliance (structured reasoning traces) reduces reward engineering complexity.
3. Memory and Cost Efficiency
Feature | PPO (e.g., OpenAI o1) | GRPO (DeepSeek R1) |
Models Trained | 2 (policy + critic) | 1 (policy only) |
Training Speed | Slower | 2-3× faster |
VRAM Requirements | High | Up to 50% lower |
Scalability | Limited | Large-scale optimized |
Real-world results show GRPO enables training a 1B-parameter reasoning model with just 16GB VRAM, democratizing RL training for smaller organizations.
Technical Advantages Over PPO
- Simplified advantage calculation: Uses mean group rewards instead of value function outputs.
- Built-in KL divergence control: Directly penalizes deviations from a reference policy without additional components.
- Structured reasoning enforcement: Mandates step-by-step explanations within <reasoning> tags, improving interpretability and self-verification.
Impact on Model Performance –
DeepSeek-R1, trained with GRPO, demonstrates:
- State-of-the-art mathematical reasoning (DeepSeekMath benchmark)
- Improved factual accuracy through self-verification mechanisms
- 93% cost reduction compared to equivalent PPO-based training
While GRPO builds on PPO’s foundation—retaining clipping and policy constraints—its group-relative approach and critic elimination make RL training more accessible. This breakthrough enables efficient scaling while maintaining stability, positioning GRPO as a transformative advancement in RL for language models.