DeepSeek’s GRPO is the biggest breakthrough since transformers

GRPO is a new reinforcement learning technique that replaces traditional methods like Proximal Policy Optimization (PPO)

DeepSeek’s Group Relative Policy Optimization (GRPO) represents a paradigm shift in reinforcement learning (RL) for large language models, addressing key limitations of Proximal Policy Optimization (PPO) through innovative simplifications and efficiency gains. Here’s why GRPO stands out:

Core Innovations in GRPO

1. Elimination of the Critic Model
GRPO removes the need for a separate value function (critic model) required in PPO, reducing memory and computational overhead by ~50%. Instead of training two models (policy + critic), GRPO uses:

Group-based reward baselines: Multiple responses per prompt are generated, with their average reward serving as a dynamic baseline.
Monte Carlo estimation: Advantages are calculated directly from sampled completions, avoiding complex value function training.

2. Stability Through Relative Ranking
GRPO introduces a comparative framework:

Responses are ranked within groups, emphasizing relative performance over absolute rewards.
A simplified reward system combining accuracy (correct answers) and format compliance (structured reasoning traces) reduces reward engineering complexity.

3. Memory and Cost Efficiency

Feature	PPO (e.g., OpenAI o1)	GRPO (DeepSeek R1)
Models Trained	2 (policy + critic)	1 (policy only)
Training Speed	Slower	2-3× faster
VRAM Requirements	High	Up to 50% lower
Scalability	Limited	Large-scale optimized

Real-world results show GRPO enables training a 1B-parameter reasoning model with just 16GB VRAM, democratizing RL training for smaller organizations.

Technical Advantages Over PPO

Simplified advantage calculation: Uses mean group rewards instead of value function outputs.
Built-in KL divergence control: Directly penalizes deviations from a reference policy without additional components.
Structured reasoning enforcement: Mandates step-by-step explanations within <reasoning> tags, improving interpretability and self-verification.

Impact on Model Performance –

DeepSeek-R1, trained with GRPO, demonstrates:

State-of-the-art mathematical reasoning (DeepSeekMath benchmark)
Improved factual accuracy through self-verification mechanisms
93% cost reduction compared to equivalent PPO-based training

While GRPO builds on PPO’s foundation—retaining clipping and policy constraints—its group-relative approach and critic elimination make RL training more accessible. This breakthrough enables efficient scaling while maintaining stability, positioning GRPO as a transformative advancement in RL for language models.

DeepSeek’s GRPO is the biggest breakthrough since transformers

Core Innovations in GRPO

Technical Advantages Over PPO

Impact on Model Performance –

Shailesh Manjrekar

Fine Tuning (RAG) or Retrieval Augmented Generation when dealing with multi-domain datasets?

When to choose RAG or LoRA for training?

DeepSeek’s GRPO is the biggest breakthrough since transformers

Core Innovations in GRPO

Technical Advantages Over PPO

Impact on Model Performance –

Shailesh Manjrekar

Recent Posts

Fine Tuning (RAG) or Retrieval Augmented Generation when dealing with multi-domain datasets?

When to choose RAG or LoRA for training?