DeepSeek’s GRPO is the biggest breakthrough since transformers

GRPO is a new reinforcement learning technique that replaces traditional methods like Proximal Policy Optimization (PPO)

DeepSeek’s Group Relative Policy Optimization (GRPO) represents a paradigm shift in reinforcement learning (RL) for large language models, addressing key limitations of Proximal Policy Optimization (PPO) through innovative simplifications and efficiency gains. Here’s why GRPO stands out:

Core Innovations in GRPO

1. Elimination of the Critic Model
GRPO removes the need for a separate value function (critic model) required in PPO, reducing memory and computational overhead by ~50%. Instead of training two models (policy + critic), GRPO uses:

  • Group-based reward baselines: Multiple responses per prompt are generated, with their average reward serving as a dynamic baseline.
  • Monte Carlo estimation: Advantages are calculated directly from sampled completions, avoiding complex value function training.

2. Stability Through Relative Ranking
GRPO introduces a comparative framework:

  • Responses are ranked within groups, emphasizing relative performance over absolute rewards.
  • A simplified reward system combining accuracy (correct answers) and format compliance (structured reasoning traces) reduces reward engineering complexity.

3. Memory and Cost Efficiency

FeaturePPO (e.g., OpenAI o1)GRPO (DeepSeek R1)
Models Trained2 (policy + critic)1 (policy only)
Training SpeedSlower2-3× faster
VRAM RequirementsHighUp to 50% lower
ScalabilityLimitedLarge-scale optimized

Real-world results show GRPO enables training a 1B-parameter reasoning model with just 16GB VRAM, democratizing RL training for smaller organizations.

Technical Advantages Over PPO

  • Simplified advantage calculation: Uses mean group rewards instead of value function outputs.
  • Built-in KL divergence control: Directly penalizes deviations from a reference policy without additional components.
  • Structured reasoning enforcement: Mandates step-by-step explanations within <reasoning> tags, improving interpretability and self-verification.

Impact on Model Performance

DeepSeek-R1, trained with GRPO, demonstrates:

  • State-of-the-art mathematical reasoning (DeepSeekMath benchmark)
  • Improved factual accuracy through self-verification mechanisms
  • 93% cost reduction compared to equivalent PPO-based training

While GRPO builds on PPO’s foundation—retaining clipping and policy constraints—its group-relative approach and critic elimination make RL training more accessible. This breakthrough enables efficient scaling while maintaining stability, positioning GRPO as a transformative advancement in RL for language models.

Shailesh Manjrekar
Shailesh Manjrekar
Shailesh Manjrekar, Chief Marketing Officer is responsible for CloudFabrix's AI and SaaS Product thought leadership, Marketing, and Go To Market strategy for Data Observability and AIOps market. Shailesh Manjrekar is a seasoned IT professional who has over two decades of experience in building and managing emerging global businesses. He brings an established background in providing effective product and solutions marketing, product management, and strategic alliances spanning AI and Deep Learning, FinTech, Lifesciences SaaS solutions. Manjrekar is an avid speaker at AI conferences like NVIDIA GTC and Storage Developer Conference and is also a Forbes Technology Council contributor since 2020, an invitation only organization of leading CxO's and Technology Executives.