Minimal-GRPO

A minimal and customizable implementation for fine-tuning Open Language Models (LLaMA, Qwen, etc.) on reasoning tasks with verifiable rewards, using two approaches:

This repo currently includes GSM8K (by OpenAI) and MathExpr datasets, and can be adapted to other tasks.

Features

Setup

  1. Clone the repository:
    git clone https://github.com/yourusername/Minimal-GRPO.git
    cd Minimal-GRPO
    
  2. Install dependencies:
    pip install -r requirements.txt
    
  3. Optional (recommended for efficiency):
    pip install flash-attn --no-build-isolation
    

Quick Start

  1. Configure your training in grpo_config.yml or es_config.yml

  2. Run the corresponding training script:
    python grpo_train.py
    
  3. Monitor with TensorBoard:
    tensorboard --logdir=grpo_logs
    

Adapting to Your Own Tasks

To adapt this code to your own dataset and task:

  1. Implement your dataset in datasets.py (follow the GSM8K or MathExpr examples)
  2. Define your reward function in reward.py to match your task’s success criteria
  3. Adjust the system prompt in the training scripts (grpo_train.py or es_train.py) to match your task format
  4. Update the DataLoader in the training script to use your new dataset

GRPO (Group Relative Policy Optimization)

GRPO is a gradient-based reinforcement learning algorithm that:

ES (Evolution Strategies)

Recent research demonstrates that ES can successfully fine-tune LLMs with billions of parameters (Qiu et al., 2025), outperforming RL methods in sample efficiency, robustness, and stability, particularly on tasks with sparse outcome-only rewards, ES is inspired by natural evolution. It:

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

References

Acknowledgments