Information
# Webscale-RL Dataset
[Paper](https://huggingface.co/papers/2510.06499) | [Code](https://github.com/SalesforceAIResearch/PretrainRL-pipeline)
## Dataset Description
**Webscale-RL** is a large-scale reinforcement learning dataset designed to address the fundamental bottleneck in LLM RL training: the scarcity of high-quality, diverse RL data.
While pretraining leverages **>1T diverse web tokens**, existing RL datasets remain limited to **<10B tokens** with constrained diversity. Webscale-RL bridges this gap by converting pretraining corpora into verifiable query and ground-truth answer pairs, effectively scaling RL data to pretraining levels while preserving the diversity of the original sources.

**Note**: This dataset was generated using GPT and should not be used to develop models that compete with OpenAI.
## Data Pipeline
The pretraining-to-RL data pipeline includes four stages:
1. **Filter**: Pre-processes and filters raw materials for quality
2. **Identifier**: Identifies domain classification and target persona
3. **Generator**: Creates question-answer pairs based on identified personas
4. **Checker**: Validates generated content for quality and correctness
More details can be found in [PretrainRL-pipeline](https://github.com/SalesforceAIResearch/PretrainRL-pipeline).
## Dataset Sources
We release ~1.1M samples in the Webscale-RL dataset. In principle, with our data pipeline, we can easily further scale up the dataset size to pretraining level. The Webscale-RL dataset is constructed from the below pretraining corpora, with the construction following the recipe of [SmolLM3](https://huggingface.co/collections/HuggingFaceTB/smollm3-686d33c1fdffe8e635317e23).
| Source | Size | Domain |
|--------|------|--------|
| DCLM | ~550K | Web text |
| Wikipedia | ~300K | Encyclopedia |
| MegaMath | ~100K | Mathematics |
| OpenMathReasoning | ~100K | Math reasoning |
| OpenCodeReasoning | ~50K | Code reasoning |
**Note**: OpenMathReasoning and OpenCodeReasoning are also included in the SmolLM3 pretraining recipe. See [pretraining datasets](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9) for more details.
## Dataset Structure
Each sample in the dataset contains:
- \`pretraining_text\`: The original text from the source material
- \`domain\`: The domain of the source material
- \`persona\`: The persona of the source material
- \`question\`: A verifiable question or prompt extracted from the source material
- \`answer\`: The ground-truth answer
## Usage
\`\`\`python
from datasets import load_dataset
dataset = load_dataset("Salesforce/Webscale-RL")
# Example of accessing data
for sample in dataset["train"]:
print(f"Pretraining Text: \{sample['pretraining_text']\}")
print(f"Question: \{sample['question']\}")
print(f"Answer: \{sample['answer']\}")
\`\`\`
## Citation
If you use this dataset in your research, please cite:
\`\`\`bibtex
@article\{cen2025webscalerl,
title=\{Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels\},
author=\{Zhepeng Cen and Haolin Chen and Shiyu Wang and Zuxin Liu and Zhiwei Liu and Ding Zhao and Silvio Savarese and Caiming Xiong and Huan Wang and Weiran Yao\},
journal=\{arXiv preprint arXiv:2510.06499\},
year=\{2025\},
\}
\`\`\`

Reply