X

Webscale-RL

Information

# Webscale-RL Dataset [Paper](https://huggingface.co/papers/2510.06499) | [Code](https://github.com/SalesforceAIResearch/PretrainRL-pipeline) ## Dataset Description **Webscale-RL** is a large-scale reinforcement learning dataset designed to address the fundamental bottleneck in LLM RL training: the scarcity of high-quality, diverse RL data. While pretraining leverages **>1T diverse web tokens**, existing RL datasets remain limited to **<10B tokens** with constrained diversity. Webscale-RL bridges this gap by converting pretraining corpora into verifiable query and ground-truth answer pairs, effectively scaling RL data to pretraining levels while preserving the diversity of the original sources. ![Webscale-RL Pipeline](assets/webscale-rl-pipeline.png) **Note**: This dataset was generated using GPT and should not be used to develop models that compete with OpenAI. ## Data Pipeline The pretraining-to-RL data pipeline includes four stages: 1. **Filter**: Pre-processes and filters raw materials for quality 2. **Identifier**: Identifies domain classification and target persona 3. **Generator**: Creates question-answer pairs based on identified personas 4. **Checker**: Validates generated content for quality and correctness More details can be found in [PretrainRL-pipeline](https://github.com/SalesforceAIResearch/PretrainRL-pipeline). ## Dataset Sources We release ~1.1M samples in the Webscale-RL dataset. In principle, with our data pipeline, we can easily further scale up the dataset size to pretraining level. The Webscale-RL dataset is constructed from the below pretraining corpora, with the construction following the recipe of [SmolLM3](https://huggingface.co/collections/HuggingFaceTB/smollm3-686d33c1fdffe8e635317e23). | Source | Size | Domain | |--------|------|--------| | DCLM | ~550K | Web text | | Wikipedia | ~300K | Encyclopedia | | MegaMath | ~100K | Mathematics | | OpenMathReasoning | ~100K | Math reasoning | | OpenCodeReasoning | ~50K | Code reasoning | **Note**: OpenMathReasoning and OpenCodeReasoning are also included in the SmolLM3 pretraining recipe. See [pretraining datasets](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9) for more details. ## Dataset Structure Each sample in the dataset contains: - \`pretraining_text\`: The original text from the source material - \`domain\`: The domain of the source material - \`persona\`: The persona of the source material - \`question\`: A verifiable question or prompt extracted from the source material - \`answer\`: The ground-truth answer ## Usage \`\`\`python from datasets import load_dataset dataset = load_dataset("Salesforce/Webscale-RL") # Example of accessing data for sample in dataset["train"]: print(f"Pretraining Text: \{sample['pretraining_text']\}") print(f"Question: \{sample['question']\}") print(f"Answer: \{sample['answer']\}") \`\`\` ## Citation If you use this dataset in your research, please cite: \`\`\`bibtex @article\{cen2025webscalerl, title=\{Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels\}, author=\{Zhepeng Cen and Haolin Chen and Shiyu Wang and Zuxin Liu and Zhiwei Liu and Ding Zhao and Silvio Savarese and Caiming Xiong and Huan Wang and Weiran Yao\}, journal=\{arXiv preprint arXiv:2510.06499\}, year=\{2025\}, \} \`\`\`

Prompts

Reviews

Tags


  • rockingdingo 2025-10-20 12:02
    Interesting:5,Helpfulness:5,Correctness:5

    This dataset is a web scale dataset useful for various tasks. But I am just wondering where can I get the reward or label since this is an RL dataset? Or is it just an explicit reward from the answer?

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos