Webscale-RL

Rating

ALL

Similar

Agent Reinforcement Learning Open Dataset

Agent Function Calling Open Dataset Example

Agent Reinforcement Learning Open Dataset Example

Agent Tool Use Dialogue Open Dataset Example

Agent Tool Use Dialogue Open Dataset

Information

# Webscale-RL Dataset [Paper](https://huggingface.co/papers/2510.06499) | [Code](https://github.com/SalesforceAIResearch/PretrainRL-pipeline) ## Dataset Description **Webscale-RL** is a large-scale reinforcement learning dataset designed to address the fundamental bottleneck in LLM RL training: the scarcity of high-quality, diverse RL data. While pretraining leverages **>1T diverse web tokens**, existing RL datasets remain limited to **<10B tokens** with constrained diversity. Webscale-RL bridges this gap by converting pretraining corpora into verifiable query and ground-truth answer pairs, effectively scaling RL data to pretraining levels while preserving the diversity of the original sources. ![Webscale-RL Pipeline](assets/webscale-rl-pipeline.png) **Note**: This dataset was generated using GPT and should not be used to develop models that compete with OpenAI. ## Data Pipeline The pretraining-to-RL data pipeline includes four stages: 1. **Filter**: Pre-processes and filters raw materials for quality 2. **Identifier**: Identifies domain classification and target persona 3. **Generator**: Creates question-answer pairs based on identified personas 4. **Checker**: Validates generated content for quality and correctness More details can be found in [PretrainRL-pipeline](https://github.com/SalesforceAIResearch/PretrainRL-pipeline). ## Dataset Sources We release ~1.1M samples in the Webscale-RL dataset. In principle, with our data pipeline, we can easily further scale up the dataset size to pretraining level. The Webscale-RL dataset is constructed from the below pretraining corpora, with the construction following the recipe of [SmolLM3](https://huggingface.co/collections/HuggingFaceTB/smollm3-686d33c1fdffe8e635317e23). | Source | Size | Domain | |--------|------|--------| | DCLM | ~550K | Web text | | Wikipedia | ~300K | Encyclopedia | | MegaMath | ~100K | Mathematics | | OpenMathReasoning | ~100K | Math reasoning | | OpenCodeReasoning | ~50K | Code reasoning | **Note**: OpenMathReasoning and OpenCodeReasoning are also included in the SmolLM3 pretraining recipe. See [pretraining datasets](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9) for more details. ## Dataset Structure Each sample in the dataset contains: - \`pretraining_text\`: The original text from the source material - \`domain\`: The domain of the source material - \`persona\`: The persona of the source material - \`question\`: A verifiable question or prompt extracted from the source material - \`answer\`: The ground-truth answer ## Usage \`\`\`python from datasets import load_dataset dataset = load_dataset("Salesforce/Webscale-RL") # Example of accessing data for sample in dataset["train"]: print(f"Pretraining Text: \{sample['pretraining_text']\}") print(f"Question: \{sample['question']\}") print(f"Answer: \{sample['answer']\}") \`\`\` ## Citation If you use this dataset in your research, please cite: \`\`\`bibtex @article\{cen2025webscalerl, title=\{Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels\}, author=\{Zhepeng Cen and Haolin Chen and Shiyu Wang and Zuxin Liu and Zhiwei Liu and Ding Zhao and Silvio Savarese and Caiming Xiong and Huan Wang and Weiran Yao\}, journal=\{arXiv preprint arXiv:2510.06499\}, year=\{2025\}, \} \`\`\`

Prompts

Reviews

Write Your Review

Detailed Ratings

ALL

Correctness

Helpfulness

Interesting

Upload Pictures and Videos

Name

Size

Type

Download

Last Modified

Community

Add Discussion

Upload Pictures and Videos

Chatbot close

Bot
Hi there
How can I help you today?

Send