X

FineWeb-Mask

Information

# FineWeb-Mask [ DATAMASK Paper](https://arxiv.org/abs/2512.24265) | [ GitHub Repository](https://github.com/ByteDance-Seed/DATAMASK) | [ Fineweb-Mask Dataset](https://huggingface.co/datasets/DATA-MASK/FineWeb-Mask)
## Introduction **FineWeb-Mask** is a 1.5 trillion token, high-efficiency pre-training dataset curated using the **DATAMASK** framework. Developed by the **ByteDance Seed team**, DATAMASK addresses the fundamental tension in large-scale data selection: the trade-off between **high quality** and **high diversity**. By modeling data selection as a **Mask Learning** problem, we provide a derivative of the original [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) corpus. FineWeb-Mask is designed to eliminate semantic redundancy while preserving the highest quality samples, allowing models to achieve superior performance with significantly less data. ## The Problem: The Quality-Diversity Trap In large language model (LLM) pre-training, developers usually face two suboptimal choices: 1. **The Quality Trap:** Filtering solely by quality scores leads to "diminishing returns." Samples become highly clustered, resulting in severe semantic redundancy. 2. **The Diversity Trap:** Filtering solely for diversity often discards high-value quality samples, leading to worse performance than the original raw dataset. 3. **The Compute Bottleneck:** Traditional diversity algorithms (like greedy selection) are computationally prohibitive for trillion-token datasets. ## Highlights: The DATAMASK Framework DATAMASK breaks this deadlock through a "joint harvesting" strategy: * **Joint Optimization:** Uses Policy Gradient algorithms to optimize both quality and diversity metrics within a unified framework. * **Extreme Acceleration:** Through probability relaxation and specialized optimization techniques, DATAMASK reduces computation time by **98.9%** compared to traditional greedy algorithms, making trillion-token selection feasible. * **The "Balancer":** Includes a tunable parameter that allows developers to define the "Golden Ratio" between quality and diversity for their specific needs. * **Semantic De-redundancy:** Visual analysis shows that FineWeb-Mask samples are distributed evenly across high-quality regions rather than being rigidly clustered. ## Evaluation Results FineWeb-Mask demonstrates that **1+1 > 2**. By selecting a subset that represents only ~10% of the original scale in specific experiments, we observed: * **Dense Models:** A **3.2% average improvement** across 12 benchmarks for 1.5B dense models. * **MoE Models:** A **1.9% improvement** for 7B Mixture-of-Experts (MoE) models. * **Length Bias Correction:** While quality filters favor long text and diversity filters favor short text, DATAMASK finds a scientific middle ground. | Model Size | Dataset | Avg. Score (12 Benchmarks) | Improvement | | --- | --- | --- | --- | | 1.5B Dense | FineWeb (Original) | Baseline | - | | 1.5B Dense | **FineWeb-Mask** | **+3.2%** | | | 7B MoE | FineWeb (Original) | Baseline | - | | 7B MoE | **FineWeb-Mask** | **+1.9%** | | ## ️ Acknowledgements FineWeb-Mask is built upon the incredible foundational work of the [HuggingFace FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) team. We are grateful to the open-source community for providing the raw corpora that made this optimization possible. ## Citation If you find our dataset or the DATAMASK framework useful, please cite our work: \`\`\`bibtex @misc\{fan2025jointselectionlargescalepretraining, title=\{Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning\}, author=\{Ziqing Fan and Yuqiao Xian and Yan Sun and Li Shen\}, year=\{2025\}, eprint=\{2512.24265\}, archivePrefix=\{arXiv\}, primaryClass=\{cs.CL\}, url=\{https://arxiv.org/abs/2512.24265\}, \} \`\`\` ## License This dataset is released under the **Apache 2.0** license. Users should also adhere to the original license terms of the FineWeb dataset and its constituent sources. ## **** Contact - Ziqing Fan: zqfan_knight@sjtu.edu.cn - Yuqiao Xian: ericxian1997@gmail.com --- **Would you like me to help you draft the "How to Use" section for loading this dataset via the Hugging Face \`datasets\` library?**

Prompts

Reviews

Tags

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos