Information
# FineWeb-Mask
[ DATAMASK Paper](https://arxiv.org/abs/2512.24265) | [ GitHub Repository](https://github.com/ByteDance-Seed/DATAMASK) | [ Fineweb-Mask Dataset](https://huggingface.co/datasets/DATA-MASK/FineWeb-Mask)
## Introduction
**FineWeb-Mask** is a 1.5 trillion token, high-efficiency pre-training dataset curated using the **DATAMASK** framework. Developed by the **ByteDance Seed team**, DATAMASK addresses the fundamental tension in large-scale data selection: the trade-off between **high quality** and **high diversity**.
By modeling data selection as a **Mask Learning** problem, we provide a derivative of the original [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) corpus. FineWeb-Mask is designed to eliminate semantic redundancy while preserving the highest quality samples, allowing models to achieve superior performance with significantly less data.
## The Problem: The Quality-Diversity Trap
In large language model (LLM) pre-training, developers usually face two suboptimal choices:
1. **The Quality Trap:** Filtering solely by quality scores leads to "diminishing returns." Samples become highly clustered, resulting in severe semantic redundancy.
2. **The Diversity Trap:** Filtering solely for diversity often discards high-value quality samples, leading to worse performance than the original raw dataset.
3. **The Compute Bottleneck:** Traditional diversity algorithms (like greedy selection) are computationally prohibitive for trillion-token datasets.
## Highlights: The DATAMASK Framework
DATAMASK breaks this deadlock through a "joint harvesting" strategy:
* **Joint Optimization:** Uses Policy Gradient algorithms to optimize both quality and diversity metrics within a unified framework.
* **Extreme Acceleration:** Through probability relaxation and specialized optimization techniques, DATAMASK reduces computation time by **98.9%** compared to traditional greedy algorithms, making trillion-token selection feasible.
* **The "Balancer":** Includes a tunable parameter that allows developers to define the "Golden Ratio" between quality and diversity for their specific needs.
* **Semantic De-redundancy:** Visual analysis shows that FineWeb-Mask samples are distributed evenly across high-quality regions rather than being rigidly clustered.
## Evaluation Results
FineWeb-Mask demonstrates that **1+1 > 2**. By selecting a subset that represents only ~10% of the original scale in specific experiments, we observed:
* **Dense Models:** A **3.2% average improvement** across 12 benchmarks for 1.5B dense models.
* **MoE Models:** A **1.9% improvement** for 7B Mixture-of-Experts (MoE) models.
* **Length Bias Correction:** While quality filters favor long text and diversity filters favor short text, DATAMASK finds a scientific middle ground.
| Model Size | Dataset | Avg. Score (12 Benchmarks) | Improvement |
| --- | --- | --- | --- |
| 1.5B Dense | FineWeb (Original) | Baseline | - |
| 1.5B Dense | **FineWeb-Mask** | **+3.2%** | |
| 7B MoE | FineWeb (Original) | Baseline | - |
| 7B MoE | **FineWeb-Mask** | **+1.9%** | |
## ️ Acknowledgements
FineWeb-Mask is built upon the incredible foundational work of the [HuggingFace FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) team. We are grateful to the open-source community for providing the raw corpora that made this optimization possible.
## Citation
If you find our dataset or the DATAMASK framework useful, please cite our work:
\`\`\`bibtex
@misc\{fan2025jointselectionlargescalepretraining,
title=\{Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning\},
author=\{Ziqing Fan and Yuqiao Xian and Yan Sun and Li Shen\},
year=\{2025\},
eprint=\{2512.24265\},
archivePrefix=\{arXiv\},
primaryClass=\{cs.CL\},
url=\{https://arxiv.org/abs/2512.24265\},
\}
\`\`\`
## License
This dataset is released under the **Apache 2.0** license. Users should also adhere to the original license terms of the FineWeb dataset and its constituent sources.
## **** Contact
- Ziqing Fan: zqfan_knight@sjtu.edu.cn
- Yuqiao Xian: ericxian1997@gmail.com
---
**Would you like me to help you draft the "How to Use" section for loading this dataset via the Hugging Face \`datasets\` library?**