X

smollm-corpus

Information

# SmolLM-Corpus This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our [SmolLM blog post](https://huggingface.co/blog/smollm). # Dataset subsets ## Cosmopedia v2 Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1). Most of the samples are generated by prompting the model to generate content on specific topics using a web page referred to as a "seed sample," as shown in Figure 1. We use web samples to increase diversity and expand the range of prompts. You can find more details in this [blog post](https://huggingface.co/blog/smollm). ### Dataset Features * \`prompt (string)\`: The input prompt used to generate the text. * \`text (string)\`: The generated text content. * \`token_length (int64)\`: The length of the text in tokens (Mistral-7B tokenizer). * \`audience (string)\`: The intended audience for the content. * \`format (string)\`: The format of the content (e.g., textbook, story). * \`seed_data (string)\`: The seed sample used to generate the text. ### Loading the dataset \`\`\`python from datasets import load_dataset ds = load_dataset("HuggingFaceTB/smollm-corpus", "cosmopedia-v2", split="train", num_proc=16) print(ds[0]) \`\`\` ## Python-Edu The \`python-edu\` subset consists of Python files that were scored 4 or more by the [educational code model](https://huggingface.co/HuggingFaceTB/python-edu-scorer). The files were extracted from the [\`stack-v2-train\`](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids) dataset. ### Dataset Features * \`blob_id (string)\`: Software Heritage (SWH) ID of the file on AWS S3. * \`repo_name (string)\`: Repository name on GitHub. * \`path (string)\`: The file path within the repository. * \`length_bytes (int64)\`: Length of the file content in UTF-8 bytes. * \`score (float32)\`: The output of the educational scoring model. * \`int_score (uint8)\`: The rounded educational score. ### Downloading the data The file contents are downloaded from Software Heritage's S3 bucket to ensure data compliance. Please refer to [the-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids) for the data license. When running on a 16-core AWS \`us-east-1\` instance, this script takes ~6 hours to download the files: \`\`\`python import boto3 import gzip from datasets import load_dataset from botocore.exceptions import ClientError num_proc = 16 s3 = boto3.client('s3') bucket_name = "softwareheritage" def download_contents(blob_id): key = f"content/\{blob_id\}" try: obj = s3.get_object(Bucket=bucket_name, Key=key) with gzip.GzipFile(fileobj=obj['Body']) as fin: content = fin.read().decode("utf-8", errors="ignore") return \{"text": content, "download_success": True\} except ClientError as e: if e.response['Error']['Code'] == 'NoSuchKey': print(f"File not found: \{key\}") return \{"text": "", "download_success": False\} else: raise ds = load_dataset("HuggingFaceTB/smollm-corpus", "python-edu", split="train", num_proc=num_proc) ds = ds.map(download_contents, input_columns="blob_id", num_proc=num_proc) # Filter out failed downloads ds = ds.filter(lambda x: x['download_success']) # Optionally, print the first example to verify the data print(ds[0]) \`\`\` ## FineWeb-Edu (deduplicated) FineWeb-Edu-Dedup is a deduplicated subset of the [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset, containing 220 billion tokens of educational web pages. The source dataset was filtered using an educational quality classifier to retain only the highest quality educational content. For more information refer to the [FineWeb-v1 blog post](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) ### Dataset Features * \`text (string)\`: The web page's text content. * \`id (string)\`: Unique ID of the web page. * \`metadata (struct)\`: Metadata about the web page, including: * \`dump (string)\`: The source CommonCrawl dump. * \`url (string)\`: The URL of the web page. * \`date (timestamp[s])\`: The date the web page was captured. * \`file_path (string)\`: The file path of the commoncrawl snapshot. * \`language (string)\`: The language of the web page. * \`language_score (float64)\`: The language probability. * \`token_count (int64)\`: The token count of the web page (gpt2 tokenizer). * \`score (float64)\`: The educational quality score. * \`int_score (int64)\`: The rounded educational quality score. ### Loading the dataset \`\`\`python from datasets import load_dataset ds = load_dataset("HuggingFaceTB/smollm-corpus", "fineweb-edu-dedup", split="train", num_proc=16) print(ds[0]) \`\`\` ## Citation \`\`\` @software\{benallal2024smollmcorpus, author = \{Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro\}, title = \{SmolLM-Corpus\}, month = July, year = 2024, url = \{https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus\} \} \`\`\`

Prompts

Reviews

Tags

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos