You Are What You Eat: Curating Data for LLM Pre-training

In Large Language Models (LLMs), ‘You are what you eat’ has never been truer. An LLM’s ultimate power hinges directly on the quality and efficiency of its pre-training data. Strategic data preparation is paramount, directly influencing computational cost ($100M+), accuracy, and the coverage of knowledge LLMs acquire. Training runs consume vast resources; understanding and optimizing pre-training data is a critical driver of progress and viability in AI. At Pearls of Wisdom (prls.co), we apply these cutting-edge principles to ensure your company’s information is optimally integrated into LLMs.

1. Key Drivers for Data Curation Strategy

The extensive computational and financial costs associated with LLM pre-training necessitate strategic data curation. Key motivations include:

Computational Resource Optimization: Reducing dataset size directly curtails computational load (FLOPs). Studies indicate compute savings ranging from 70% to nearly 90%. For instance, the DataComp for Language Models (DCLM) framework highlights that training on a thoughtfully selected 1.58% of the available data led to models that cost 6.6x less to train, while their DCLM-BASELINE still achieved state-of-the-art MMLU scores (64%). DatologyAI’s DAIT dataset enabled reaching target accuracy with 86.9% less compute compared to RedPajama V1, a 7.7x training speedup.
Shorter Optimization Path: Enhanced data quality—achieved by removing noise and irrelevant content—accelerates model convergence and improves final performance. The Ask-LLM methodology demonstrated that models trained on quality-filtered data can converge up to 70% faster, even when rejecting 90% of the original dataset, and still outperform models trained on the full dataset. Similarly, model-based filtering using fastText (with OH-2.5 + ELI5 reference data) was identified as a key component for the DCLM-BASELINE dataset’s success.
Acquisition of Specialized and Long-Tail Knowledge: To develop LLMs with nuanced understanding and capabilities in specialized domains, incorporating high-quality “long-tail” data is crucial. Datasets like Nemotron-CC, with 4.4 trillion unique real globally deduplicated tokens and an additional 1.9 trillion synthetic tokens, are designed to capture a broader range of information. This extensive pool of unique tokens has proven crucial for long-horizon training, enabling an 8B parameter model trained for 15T tokens (7.2T from Nemotron-CC) to achieve +5 MMLU points over Llama 3.1 8B. Nemotron-CC-HQ (a 1.1T token subset) showed a +5.6 MMLU gain over DCLM when training 8B models for 1T tokens. This is precisely the “knowledge gap” that Pearls of Wisdom (prls.co) addresses for individual companies and their proprietary offerings, ensuring your specific information is integrated into the LLM’s knowledge base, not just generic facts.

2. Foundational Datasets & Methodologies: Learning from the Open

While most leading models (like GPT, Gemini, Claude, Grok (XAI)) keep their data preparation pipelines proprietary, and even many open-weight models (like Llama, Mistral, Qwen, DeepSeek) don’t fully detail their datasets, some—like OLMo by the Paul G. Allen Institute for AI (AI2) and Nemotron by NVIDIA—offer valuable transparency. Notably, the OLMo 2 model performs on par with comparable ~32B and ~7B models, and Nemotron Ultra 235B approaches the performance of SOTA models from OpenAI, Google, XAI, and Anthropic.

Benchmarks and practical usage suggest there isn’t a vast difference between many open and closed models. Furthermore, based on conversations with engineers and researchers in the Bay Area, core pre-training data preparation techniques appear highly similar across the industry. The primary differentiator for closed-source models often lies in the extensive use of paid, professionally labeled, and specially prepared data from services like Scale AI and SurgeHQ.

The techniques discussed in open literature form the foundation used by academia and industry alike, including our team at Pearls of Wisdom (prls.co). This blog post relies heavily on insights from AI2’s Dolma and NVIDIA’s Nemotron-CC papers, alongside other foundational works. These papers help build intuition about what data goes into models, what gets filtered out, what is synthesized, and how it’s all mixed together. Here are those key publications:

Name	Date	Size	Data Type/Sources	Impact
Trafilatura	2021	Tool	Web pages (HTML, XML)	Open-source tool for robust web text/metadata extraction from web; boilerplate removal. F1 score of 0.912 reported.
C4 (Colossal Clean Crawled Corpus)	2020	~750GB text / ~156B tokens from April 2019 Common Crawl scrape.	Filtered Common Crawl (using heuristics like line length, terminal punctuation, bad word lists).	Introduced a large, cleaner version of Common Crawl for pre-training T5; widely adopted as a baseline.
The Pile	2020	~825 GiB / 387B tokens (v1)	Diverse mix: Pile-CC (Common Crawl), PubMed Central, Books3, OpenWebText2, arXiv, GitHub, Wikipedia, etc. (22 datasets).	Championed dataset diversity for general-purpose LLMs; open and thoroughly documented; used for training many open models.
RefinedWeb (for Falcon LLMs)	2023	5T tokens (full from >90 CC dumps); 600B tokens publicly released extract.	Extensively filtered (URL, document-wise, line-wise) and deduplicated (MinHash, exact substring) Common Crawl.	Showed meticulously curated web data alone can produce SOTA models, outperforming models trained on mixed curated corpora like The Pile.
Textbooks Are All You Need (phi-1)	2023	Model: 1.3B params. Data: <7B tokens (<6B filtered web, <1B synthetic textbooks, ~180M exercises).	Filtered “textbook quality” web data (The Stack, StackOverflow); synthetically generated Python textbooks/exercises (GPT-3.5).	Demonstrated high-quality, smaller datasets can lead to remarkable performance (phi-1: 50.6% HumanEval, 55.5% MBPP), emphasizing data quality over quantity.
Dolma	2024 (Feb)	3T Llama tokens (~11.5 TB from ~200TB raw).	Common Crawl (25 snapshots 2020-2023), GitHub (The Stack), Reddit (Pushshift), Semantic Scholar (peS2o), Project Gutenberg, Wikipedia.	Large-scale (3T Llama tokens), open, multi-source corpus with an open-source toolkit (filtering, mixing, deduplication) for transparency and reproducibility in data curation research.
Ask-LLM	2024 (Feb)	Curated from existing datasets (e.g., C4) by rejecting up to 90%.	Uses instruction-tuned LLMs (e.g., Flan-T5) to directly assess training example quality for pretraining.	Showed models trained on quality-filtered data (rejecting up to 90%) converge up to 70% faster and outperform full-data training.
FineWeb & FineWeb-Edu	2024 (June)	FineWeb: 15T tokens. FineWeb-Edu: 1.3T tokens.	FineWeb: 96 Common Crawl snapshots, filtered (URL, language, MassiveText, C4-inspired, custom heuristics) & per-snapshot MinHash deduplicated (5-grams, 75% similarity). FineWeb-Edu: Subset of FineWeb with Llama-3-70B-Instruct based educational classifier.	Very large-scale, meticulously documented web dataset; FineWeb-Edu showed significant MMLU (+12% rel.) and ARC (+24% rel.) gains from LLM-annotated educational content filtering.
DataComp-LM (DCLM-BASELINE)	2024 (June)	DCLM-POOL: 240T tokens. DCLM-BASELINE (7B model) trained on 2.6T tokens from DCLM-POOL.	Common Crawl (DCLM-POOL); DCLM-BASELINE uses resiliparse extraction, RefinedWeb heuristics, Bloom filter deduplication, and `fastText` (OH-2.5 + ELI5) model-based filtering on this pool.	Established a benchmark for data curation; DCLM-BASELINE achieved SOTA MMLU (64%) among open-data models, 6.6x less compute than Llama 3 8B, highlighting `fastText` model-based filtering efficacy.
Nemotron-CC	2024 (Dec)	6.3T tokens (4.4T real global deduped + 1.9T synthetic).	99 Common Crawl snapshots (Justext extraction); synthetically generated data (rephrasing, QA, etc.) using Mistral NeMo 12B.	Balanced quality and quantity for long-horizon training; advanced curation (classifier ensembling, synthetic data, reduced heuristics) yielding SOTA results (+5 MMLU vs Llama 3.1 8B for 15T training).
Nemotron-H (Data Aspects for Models)	2025 (Apr)	Pre-trained on up to 20T tokens (Nemotron-H-56B).	Nemotron-CC, curated math/code/academic data, extensive synthetic data for math (OpenWebMath expansion), code (problem/solution generation), SFT-style (OpenMathInstruct-2, Genetic Instruct).	Demonstrates use of large curated (Nemotron-CC) + diverse synthetic datasets with phased blending (curriculum learning) for training SOTA hybrid Mamba-Transformer models.

3. The LLM Pre-training Data Pipeline

The creation of a high-caliber pre-training dataset involves several critical stages:

Data Acquisition
Text Extraction
Quality & Content Filtering
Deduplication Strategies
Synthetic Data Generation
Corpus Assembly & Mixing

3.1 Data Acquisition

The initial step involves amassing vast quantities of raw textual data.

Dominant Source: Common Crawl (CC) is the predominant source, providing petabytes of web data and is used by most major open datasets like Dolma (from 25 CC snapshots 2020-2023), FineWeb (from 96 CC snapshots), DCLM-POOL (240T tokens from all CC data prior to 2023), and Nemotron-CC (from 99 CC snapshots 2013-2024).
Challenges with Raw Web Data: Raw CC data is inherently noisy, containing boilerplate, ads, and non-textual content. Pearls of Wisdom (prls.co) recognizes these challenges and leverages proprietary AI agents to meticulously analyze and clean your existing content.
Intelligent Crawling: Efforts like CRAW4LLM aim for more efficient crawling by prioritizing URLs influential for LLM pre-training, potentially achieving similar performance with significantly fewer URLs (e.g., 21% of URLs for same performance).
Auxiliary High-Quality Sources: Beyond web crawls, pipelines often incorporate curated sources like GitHub code (Dolma, Nemotron-H), scientific papers (e.g., Semantic Scholar for Dolma, Nemotron-H academic data), books (Project Gutenberg for Dolma, Nemotron-H academic data), Reddit discussions (Dolma), and encyclopedic content like Wikipedia (Dolma, Nemotron-H math data).
Proprietary Sources: Commercial LLMs often augment public data with extensively labeled and specially prepared datasets from third-party vendors. This is where Pearls of Wisdom (prls.co) excels, providing a critical source of structured, accurate data for specific companies.

3.2 Text Extraction

The raw data, often acquired as WARC (Web ARChive) files, primarily contains raw HTML content alongside HTTP headers. The crucial text extraction stage aims to convert this structured, often noisy, HTML into clean, plain text suitable for model training. This involves:

Boilerplate & Tag Removal: The primary challenge is stripping away non-content elements. Tools like Trafilatura (F1 score 0.912, used by FineWeb) and jusText (used by Nemotron-CC) are employed to remove navigation menus, advertisements, footers, and other boilerplate. DCLM used resiliparse for text extraction from HTML, finding it improved CORE by at least 2.5 points over WET extraction and was 8x faster than trafilatura. This process often includes removing most HTML tags, even structural ones like headings (<h1>, <h2>) and lists (<li>, <ol>), to isolate the core prose. The choice of tool matters; for instance, Nemotron-CC found jusText yielded 28.6% more high-quality tokens compared to Trafilatura. Pearls of Wisdom (prls.co) leverages advanced Natural Language Processing (NLP) to go beyond simple extraction, understanding the context and relationships within your content.
Noise & Symbol Reduction: Beyond tags, this step typically involves eliminating excessive whitespace, non-printing characters, and special symbols that don’t contribute significant semantic value to the text, ensuring a cleaner input stream for the model.
Language Identification: Early filtering for target languages (predominantly English in many SOTA datasets) is common, using tools like fastText or pycld2. Dolma used fastText with an English score threshold of $\ge0.5$, removing 61.7% of data by byte size. FineWeb and Nemotron-CC also use language classifiers.

3.3 Quality & Content Filtering

Beyond basic cleanliness, enhancing data quality involves sophisticated methods to identify and retain the most valuable content for LLM training. This stage is critical for model performance and convergence speed. Pearls of Wisdom (prls.co)‘s proprietary AI agents are specifically designed for this semantic winnowing, ensuring only the most relevant and accurate information about your company is prepared for LLM ingestion.

Heuristic-Based Filtering: Initial passes often use rules based on document characteristics. Examples include Gopher and C4 rules (Dolma used Gopher All + C4 N