A Guide to Pre-training Large Language Models
Pre-training is the most expensive part of building an LLM. We explain the data pipeline, the 'Ne...
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Pre-training is the phase where an LLM learns "Language" and "World Knowledge" by reading petabytes of text. It uses Self-Supervised Learning to predict the next word in a sentence. This creates the "Base Model" which is later fine-tuned.
๐ The Library Metaphor: What Pre-training Actually Does
Imagine teaching a child to read.
- Pre-training: You lock the child in a library for ten years. They read every book โ grammar, history, math, code, recipes. They absorb the structure and content of human knowledge. But they have no social skills; they can't follow instructions or hold a polite conversation.
- Fine-tuning (next step): You hire a tutor to teach manners ("don't say harmful things") and specific tasks ("summarize this document").
Pre-training creates the Base Model โ a powerful but raw artifact. Fine-tuning shapes it into a product (ChatGPT, Claude, Gemini).
๐ Deep Dive: Pre-Training Fundamentals
Before an LLM can write poetry, translate languages, or explain code, it needs to understand the basic patterns of human text. Pre-training is the process of building that foundational understanding from scratch โ no hand-labeled data required.
Self-supervised learning is the key insight that makes this scale. Instead of asking humans to annotate millions of examples, the model creates its own training signal: given the words so far, predict the next one. The labels are already in the data โ you just cover up the next word and ask the model to guess it. Correct it when it's wrong. Repeat for trillions of tokens.
Tokenization turns raw text into sequences of integers the model can process. Most modern LLMs use Byte-Pair Encoding (BPE) or SentencePiece, which break text into subword units. The word "unbelievable" might become three tokens: un, believ, able. A typical vocabulary has 32,000โ128,000 unique tokens โ enough to cover most languages, programming syntax, and scientific notation without exploding in size.
The training corpus is the collection of text the model learns from. The mix and quality of sources shapes model strengths more than almost any other design decision:
| Source | Typical share | What it contributes |
| Common Crawl (web text) | ~60โ80% | Broad language coverage, diverse topics |
| Books and long-form writing | ~5โ15% | Multi-paragraph coherence and reasoning |
| GitHub code repositories | ~5โ10% | Programming ability and logical structure |
| Wikipedia / arXiv / papers | ~5โ10% | Factual accuracy and technical depth |
A carefully filtered 300B-token corpus will produce a stronger model than a carelessly collected 3T-token one. Data curation is a competitive differentiator.
๐ The Pre-Training Workflow: From Raw Data to Base Model
Pre-training follows a structured pipeline. Each stage is essential โ skipping or rushing any one of them measurably degrades the final model quality.
graph TD
A[Raw Text Sources Web Books Code Papers] --> B[Deduplication & Quality Filtering]
B --> C[Tokenization BPE / SentencePiece]
C --> D[Packed Sequences Fill Context Windows]
D --> E[GPU / TPU Cluster Forward Pass + Backpropagation]
E --> F{Checkpoint?}
F -->|Every N steps| G[Save Checkpoint to Storage]
F -->|Continue training| E
G --> H[Base Model]
What each stage does:
- Dedup + filter: Remove near-duplicate web pages and low-quality HTML. Training on repeated text causes the model to memorize rather than generalize. Quality filtering is often more impactful than simply adding more raw tokens.
- Tokenize: Convert text into integer token IDs using BPE; pack multiple shorter sequences together to fill each context window completely, maximizing GPU utilization.
- Train: Run the transformer forward pass to produce token predictions, compute cross-entropy loss against the true next tokens, and backpropagate gradients to update all model weights.
- Checkpoint: Save model weights to persistent storage every few thousand steps. Multi-week training runs on thousands of GPUs are prone to hardware failures; checkpointing is the safety net that prevents catastrophic loss of progress.
- Base model: The final artifact โ a transformer whose weights encode grammar, world facts, code patterns, and reasoning structures absorbed from the entire corpus.
๐ข Next-Token Prediction: The Self-Supervised Training Signal
The entire pre-training game is one question repeated billions of times:
"Given the text so far, what is the most likely next token?"
Input: "The capital of France is"
Target: "Paris"
This is self-supervised because the labels are already in the data โ you just mask the next word. No human annotation needed.
The Loss Function: Cross-Entropy
$$L = -\sum_{t} \log P(x_t \mid x_{
- $x_t$: the correct next token at step $t$
- $x_{<t}$: all preceding tokens
- $P(xt \mid x{<t})$: the model's probability for the correct token
The model is penalized for assigning low probability to the correct next word. Minimizing $L$ over trillions of tokens forces the model to learn grammar, facts, and reasoning patterns.
๐ Next-Token Prediction Training Loop
sequenceDiagram
participant D as Data Loader
participant T as Tokenizer
participant M as Transformer
participant L as Loss Function
participant O as Optimizer
D->>T: Raw text sequence
T->>M: Token IDs [x1, x2, ..., xN]
M->>M: Forward pass (attention layers)
M->>L: Predicted logits for each position
L->>L: Cross-entropy vs true next token
L->>O: Backpropagate gradients
O->>M: Update all weights
M->>D: Ready for next batch
This sequence diagram traces one complete training step from raw text to updated model weights. The data loader feeds a text sequence to the tokenizer, which converts it to integer IDs; the transformer runs a forward pass to produce logits for every position; the loss function computes cross-entropy against the true next tokens; and the optimizer uses the resulting gradients to update all model weights. The key takeaway is that no human annotation is needed anywhere in this loop โ the next token in the sequence is the label, and this self-supervised cycle runs automatically over trillions of such steps.
โ๏ธ The Data Pipeline: From the Web to a Training Run
flowchart LR
A[Common Crawl Books3 / GitHub / arXiv] --> B[Deduplication]
B --> C[Quality Filtering]
C --> D[Tokenization]
D --> E[Packed Sequences 2K128K tokens]
E --> F[Training Shards on Object Storage]
F --> G[GPU Cluster]
This pipeline shows the journey from raw internet crawl data to packed training batches ready for a GPU cluster. Near-duplicate removal and quality filtering happen early because noisy or repeated text degrades model quality far more than simply adding more volume. The final packing step is critical for GPU efficiency: sequences are concatenated to fill each context window completely, ensuring no compute cycles are wasted on padding tokens.
| Stage | What happens | Why it matters |
| Deduplication | Remove near-duplicate pages | Prevents memorization of repeated text |
| Quality filter | Remove boilerplate, low-quality HTML | Improves token efficiency |
| Tokenization | BPE / SentencePiece | Compresses text; handles rare words |
| Packing | Fill context windows to capacity | Maximizes GPU utilization |
Training data typically includes Common Crawl (web text), Books3, GitHub code, arXiv papers, and Wikipedia. The mix ratio shapes model strengths.
๐ง Deep Dive: Inside the Training Loop: Loss, Gradients, and Checkpoints
The training loop looks like this in pseudocode:
for batch in training_data:
tokens = tokenize(batch)
logits = model(tokens[:-1]) # predict all positions
loss = cross_entropy(logits, tokens[1:]) # compare to true next token
loss.backward() # compute gradients
optimizer.step() # update weights
optimizer.zero_grad()
if step % checkpoint_interval == 0:
save_checkpoint(model)
In practice, training runs on thousands of GPUs or TPUs for weeks to months, using advanced parallelism strategies (data parallelism, tensor parallelism, pipeline parallelism).
๐ฌ Internals
Pre-training minimizes cross-entropy loss over next-token prediction: L = -ฮฃ log P(t_i | t1,...,t{i-1}). The training corpus is tokenized and packed into fixed-length sequences (e.g., 2048 or 4096 tokens); documents are concatenated with separator tokens to maximize GPU utilization. Learning rate follows a warmup-then-cosine-decay schedule โ warmup prevents early instability; cosine decay improves final perplexity by 2โ5%.
โก Performance Analysis
Training LLaMA-2 7B requires ~184,000 A100-GPU-hours on 2T tokens โ roughly โ6M at cloud rates. Chinchilla scaling laws show that compute-optimal training allocates equal FLOPS to parameters and tokens: a 7B model is optimally trained on ~140B tokens. Perplexity on standard benchmarks drops roughly logarithmically with compute, so doubling GPU budget yields diminishing returns beyond the Chinchilla point.
๐งญ Decision Guide: What a Base Model Can and Cannot Do
| Can do | Cannot do |
| Complete text in context | Follow instructions reliably |
| Summarize if prompted cleverly | Refuse harmful requests |
| Write code that is syntactically plausible | Admit when it doesn't know |
| Translate languages | Have a consistent helpful persona |
A base model will happily continue any text you give it โ including harmful content. Fine-tuning with RLHF or SFT shapes it into a helpful, harmless assistant.
โ๏ธ Trade-offs & Failure Modes: Cost, Carbon, and Scaling
Training a frontier-scale LLM (GPT-4, LLaMA 3 70B) requires:
- Compute: thousands of H100 GPUs running for months
- Cost: $50Mโ$100M+ per frontier run
- Energy: significant carbon footprint
The key trade-offs:
- Data scale vs data quality: more tokens help, but noisy corpora have diminishing returns
- Larger model vs smaller but high-quality: a well-filtered 7B model can out-perform a poorly trained 70B
- Pre-training breadth vs fine-tuning depth: broad pre-training creates a flexible base; fine-tuning sharpens it for specific tasks
Few organizations can afford to pre-train from scratch. Most practitioners work with open base models (LLaMA, Mistral, Qwen) and apply LoRA fine-tuning.
๐ Pre-Training Data Pipeline
flowchart LR
Web[Raw Web (Common Crawl)]
Books[Books3 / arXiv / GitHub / Wiki]
Dedup[Near-Duplicate Removal]
Filter[Quality Filter (heuristics + classifiers)]
Mix[Domain Mix (web 70%, code 15%...)]
Tok[Tokenization (BPE / SentencePiece)]
Pack[Pack Sequences (fill context windows)]
Shards[Training Shards (object storage)]
Train[GPU Cluster Forward + Backprop]
Web --> Dedup
Books --> Dedup
Dedup --> Filter --> Mix --> Tok --> Pack --> Shards --> Train
This diagram shows the full data curation pipeline, from heterogeneous raw sources (web crawl, books, code, academic papers) through deduplication, quality filtering, and domain mixing, to the packed training shards consumed by the GPU cluster. The domain mix stage is where practitioners make decisions that permanently shape model strengths: increasing the code share improves programming ability; increasing scientific paper share improves technical reasoning. The key insight is that these curation choices cannot be undone after training โ they are baked into the base model's capabilities for all downstream use cases.
๐ Real-World Applications: Where Pre-trained Models Power the World
Pre-trained language models are the invisible backbone of dozens of products in use today. The same foundational technique โ next-token prediction on massive corpora โ powers everything from consumer chat apps to scientific research tools.
| Product / System | Pre-trained Base | What it enables |
| ChatGPT / GPT-4 | OpenAI internal base model | Conversational AI, coding, long-form writing |
| Claude 3 | Anthropic base model | Safety-focused long-context assistant |
| Gemini 1.5 | Google DeepMind base | Multimodal: text, images, audio, video |
| GitHub Copilot | Codex / GPT-4 family | In-editor code completion and generation |
| LLaMA 3 / Mistral | Open-weight base models | Community fine-tuning and research platform |
| AlphaFold 2 | Pre-trained on protein sequences | 3D protein structure prediction |
Beyond chat assistants, pre-trained models drive progress across many fields:
- Search engines: Google and Bing use LLMs to improve query understanding and surface direct answers within results.
- Legal and finance: Domain-specific models read and summarize contracts, regulatory filings, and earnings calls in seconds.
- Drug discovery: Models pre-trained on biochemical literature assist researchers in generating and filtering hypotheses.
- Education: Tutoring tools use pre-trained bases fine-tuned for pedagogy โ adapting explanations to the student's level.
The pattern is consistent: pre-train broadly on diverse data, fine-tune narrowly for a specific task. The expensive, reusable asset is always the base model.
๐งช Practical Considerations for Practitioners
Here is what pre-training means for your day-to-day work as a practitioner:
You almost certainly will not pre-train from scratch. Training a frontier model requires thousands of H100 GPUs for weeks and costs $50Mโ$100M+. Even well-funded startups typically start from an open base model and fine-tune from there.
Open base models available today give you a strong starting point at zero training cost:
- LLaMA 3 8B / 70B (Meta): Strong general-purpose base; commercially licensed for most use cases.
- Mistral 7B / Mixtral 8x7B: Efficient architectures with excellent performance-per-compute ratio.
- Qwen 2.5 (Alibaba): Strong multilingual capabilities and coding performance.
Your practical levers after choosing a base model:
| Approach | Compute needed | When to use |
| Prompt engineering | Inference only | Task is well-defined; no labeled training data available |
| LoRA fine-tuning | 1โ2 consumer GPUs | Custom tone, domain vocabulary, specific task style |
| Full fine-tuning | 4โ16 GPUs | Deep behavior change with a large labeled dataset |
| Pre-training from scratch | 1,000+ GPUs | Novel domain not covered by any existing model |
The LoRA shortcut: LoRA (Low-Rank Adaptation) freezes the base model weights and trains tiny adapter matrices inserted into each transformer attention layer. This reduces the number of trainable parameters by 100โ10,000ร while preserving most of the fine-tuning quality. It is the most practical approach for teams without a large GPU budget.
๐ ๏ธ Hugging Face Transformers: The Trainer API and DataCollatorForLanguageModeling
Hugging Face Transformers is the standard open-source toolkit for working with pre-trained language models โ it provides AutoModel classes for every major architecture, Trainer for training loops, and data utilities like DataCollatorForLanguageModeling that handle the causal LM training objective (next-token prediction with automatic label shifting) without boilerplate.
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
DataCollatorForLanguageModeling,
TrainingArguments,
Trainer,
)
from datasets import load_dataset
model_name = "gpt2" # swap to mistralai/Mistral-7B for a larger pre-training run
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name)
# Load and tokenize a text corpus (e.g., domain-specific pre-training data)
raw_dataset = load_dataset("text", data_files={"train": "corpus.txt"})
def tokenize(examples):
return tokenizer(examples["text"], truncation=True, max_length=512)
tokenized = raw_dataset.map(tokenize, batched=True, remove_columns=["text"])
# DataCollatorForLanguageModeling:
# - mlm=False โ causal LM mode (next-token prediction, not masked LM)
# - automatically shifts labels by one position โ no manual label construction
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments(
output_dir="./pretrained-output",
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=5e-5,
bf16=True,
logging_steps=100,
save_steps=500,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
data_collator=collator,
)
trainer.train()
trainer.save_model("./pretrained-output/final")
DataCollatorForLanguageModeling with mlm=False is the key piece โ it constructs the (input_ids, labels) pairs for causal next-token prediction automatically, where labels[i] = input_ids[i+1] across each sequence. This is the same training objective described in the cross-entropy section above, just abstracted into two lines.
For a full deep-dive on Hugging Face Transformers' Trainer API and pre-training pipelines, a dedicated follow-up post is planned.
๐ Lessons from Pre-Training at Scale
Years of large-scale pre-training experiments have surfaced insights that are not obvious from first principles โ and that matter for anyone building on top of these models:
Scaling laws are predictable. Kaplan et al. (2020) showed that validation loss decreases as a smooth power law with compute, data size, and parameter count. You can reliably forecast how much a larger model will improve before committing to the expensive training run.
The Chinchilla lesson: you are probably undertraining. The 2022 Chinchilla paper (Hoffmann et al.) showed that most pre-2022 LLMs used too many parameters relative to their training token count. The compute-optimal recipe is roughly 20 tokens per parameter: a 7B model should train on ~140B tokens. Well-trained smaller models routinely outperform larger models that were trained on too little data.
Data quality beats data quantity. Filtering noisy web text, deduplicating aggressively, and curating high-quality sources (books, code, peer-reviewed papers) often produces better results than simply dumping more raw tokens into training. LLaMA 3's heavily filtered corpus approach demonstrated this at scale.
Emergent capabilities appear suddenly. Some abilities โ multi-step arithmetic, in-context few-shot learning, chain-of-thought reasoning โ are nearly absent at small scale and appear almost discontinuously at certain parameter or data thresholds. This emergence phenomenon remains an active research area and has direct implications for evaluating safety before deployment.
A base model is not safe to deploy directly. It will complete any prompt without guardrails, including requests for harmful content. Reinforcement Learning from Human Feedback (RLHF) and Supervised Fine-Tuning (SFT) are required to transform a raw base model into a safe, helpful product.
๐ TLDR: Summary & Key Takeaways
- The loss is cross-entropy; minimizing it forces the model to learn grammar, facts, and reasoning.
- The result is a Base Model โ capable but unaligned. Fine-tuning is required for product use.
- The data pipeline (dedup โ filter โ tokenize โ pack) is as important as the model architecture.
- Most practitioners never pre-train from scratch; they fine-tune existing open models.
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
