All Posts

A Guide to Pre-training Large Language Models

Pre-training is the most expensive part of building an LLM. We explain the data pipeline, the 'Ne...

Abstract AlgorithmsAbstract Algorithms
ยทยท5 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Pre-training is the phase where an LLM learns "Language" and "World Knowledge" by reading petabytes of text. It uses Self-Supervised Learning to predict the next word in a sentence. This creates the "Base Model" which is later fine-tuned.


๐Ÿ“– The Library Metaphor: What Pre-training Actually Does

Imagine teaching a child to read.

  • Pre-training: You lock the child in a library for ten years. They read every book โ€” grammar, history, math, code, recipes. They absorb the structure and content of human knowledge. But they have no social skills; they can't follow instructions or hold a polite conversation.
  • Fine-tuning (next step): You hire a tutor to teach manners ("don't say harmful things") and specific tasks ("summarize this document").

Pre-training creates the Base Model โ€” a powerful but raw artifact. Fine-tuning shapes it into a product (ChatGPT, Claude, Gemini).


๐Ÿ”ข Next-Token Prediction: The Self-Supervised Training Signal

The entire pre-training game is one question repeated billions of times:

"Given the text so far, what is the most likely next token?"

Input:   "The capital of France is"
Target:  "Paris"

This is self-supervised because the labels are already in the data โ€” you just mask the next word. No human annotation needed.

The Loss Function: Cross-Entropy

$$L = -\sum_{t} \log P(x_t \mid x_{

  • $x_t$: the correct next token at step $t$
  • $x_{<t}$: all preceding tokens
  • $P(xt \mid x{<t})$: the model's probability for the correct token

The model is penalized for assigning low probability to the correct next word. Minimizing $L$ over trillions of tokens forces the model to learn grammar, facts, and reasoning patterns.


โš™๏ธ The Data Pipeline: From the Web to a Training Run

flowchart LR
    A[Common Crawl\nBooks3 / GitHub / arXiv] --> B[Deduplication]
    B --> C[Quality Filtering]
    C --> D[Tokenization]
    D --> E[Packed Sequences\n2Kโ€“128K tokens]
    E --> F[Training Shards\non Object Storage]
    F --> G[GPU Cluster]
StageWhat happensWhy it matters
DeduplicationRemove near-duplicate pagesPrevents memorization of repeated text
Quality filterRemove boilerplate, low-quality HTMLImproves token efficiency
TokenizationBPE / SentencePieceCompresses text; handles rare words
PackingFill context windows to capacityMaximizes GPU utilization

Training data typically includes Common Crawl (web text), Books3, GitHub code, arXiv papers, and Wikipedia. The mix ratio shapes model strengths.


๐Ÿง  Inside the Training Loop: Loss, Gradients, and Checkpoints

The training loop looks like this in pseudocode:

for batch in training_data:
    tokens = tokenize(batch)
    logits = model(tokens[:-1])          # predict all positions
    loss = cross_entropy(logits, tokens[1:])  # compare to true next token
    loss.backward()                      # compute gradients
    optimizer.step()                     # update weights
    optimizer.zero_grad()
    if step % checkpoint_interval == 0:
        save_checkpoint(model)

In practice, training runs on thousands of GPUs or TPUs for weeks to months, using advanced parallelism strategies (data parallelism, tensor parallelism, pipeline parallelism).


๐Ÿงฉ What a Base Model Can and Cannot Do

Can doCannot do
Complete text in contextFollow instructions reliably
Summarize if prompted cleverlyRefuse harmful requests
Write code that is syntactically plausibleAdmit when it doesn't know
Translate languagesHave a consistent helpful persona

A base model will happily continue any text you give it โ€” including harmful content. Fine-tuning with RLHF or SFT shapes it into a helpful, harmless assistant.


โš–๏ธ Cost, Carbon, and the Scaling Trap

Training a frontier-scale LLM (GPT-4, LLaMA 3 70B) requires:

  • Compute: thousands of H100 GPUs running for months
  • Cost: $50Mโ€“$100M+ per frontier run
  • Energy: significant carbon footprint

The key trade-offs:

  • Data scale vs data quality: more tokens help, but noisy corpora have diminishing returns
  • Larger model vs smaller but high-quality: a well-filtered 7B model can out-perform a poorly trained 70B
  • Pre-training breadth vs fine-tuning depth: broad pre-training creates a flexible base; fine-tuning sharpens it for specific tasks

Few organizations can afford to pre-train from scratch. Most practitioners work with open base models (LLaMA, Mistral, Qwen) and apply LoRA fine-tuning.


๐Ÿ“Œ Key Takeaways

  • Pre-training = self-supervised learning on massive text corpora using next-token prediction.
  • The loss is cross-entropy; minimizing it forces the model to learn grammar, facts, and reasoning.
  • The result is a Base Model โ€” capable but unaligned. Fine-tuning is required for product use.
  • The data pipeline (dedup โ†’ filter โ†’ tokenize โ†’ pack) is as important as the model architecture.
  • Most practitioners never pre-train from scratch; they fine-tune existing open models.

๐Ÿงฉ Test Your Understanding

  1. Why is next-token prediction called "self-supervised"?
  2. What does the cross-entropy loss penalize the model for?
  3. Why is deduplication important before training?
  4. What is the difference between a base model and a fine-tuned assistant?

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms