What are Logits in Machine Learning and Why They Matter
Logits are the raw, unnormalized scores output by a neural network before they are turned into pr...
Abstract AlgorithmsAI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Logits are the raw, unnormalized scores produced by the final layer of a neural network โ before any probability transformation. Softmax converts them to probabilities. Temperature scales them before Softmax to control output randomness.
๐ The Confidence Meter Before Calibration
A doctor looks at an X-ray and assigns a raw "gut score": Cancer likelihood = 8.2, Normal = 1.4. These numbers are not percentages โ they haven't been normalized yet. To get percentages, she runs them through a calibration formula.
Logits are those raw gut scores. The Softmax function is the calibration formula.
๐ What Exactly Is a Logit?
The word logit comes from statistics โ it's short for "log-odds unit." In modern deep learning, the term is used more loosely to mean the raw, unnormalized score produced by the final linear layer of a neural network before any activation function is applied.
Think of a neural network as a pipeline: raw inputs pass through multiple layers of transformations, and at the very end, a linear layer projects everything into a vector of numbers โ one number per possible output class. Those numbers are the logits.
Key properties of logits:
- They can be any real number โ positive, negative, or zero.
- A higher logit means the network is more confident in that class relative to others.
- They have no guaranteed scale โ a logit of 8.0 from one model isn't necessarily more confident than a logit of 3.0 from another model.
- They must be normalized (via Softmax, Sigmoid, etc.) before being interpreted as probabilities.
Understanding logits is essential for working with loss functions, temperature sampling, and model calibration.
๐ข From Network Output to Prediction
A classification network predicting image labels outputs raw scores:
Raw output (logits):
Cat: 4.5
Dog: -1.2
Bird: 0.8
These numbers have no fixed scale. A "4.5" just means "more confident Cat than the others."
Softmax converts logits to probabilities:
$$P_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$$
Applied to the example โ computing e^4.5 = 90.02, e^โ1.2 = 0.30, e^0.8 = 2.23, then dividing each by the sum 92.55:
| Class | Logit | e^z | Probability |
| Cat | 4.5 | 90.02 | ~97% |
| Dog | -1.2 | 0.30 | ~0.3% |
| Bird | 0.8 | 2.23 | ~2.4% |
The model is 97% confident it is a cat.
๐ Logits to Token Pipeline
flowchart LR
LG[Raw Logits] --> SM[Softmax]
SM --> PR[Probabilities]
PR --> TK[Top-K Filter]
TK --> SP[Sample Token]
SP --> OUT[Output Token]
This flowchart traces the journey of a raw logit vector from model output to a sampled output token. Starting with the raw scores produced by the final linear layer, Softmax converts them into a probability distribution, Top-K filtering narrows the candidates to the most likely tokens, and sampling draws the final output token stochastically. The key insight is that the raw logit values drive the entire downstream selection process โ changing the logits (via temperature scaling) shifts the probability mass and therefore changes which token gets selected.
โ๏ธ Why Not Output Probabilities Directly?
- Training stability: Working with logits and applying log-Softmax is numerically more stable than computing probabilities first, then taking their logarithm for the cross-entropy loss โ the standard training objective that penalizes the model proportionally to how wrong its predicted probability distribution is compared to the true label.
- Flexibility: The same logits can be processed differently (Softmax for classification, sigmoid for multi-label, raw for regression).
- Temperature scaling: You can reshape the distribution from the logits before applying Softmax, which you couldn't do if the network directly output probabilities.
๐ The Logit Pipeline: From Input to Prediction
Here is how logits flow through the full inference pipeline in a classification or language model:
flowchart TD
Input[Input Data (text, image, etc.)] --> Encoder[Encoder / Hidden Layers (feature extraction)]
Encoder --> Linear[Final Linear Layer (produces raw logits z_i)]
Linear --> Temp[ Temperature T (optional scaling)]
Temp --> Softmax[Softmax probability distribution P_i]
Softmax --> Decision[Argmax or Sampling predicted class / next token]
Step-by-step:
- Input โ raw data (tokens, pixels) enters the network.
- Hidden layers โ extract abstract features through multiple transformations.
- Linear layer โ maps features to a logit per class (no activation, just weights ร input + bias).
- Temperature scaling โ optionally divides each logit by T to sharpen or flatten the distribution.
- Softmax โ converts logits to a valid probability distribution summing to 1.
- Decision โ argmax picks the highest-probability class; sampling draws stochastically from the distribution.
This pipeline is the same whether you're classifying images, generating text, or running a sentiment model.
๐ง Deep Dive: Temperature: Reshaping the Logit Distribution
In language models, the vocabulary logits are scaled by a temperature $T$ before Softmax:
$$P_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$
Effect of temperature on the cat/dog/bird example (logits: 4.5, -1.2, 0.8):
| Temperature | Cat P | Dog P | Bird P | Interpretation |
| T = 1.0 | 97% | 0.3% | 2.4% | Standard |
| T = 0.5 | ~99.9% | ~0% | ~0.1% | Even more peaked on Cat |
| T = 2.0 | ~80% | ~3% | ~17% | Flatter โ more uncertainty expressed |
| T โ 0 | 100% | 0% | 0% | Greedy (argmax) |
Low T: Sharp distribution. High confidence. Repetitive in language models.
High T: Flat distribution. Diverse/random. Creative in language models.
flowchart LR
Hidden[Hidden Layer Output] --> Logits[Logit Layer (raw z_i scores)]
Logits --> Temp[ Temperature T]
Temp --> Softmax[Softmax probabilities P_i]
Softmax --> Sample[Sample or Argmax predicted class / next token]
๐ Real-World Applications of Logits
Logits appear in almost every modern ML system. Here are concrete examples of where they matter:
1. Large Language Models (ChatGPT, Gemini, Claude) Every time an LLM generates the next word, it produces a logit for every token in its vocabulary (often 50,000+ tokens). Temperature and top-k/top-p sampling are applied to these logits before selecting the next token.
2. Image Classification (ResNet, ViT) The final layer of an image classifier outputs one logit per class. For ImageNet (1,000 classes), that's a 1,000-dimensional logit vector, converted via Softmax to pick the most likely object.
3. Sentiment Analysis A binary sentiment classifier outputs two logits: one for "positive" and one for "negative." Softmax (or Sigmoid on a single logit) converts these into a probability like "82% positive."
4. Model Calibration Well-calibrated models have logits that map to accurate probabilities. Temperature scaling is a common post-training calibration technique: a single temperature parameter T is tuned on a validation set to make predicted probabilities match actual frequencies.
5. Retrieval and Ranking In contrastive learning (CLIP, sentence embeddings), similarity logits measure how well query-document pairs match, then Softmax ranks results.
๐ Next Token Prediction Loop
sequenceDiagram
participant C as Context Tokens
participant M as LLM
participant L as Logits
participant S as Sampler
C->>M: input sequence
M->>L: compute logits
L->>S: apply temperature
S-->>C: append new token
C->>M: next iteration
This sequence diagram illustrates the autoregressive token generation loop of a large language model. The LLM receives the accumulated context tokens, computes logits for the next token position, passes them to the sampler (which applies temperature and any top-k or top-p filters), and appends the selected token back to the context before the next iteration. The loop continues until an end-of-sequence token is generated or a maximum length is reached โ making the logit-to-token pipeline repeat for every single token in the model's output.
๐งช Hands-On: Logits in PyTorch
Let's make logits concrete with runnable code.
import torch
import torch.nn.functional as F
# Simulated logits from a 3-class classifier
logits = torch.tensor([4.5, -1.2, 0.8])
# Standard Softmax (T=1)
probs = F.softmax(logits, dim=0)
print("Probabilities:", probs)
# tensor([0.9730, 0.0030, 0.0240])
# Temperature scaling: T=2.0 (flatter distribution)
T = 2.0
probs_hot = F.softmax(logits / T, dim=0)
print("T=2.0:", probs_hot)
# tensor([0.8360, 0.0280, 0.1360])
# CrossEntropyLoss expects raw logits โ NOT softmax output
target = torch.tensor([0]) # true class is index 0 (Cat)
loss = torch.nn.CrossEntropyLoss()(logits.unsqueeze(0), target)
print("Loss:", loss.item()) # 0.0274
Key takeaways:
F.softmax(logits)converts raw scores to probabilities.- Dividing by temperature before Softmax flattens or sharpens the distribution.
CrossEntropyLossappliesLogSoftmaxinternally โ never pass already-softmaxed values or you'll get incorrect gradients.
โ๏ธ Trade-offs & Failure Modes: Logits in Different Contexts
| Context | Output Layer | Applied After | Purpose |
| Multi-class classification | Logit vector (vocab size ร 1) | Softmax | Pick one class |
| Multi-label classification | Logit vector | Sigmoid (per logit) | Multiple binary predictions |
| Language model | Logit vector (one per vocab token) | Softmax + Temperature | Sample next token |
| Binary classification | Single logit | Sigmoid | P(positive class) |
| Regression | Raw value (no normalization) | None | Continuous output |
๐งญ Decision Guide: When to Use Logits vs. Probabilities
| You want to... | Use |
| Compute loss during training | Raw logits โ CrossEntropyLoss (never Softmax first) |
| Compare confidence across predictions | Softmax probabilities, not raw logits |
| Control output diversity at inference | Divide logits by temperature T before Softmax |
| Rank candidates in retrieval or search | Similarity logits (higher = more relevant) |
| Calibrate model confidence post-training | Tune temperature T on a validation set to align probabilities with observed frequencies |
๐ ๏ธ PyTorch and Hugging Face Transformers: Logits in Real Model Pipelines
PyTorch is the foundational tensor computation library that underlies virtually every modern neural network; its torch.nn.functional module provides the canonical softmax, log_softmax, and cross_entropy functions that operate directly on logit tensors. Hugging Face Transformers builds on top of PyTorch and exposes raw logits from any supported model through its outputs.logits tensor, making it straightforward to inspect, temperature-scale, or re-sample from the full vocabulary distribution.
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").eval()
# Encode a prompt and run a forward pass
input_ids = tokenizer("The capital of France is", return_tensors="pt").input_ids
with torch.no_grad():
outputs = model(input_ids)
# outputs.logits shape: [batch=1, seq_len, vocab_size=50257]
next_token_logits = outputs.logits[0, -1, :] # logits for the NEXT token
# Standard softmax โ probability distribution
probs = F.softmax(next_token_logits, dim=-1)
# Temperature scaling: T=0.7 sharpens the distribution
T = 0.7
probs_scaled = F.softmax(next_token_logits / T, dim=-1)
# Top predicted token
top_token_id = torch.argmax(probs).item()
print(f"Most likely next token: '{tokenizer.decode([top_token_id])}'") # โ ' Paris'
print(f"Probability: {probs[top_token_id]:.4f}")
This pattern โ get outputs.logits, slice the last position, apply temperature, then sample โ is the exact inference loop used inside every HuggingFace model.generate() call, just made explicit here for learning purposes.
For a full deep-dive on PyTorch and Hugging Face Transformers, dedicated follow-up posts are planned.
๐ Key Lessons and Common Pitfalls
Lesson 1 โ Don't double-apply Softmax.
The most common beginner mistake is passing Softmax-normalized probabilities into CrossEntropyLoss. The loss function already applies LogSoftmax internally. Applying Softmax twice produces wrong gradients and silent training failures.
Lesson 2 โ Logits are not comparable across models. A logit of 5.0 from Model A and a logit of 5.0 from Model B do not mean the same thing. The scale of logits depends on weight initialization and training dynamics. Always compare probabilities, not raw logits, when evaluating different models.
Lesson 3 โ Temperature is a powerful control knob. During inference, temperature is one of the cheapest hyperparameters to tune. No retraining required โ just divide the logits before Softmax. Start with T=1.0, lower it for precision tasks (code generation), and raise it for creative tasks (story writing).
Lesson 4 โ Large logit gaps cause probability collapse. If one logit is much larger than the rest (e.g., [12.0, 0.1, 0.2]), Softmax will assign ~100% to the top class and ~0% to others. This can cause vanishing gradients in early training. Weight initialization strategies (Xavier, He) are designed to keep logits in a reasonable range at the start of training.
Lesson 5 โ Log-Softmax is more numerically stable than Softmax โ log.
Always use F.log_softmax + NLLLoss (or equivalently CrossEntropyLoss directly) rather than F.softmax followed by torch.log, to avoid floating-point underflow when probabilities are very small.
๐ TLDR: Summary & Key Takeaways
- Logits = raw, unnormalized scores from the final linear layer of a neural network.
- Softmax converts logits to a probability distribution summing to 1.
- Temperature divides logits before Softmax: low T โ peaked (confident), high T โ flat (diverse).
CrossEntropyLossin PyTorch/TensorFlow takes logits directly โ don't apply Softmax before passing to the loss function; it's included internally for numerical stability.
๐ Related Posts
- How GPT/LLM Works
- Text Decoding Strategies: Greedy, Beam Search, and Sampling
- Embeddings Explained
- LLM Hyperparameters Guide: Temperature, Top-P, and Top-K Explained
- Tokenization Explained: How LLMs Understand Text

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)
TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...
Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive
TLDR: LoRA freezes the base model and trains two tiny matrices per layer โ 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2ร A100 80 GB instead of 8...
Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs
TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global minimum across all partitions, subtracts the thresho...
