All Posts

What are Logits in Machine Learning and Why They Matter

Logits are the raw, unnormalized scores output by a neural network before they are turned into pr...

Abstract AlgorithmsAbstract Algorithms
ยทยท11 min read

AI-assisted content.

TLDR: Logits are the raw, unnormalized scores produced by the final layer of a neural network โ€” before any probability transformation. Softmax converts them to probabilities. Temperature scales them before Softmax to control output randomness.


๐Ÿ“– The Confidence Meter Before Calibration

A doctor looks at an X-ray and assigns a raw "gut score": Cancer likelihood = 8.2, Normal = 1.4. These numbers are not percentages โ€” they haven't been normalized yet. To get percentages, she runs them through a calibration formula.

Logits are those raw gut scores. The Softmax function is the calibration formula.


๐Ÿ” What Exactly Is a Logit?

The word logit comes from statistics โ€” it's short for "log-odds unit." In modern deep learning, the term is used more loosely to mean the raw, unnormalized score produced by the final linear layer of a neural network before any activation function is applied.

Think of a neural network as a pipeline: raw inputs pass through multiple layers of transformations, and at the very end, a linear layer projects everything into a vector of numbers โ€” one number per possible output class. Those numbers are the logits.

Key properties of logits:

  • They can be any real number โ€” positive, negative, or zero.
  • A higher logit means the network is more confident in that class relative to others.
  • They have no guaranteed scale โ€” a logit of 8.0 from one model isn't necessarily more confident than a logit of 3.0 from another model.
  • They must be normalized (via Softmax, Sigmoid, etc.) before being interpreted as probabilities.

Understanding logits is essential for working with loss functions, temperature sampling, and model calibration.


๐Ÿ”ข From Network Output to Prediction

A classification network predicting image labels outputs raw scores:

Raw output (logits):
  Cat:  4.5
  Dog: -1.2
  Bird: 0.8

These numbers have no fixed scale. A "4.5" just means "more confident Cat than the others."

Softmax converts logits to probabilities:

$$P_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

Applied to the example โ€” computing e^4.5 = 90.02, e^โˆ’1.2 = 0.30, e^0.8 = 2.23, then dividing each by the sum 92.55:

ClassLogite^zProbability
Cat4.590.02~97%
Dog-1.20.30~0.3%
Bird0.82.23~2.4%

The model is 97% confident it is a cat.

๐Ÿ“Š Logits to Token Pipeline

flowchart LR
    LG[Raw Logits] --> SM[Softmax]
    SM --> PR[Probabilities]
    PR --> TK[Top-K Filter]
    TK --> SP[Sample Token]
    SP --> OUT[Output Token]

This flowchart traces the journey of a raw logit vector from model output to a sampled output token. Starting with the raw scores produced by the final linear layer, Softmax converts them into a probability distribution, Top-K filtering narrows the candidates to the most likely tokens, and sampling draws the final output token stochastically. The key insight is that the raw logit values drive the entire downstream selection process โ€” changing the logits (via temperature scaling) shifts the probability mass and therefore changes which token gets selected.


โš™๏ธ Why Not Output Probabilities Directly?

  1. Training stability: Working with logits and applying log-Softmax is numerically more stable than computing probabilities first, then taking their logarithm for the cross-entropy loss โ€” the standard training objective that penalizes the model proportionally to how wrong its predicted probability distribution is compared to the true label.
  2. Flexibility: The same logits can be processed differently (Softmax for classification, sigmoid for multi-label, raw for regression).
  3. Temperature scaling: You can reshape the distribution from the logits before applying Softmax, which you couldn't do if the network directly output probabilities.

๐Ÿ“Š The Logit Pipeline: From Input to Prediction

Here is how logits flow through the full inference pipeline in a classification or language model:

flowchart TD
    Input[Input Data (text, image, etc.)] --> Encoder[Encoder / Hidden Layers (feature extraction)]
    Encoder --> Linear[Final Linear Layer (produces raw logits z_i)]
    Linear --> Temp[ Temperature T (optional scaling)]
    Temp --> Softmax[Softmax  probability distribution P_i]
    Softmax --> Decision[Argmax or Sampling  predicted class / next token]

Step-by-step:

  1. Input โ€” raw data (tokens, pixels) enters the network.
  2. Hidden layers โ€” extract abstract features through multiple transformations.
  3. Linear layer โ€” maps features to a logit per class (no activation, just weights ร— input + bias).
  4. Temperature scaling โ€” optionally divides each logit by T to sharpen or flatten the distribution.
  5. Softmax โ€” converts logits to a valid probability distribution summing to 1.
  6. Decision โ€” argmax picks the highest-probability class; sampling draws stochastically from the distribution.

This pipeline is the same whether you're classifying images, generating text, or running a sentiment model.


๐Ÿง  Deep Dive: Temperature: Reshaping the Logit Distribution

In language models, the vocabulary logits are scaled by a temperature $T$ before Softmax:

$$P_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$

Effect of temperature on the cat/dog/bird example (logits: 4.5, -1.2, 0.8):

TemperatureCat PDog PBird PInterpretation
T = 1.097%0.3%2.4%Standard
T = 0.5~99.9%~0%~0.1%Even more peaked on Cat
T = 2.0~80%~3%~17%Flatter โ€” more uncertainty expressed
T โ†’ 0100%0%0%Greedy (argmax)

Low T: Sharp distribution. High confidence. Repetitive in language models.
High T: Flat distribution. Diverse/random. Creative in language models.

flowchart LR
    Hidden[Hidden Layer Output] --> Logits[Logit Layer (raw z_i scores)]
    Logits --> Temp[ Temperature T]
    Temp --> Softmax[Softmax  probabilities P_i]
    Softmax --> Sample[Sample or Argmax  predicted class / next token]

๐ŸŒ Real-World Applications of Logits

Logits appear in almost every modern ML system. Here are concrete examples of where they matter:

1. Large Language Models (ChatGPT, Gemini, Claude) Every time an LLM generates the next word, it produces a logit for every token in its vocabulary (often 50,000+ tokens). Temperature and top-k/top-p sampling are applied to these logits before selecting the next token.

2. Image Classification (ResNet, ViT) The final layer of an image classifier outputs one logit per class. For ImageNet (1,000 classes), that's a 1,000-dimensional logit vector, converted via Softmax to pick the most likely object.

3. Sentiment Analysis A binary sentiment classifier outputs two logits: one for "positive" and one for "negative." Softmax (or Sigmoid on a single logit) converts these into a probability like "82% positive."

4. Model Calibration Well-calibrated models have logits that map to accurate probabilities. Temperature scaling is a common post-training calibration technique: a single temperature parameter T is tuned on a validation set to make predicted probabilities match actual frequencies.

5. Retrieval and Ranking In contrastive learning (CLIP, sentence embeddings), similarity logits measure how well query-document pairs match, then Softmax ranks results.

๐Ÿ“Š Next Token Prediction Loop

sequenceDiagram
    participant C as Context Tokens
    participant M as LLM
    participant L as Logits
    participant S as Sampler
    C->>M: input sequence
    M->>L: compute logits
    L->>S: apply temperature
    S-->>C: append new token
    C->>M: next iteration

This sequence diagram illustrates the autoregressive token generation loop of a large language model. The LLM receives the accumulated context tokens, computes logits for the next token position, passes them to the sampler (which applies temperature and any top-k or top-p filters), and appends the selected token back to the context before the next iteration. The loop continues until an end-of-sequence token is generated or a maximum length is reached โ€” making the logit-to-token pipeline repeat for every single token in the model's output.


๐Ÿงช Hands-On: Logits in PyTorch

Let's make logits concrete with runnable code.

import torch
import torch.nn.functional as F

# Simulated logits from a 3-class classifier
logits = torch.tensor([4.5, -1.2, 0.8])

# Standard Softmax (T=1)
probs = F.softmax(logits, dim=0)
print("Probabilities:", probs)
# tensor([0.9730, 0.0030, 0.0240])

# Temperature scaling: T=2.0 (flatter distribution)
T = 2.0
probs_hot = F.softmax(logits / T, dim=0)
print("T=2.0:", probs_hot)
# tensor([0.8360, 0.0280, 0.1360])

# CrossEntropyLoss expects raw logits โ€” NOT softmax output
target = torch.tensor([0])  # true class is index 0 (Cat)
loss = torch.nn.CrossEntropyLoss()(logits.unsqueeze(0), target)
print("Loss:", loss.item())  # 0.0274

Key takeaways:

  • F.softmax(logits) converts raw scores to probabilities.
  • Dividing by temperature before Softmax flattens or sharpens the distribution.
  • CrossEntropyLoss applies LogSoftmax internally โ€” never pass already-softmaxed values or you'll get incorrect gradients.

โš–๏ธ Trade-offs & Failure Modes: Logits in Different Contexts

ContextOutput LayerApplied AfterPurpose
Multi-class classificationLogit vector (vocab size ร— 1)SoftmaxPick one class
Multi-label classificationLogit vectorSigmoid (per logit)Multiple binary predictions
Language modelLogit vector (one per vocab token)Softmax + TemperatureSample next token
Binary classificationSingle logitSigmoidP(positive class)
RegressionRaw value (no normalization)NoneContinuous output

๐Ÿงญ Decision Guide: When to Use Logits vs. Probabilities

You want to...Use
Compute loss during trainingRaw logits โ†’ CrossEntropyLoss (never Softmax first)
Compare confidence across predictionsSoftmax probabilities, not raw logits
Control output diversity at inferenceDivide logits by temperature T before Softmax
Rank candidates in retrieval or searchSimilarity logits (higher = more relevant)
Calibrate model confidence post-trainingTune temperature T on a validation set to align probabilities with observed frequencies

๐Ÿ› ๏ธ PyTorch and Hugging Face Transformers: Logits in Real Model Pipelines

PyTorch is the foundational tensor computation library that underlies virtually every modern neural network; its torch.nn.functional module provides the canonical softmax, log_softmax, and cross_entropy functions that operate directly on logit tensors. Hugging Face Transformers builds on top of PyTorch and exposes raw logits from any supported model through its outputs.logits tensor, making it straightforward to inspect, temperature-scale, or re-sample from the full vocabulary distribution.

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").eval()

# Encode a prompt and run a forward pass
input_ids = tokenizer("The capital of France is", return_tensors="pt").input_ids

with torch.no_grad():
    outputs = model(input_ids)

# outputs.logits shape: [batch=1, seq_len, vocab_size=50257]
next_token_logits = outputs.logits[0, -1, :]  # logits for the NEXT token

# Standard softmax โ†’ probability distribution
probs = F.softmax(next_token_logits, dim=-1)

# Temperature scaling: T=0.7 sharpens the distribution
T = 0.7
probs_scaled = F.softmax(next_token_logits / T, dim=-1)

# Top predicted token
top_token_id = torch.argmax(probs).item()
print(f"Most likely next token: '{tokenizer.decode([top_token_id])}'")  # โ†’ ' Paris'
print(f"Probability: {probs[top_token_id]:.4f}")

This pattern โ€” get outputs.logits, slice the last position, apply temperature, then sample โ€” is the exact inference loop used inside every HuggingFace model.generate() call, just made explicit here for learning purposes.

For a full deep-dive on PyTorch and Hugging Face Transformers, dedicated follow-up posts are planned.


๐Ÿ“š Key Lessons and Common Pitfalls

Lesson 1 โ€” Don't double-apply Softmax. The most common beginner mistake is passing Softmax-normalized probabilities into CrossEntropyLoss. The loss function already applies LogSoftmax internally. Applying Softmax twice produces wrong gradients and silent training failures.

Lesson 2 โ€” Logits are not comparable across models. A logit of 5.0 from Model A and a logit of 5.0 from Model B do not mean the same thing. The scale of logits depends on weight initialization and training dynamics. Always compare probabilities, not raw logits, when evaluating different models.

Lesson 3 โ€” Temperature is a powerful control knob. During inference, temperature is one of the cheapest hyperparameters to tune. No retraining required โ€” just divide the logits before Softmax. Start with T=1.0, lower it for precision tasks (code generation), and raise it for creative tasks (story writing).

Lesson 4 โ€” Large logit gaps cause probability collapse. If one logit is much larger than the rest (e.g., [12.0, 0.1, 0.2]), Softmax will assign ~100% to the top class and ~0% to others. This can cause vanishing gradients in early training. Weight initialization strategies (Xavier, He) are designed to keep logits in a reasonable range at the start of training.

Lesson 5 โ€” Log-Softmax is more numerically stable than Softmax โ†’ log. Always use F.log_softmax + NLLLoss (or equivalently CrossEntropyLoss directly) rather than F.softmax followed by torch.log, to avoid floating-point underflow when probabilities are very small.


๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • Logits = raw, unnormalized scores from the final linear layer of a neural network.
  • Softmax converts logits to a probability distribution summing to 1.
  • Temperature divides logits before Softmax: low T โ†’ peaked (confident), high T โ†’ flat (diverse).
  • CrossEntropyLoss in PyTorch/TensorFlow takes logits directly โ€” don't apply Softmax before passing to the loss function; it's included internally for numerical stability.


Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms