All Posts

What are Logits in Machine Learning and Why They Matter

Logits are the raw, unnormalized scores output by a neural network before they are turned into pr...

Abstract AlgorithmsAbstract Algorithms
ยทยท5 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Logits are the raw, unnormalized scores produced by the final layer of a neural network โ€” before any probability transformation. Softmax converts them to probabilities. Temperature scales them before Softmax to control output randomness.


๐Ÿ“– The Confidence Meter Before Calibration

A doctor looks at an X-ray and assigns a raw "gut score": Cancer likelihood = 8.2, Normal = 1.4. These numbers are not percentages โ€” they haven't been normalized yet. To get percentages, she runs them through a calibration formula.

Logits are those raw gut scores. The Softmax function is the calibration formula.


๐Ÿ”ข From Network Output to Prediction

A classification network predicting image labels outputs raw scores:

Raw output (logits):
  Cat:  4.5
  Dog: -1.2
  Bird: 0.8

These numbers have no fixed scale. A "4.5" just means "more confident Cat than the others."

Softmax converts logits to probabilities:

$$P_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

Applied to the example:

ClassLogite^zProbability
Cat4.590.0290.0%
Dog-1.20.300.3%
Bird0.82.232.2%

Wait - let me recalculate. e^4.5 = 90.02, e^-1.2 = 0.30, e^0.8 = 2.23. Sum = 92.55. Cat: 90.02/92.55 = 97.3%.

ClassLogite^zProbability
Cat4.590.02~97%
Dog-1.20.30~0.3%
Bird0.82.23~2.4%

The model is 97% confident it is a cat.


โš™๏ธ Why Not Output Probabilities Directly?

  1. Training stability: Working with logits and applying log-Softmax is numerically more stable than computing probabilities first, then taking log for cross-entropy loss.
  2. Flexibility: The same logits can be processed differently (Softmax for classification, sigmoid for multi-label, raw for regression).
  3. Temperature scaling: You can reshape the distribution from the logits before applying Softmax, which you couldn't do if the network directly output probabilities.

๐Ÿง  Temperature: Reshaping the Logit Distribution

In language models, the vocabulary logits are scaled by a temperature $T$ before Softmax:

$$P_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$

Effect of temperature on the cat/dog/bird example (logits: 4.5, -1.2, 0.8):

TemperatureCat PDog PBird PInterpretation
T = 1.097%0.3%2.4%Standard
T = 0.5~99.9%~0%~0.1%Even more peaked on Cat
T = 2.0~80%~3%~17%Flatter โ€” more uncertainty expressed
T โ†’ 0100%0%0%Greedy (argmax)

Low T: Sharp distribution. High confidence. Repetitive in language models.
High T: Flat distribution. Diverse/random. Creative in language models.

flowchart LR
    Hidden["Hidden Layer Output"] --> Logits["Logit Layer\n(raw z_i scores)"]
    Logits --> Temp["รท Temperature T"]
    Temp --> Softmax["Softmax\nโ†’ probabilities P_i"]
    Softmax --> Sample["Sample or Argmax\nโ†’ predicted class / next token"]

โš–๏ธ Logits in Different Contexts

ContextOutput LayerApplied AfterPurpose
Multi-class classificationLogit vector (vocab size ร— 1)SoftmaxPick one class
Multi-label classificationLogit vectorSigmoid (per logit)Multiple binary predictions
Language modelLogit vector (one per vocab token)Softmax + TemperatureSample next token
Binary classificationSingle logitSigmoidP(positive class)
RegressionRaw value (no normalization)NoneContinuous output

๐Ÿ“Œ Summary

  • Logits = raw, unnormalized scores from the final linear layer of a neural network.
  • Softmax converts logits to a probability distribution summing to 1.
  • Temperature divides logits before Softmax: low T โ†’ peaked (confident), high T โ†’ flat (diverse).
  • CrossEntropyLoss in PyTorch/TensorFlow takes logits directly โ€” don't apply Softmax before passing to the loss function; it's included internally for numerical stability.

๐Ÿ“ Practice Quiz

  1. A model outputs logits [4.5, -1.2, 0.8] for classes Cat, Dog, Bird. After Softmax, which class has the highest probability?

    • A) Dog, because -1.2 is the outlier.
    • B) Cat โ€” the highest logit (4.5) maps to the highest Softmax probability.
    • C) Bird โ€” the middle logit is most "normal."
      Answer: B
  2. You apply temperature T=0.1 to language model logits before Softmax. What effect does this have?

    • A) The model becomes more creative and unpredictable.
    • B) The distribution becomes very peaked โ€” the highest-logit token gets nearly all probability mass; output is near-deterministic.
    • C) It has no effect when T < 1.
      Answer: B
  3. Why does PyTorch's nn.CrossEntropyLoss expect raw logits rather than Softmax-normalized probabilities?

    • A) Logits require less memory.
    • B) Computing log(Softmax(logits)) directly (LogSoftmax) is numerically more stable than computing Softmax first and then log โ€” avoiding floating-point underflow.
    • C) CrossEntropyLoss cannot process probabilities between 0 and 1.
      Answer: B

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms