Softmax Function Explained: From Raw Scores to Probabilities

Why every neural network classifier converts its raw output through exponential normalization before calling it a prediction

Machine Learning Fundamentals

Abstract Algorithms

·May 3, 2026·21 min read

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 21 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Softmax converts a vector of raw scores (logits) into a valid probability distribution by exponentiating each value and dividing by the total. Subtracting the max before exponentiating prevents floating-point overflow. Temperature scaling controls how "peaked" or "flat" the distribution is. Softmax appears twice in every transformer — inside scaled dot-product attention and at the vocabulary output head — making it one of the most consequential two-line functions in all of deep learning.

📖 When Confidence Scores Aren't Probabilities

A neural network for image classification finishes its forward pass and outputs three numbers for a photo of a cat: Cat: 4.5, Dog: -1.2, Bird: 0.8. Is this model "97% sure it's a cat" — or just "somewhat more sure than the others"? Looking at the raw numbers alone, you genuinely cannot tell.

These numbers are called logits — the raw, unnormalized output of the network's final linear layer. They have no fixed scale, no guaranteed range, and they do not sum to one. A logit of 4.5 from one model trained on ImageNet is not the same confidence level as 4.5 from a fine-tuned model trained on medical images. The numbers are meaningless in isolation.

To make predictions useful, you need to convert those raw scores into a proper probability distribution — a set of values that are all positive, sum to exactly 1.0, and can be read as "the model believes there is a P% chance this belongs to class X." That conversion is exactly what the Softmax function does.

Understanding Softmax deeply is not a theoretical exercise. Knowing when it overflows (and why), how temperature scaling changes its behavior, where it lives inside transformer attention layers, and why the log-softmax trick exists are the skills that separate practitioners who can debug training instabilities from those who cannot.

🔍 What Softmax Does: Turning Competing Scores Into a Shared Probability

Think of Softmax like an election with proportional representation driven by exponential votes. Five candidates (classes) compete. Their raw scores are: A = 3.0, B = 1.0, C = 0.2, D = −0.5, E = 2.5. Instead of assigning votes linearly, Softmax assigns votes proportional to $e^{\text{score}}$ — the exponential of each score. Candidate A does not get 3x the votes of Candidate B (score ratio); they get $e^{3.0} / e^{1.0} = e^2 \approx 7.4\times$ the votes (exponential ratio). The winner is amplified. The losers are suppressed.

The key three properties that make Softmax the right tool here:

Property	Why It Matters
All outputs are positive	$e^x > 0$ for any real $x$, so no negative probabilities
Outputs sum to exactly 1	Division by the total enforces a valid probability distribution
Differentiable everywhere	Smooth gradients flow through Softmax during backpropagation
Preserves relative ordering	The class with the highest logit always gets the highest probability

This is also why it is called soft-max: it behaves like the argmax function (putting all mass on the winning class) when score differences are large, but remains smooth and differentiable — which hard argmax is not. That differentiability is what makes it trainable via gradient descent.

⚙️ The Softmax Formula: Why Exponentials and How Normalization Works

Given a vector of $K$ logits $\mathbf{z} = [z_1, z_2, \ldots, z_K]$, Softmax produces output probabilities $\mathbf{p} = [p_1, p_2, \ldots, p_K]$ defined as:

$$p_i = \frac{e^{z_i}}{\displaystyle\sum_{j=1}^{K} e^{z_j}}$$

Let's walk through this with the image classifier example: logits Cat = 4.5, Dog = -1.2, Bird = 0.8.

Class	Logit $z_i$	$e^{z_i}$	$p_i = e^{z_i} / \Sigma$
Cat	4.5	90.02	90.02 / 92.55 ≈ 97.3%
Dog	−1.2	0.30	0.30 / 92.55 ≈ 0.3%
Bird	0.8	2.23	2.23 / 92.55 ≈ 2.4%
Sum	—	92.55	100.0%

The model is nearly certain it is a cat. The raw gap of 4.5 vs 0.8 (a difference of 3.7) became a probability gap of 97.3% vs 2.4% — exponential amplification at work.

Why not just use a linear normalization? If you divided each logit by the sum of all logits, negative logits would produce negative probabilities. And a linear ratio doesn't amplify high-confidence signals the way real decision-making should — a surgeon who is "slightly more confident" in Diagnosis A versus Diagnosis B should not recommend Treatment A with merely 51% certainty. The exponential ensures confident predictions stay confident.

🧠 Deep Dive: Competitive Suppression, Gradients, and the Cross-Entropy Connection

The Internals: Winner-Take-More and How Softmax Amplifies Score Margins

The most important intuition about Softmax that textbooks frequently understate is that it does not just "normalize" — it actively suppresses the losing classes, more aggressively as the winning margin grows.

Consider: logits [1.0, 0.0] → Softmax → [73%, 27%]. Now add 10 to both: logits [11.0, 10.0] → Softmax → still [73%, 27%]. Softmax is invariant to adding a constant to all logits. This is a foundational property — it means the absolute scale of logits does not matter, only their relative differences.

Now watch what happens as the margin between classes grows:

Logits	Winner Probability	Loser Probability
[1.0, 0.0]	73.1%	26.9%
[2.0, 0.0]	88.1%	11.9%
[5.0, 0.0]	99.3%	0.7%
[10.0, 0.0]	99.995%	0.005%

As the logit margin increases, Softmax increasingly resembles argmax — all probability mass flows to the winner. This is why a well-trained classifier with a large margin is appropriately very confident, while an undertrained or uncertain model has a flatter distribution.

The gradient of Softmax combined with cross-entropy loss is elegant. Cross-entropy loss for a correct class $c$ is:

$$\mathcal{L} = -\log(p_c)$$

The gradient of this loss with respect to logit $z_i$ simplifies to a strikingly clean form:

$$\frac{\partial \mathcal{L}}{\partial z_i} = p_i - y_i$$

where $y_i = 1$ for the correct class and $y_i = 0$ for all others. In plain English: the gradient is just prediction minus truth. For the correct class, the gradient is negative (push the logit higher). For incorrect classes, the gradient is positive (push them lower). No complicated chain-rule matrices needed. This is why Softmax + cross-entropy is the universal default for multi-class classification — it is both numerically well-behaved and mathematically beautiful.

Mathematical Model: Tracing the Formula Through a Full Training Step

Let's trace a three-class example end to end with logits $\mathbf{z} = [2.0, 1.0, 0.5]$ and the true label being class 0.

Forward pass:

Exponentiate: $[e^{2.0}, e^{1.0}, e^{0.5}] = [7.389, 2.718, 1.649]$
Sum: $7.389 + 2.718 + 1.649 = 11.756$
Probabilities: $\mathbf{p} = [0.629, 0.231, 0.140]$

Loss: $\mathcal{L} = -\log(0.629) = 0.464$

Backward pass (gradients):

$\partial \mathcal{L} / \partial z_0 = 0.629 - 1 = -0.371$ → backprop will push $z_0$ higher ✓
$\partial \mathcal{L} / \partial z_1 = 0.231 - 0 = +0.231$ → backprop will push $z_1$ lower ✓
$\partial \mathcal{L} / \partial z_2 = 0.140 - 0 = +0.140$ → backprop will push $z_2$ lower ✓

After one gradient step, the correct class logit increases and the incorrect ones decrease — exactly the behavior that drives training forward.

Performance Analysis: Softmax Complexity and Bottlenecks in Large Vocabularies

Computing Softmax over a vector of $K$ values takes $O(K)$ time: one pass to exponentiate, one to sum, one to divide. For image classification with 1,000 ImageNet classes, this is essentially free.

The bottleneck becomes significant in language models. GPT-style models have vocabularies of 50,000–100,000 tokens. At inference time, every token generation step requires a Softmax over the full vocabulary. For a batch of 32 sequences of length 512 tokens each, that is 16,384 separate Softmax operations per forward pass, each over 50,000 values.

Additionally, the sum reduction ($\sum_j e^{z_j}$) requires synchronization across all $K$ values — it cannot be trivially parallelized in the way matrix multiplications can. This is why LLM training commonly uses sampled softmax (approximate, using a subset of negative classes during training) and inference uses vocabulary truncation strategies like top-k or nucleus sampling before applying Softmax, not after.

🌡️ Temperature Scaling: Sharpening and Flattening the Distribution

Standard Softmax has no "confidence dial." Temperature scaling adds one by dividing the logits by a scalar $T$ before applying Softmax:

$$p_i = \frac{e^{z_i / T}}{\displaystyle\sum_{j=1}^{K} e^{z_j / T}}$$

The parameter $T$ is the temperature, borrowed from statistical physics (the Boltzmann distribution):

Temperature	Effect	Practical Use
$T = 1$	Standard Softmax — no change	Default for classification
$T < 1$ (e.g., 0.3)	Sharpens — amplifies differences, winner takes more	Confident generation, structured outputs, greedy decoding
$T > 1$ (e.g., 1.5)	Flattens — compresses differences, loser classes get more probability	Creative generation, diverse sampling, exploration
$T \to 0$	Approaches hard argmax — all mass on the winner	Equivalent to greedy decoding
$T \to \infty$	Uniform distribution — all classes equally likely	Random sampling, maximum diversity

This is the exact mechanism behind the "temperature" setting in ChatGPT, Claude, and other LLMs. When you set temperature to 0.2, you're sharpening the token probability distribution — the model follows its most likely path. At temperature 1.5, the model samples more broadly from its distribution, producing more varied but sometimes less coherent text.

Temperature scaling happens before Softmax, applied directly to the logits. It does not change which class has the highest probability — it only changes by how much. The argmax prediction is temperature-invariant; only the sampling behavior changes.

🆚 Softmax vs Sigmoid: When Multi-Class Means Something Different

Sigmoid and Softmax are frequently confused. Both map real numbers to (0, 1), but they serve fundamentally different purposes:

Dimension	Softmax	Sigmoid
Output constraint	Outputs sum to exactly 1 across all classes	Each output independent; outputs may sum to anything
Use case	Mutually exclusive multi-class: exactly one class is correct	Multi-label: any combination of classes can be true
Formula	$p_i = e^{z_i} / \sum_j e^{z_j}$	$\sigma(z) = 1 / (1 + e^{-z})$
Interactions	Outputs compete — increasing one decreases others	Outputs independent — changing one does not affect others
Binary classification	Reduces to sigmoid when $K = 2$	Natural binary choice
Example	"Is this photo a cat, dog, or bird?"	"Which of these tags apply to this article?"

The single most common mistake in building classifiers is using Softmax for a multi-label problem. If you're classifying movie genres and a film can be both "Action" and "Comedy," using Softmax forces the probabilities to compete — adding mass to "Action" necessarily removes mass from "Comedy." Use sigmoid with binary cross-entropy per label instead.

Note that binary Softmax and sigmoid are mathematically equivalent. For two classes with logits $z_0$ and $z_1$, the Softmax probability of class 0 is $e^{z_0} / (e^{z_0} + e^{z_1})$. Setting $z_1 = 0$, this reduces to $1 / (1 + e^{-z_0})$ — the sigmoid function.

⚠️ The Overflow Trap: Why Naive Softmax Explodes and How the Max Trick Saves It

Here is a problem that bites beginners immediately upon moving from toy examples to real models. Consider a logit vector where one value is large: $\mathbf{z} = [1000, 999, 998]$.

Computing $e^{1000}$ overflows IEEE 754 float64 — the maximum representable value is approximately $e^{709}$. The result is inf, and inf / (inf + inf) is nan. Your entire training step becomes meaningless.

The max subtraction trick fixes this with one line of algebra. Because Softmax is invariant to adding a constant to all logits (as established earlier), we subtract the maximum logit value first:

$$p_i = \frac{e^{z_i - \max(\mathbf{z})}}{\displaystyle\sum_{j=1}^{K} e^{z_j - \max(\mathbf{z})}}$$

For $\mathbf{z} = [1000, 999, 998]$, we subtract 1000: shifted vector is $[0, -1, -2]$. Now $e^0 = 1$, $e^{-1} = 0.368$, $e^{-2} = 0.135$. No overflow. The result is exactly the same probability distribution.

Log-Softmax takes this further. During training, you almost always take the log of Softmax output for the cross-entropy loss. Computing log(softmax(z)) naively is both numerically unstable (log of a very small number is very negative) and wasteful (you exponentiate and then take the log, undoing the exponentiation). The numerically stable log-softmax is:

$$\log(p_i) = z_i - \max(\mathbf{z}) - \log\!\left(\sum_j e^{z_j - \max(\mathbf{z})}\right)$$

This avoids both overflow (in the exponentials) and underflow (in the log). PyTorch exposes this as torch.nn.functional.log_softmax, and the loss function nn.NLLLoss expects log-probabilities as input. Equivalently, nn.CrossEntropyLoss combines log-softmax and NLL loss in a single numerically stable operation — which is why you should never manually apply Softmax before CrossEntropyLoss in PyTorch. It will apply log-softmax internally, and you'll be computing softmax twice (plus introducing instability).

📊 Visualizing Softmax Flow Through a Classification Network

The diagram below traces how data moves through a classification neural network from raw input to final prediction, with Softmax acting as the bridge between learned representations and interpretable probabilities.

flowchart TD
    A["Input Data"] --> B["Embedding and Hidden Layers"]
    B --> C["Final Linear Projection"]
    C --> D["Raw Logits Vector"]
    D --> E["Subtract Max for Numerical Stability"]
    E --> F["Exponentiate Each Logit"]
    F --> G["Sum All Exponentiated Values"]
    G --> H["Divide Each Value by Sum"]
    H --> I["Probability Distribution over Classes"]
    I --> J["argmax for Deterministic Prediction"]
    I --> K["Sampling for Stochastic Prediction"]
    I --> L["Cross-Entropy Loss during Training"]
    L --> M["Backpropagation"]
    M --> C

The diagram reveals several important design decisions at once. First, the stability step (subtracting max) happens before exponentiation — in a well-implemented production system this is never skipped. Second, there are two ways to consume the probability distribution at inference time: deterministic argmax (greedy decoding) and stochastic sampling (for generative models). Third, the arrow from Cross-Entropy Loss back to the Final Linear Projection is the gradient signal — the layer immediately before Softmax receives the clean $p_i - y_i$ gradient that makes Softmax + cross-entropy so training-friendly. Hidden layers further back receive their gradients through the chain rule from there.

Notice that backpropagation terminates at the linear projection, not at Softmax — during inference, the probability distribution produced by Softmax is purely a read-out layer. The model's learned knowledge lives in the weight matrices of the linear projection and all preceding layers.

🌍 Softmax in Real Production Systems: Image Classifiers, LLMs, and Transformers

Softmax appears in far more places than just "the final layer of a classifier." Understanding where it lives in production architectures changes how you read model outputs and debug failure modes.

Image and text classification. The canonical use case: a ResNet classifier for 1,000 ImageNet categories applies Softmax to its 1,000-dimensional logit output. A sentiment classifier for positive/negative/neutral applies Softmax to 3 logits. The predicted class is the argmax of the resulting distribution. During training, the cross-entropy loss uses log-softmax internally.

Transformer self-attention — the inner Softmax. Inside every transformer layer, the scaled dot-product attention mechanism computes a compatibility score between each pair of tokens (query × key). These scores are passed through Softmax along the key dimension to produce attention weights:

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}$$

Here Softmax converts raw token-pair similarity scores into a distribution that says "when processing token $i$, how much attention should I pay to token $j$?" The division by $\sqrt{d_k}$ is a temperature scaling with $T = \sqrt{d_k}$ — it prevents the dot products from growing so large that Softmax saturates into near-one-hot distributions, which would kill the gradients.

LLM output head — the vocabulary Softmax. After all transformer layers, a final linear projection maps the hidden state to a vector of size $V$ (vocabulary size, typically 32,000–100,000). Softmax converts this into a probability distribution over all possible next tokens. During training, cross-entropy loss is applied. During inference, the model samples from this distribution (or takes the argmax for greedy decoding) to select the next token, one at a time.

The fact that Softmax appears at both the attention layer and the output head means a single transformer forward pass involves dozens of Softmax operations — one per attention head per layer, plus one at the output. Understanding the numerical stability and gradient behavior of Softmax is therefore not optional for anyone building or fine-tuning transformers.

⚖️ Softmax Failure Modes: Overconfidence, Logit Collapse, and Dense Attention

Softmax is ubiquitous but not without problems. Three failure modes appear repeatedly in production systems.

Overconfident predictions. A well-trained model can have logit margins so large that Softmax produces probabilities extremely close to 1.0 for the predicted class. When the model encounters out-of-distribution inputs — inputs unlike anything in training — it will still produce a high-confidence prediction rather than signaling uncertainty. This is a structural limitation of Softmax: it always produces a probability distribution, even when "I don't know" is the right answer. Calibration techniques like temperature scaling, label smoothing, and conformal prediction are used to address this.

Logit collapse in early training. At initialization, all logits are near zero, so Softmax outputs a near-uniform distribution (approximately 1/K for each class). This is usually fine. But if the learning rate is too high or weight initialization is poor, logits can explode in magnitude early in training. The resulting Softmax produces near-one-hot outputs before the network has learned meaningful representations — gradients essentially stop flowing to the losing classes, stalling convergence.

Quadratic attention density in long sequences. In transformer attention, Softmax produces attention weights over all $n$ tokens for each of the $n$ query positions. This is an $n \times n$ matrix, and because Softmax forces weights to sum to 1, each token attends to all others — even when most have zero relevance. For sequences of length 8,192 or 32,768 (common in modern long-context models), this dense attention is both memory-expensive and semantically noisy. Alternatives like Sparsemax (which produces sparse distributions with exact zeros), local attention, and linear attention all try to address this, though none has fully replaced Softmax in production transformers as of 2025.

🧭 When to Use Softmax and When to Look for Alternatives

Situation	Recommendation
Mutually exclusive multi-class classification	Use Softmax — the mutual exclusivity constraint is a feature, not a bug
Multi-label classification	Use Sigmoid per class — classes do not compete
Binary classification	Use Sigmoid or Softmax with 2 outputs; both are equivalent, Sigmoid is simpler
LLM token generation with greedy decoding	Use argmax directly on logits — skip Softmax entirely at inference time
Sampling with temperature	Apply temperature scaling to logits, then Softmax, then sample
Training with PyTorch CrossEntropyLoss	Do NOT apply Softmax before passing logits to the loss; it applies log-softmax internally
Long sequences in transformers	Consider Flash Attention, Sparsemax, or linear attention variants
Need calibrated uncertainty estimates	Train with label smoothing; apply post-hoc temperature scaling on the logits

The single most actionable rule: if you are using nn.CrossEntropyLoss in PyTorch, pass raw logits — never pass softmax outputs. This mistake is silent (the model will still train, just less efficiently and with worse numerical stability) and is one of the most common bugs in ML codebases.

🧪 Practical Examples: Classifying Sentiment and Sampling the Next Token

These two examples demonstrate Softmax in the contexts where practitioners encounter it most often: building a classifier and understanding LLM token generation.

Example 1: Stable NumPy Softmax and the max trick

This demonstrates both the naive (broken) implementation and the numerically stable version. Notice how the naive version fails silently on large logits while the stable version handles them correctly.

import numpy as np

def softmax_naive(z):
    """Naive implementation — overflows for large logits."""
    exp_z = np.exp(z)
    return exp_z / exp_z.sum()

def softmax_stable(z):
    """Numerically stable implementation using the max subtraction trick."""
    z_shifted = z - z.max()        # shift to prevent overflow
    exp_z = np.exp(z_shifted)
    return exp_z / exp_z.sum()

def log_softmax_stable(z):
    """Numerically stable log-softmax — avoids log(very small number) underflow."""
    z_shifted = z - z.max()
    return z_shifted - np.log(np.exp(z_shifted).sum())

# Typical classification logits — both implementations agree
typical_logits = np.array([4.5, -1.2, 0.8])
print("Typical logits:")
print("  Naive:  ", np.round(softmax_naive(typical_logits), 4))
print("  Stable: ", np.round(softmax_stable(typical_logits), 4))

# Large logits — naive overflows, stable is fine
large_logits = np.array([1000.0, 999.0, 998.0])
print("\nLarge logits (overflow scenario):")
print("  Naive:  ", softmax_naive(large_logits))   # [nan, nan, nan]
print("  Stable: ", np.round(softmax_stable(large_logits), 4))  # [0.6652, 0.2447, 0.0900]

# Temperature scaling: control confidence
logits = np.array([3.0, 1.5, 0.5])
for temp in [0.3, 1.0, 2.0]:
    probs = softmax_stable(logits / temp)
    print(f"  T={temp}: {np.round(probs, 3)}")
# T=0.3: [0.997, 0.003, 0.000]  -- very sharp, decisive
# T=1.0: [0.725, 0.212, 0.063]  -- standard
# T=2.0: [0.518, 0.303, 0.179]  -- flat, exploratory

Example 2: PyTorch Softmax in a classifier and the CrossEntropyLoss gotcha

This example shows the correct way to use Softmax in a PyTorch training loop and highlights the most common bug — applying Softmax before CrossEntropyLoss.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Simulated output from a sentiment classifier: 3 classes (negative, neutral, positive)
logits = torch.tensor([[2.1, -0.5, 3.8],   # batch item 0
                        [0.2,  1.3, 0.1]])  # batch item 1
true_labels = torch.tensor([2, 1])  # class indices: positive, neutral

# --- Correct: pass raw logits to CrossEntropyLoss ---
# CrossEntropyLoss internally applies log_softmax + NLLLoss
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, true_labels)
print(f"Correct loss: {loss.item():.4f}")

# --- Wrong: applying softmax before CrossEntropyLoss ---
probs = F.softmax(logits, dim=-1)
wrong_loss = loss_fn(probs, true_labels)  # Loss is still computed but wrong magnitude
print(f"Wrong loss (softmax applied twice internally): {wrong_loss.item():.4f}")

# --- Correct: use softmax only for reading probabilities at inference ---
with torch.no_grad():
    probs = F.softmax(logits, dim=-1)
    predictions = probs.argmax(dim=-1)
    print(f"\nPredicted classes: {predictions.tolist()}")   # [2, 1]
    print(f"Confidences:       {probs.max(dim=-1).values.tolist()}")

# --- Temperature-scaled sampling (LLM token selection) ---
vocab_logits = torch.randn(50000)  # simulate vocabulary-size logit vector
temperature = 0.8
scaled_probs = F.softmax(vocab_logits / temperature, dim=-1)
sampled_token = torch.multinomial(scaled_probs, num_samples=1)
print(f"\nSampled token ID at T=0.8: {sampled_token.item()}")

# --- Using log_softmax + NLLLoss (equivalent to CrossEntropyLoss) ---
log_probs = F.log_softmax(logits, dim=-1)
nll_loss = nn.NLLLoss()(log_probs, true_labels)
print(f"NLLLoss with log_softmax: {nll_loss.item():.4f}")  # matches loss above

🛠️ PyTorch and NumPy: Softmax APIs Worth Knowing

PyTorch provides production-grade Softmax implementations that handle all the numerical stability concerns described in this post.

torch.nn.functional.softmax(input, dim) computes standard Softmax over the specified dimension. The dim argument is mandatory in modern PyTorch — always specify it explicitly. For a batch of logits with shape [batch_size, num_classes], use dim=-1 (the class dimension).

torch.nn.functional.log_softmax(input, dim) computes log-softmax in a single numerically stable pass — it does not compute Softmax and then take the log. Use this when you need log-probabilities for downstream computations (NLL loss, beam search scoring, sequence probability calculations).

torch.nn.CrossEntropyLoss combines log_softmax + NLLLoss internally. This is the correct loss for multi-class classification. Pass raw logits; do not pre-apply Softmax or log-softmax.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Batch of logits: shape [batch=4, classes=5]
logits = torch.randn(4, 5)

# Standard softmax over class dimension
probs = F.softmax(logits, dim=-1)
assert torch.allclose(probs.sum(dim=-1), torch.ones(4), atol=1e-6), "Should sum to 1"

# Log-softmax: numerically stable, use for NLLLoss or manual CE
log_probs = F.log_softmax(logits, dim=-1)

# CrossEntropyLoss: the right way to train a classifier
labels = torch.randint(0, 5, (4,))
loss = nn.CrossEntropyLoss()(logits, labels)   # pass raw logits — NOT probs

# Label smoothing: reduces overconfidence by mixing target with uniform distribution
smooth_loss = nn.CrossEntropyLoss(label_smoothing=0.1)(logits, labels)

print(f"Standard CE loss:       {loss.item():.4f}")
print(f"Label-smoothed CE loss: {smooth_loss.item():.4f}")

For a full deep-dive on training classifiers with PyTorch including learning rate schedules, mixed precision, and validation loops, see the Machine Learning Fundamentals series.

📚 Lessons Learned from Softmax Bugs in Production

1. Never apply Softmax before CrossEntropyLoss. This is the single most common PyTorch bug. The loss function applies log-softmax internally. Passing softmax outputs means the model is computing softmax twice, which degrades gradients and inflates the loss to incorrect values. The model will still train — it will just be wrong in a way that is difficult to diagnose.

2. The numerical stability fix is one line and has no downside. Subtracting the max before exponentiating is mathematically equivalent to standard Softmax (Softmax is shift-invariant) and prevents float overflow with zero trade-off. There is no reason to use the naive version anywhere in production code.

3. Temperature is a logit-space operation, not a probability-space operation. Dividing probabilities by a constant is not the same as dividing logits. Temperature scaling must happen on the raw logits before Softmax, not on the Softmax output.

4. A confident model is not a correct model. Softmax always produces a distribution. High Softmax probability (97%) does not mean the model is 97% likely to be correct — it means the model has assigned 97% of its probability mass to that class relative to the others it was trained to distinguish. On out-of-distribution inputs, a well-trained model will still produce confident-looking outputs. Treat Softmax outputs as relative confidence scores, not calibrated probabilities, unless you have explicitly applied temperature calibration using a held-out validation set.

5. Log-softmax is underutilized. Practitioners who need log-probabilities (beam search, sequence scoring, CTC) often compute log(softmax(z)) manually. This is both numerically unstable and slower. F.log_softmax should always be used instead.

📌 TLDR: Softmax in Six Bullets

What it does: Converts a vector of raw logits into a probability distribution that sums to 1, using the formula $p_i = e^{z_i} / \sum_j e^{z_j}$.
Why exponentials: They ensure all outputs are positive, amplify differences between classes, and are differentiable everywhere — making Softmax compatible with gradient descent.
Numerical stability: Always subtract the maximum logit before exponentiating. Use log_softmax when you need log-probabilities. Never call softmax before CrossEntropyLoss in PyTorch.
Temperature scaling: Divide logits by $T$ before Softmax. $T < 1$ sharpens (more decisive), $T > 1$ flattens (more exploratory). This is the "temperature" knob in LLM APIs.
Softmax vs Sigmoid: Softmax for mutually exclusive multi-class (outputs compete, sum to 1). Sigmoid for multi-label or binary (outputs independent, can each be 0–1).
In transformers: Softmax appears in the attention layer (to normalize attention scores) and at the output head (to produce token probabilities). Both use it for different purposes — attention Softmax uses $\sqrt{d_k}$ temperature scaling to prevent saturation.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Dot Product in Machine Learning: The Engine Behind Similarity, Attention, and Neural Networks

TLDR: The dot product multiplies corresponding elements of two vectors and sums the results. In machine learning it does three critical jobs: it scores semantic similarity between embeddings, computes every activation in a fully connected layer, and ...

May 3, 2026•21 min read

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Deep-Dive

TLDR: LoRA freezes the base model and trains two tiny matrices per layer — 0.1 % of parameters, 70 % less GPU memory, near-identical quality. QLoRA adds 4-bit NF4 quantization of the frozen base, enabling 70B fine-tuning on 2× A100 80 GB instead of 8...

Apr 19, 2026•29 min read

RAG vs Fine-Tuning: When to Use Each (and When to Combine Them)

TLDR: RAG gives LLMs access to current knowledge at inference time; fine-tuning changes how they reason and write. Use RAG when your data changes. Use fine-tuning when you need consistent style, tone, or domain reasoning. Use both for production assi...

Apr 19, 2026•27 min read

Build vs Buy: Deploying Your Own LLM vs Using ChatGPT, Gemini, and Claude APIs

TLDR: Use the API until you hit $10K/month or a hard data privacy requirement. Then add a semantic cache. Then evaluate hybrid routing. Self-hosting full model serving is only cost-effective at > 50M tokens/day with a dedicated MLOps team. The build ...

Apr 19, 2026•30 min read