Calculus of Backpropagation: Mathematical Derivation from Scratch with Matrix Derivatives
Derive and implement the backpropagation algorithm from scratch in Python using raw matrix calculus.

Abstract Algorithms
Helping engineers master software engineering topics.
TLDR: Backpropagation is the engine behind deep learning. It uses the chain rule of calculus to calculate the gradient of a loss function with respect to the network's weights. This guide derives the backpropagation equations using matrix calculus, analyzes vanishing gradients, and provides a runnable Python implementation using only raw NumPy.
π Concept: The Mathematics of Neural Network Training
Supervised machine learning relies on optimization. A neural network is a mathematical function $f(x; W)$ that maps an input vector $x$ to a predicted output vector $\hat{y}$ using a set of weights and biases, collectively denoted as $W$. Training the network means finding the parameters $W$ that minimize a loss function $L(y, \hat{y})$, which measures the error between our predictions and the ground-truth targets $y$.
To minimize the loss function, we use Gradient Descent. Gradient descent updates the weights in the opposite direction of the gradient of the loss function:
$$ W \leftarrow W - \alpha \frac{\partial L}{\partial W} $$
where $\alpha$ is the learning rate. Calculating the partial derivatives of the loss with respect to every weight in a deep neural network is complex. The Backpropagation algorithm solves this by using the chain rule of calculus to recursively calculate gradients from the output layer back to the input layer.
βοΈ Mechanics: Forward and Backward Propagation Passes
A neural network alternates between two passes:
- Forward Pass: Input data propagates forward through the layers. For each layer $l$, we calculate the weighted input $z^{[l]}$ and the activation $a^{[l]}$:
$$ z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]} $$
$$ a^{[l]} = \sigma(z^{[l]}) $$
where $\sigma$ is an activation function (like Sigmoid or ReLU). - Backward Pass: We compute the gradient of the loss with respect to the activations, weighted inputs, and parameters of each layer. We do this by calculating the error term $\delta^{[l]} = \frac{\partial L}{\partial z^{[l]}}$ for each layer and using it to find $\frac{\partial L}{\partial W^{[l]}}$ and $\frac{\partial L}{\partial b^{[l]}}$.
π Flow: High-Level Gradient Propagation Sequence
The flowchart below visualizes the transition between forward and backward propagation through a two-layer neural network:
flowchart LR
X[Input x] -->|Forward Pass: W1, b1| Z1[Z1 = W1*x + b1]
Z1 -->|Activation: Sigmoid| A1[A1 = sig Z1]
A1 -->|Forward Pass: W2, b2| Z2[Z2 = W2*A1 + b2]
Z2 -->|Activation: Sigmoid| A2[Output y_hat]
A2 -->|Compute Loss| Loss[Loss L]
Loss -->|Backward Pass: dLoss/dZ2| Delta2[Delta 2]
Delta2 -->|dLoss/dW2| GradW2[dLoss/dW2]
Delta2 -->|dLoss/dA1| Delta1[Delta 1]
Delta1 -->|dLoss/dW1| GradW1[dLoss/dW1]
The table below traces the variables calculated at each step of the forward and backward passes:
| Step | Operation Type | Calculated Variable | Equation |
| 1 | Forward | Weighted Input $z^{[1]}$ | $z^{[1]} = W^{[1]} x + b^{[1]}$ |
| 2 | Forward | Activation $a^{[1]}$ | $a^{[1]} = \sigma(z^{[1]})$ |
| 3 | Forward | Weighted Input $z^{[2]}$ | $z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}$ |
| 4 | Forward | Output $\hat{y} = a^{[2]}$ | $a^{[2]} = \sigma(z^{[2]})$ |
| 5 | Backward | Output Error $\delta^{[2]}$ | $\delta^{[2]} = \frac{\partial L}{\partial a^{[2]}} \odot \sigma'(z^{[2]})$ |
| 6 | Backward | Layer 2 Gradients | $\frac{\partial L}{\partial W^{[2]}} = \delta^{[2]} (a^{[1]})^T$, $\frac{\partial L}{\partial b^{[2]}} = \delta^{[2]}$ |
| 7 | Backward | Hidden Error $\delta^{[1]}$ | $\delta^{[1]} = ((W^{[2]})^T \delta^{[2]}) \odot \sigma'(z^{[1]})$ |
| 8 | Backward | Layer 1 Gradients | $\frac{\partial L}{\partial W^{[1]}} = \delta^{[1]} x^T$, $\frac{\partial L}{\partial b^{[1]}} = \delta^{[1]}$ |
π§ Deep Dive: Structural Evolution & Performance Profiling
Let us examine the mathematical foundations and performance characteristics of matrix backpropagation.
Neural Network Layer Internals
A neural network layer consists of a weight matrix $W$ of size $n^{[l]} \times n^{[l-1]}$, where $n^{[l]}$ is the number of neurons in the current layer and $n^{[l-1]}$ is the number of neurons in the previous layer. The bias is a vector $b$ of size $n^{[l]} \times 1$. Storing these parameters in contiguous memory allocations using NumPy arrays allows the CPU to perform vector matrix multiplications in parallel using BLAS (Basic Linear Algebra Subprograms) libraries.
Mathematical Model of Backpropagation Calculus
Let us derive the exact matrix derivatives for a network of $L$ layers. We define the error term at layer $l$ as: $$ \delta^{[l]} = \frac{\partial L}{\partial z^{[l]}} $$
For the final layer $L$, using the chain rule: $$ \delta^{[L]} = \frac{\partial L}{\partial a^{[L]}} \odot \sigma'(z^{[L]}) $$ where $\odot$ denotes the element-wise Hadamard product. For a mean squared error loss $L = \frac{1}{2} | a^{[L]} - y |^2$, the derivative is: $$ \frac{\partial L}{\partial a^{[L]}} = a^{[L]} - y $$
To derive the error term for a hidden layer $l$ from the succeeding layer $l+1$, we use the multivariate chain rule: $$ \delta^{[l]} = \frac{\partial L}{\partial z^{[l]}} = \left( \frac{\partial z^{[l+1]}}{\partial z^{[l]}} \right)^T \frac{\partial L}{\partial z^{[l+1]}} $$ Since $z^{[l+1]} = W^{[l+1]} \sigma(z^{[l]}) + b^{[l+1]}$, we have: $$ \frac{\partial z^{[l+1]}}{\partial z^{[l]}} = W^{[l+1]} \text{diag}(\sigma'(z^{[l]})) $$ Substituting this back gives our final vector recurrence relation: $$ \delta^{[l]} = \left( (W^{[l+1]})^T \delta^{[l+1]} \right) \odot \sigma'(z^{[l]}) $$
Once we have $\delta^{[l]}$, we calculate the partial derivatives for the weights and biases: $$ \frac{\partial L}{\partial W^{[l]}} = \frac{\partial L}{\partial z^{[l]}} \frac{\partial z^{[l]}}{\partial W^{[l]}} = \delta^{[l]} (a^{[l-1]})^T $$ $$ \frac{\partial L}{\partial b^{[l]}} = \delta^{[l]} $$
These equations form the complete mathematical model of backpropagation.
Performance Analysis of NumPy Vectorization
Implementing these equations in Python using loops would result in slow execution times because Python interpreter loops add significant overhead.
Instead, we use Vectorization. We pack $m$ training examples into a design matrix $X$ of size $n^{[0]} \times m$. The forward and backward equations become matrix-matrix multiplications: $$ Z^{[1]} = W^{[1]} X + b^{[1]} $$ $$ dW^{[l]} = \frac{1}{m} \delta^{[l]} (A^{[l-1]})^T $$ By vectorizing the batch, NumPy delegates the heavy computations to C-based linear algebra libraries (like OpenBLAS or MKL), accelerating training throughput by several orders of magnitude.
ποΈ Advanced Concepts: Vanishing Gradients and Activation Functions
A classic problem when training deep networks using backpropagation is the Vanishing Gradient Problem.
Consider the derivative of the Sigmoid activation function: $$ \sigma'(z) = \sigma(z)(1 - \sigma(z)) $$ The maximum value of the Sigmoid derivative is $0.25$. As gradients are backpropagated through many layers, the term $\delta^{[l]} = \left( (W^{[l+1]})^T \delta^{[l+1]} \right) \odot \sigma'(z^{[l]})$ is multiplied by these small derivatives repeatedly. For a 5-layer network, the gradient of the first layer is scaled by at least $(0.25)^5 \approx 0.00097$. This causes the weights of the early layers to update very slowly, halting learning.
To prevent this, modern networks use the ReLU (Rectified Linear Unit) activation function: $$ f(x) = \max(0, x) \quad \implies \quad f'(x) = \begin{cases} 1 & x > 0 \ 0 & x \le 0 \end{cases} $$ Because the derivative of ReLU is 1 for positive inputs, gradients propagate through deep networks without vanishing.
π Applications: From Perceptrons to Transformers
- Transformer Training: Large Language Models (LLMs) use backpropagation to learn attention weight projections during training.
- Convolutional Networks (CNNs): Image recognition networks use backpropagation to update spatial filter matrices.
- Reinforcement Learning: Policy networks use backpropagation scaled by environment rewards to adjust agent actions.
βοΈ Trade-offs and Failure Modes
- Exploding Gradients: If the weight values are initialized too large, the matrix multiplication $((W^{[l+1]})^T \delta^{[l+1]})$ can cause gradients to grow exponentially, leading to numerical overflow (
NaNvalues). - Mitigation: Use Xavier (Glorot) or He initialization to scale initial weights based on layer sizes.
- Dead ReLUs: If a large gradient updates a neuron such that it always outputs negative values, the ReLU derivative will always be 0. That neuron will permanently stop learning.
- Mitigation: Use a Leaky ReLU activation function or a smaller learning rate.
π§ Decision Guide: Custom Autograd vs. Frameworks
| Scenario | Recommendation | Alternative |
| Learning neural network theory or interviewing | Write Custom Backpropagation | Solidifies understanding of matrix calculus. |
| Building production-grade deep learning models | PyTorch / TensorFlow | Autograd automatically constructs the computation graph and compiles it to GPU/TPU kernels. |
| Deploying lightweight models on resource-constrained devices | ONNX Runtime | Avoids the overhead of large training frameworks. |
π§ͺ Practical Implementation: NumPy MLP from Scratch
Here is a complete, runnable Python implementation of a 2-layer Neural Network with backpropagation from scratch.
import numpy as np
class NumPyNeuralNetwork:
def __init__(self, input_dim, hidden_dim, output_dim):
# Seed for reproducibility
np.random.seed(42)
# He initialization for hidden layer, Xavier for output layer
self.W1 = np.random.randn(hidden_dim, input_dim) * np.sqrt(2.0 / input_dim)
self.b1 = np.zeros((hidden_dim, 1))
self.W2 = np.random.randn(output_dim, hidden_dim) * np.sqrt(1.0 / hidden_dim)
self.b2 = np.zeros((output_dim, 1))
def _sigmoid(self, z):
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
def _sigmoid_derivative(self, z):
s = self._sigmoid(z)
return s * (1.0 - s)
def forward(self, X):
# X shape: (input_dim, batch_size)
self.A0 = X
self.Z1 = np.dot(self.W1, self.A0) + self.b1
self.A1 = self._sigmoid(self.Z1)
self.Z2 = np.dot(self.W2, self.A1) + self.b2
self.A2 = self._sigmoid(self.Z2)
return self.A2
def backward(self, Y):
# Y shape: (output_dim, batch_size)
m = Y.shape[1]
# Output layer error
self.dZ2 = (self.A2 - Y) * self._sigmoid_derivative(self.Z2)
self.dW2 = (1.0 / m) * np.dot(self.dZ2, self.A1.T)
self.db2 = (1.0 / m) * np.sum(self.dZ2, axis=1, keepdims=True)
# Hidden layer error (propagated back)
self.dZ1 = np.dot(self.W2.T, self.dZ2) * self._sigmoid_derivative(self.Z1)
self.dW1 = (1.0 / m) * np.dot(self.dZ1, self.A0.T)
self.db1 = (1.0 / m) * np.sum(self.dZ1, axis=1, keepdims=True)
def update_parameters(self, learning_rate):
self.W1 -= learning_rate * self.dW1
self.b1 -= learning_rate * self.db1
self.W2 -= learning_rate * self.dW2
self.b2 -= learning_rate * self.db2
def train(self, X, Y, epochs, learning_rate):
for epoch in range(epochs):
predictions = self.forward(X)
self.backward(Y)
self.update_parameters(learning_rate)
if epoch % 100 == 0:
loss = np.mean(0.5 * (predictions - Y) ** 2)
print(f"Epoch {epoch:04d} - Loss: {loss:.6f}")
# Example worked dataset: Binary XOR Logic Gate
if __name__ == "__main__":
# 4 training examples, 2 input features
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]).T
# XOR outputs
Y = np.array([[0], [1], [1], [0]]).T
# Network configuration: 2 inputs, 4 hidden units, 1 output
nn = NumPyNeuralNetwork(input_dim=2, hidden_dim=4, output_dim=1)
print("Training NumPy MLP on XOR gate dataset:")
nn.train(X, Y, epochs=1000, learning_rate=0.5)
# Final predictions
final_output = nn.forward(X)
print("\nFinal XOR Predictions:")
for i in range(X.shape[1]):
print(f"Input: {X[:, i]} -> Target: {Y[0, i]} -> Predicted: {final_output[0, i]:.4f}")
π Lessons Learned: Common Backpropagation Bugs
- Incorrect Matrix Transpositions: When multiplying matrix derivatives, always verify the shape dimensions. If you get a
ValueError: shapes not aligned, double-check your transpositions (e.g.np.dot(dZ2, A1.T)instead ofnp.dot(dZ2, A1)). - Forgetting to Normalize Gradients by Batch Size ($m$): If you do not divide the accumulated weight gradients by $m$ (the number of training examples), the gradient magnitude will scale with batch size. This forces you to modify your learning rate whenever the batch size changes.
- Mismatched Element-wise vs Matrix Multiplication: The transition equation $\delta^{[l]} = \left( (W^{[l+1]})^T \delta^{[l+1]} \right) \odot \sigma'(z^{[l]})$ uses a matrix multiplication for the weights, followed by an element-wise (Hadamard) multiplication for the activation derivative. Replacing the Hadamard product with a matrix dot product will yield incorrect shape results.
π Summary: The Backpropagation Cheatsheet
- Forward Pass: Compute linear combinations and activations from input to output.
- Backward Pass: Propagate error gradients backward using the chain rule.
- Error Term: $\delta^{[l]} = \frac{\partial L}{\partial z^{[l]}}$.
- Hidden Error Formula: $\delta^{[l]} = ((W^{[l+1]})^T \dots) \odot \sigma'(z^{[l]})$.
- Parameter Gradients: $dW^{[l]} = \delta^{[l]} (A^{[l-1]})^T$ and $db^{[l]} = \delta^{[l]}$.
- Vanishing Gradients: Caused by multiplying multiple small activation derivatives in deep networks; mitigated by using ReLU.
- Vectorization: Processing batches as matrices wraps computations in high-performance C libraries, eliminating Python interpreter overhead.
AI-generated article quiz
Test your understanding
Ready to test what you just learned?
Generate four focused questions from this article. Answers include immediate explanations.
Guided series path
Machine Learning Fundamentals
Reader feedback
Was this article useful?
Rate it if it helped, then continue with the next deep dive when you are ready.
Sign in to save your rating.
Article metadata