Home/Blog/Python/Calculus of Backpropagation: Mathematical Derivation from Scratch with Matrix Derivatives

PythonIntermediate•10 min read•Jun 18, 2026

Calculus of Backpropagation: Mathematical Derivation from Scratch with Matrix Derivatives

Derive and implement the backpropagation algorithm from scratch in Python using raw matrix calculus.

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: Backpropagation is the engine behind deep learning. It uses the chain rule of calculus to calculate the gradient of a loss function with respect to the network's weights. This guide derives the backpropagation equations using matrix calculus, analyzes vanishing gradients, and provides a runnable Python implementation using only raw NumPy.

📖 Concept: The Mathematics of Neural Network Training

Supervised machine learning relies on optimization. A neural network is a mathematical function $f(x; W)$ that maps an input vector $x$ to a predicted output vector $\hat{y}$ using a set of weights and biases, collectively denoted as $W$. Training the network means finding the parameters $W$ that minimize a loss function $L(y, \hat{y})$, which measures the error between our predictions and the ground-truth targets $y$.

To minimize the loss function, we use Gradient Descent. Gradient descent updates the weights in the opposite direction of the gradient of the loss function:

$$ W \leftarrow W - \alpha \frac{\partial L}{\partial W} $$

where $\alpha$ is the learning rate. Calculating the partial derivatives of the loss with respect to every weight in a deep neural network is complex. The Backpropagation algorithm solves this by using the chain rule of calculus to recursively calculate gradients from the output layer back to the input layer.

⚙️ Mechanics: Forward and Backward Propagation Passes

A neural network alternates between two passes:

Forward Pass: Input data propagates forward through the layers. For each layer $l$, we calculate the weighted input $z^{[l]}$ and the activation $a^{[l]}$:
$$ z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]} $$
$$ a^{[l]} = \sigma(z^{[l]}) $$
where $\sigma$ is an activation function (like Sigmoid or ReLU).
Backward Pass: We compute the gradient of the loss with respect to the activations, weighted inputs, and parameters of each layer. We do this by calculating the error term $\delta^{[l]} = \frac{\partial L}{\partial z^{[l]}}$ for each layer and using it to find $\frac{\partial L}{\partial W^{[l]}}$ and $\frac{\partial L}{\partial b^{[l]}}$.

📊 Flow: High-Level Gradient Propagation Sequence

The flowchart below visualizes the transition between forward and backward propagation through a two-layer neural network:

flowchart LR
    X[Input x] -->|Forward Pass: W1, b1| Z1[Z1 = W1*x + b1]
    Z1 -->|Activation: Sigmoid| A1[A1 = sig Z1]
    A1 -->|Forward Pass: W2, b2| Z2[Z2 = W2*A1 + b2]
    Z2 -->|Activation: Sigmoid| A2[Output y_hat]
    A2 -->|Compute Loss| Loss[Loss L]

    Loss -->|Backward Pass: dLoss/dZ2| Delta2[Delta 2]
    Delta2 -->|dLoss/dW2| GradW2[dLoss/dW2]
    Delta2 -->|dLoss/dA1| Delta1[Delta 1]
    Delta1 -->|dLoss/dW1| GradW1[dLoss/dW1]

The table below traces the variables calculated at each step of the forward and backward passes:

Step	Operation Type	Calculated Variable	Equation
1	Forward	Weighted Input $z^{[1]}$	$z^{[1]} = W^{[1]} x + b^{[1]}$
2	Forward	Activation $a^{[1]}$	$a^{[1]} = \sigma(z^{[1]})$
3	Forward	Weighted Input $z^{[2]}$	$z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}$
4	Forward	Output $\hat{y} = a^{[2]}$	$a^{[2]} = \sigma(z^{[2]})$
5	Backward	Output Error $\delta^{[2]}$	$\delta^{[2]} = \frac{\partial L}{\partial a^{[2]}} \odot \sigma'(z^{[2]})$
6	Backward	Layer 2 Gradients	$\frac{\partial L}{\partial W^{[2]}} = \delta^{[2]} (a^{[1]})^T$, $\frac{\partial L}{\partial b^{[2]}} = \delta^{[2]}$
7	Backward	Hidden Error $\delta^{[1]}$	$\delta^{[1]} = ((W^{[2]})^T \delta^{[2]}) \odot \sigma'(z^{[1]})$
8	Backward	Layer 1 Gradients	$\frac{\partial L}{\partial W^{[1]}} = \delta^{[1]} x^T$, $\frac{\partial L}{\partial b^{[1]}} = \delta^{[1]}$

🧠 Deep Dive: Structural Evolution & Performance Profiling

Let us examine the mathematical foundations and performance characteristics of matrix backpropagation.

Neural Network Layer Internals

A neural network layer consists of a weight matrix $W$ of size $n^{[l]} \times n^{[l-1]}$, where $n^{[l]}$ is the number of neurons in the current layer and $n^{[l-1]}$ is the number of neurons in the previous layer. The bias is a vector $b$ of size $n^{[l]} \times 1$. Storing these parameters in contiguous memory allocations using NumPy arrays allows the CPU to perform vector matrix multiplications in parallel using BLAS (Basic Linear Algebra Subprograms) libraries.

Mathematical Model of Backpropagation Calculus

Let us derive the exact matrix derivatives for a network of $L$ layers. We define the error term at layer $l$ as: $$ \delta^{[l]} = \frac{\partial L}{\partial z^{[l]}} $$

For the final layer $L$, using the chain rule: $$ \delta^{[L]} = \frac{\partial L}{\partial a^{[L]}} \odot \sigma'(z^{[L]}) $$ where $\odot$ denotes the element-wise Hadamard product. For a mean squared error loss $L = \frac{1}{2} | a^{[L]} - y |^2$, the derivative is: $$ \frac{\partial L}{\partial a^{[L]}} = a^{[L]} - y $$

To derive the error term for a hidden layer $l$ from the succeeding layer $l+1$, we use the multivariate chain rule: $$ \delta^{[l]} = \frac{\partial L}{\partial z^{[l]}} = \left( \frac{\partial z^{[l+1]}}{\partial z^{[l]}} \right)^T \frac{\partial L}{\partial z^{[l+1]}} $$ Since $z^{[l+1]} = W^{[l+1]} \sigma(z^{[l]}) + b^{[l+1]}$, we have: $$ \frac{\partial z^{[l+1]}}{\partial z^{[l]}} = W^{[l+1]} \text{diag}(\sigma'(z^{[l]})) $$ Substituting this back gives our final vector recurrence relation: $$ \delta^{[l]} = \left( (W^{[l+1]})^T \delta^{[l+1]} \right) \odot \sigma'(z^{[l]}) $$

Once we have $\delta^{[l]}$, we calculate the partial derivatives for the weights and biases: $$ \frac{\partial L}{\partial W^{[l]}} = \frac{\partial L}{\partial z^{[l]}} \frac{\partial z^{[l]}}{\partial W^{[l]}} = \delta^{[l]} (a^{[l-1]})^T $$ $$ \frac{\partial L}{\partial b^{[l]}} = \delta^{[l]} $$

These equations form the complete mathematical model of backpropagation.

Performance Analysis of NumPy Vectorization

Implementing these equations in Python using loops would result in slow execution times because Python interpreter loops add significant overhead.

Instead, we use Vectorization. We pack $m$ training examples into a design matrix $X$ of size $n^{[0]} \times m$. The forward and backward equations become matrix-matrix multiplications: $$ Z^{[1]} = W^{[1]} X + b^{[1]} $$ $$ dW^{[l]} = \frac{1}{m} \delta^{[l]} (A^{[l-1]})^T $$ By vectorizing the batch, NumPy delegates the heavy computations to C-based linear algebra libraries (like OpenBLAS or MKL), accelerating training throughput by several orders of magnitude.

🏗️ Advanced Concepts: Vanishing Gradients and Activation Functions

A classic problem when training deep networks using backpropagation is the Vanishing Gradient Problem.

Consider the derivative of the Sigmoid activation function: $$ \sigma'(z) = \sigma(z)(1 - \sigma(z)) $$ The maximum value of the Sigmoid derivative is $0.25$. As gradients are backpropagated through many layers, the term $\delta^{[l]} = \left( (W^{[l+1]})^T \delta^{[l+1]} \right) \odot \sigma'(z^{[l]})$ is multiplied by these small derivatives repeatedly. For a 5-layer network, the gradient of the first layer is scaled by at least $(0.25)^5 \approx 0.00097$. This causes the weights of the early layers to update very slowly, halting learning.

To prevent this, modern networks use the ReLU (Rectified Linear Unit) activation function: $$ f(x) = \max(0, x) \quad \implies \quad f'(x) = \begin{cases} 1 & x > 0 \ 0 & x \le 0 \end{cases} $$ Because the derivative of ReLU is 1 for positive inputs, gradients propagate through deep networks without vanishing.

🌍 Applications: From Perceptrons to Transformers

Transformer Training: Large Language Models (LLMs) use backpropagation to learn attention weight projections during training.
Convolutional Networks (CNNs): Image recognition networks use backpropagation to update spatial filter matrices.
Reinforcement Learning: Policy networks use backpropagation scaled by environment rewards to adjust agent actions.

⚖️ Trade-offs and Failure Modes

Exploding Gradients: If the weight values are initialized too large, the matrix multiplication $((W^{[l+1]})^T \delta^{[l+1]})$ can cause gradients to grow exponentially, leading to numerical overflow (NaN values).
Mitigation: Use Xavier (Glorot) or He initialization to scale initial weights based on layer sizes.
Dead ReLUs: If a large gradient updates a neuron such that it always outputs negative values, the ReLU derivative will always be 0. That neuron will permanently stop learning.
Mitigation: Use a Leaky ReLU activation function or a smaller learning rate.

🧭 Decision Guide: Custom Autograd vs. Frameworks

Scenario	Recommendation	Alternative
Learning neural network theory or interviewing	Write Custom Backpropagation	Solidifies understanding of matrix calculus.
Building production-grade deep learning models	PyTorch / TensorFlow	Autograd automatically constructs the computation graph and compiles it to GPU/TPU kernels.
Deploying lightweight models on resource-constrained devices	ONNX Runtime	Avoids the overhead of large training frameworks.

🧪 Practical Implementation: NumPy MLP from Scratch

Here is a complete, runnable Python implementation of a 2-layer Neural Network with backpropagation from scratch.

import numpy as np

class NumPyNeuralNetwork:
    def __init__(self, input_dim, hidden_dim, output_dim):
        # Seed for reproducibility
        np.random.seed(42)

        # He initialization for hidden layer, Xavier for output layer
        self.W1 = np.random.randn(hidden_dim, input_dim) * np.sqrt(2.0 / input_dim)
        self.b1 = np.zeros((hidden_dim, 1))

        self.W2 = np.random.randn(output_dim, hidden_dim) * np.sqrt(1.0 / hidden_dim)
        self.b2 = np.zeros((output_dim, 1))

    def _sigmoid(self, z):
        return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

    def _sigmoid_derivative(self, z):
        s = self._sigmoid(z)
        return s * (1.0 - s)

    def forward(self, X):
        # X shape: (input_dim, batch_size)
        self.A0 = X
        self.Z1 = np.dot(self.W1, self.A0) + self.b1
        self.A1 = self._sigmoid(self.Z1)

        self.Z2 = np.dot(self.W2, self.A1) + self.b2
        self.A2 = self._sigmoid(self.Z2)
        return self.A2

    def backward(self, Y):
        # Y shape: (output_dim, batch_size)
        m = Y.shape[1]

        # Output layer error
        self.dZ2 = (self.A2 - Y) * self._sigmoid_derivative(self.Z2)
        self.dW2 = (1.0 / m) * np.dot(self.dZ2, self.A1.T)
        self.db2 = (1.0 / m) * np.sum(self.dZ2, axis=1, keepdims=True)

        # Hidden layer error (propagated back)
        self.dZ1 = np.dot(self.W2.T, self.dZ2) * self._sigmoid_derivative(self.Z1)
        self.dW1 = (1.0 / m) * np.dot(self.dZ1, self.A0.T)
        self.db1 = (1.0 / m) * np.sum(self.dZ1, axis=1, keepdims=True)

    def update_parameters(self, learning_rate):
        self.W1 -= learning_rate * self.dW1
        self.b1 -= learning_rate * self.db1
        self.W2 -= learning_rate * self.dW2
        self.b2 -= learning_rate * self.db2

    def train(self, X, Y, epochs, learning_rate):
        for epoch in range(epochs):
            predictions = self.forward(X)
            self.backward(Y)
            self.update_parameters(learning_rate)

            if epoch % 100 == 0:
                loss = np.mean(0.5 * (predictions - Y) ** 2)
                print(f"Epoch {epoch:04d} - Loss: {loss:.6f}")

# Example worked dataset: Binary XOR Logic Gate
if __name__ == "__main__":
    # 4 training examples, 2 input features
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]).T
    # XOR outputs
    Y = np.array([[0], [1], [1], [0]]).T

    # Network configuration: 2 inputs, 4 hidden units, 1 output
    nn = NumPyNeuralNetwork(input_dim=2, hidden_dim=4, output_dim=1)

    print("Training NumPy MLP on XOR gate dataset:")
    nn.train(X, Y, epochs=1000, learning_rate=0.5)

    # Final predictions
    final_output = nn.forward(X)
    print("\nFinal XOR Predictions:")
    for i in range(X.shape[1]):
        print(f"Input: {X[:, i]} -> Target: {Y[0, i]} -> Predicted: {final_output[0, i]:.4f}")

📚 Lessons Learned: Common Backpropagation Bugs

Incorrect Matrix Transpositions: When multiplying matrix derivatives, always verify the shape dimensions. If you get a ValueError: shapes not aligned, double-check your transpositions (e.g. np.dot(dZ2, A1.T) instead of np.dot(dZ2, A1)).
Forgetting to Normalize Gradients by Batch Size ($m$): If you do not divide the accumulated weight gradients by $m$ (the number of training examples), the gradient magnitude will scale with batch size. This forces you to modify your learning rate whenever the batch size changes.
Mismatched Element-wise vs Matrix Multiplication: The transition equation $\delta^{[l]} = \left( (W^{[l+1]})^T \delta^{[l+1]} \right) \odot \sigma'(z^{[l]})$ uses a matrix multiplication for the weights, followed by an element-wise (Hadamard) multiplication for the activation derivative. Replacing the Hadamard product with a matrix dot product will yield incorrect shape results.

📌 Summary: The Backpropagation Cheatsheet

Forward Pass: Compute linear combinations and activations from input to output.
Backward Pass: Propagate error gradients backward using the chain rule.
Error Term: $\delta^{[l]} = \frac{\partial L}{\partial z^{[l]}}$.
Hidden Error Formula: $\delta^{[l]} = ((W^{[l+1]})^T \dots) \odot \sigma'(z^{[l]})$.
Parameter Gradients: $dW^{[l]} = \delta^{[l]} (A^{[l-1]})^T$ and $db^{[l]} = \delta^{[l]}$.
Vanishing Gradients: Caused by multiplying multiple small activation derivatives in deep networks; mitigated by using ReLU.
Vectorization: Processing batches as matrices wraps computations in high-performance C libraries, eliminating Python interpreter overhead.

AI-generated article quiz

Test your understanding

🧠

Ready to test what you just learned?

Generate four focused questions from this article. Answers include immediate explanations.

Guided series path

Machine Learning Fundamentals

View all lessons →

Lesson 14 of 19

← Previous lessonReinforcement Learning: Agents, Environments, and Rewards in PracticeIntermediate · 14 min Next lesson →Deep Learning Architectures: CNNs, RNNs, and TransformersIntermediate · 13 min

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata