Home/Blog/Python/Deep Learning Optimizers: Deriving Momentum, RMSProp, and AdamW mathematically
PythonIntermediateβ€’11 min readβ€’

Deep Learning Optimizers: Deriving Momentum, RMSProp, and AdamW mathematically

Derive and implement modern deep learning optimizers (SGD, Momentum, RMSProp, and AdamW) from scratch in Python.

Abstract Algorithms

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: Optimizers determine how we update neural network weights during training. While Stochastic Gradient Descent (SGD) updates parameters along the negative gradient direction, modern models use adaptive learning rates and momentum. This guide derives the mathematical models behind Momentum, RMSProp, Adam, and AdamW, and implements them from scratch in Python using raw NumPy.


πŸ“– Concept: The Need for Adaptive Learning Rates

When training deep neural networks, the objective of our optimization algorithm is to navigate a complex, high-dimensional loss landscape to find a global minimum. The simplest optimization algorithm, Stochastic Gradient Descent (SGD), updates the parameters $W$ by taking a step in the direction of the negative gradient of the loss:

$$ W_{t+1} = W_t - \alpha \nabla L(W_t) $$

where $\alpha$ is a fixed learning rate and $\nabla L(W_t)$ is the gradient vector.

However, SGD exhibits major vulnerabilities in complex loss landscapes:

  • Oscillations in Canyons: If the loss landscape is steep in one direction and gentle in another (forming a ravine or canyon), SGD will oscillate wildly across the steep walls, making very slow progress along the valley floor toward the minimum.
  • Saddle Points and Local Minima: SGD can easily stall in flat regions (saddle points) where the gradient is close to zero, halting weight updates.
  • Fixed Learning Rate: Applying the same learning rate $\alpha$ to all parameters is inefficient. Frequently occurring features need smaller updates, while rare features need larger updates to capture their patterns.

To overcome these issues, researchers developed optimizers that incorporate history (momentum) and adapt the learning rate for each individual weight dynamically.


βš™οΈ Mechanics: Gradient Descent Step Modifications

Modern optimizers evolve by adding historical tracking loops to the standard SGD update:

  1. Momentum: Accelerates gradient descent by accumulating a moving average of past gradients. This acts like a heavy ball rolling down a hill, carrying momentum that helps it skip past local minima and dampen oscillations.
  2. RMSProp (Root Mean Square Propagation): Dampens oscillations by dividing the learning rate for a parameter by a running average of the magnitudes of its recent gradients. This scales down updates for parameters with large gradients and scales up updates for parameters with small gradients.
  3. Adam (Adaptive Moment Estimation): Combines the ideas of Momentum and RMSProp. It maintains both a running average of the gradients (first moment) and a running average of the squared gradients (second moment).
  4. AdamW: Fixes a bug in Adam's weight decay implementation. Instead of adding L2 regularization directly to the gradient calculation, AdamW applies weight decay directly to the parameter values, restoring correct weight updates.

πŸ“Š Flow: Optimization Path Transitions

The flowchart below visualizes the mathematical progression and logical updates introduced by each optimizer stage:

flowchart TD
    SGD[Stochastic Gradient Descent
Standard step: W - alpha * grad] -->|Add Velocity/History| Mom[Momentum
Accumulate past step directions] Mom -->|Combine with Squared Gradients| Adam[Adam
First and second moments tracking] SGD -->|Divide by Squared Gradients| RMS[RMSProp
Normalize step size by gradient magnitude] RMS -->|Combine with Velocity| Adam Adam -->|Decouple Weight Decay| AdamW[AdamW
Subtract weight decay directly from weights]

The table below summarizes the key hyperparameters used to configure these mathematical update paths:

HyperparameterMathematical SymbolRoleStandard Value
Learning Rate$\alpha$Controls the step size taken in the update direction.0.001 - 0.01
Momentum Decay$\beta$ or $\beta_1$Scales the contribution of past gradients (first moment).0.9
RMSProp Decay$\beta_2$Scales the contribution of past squared gradients (second moment).0.999
Epsilon$\epsilon$Small constant added to prevent division by zero.1e-8
Weight Decay$\lambda$Scales the direct subtraction of parameter values.0.01

🧠 Deep Dive: Vectorized updates and Weight Decay

Let us derive the exact equations and analyze the performance constraints of vectorized optimizers.

Optimizer State Update Internals

During backpropagation, we calculate the gradient matrix $g_t = \nabla L(W_t)$. To run updates, the optimizer must maintain historical matrices of the same shape as $W_t$ in executor memory.

  • For Momentum, we maintain a velocity matrix $v_t$.
  • For RMSProp, we maintain a squared gradient cache matrix $s_t$.
  • For Adam, we maintain both first moment $m_t$ and second moment $v_t$ matrices. These state variables must be allocated as contiguous memory arrays to leverage CPU vector operations.

Mathematical Model of Optimization Trajectories

Let $g_t$ be the gradient at step $t$. We derive the recurrence relations for each optimizer from scratch:

1. Momentum Derivation

We maintain a running average of past gradients: $$ vt = \beta v{t-1} + (1 - \beta) gt $$ $$ W{t+1} = W_t - \alpha v_t $$ If $\beta = 0.9$, the velocity vector $v_t$ averages gradients over approximately the last 10 steps.

2. RMSProp Derivation

We maintain a running average of squared gradients: $$ s_t = \beta2 s{t-1} + (1 - \beta_2) gt^2 $$ $$ W{t+1} = W_t - \frac{\alpha}{\sqrt{s_t} + \epsilon} \odot g_t $$ where $g_t^2$ is the element-wise square of the gradient vector, and $\odot$ is the Hadamard product.

3. Adam Derivation

We calculate both moments: $$ m_t = \beta1 m{t-1} + (1 - \beta_1) g_t $$ $$ v_t = \beta2 v{t-1} + (1 - \beta_2) g_t^2 $$ Since $m_t$ and $v_t$ are typically initialized to 0, they are biased toward zero, especially during early steps. To correct this, we compute bias-corrected moments: $$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad \text{and} \quad \hat{v}_t = \frac{v_t}{1 - \beta2^t} $$ The final parameter update is: $$ W{t+1} = W_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \odot \hat{m}_t $$

Performance Analysis of Vectorized Gradients

Computing these formulas element-wise in Python would introduce massive runtime overhead. By vectorizing the equations using NumPy, we compile the matrix operations into optimized C-based executions.

However, maintaining state variables ($m_t$ and $v_t$) for every weight doubles the memory footprint of our model parameters. For a model with 1 billion parameters (~4 GB in 32-bit float), storing Adam's optimizer state requires an additional 8 GB of memory, which can lead to GPU or JVM out-of-memory errors if not managed correctly.


πŸ—οΈ Advanced Concepts: Decoupling Weight Decay in AdamW

L2 regularization (commonly called weight decay) adds a penalty term $\frac{1}{2} \lambda |W|^2$ to the loss function to prevent overfitting. In standard gradient descent, the gradient of the regularized loss is: $$ \nabla L_{reg}(W_t) = g_t + \lambda W_t $$

In Adam, if we feed this regularized gradient into the moment updates: $$ m_t = \beta1 m{t-1} + (1 - \beta_1) (g_t + \lambda W_t) $$ The regularization term is scaled by the running second moment $\sqrt{v_t}$. If a parameter has historically large gradients, its $v_t$ will be large, which divides and suppresses the weight decay penalty. Conversely, parameters with small gradients will experience stronger weight decay.

To restore correct regularization, Loshchilov and Hutter (2017) proposed AdamW, which decouples weight decay from the gradient moments entirely, subtracting the penalty directly from the parameter values: $$ W_{t+1} = W_t - \alpha \lambda W_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \odot \hat{m}_t $$ This ensures that all weights decay at a rate proportional to their value, improving model generalization.


🌍 Applications: From Neural Networks to NLP Transformers

  1. Large Language Model Training: Almost all modern Transformers (including GPT and Llama) are trained using AdamW because it provides stable convergence for deep architectures.
  2. Computer Vision (ResNet): Often trained using SGD with Momentum, as it can achieve slightly higher generalization accuracy when tuned correctly.
  3. Generative Adversarial Networks (GANs): Use Adam to stabilize the competitive training loops between generator and discriminator networks.

βš–οΈ Trade-offs and Failure Modes

  • Hyperparameter Sensitivity: Adam and AdamW introduce several hyperparameters ($\beta_1$, $\beta_2$, $\epsilon$, $\lambda$). Selecting incorrect values can lead to unstable training or divergence.
  • Memory Constraints: Adaptive optimizers consume three times the memory of standard SGD, limiting batch sizes on hardware accelerators.

🧭 Decision Guide: SGD vs. Adam vs. AdamW

MetricSGDAdamAdamW
Convergence SpeedSlowFastFast
Generalization UptimeExcellent (when tuned)GoodExcellent (with weight decay)
Memory FootprintLow (no state)High ($2 \times$ parameters)High ($2 \times$ parameters)
Primary Use CaseClassic CNNs, simple MLPsBasic exploratory deep learningState-of-the-art Transformers

πŸ§ͺ Practical Implementation: NumPy Optimizers Comparison

Here is a complete, runnable Python script implementing SGD, Momentum, RMSProp, and AdamW from scratch.

import numpy as np

class Optimizers:
    def __init__(self, W_shape):
        self.W_shape = W_shape
        # Momentum state
        self.v_momentum = np.zeros(W_shape)
        # RMSProp state
        self.s_rmsprop = np.zeros(W_shape)
        # Adam/AdamW state
        self.m_adam = np.zeros(W_shape)
        self.v_adam = np.zeros(W_shape)
        self.t = 0

    def sgd(self, W, dW, lr):
        return W - lr * dW

    def momentum(self, W, dW, lr, beta=0.9):
        # Accumulate velocity
        self.v_momentum = beta * self.v_momentum + (1.0 - beta) * dW
        return W - lr * self.v_momentum

    def rmsprop(self, W, dW, lr, beta2=0.999, eps=1e-8):
        # Accumulate squared gradients
        self.s_rmsprop = beta2 * self.s_rmsprop + (1.0 - beta2) * (dW ** 2)
        return W - (lr / (np.sqrt(self.s_rmsprop) + eps)) * dW

    def adamw(self, W, dW, lr, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01):
        self.t += 1

        # 1. Update biased first moment estimate
        self.m_adam = beta1 * self.m_adam + (1.0 - beta1) * dW
        # 2. Update biased second raw moment estimate
        self.v_adam = beta2 * self.v_adam + (1.0 - beta2) * (dW ** 2)

        # 3. Compute bias-corrected first moment estimate
        m_hat = self.m_adam / (1.0 - beta1 ** self.t)
        # 4. Compute bias-corrected second raw moment estimate
        v_hat = self.v_adam / (1.0 - beta2 ** self.t)

        # 5. Apply decoupled weight decay and parameter updates
        W_decayed = W - lr * weight_decay * W
        return W_decayed - (lr / (np.sqrt(v_hat) + eps)) * m_hat

# Compare optimizers on a simple quadratic loss function: L(W) = W^2
if __name__ == "__main__":
    initial_W = np.array([10.0])
    target_W = np.array([0.0])
    epochs = 200
    lr = 0.1

    print("Comparing Optimizer convergence paths on L(W) = W^2:")

    for name in ["SGD", "Momentum", "RMSProp", "AdamW"]:
        W = np.copy(initial_W)
        opt = Optimizers(W.shape)

        for epoch in range(epochs):
            # Gradient of L(W) = W^2 is 2*W
            dW = 2.0 * W

            if name == "SGD":
                W = opt.sgd(W, dW, lr)
            elif name == "Momentum":
                W = opt.momentum(W, dW, lr)
            elif name == "RMSProp":
                W = opt.rmsprop(W, dW, lr)
            elif name == "AdamW":
                W = opt.adamw(W, dW, lr)

        print(f"Optimizer: {name:10s} -> Initial W: {initial_W[0]:.2f} -> Final W: {W[0]:.6f}")

πŸ“š Lessons Learned: Common Optimization Pitfalls

  1. Confusing L2 Regularization with Decoupled Weight Decay: Standard deep learning libraries (like PyTorch) historically named their L2 regularization parameter weight_decay in the Adam optimizer class. This caused confusion because it implemented the coupled L2 variant, which dampens regularization updates. Always use the AdamW class explicitly if you want correct decoupled weight decay.
  2. Missing Bias Corrections in Adam: During early epochs (when $t$ is small), the moment variables $m_t$ and $v_t$ are heavily biased toward their zero initialization. Bypassing the calculation of $\hat{m}_t$ and $\hat{v}_t$ leads to small update steps at the start of training, slowing down convergence.
  3. Sharing Optimizer States Across Different Layers: Each layer matrix (weight and bias) must have its own dedicated moment history arrays. Sharing state parameters across different weight matrices will mix gradients, corrupting the historical tracking variables.

πŸ“Œ Summary: The Optimizers Cheatsheet

  • SGD: Baseline optimizer; updates weights along the negative gradient vector.
  • Momentum: Dampens oscillations by adding a running average of past gradients ($v_t$).
  • RMSProp: Prevents scaling issues by dividing the learning rate by the running root-mean-square of gradients ($s_t$).
  • Adam: Combines Momentum and RMSProp, maintaining both first and second moments.
  • AdamW: Decouples weight decay from the gradient moments, applying the penalty directly to the parameter weights.
  • Memory Constraint: Adaptive optimizers require storing additional state matrices in memory, increasing the GPU footprint.

AI-generated article quiz

Test your understanding

🧠

Ready to test what you just learned?

Generate four focused questions from this article. Answers include immediate explanations.

Guided series path

Machine Learning Fundamentals

View all lessons β†’
Lesson 10 of 19

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Sign in to save your rating.