All Posts

Neural Networks Explained: From Neurons to Deep Learning

How do computers learn? We start with a single neuron (Perceptron) and build up to Deep Neural Networks.

Abstract AlgorithmsAbstract Algorithms
Β·Β·14 min read
Cover Image for Neural Networks Explained: From Neurons to Deep Learning
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: A neural network is a stack of simple "neurons" that turn raw inputs into predictions by learning the right weights and biases. Training means repeatedly nudging those numbers via back-propagation until the error shrinks. Master the basics and you can build everything from spam filters to image classifiers.


πŸ“– What Is a Neuron, and Why Do We Stack Them?

A single neuron in a neural network is just a weighted sum plus an activation function:

output = relu(w1*x1 + w2*x2 + bias)

String 100 million of these together with the right weights, and you get GPT-4. The entire architecture of modern AI β€” from spam filters to self-driving cars β€” reduces to this one line, repeated at massive scale with learned weights. This post explains how.

Think of each neuron as one expert in a committee deciding whether to approve a loan. Each expert looks at different evidence β€” credit score, income, debt-to-income ratio β€” and casts a weighted vote. The final decision is the weighted sum of those votes, passed through a threshold rule. That's one artificial neuron.

AspectNeural NetworkTraditional Linear Model
Non-linearityIntroduced by activation functionsNone
Feature engineeringNetwork learns features automaticallyHeavy manual design
InterpretabilityLow (distributed weights)High (explicit coefficients)
ScalabilityHandles millions of parametersLimited by feature count

A single neuron can only draw a straight line between classes. The moment we stack neurons into layers, the network can represent curves, spirals, and arbitrary decision boundaries.


πŸ” The Perceptron: Neural Network Building Blocks

A single artificial neuron computes:

$$y = f\!\left(\sum_{i=1}^{n} w_i x_i + b ight)$$

where $\mathbf{x}$ are input features, $\mathbf{w}$ are learnable weights, $b$ is a bias, and $f$ is an activation function (ReLU, sigmoid, or tanh).

Toy dataset β€” loan approval:

IDAgeSalary ($k$)Owns HouseLabel (Approve?)
12545NoNo
24080YesYes
33060NoNo
450120YesYes

A perceptron learns a weighted combination of age, salary, and house ownership to separate approvals from rejections. But it can only learn linear boundaries.

Going deeper: the Multi-Layer Perceptron

Stacking perceptrons into layers lets the network model non-linear functions. The famously unlearnable XOR problem requires just two layers:

input β†’ [Linear β†’ ReLU] β†’ [Linear β†’ Sigmoid] β†’ output

Each hidden layer transforms the representation; the final layer makes a decision.

πŸ“Š Fully-Connected Network: Inputs, Hidden Neurons, and Output

flowchart LR
    subgraph Input Layer
        I1[x1]
        I2[x2]
        I3[x3]
    end
    subgraph Hidden Layer 1
        H1[N1]
        H2[N2]
        H3[N3]
        H4[N4]
    end
    subgraph Output Layer
        O1[y]
    end
    I1 --> H1 & H2 & H3 & H4
    I2 --> H1 & H2 & H3 & H4
    I3 --> H1 & H2 & H3 & H4
    H1 & H2 & H3 & H4 --> O1

This diagram shows a fully connected (dense) network architecture: every input neuron connects to every hidden neuron, and every hidden neuron connects to the output. The three input nodes (x1, x2, x3) fan out to four hidden neurons β€” each computing a weighted sum followed by an activation function β€” before converging to a single output prediction. The key takeaway is that "fully connected" means every pair of adjacent-layer neurons shares a weight, which is why the number of parameters grows as the product of layer widths and why matrix multiplication is the natural operation for the forward pass.


βš™οΈ How a Neural Network Actually Trains

Training a network is a three-step cycle:

Forward pass β€” data flows through the layers, producing a predicted output.

Loss computation β€” the prediction is compared to the true label using a loss function (e.g., cross-entropy for classification).

Backward pass β€” gradients of the loss are computed through the network via the chain rule, and weights are nudged in the direction that reduces loss.

graph TD
    A[Input Vector x] --> B[Linear Layer 1]
    B --> C[ReLU activation]
    C --> D[Linear Layer 2]
    D --> E[Softmax]
    E --> F[Prediction Ε·]
    F --> G[Loss L vs true y]
    G --> H[Backward pass: compute gradients]
    H --> I[Optimizer: Adam / SGD]
    I --> J[Update weights]
    J --> B

Forward pass shape transformations

StepOperationShape
Inputraw tensor(batch, features)
Linear 1nn.Linear(features, hidden)(batch, hidden)
ReLUelement-wise max(0, x)(batch, hidden)
Linear 2nn.Linear(hidden, classes)(batch, classes)
Softmaxnormalize to probabilities(batch, classes)

πŸ“Š Forward Pass and Backpropagation Signal Flow

sequenceDiagram
    participant Input
    participant HiddenLayer
    participant Output
    participant Loss
    Input->>HiddenLayer: Forward: activations
    HiddenLayer->>Output: Forward: predictions
    Output->>Loss: Compute loss
    Loss->>Output: Backprop: dL/dO
    Output->>HiddenLayer: Backprop: dL/dH
    HiddenLayer->>Input: Backprop: dL/dW
    Input->>HiddenLayer: Update weights

This diagram traces the bidirectional signal flow in a neural network: the forward pass moves activations from Input through HiddenLayer to Output, where the loss is computed; the backward pass reverses direction, propagating gradient signals from Loss back through each layer to update weights. Each backward arrow carries a partial derivative (dL/dO, dL/dH, dL/dW) computed by the chain rule β€” the mathematical mechanism that makes deep learning tractable. The key takeaway is that backpropagation is simply a systematic application of the chain rule in reverse, and every weight in the network receives its own gradient signal regardless of how many layers separate it from the output.


🧠 Deep Dive: Why Activation Functions Are the Key to Learning

Without non-linear activations, stacking layers is mathematically equivalent to a single linear transformationβ€”adding more layers gains nothing. Activations like ReLU break this linearity so the network can learn curved, complex decision boundaries.

ActivationFormulaCommon use
Sigmoid1 / (1 + e⁻ˣ)Binary output layer
Tanh(eΛ£ βˆ’ e⁻ˣ) / (eΛ£ + e⁻ˣ)Hidden layers in older RNNs
ReLUmax(0, x)Default for most hidden layers
Softmaxeˣⁱ / ΣeˣʲMulti-class output layer

πŸ“Š Choosing the Right Activation Function for Each Layer

flowchart TD
    A[Choose Activation Fn] --> B{Layer Type?}
    B -- Hidden --> C{Deep Network?}
    B -- Output --> D{Task Type?}
    C -- Yes --> E[ReLU / Leaky ReLU]
    C -- No --> F[Sigmoid / Tanh]
    D -- Binary Class --> G[Sigmoid]
    D -- Multi-class --> H[Softmax]
    D -- Regression --> I[Linear / None]

This decision tree shows how to choose the correct activation function based on where a layer sits in the network and what task it performs. Hidden layers in deep networks default to ReLU (or Leaky ReLU) to avoid the vanishing gradient problem, while shallower or older architectures may use Sigmoid or Tanh. Output layers are always determined by the loss function: Sigmoid pairs with binary cross-entropy, Softmax pairs with categorical cross-entropy, and no activation (linear) pairs with MSE for regression. The key takeaway is that the output activation is not a hyperparameter to tune β€” it is dictated by the task.

πŸ“Š Training Flow: How a Neural Network Learns Step by Step

The following diagram shows the complete training cycle. Each iteration moves data forward through the network to produce a prediction, then propagates error signals backward to adjust every weight. This cycle repeats for every mini-batch in the dataset.

graph TD
    A[Load Mini-Batch] --> B[Forward Pass Through Layers]
    B --> C[Compute Predictions]
    C --> D[Calculate Loss: how wrong is the prediction?]
    D --> E[Backward Pass: compute gradients via chain rule]
    E --> F[Optimizer: Adam / SGD / RMSProp]
    F --> G[Update All Weights]
    G --> A

Understanding each step individually makes debugging far easier. If loss is not decreasing, the issue is either the forward pass (wrong architecture), the loss function (wrong objective), or the optimizer (wrong hyperparameters). Identifying which stage is broken narrows the problem immediately.


πŸ§ͺ Building a Simple Classifier in PyTorch

This example implements the complete training loop for a two-layer neural network using PyTorch's nn.Sequential API on a toy 3-feature binary classification problem. It was chosen because it maps directly onto every concept covered so far: nn.Linear layers perform the matrix multiply, nn.ReLU introduces non-linearity, CrossEntropyLoss applies probability-based loss, and loss.backward() runs backpropagation. As you read the code, focus on the four-line inner loop β€” zero_grad β†’ forward β†’ loss.backward() β†’ optimizer.step() β€” that sequence is the foundation of every neural network training run, from this toy example to production-scale models.

import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self, n_features, n_hidden, n_classes):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_features, n_hidden),
            nn.ReLU(),
            nn.Linear(n_hidden, n_classes)
        )

    def forward(self, x):
        return self.net(x)

model = SimpleNet(n_features=3, n_hidden=16, n_classes=2)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Training loop (one epoch sketch)
for X_batch, y_batch in dataloader:
    optimizer.zero_grad()
    logits = model(X_batch)
    loss = criterion(logits, y_batch)
    loss.backward()
    optimizer.step()

The three lines loss.backward() β†’ optimizer.step() are where all the learning happens. PyTorch automatically computes the gradient of every parameter.


🌍 Real-World Applications: Neural Networks Running in Production Today

The same training loop from the PyTorch example above scales to every major AI product in production today:

ProductArchitectureWhat it predicts
Gmail spam filter3-layer MLPSpam vs. not-spam from email features
Netflix recommendationsEmbedding + MLPWatch probability given user history
Face ID (iPhone)Deep CNNMatch probability against enrolled face
GPT-496-layer TransformerNext token in a sequence

Each trains by the same loop: forward pass β†’ loss β†’ backward pass β†’ weight update. The difference is data volume, model depth, and compute scale.

ReLU is preferred in hidden layers because it avoids the vanishing gradient problem that plagued sigmoid and tanh in deep networks β€” gradient magnitudes stay healthy for positive activations, allowing learning signals to reach early layers.


βš–οΈ Trade-offs & Failure Modes: Common Mistakes and How to Avoid Them

MistakeSymptomFix
No normalizationLoss diverges, slow convergenceNormalize inputs to zero mean, unit variance
Learning rate too highLoss explodes or oscillatesStart at 1e-3 with Adam; use a scheduler
Too many layers for small datasetOverfits quicklyAdd dropout, reduce capacity, or use transfer learning
Forgetting optimizer.zero_grad()Gradients accumulate across batchesCall it at the start of every batch
Using sigmoid output + BCELossNumerical instabilityUse raw logits + BCEWithLogitsLoss

🧭 Decision Guide: When Neural Networks Are the Right Tool

  • Use neural networks when your data is high-dimensional and unstructured (images, audio, raw text) and you have thousands of labeled examples.
  • Start with a pre-trained model rather than training from scratchβ€”transfer learning typically outperforms scratch models with 10Γ— less data.
  • Prefer gradient-boosted trees (XGBoost, LightGBM) for tabular data with fewer than ~50k rowsβ€”faster to tune and often more accurate.
  • Monitor overfitting early: if train loss keeps dropping but validation loss rises, add dropout or reduce model size before training longer.

🎯 What to Learn Next


πŸ› οΈ PyTorch: The Research-to-Production Neural Network Framework

PyTorch is an open-source deep learning framework from Meta AI with dynamic computation graphs, GPU acceleration, and an ecosystem of pre-trained models β€” it is the dominant framework for neural network research and increasingly the standard for production deployment via TorchScript and ONNX export.

The post already shows a basic PyTorch training loop. Here is a more complete example that adds proper data loading, validation tracking, and a learning-rate scheduler β€” the key additions that take a toy example to a production-ready training pipeline:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# --- Data ---
X = torch.randn(200, 10)               # 200 examples, 10 features
y = (X[:, 0] + X[:, 1] > 0).long()    # binary label

dataset  = TensorDataset(X, y)
train_ds, val_ds = torch.utils.data.random_split(dataset, [160, 40])
train_dl = DataLoader(train_ds, batch_size=16, shuffle=True)
val_dl   = DataLoader(val_ds,   batch_size=16)

# --- Model with Dropout (regularization) ---
class MLPWithDropout(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(10, 64), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(64, 32), nn.ReLU(),
            nn.Linear(32, 2)
        )
    def forward(self, x): return self.net(x)

model     = MLPWithDropout()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)
criterion = nn.CrossEntropyLoss()

# --- Training loop with validation ---
for epoch in range(20):
    model.train()
    for X_b, y_b in train_dl:
        optimizer.zero_grad()
        loss = criterion(model(X_b), y_b)
        loss.backward()
        optimizer.step()
    scheduler.step()

    model.eval()
    with torch.no_grad():
        val_correct = sum((model(X_b).argmax(1) == y_b).sum().item()
                          for X_b, y_b in val_dl)
    if epoch % 5 == 0:
        print(f"Epoch {epoch:3d} | val_acc={val_correct/40:.0%} | lr={scheduler.get_last_lr()[0]:.1e}")

model.eval() + torch.no_grad() during validation ensures Dropout is disabled and no gradients are stored β€” the two most common omissions that cause incorrect validation metrics.

For a full deep-dive on PyTorch, a dedicated follow-up post is planned.


πŸ› οΈ TensorFlow/Keras: High-Level Neural Network Training

TensorFlow is an open-source deep learning framework from Google with tf.keras as its high-level API β€” it dominates production deployments through TensorFlow Serving, TFLite for mobile, and TensorFlow.js for browser inference, making it the framework of choice when the model must run on edge devices or be served at scale via a REST endpoint.

Keras's model.fit() handles the entire training loop in one call, including validation, callbacks, and metric tracking:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Same architecture as the PyTorch example β€” directly comparable
model = keras.Sequential([
    layers.Dense(64, activation="relu", input_shape=(10,)),
    layers.Dropout(0.3),
    layers.Dense(32, activation="relu"),
    layers.Dense(2, activation="softmax")
])

model.compile(optimizer=keras.optimizers.AdamW(1e-3),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

# EarlyStopping callback β€” equivalent to monitoring val_acc in PyTorch loop
early_stop = keras.callbacks.EarlyStopping(monitor="val_accuracy",
                                            patience=5, restore_best_weights=True)

import numpy as np
X_np = tf.random.normal((200, 10)).numpy()
y_np = (X_np[:, 0] + X_np[:, 1] > 0).astype(int)

history = model.fit(X_np, y_np, epochs=50, batch_size=16,
                    validation_split=0.2, callbacks=[early_stop], verbose=0)
print(f"Best val accuracy: {max(history.history['val_accuracy']):.0%}")

TensorFlow's SavedModel format exports the trained model for serving: model.save("./saved_model") produces a directory deployable directly to TensorFlow Serving without any code changes.

For a full deep-dive on TensorFlow/Keras, a dedicated follow-up post is planned.


πŸ“š Lessons from Building Neural Networks

Building neural networks teaches a set of hard-won lessons that no amount of theory fully prepares you for.

Debug on a single batch first. Before running a full training loop, verify your model can overfit to a single mini-batch of 4–8 examples. If it cannot memorize a tiny dataset, something is fundamentally broken in the architecture, the loss function, or the data pipeline.

Gradients are your feedback signal. Monitoring gradient norms reveals instability before it manifests as exploding or vanishing loss. A healthy gradient norm stays roughly stable across training. A gradient that grows exponentially signals an unstable learning rate; one that collapses to zero signals dead neurons or a vanishing gradient.

Architecture first, hyperparameters second. Start with the simplest possible architecture that could theoretically solve the problem. Add complexity only when you have evidence the simple model has hit a ceiling. Most beginners add layers when they should be checking data quality.

Transfer learning beats training from scratch. For most practical applications, fine-tuning a pre-trained model outperforms training a custom architecture from scratch with far less data and compute.

Use the simplest architecture that works. Practitioners consistently underestimate how far a two-layer MLP with good preprocessing can go. Complexity has a real cost: harder debugging, longer training, and greater risk of overfitting on small datasets. Add layers only when experiments confirm a simpler model has reached its ceiling. Always benchmark against well-tuned linear baselines before scaling up the model architecture.


πŸ“Œ TLDR: Summary & Key Takeaways

  • A neural network is a composition of simple linear transformations and non-linear activations.
  • Training = forward pass β†’ loss β†’ backward pass β†’ weight update, repeated until the model is good enough.
  • ReLU is the default activation; softmax turns raw logits into probabilities for classification.
  • Common failures (vanishing gradients, overfitting, unstable learning) each have well-known remedies.
  • Start small, verify the training loop is working on a tiny batch before scaling up.

πŸ“ Practice Quiz

  1. Why can a single perceptron only solve linearly separable problems?

    • A) It has no activation function
    • B) It can only draw a single straight decision boundary
    • C) It does not use backpropagation

    Correct Answer: B β€” A single neuron applies a linear transformation followed by an activation threshold, which produces a hyperplane (straight line in 2D) as its decision boundary. Non-linear patterns like XOR require at least one hidden layer.

  2. What does loss.backward() compute in PyTorch?

    • A) It updates the weights directly
    • B) It computes gradients of the loss with respect to all parameters
    • C) It resets the optimizer state

    Correct Answer: B β€” loss.backward() runs the reverse-mode automatic differentiation pass, computing the gradient of the scalar loss with respect to every parameter in the model via the chain rule.

  3. Why is ReLU preferred over sigmoid in hidden layers?

    • A) ReLU is always faster
    • B) Sigmoid output is always negative
    • C) ReLU alleviates the vanishing gradient problem in deep networks

    Correct Answer: C β€” Sigmoid saturates near 0 and 1, causing gradients to vanish in deep networks. ReLU passes gradients unchanged for positive inputs, allowing learning signals to propagate effectively through many layers.



Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms