Neural Networks Explained: From Neurons to Deep Learning
How do computers learn? We start with a single neuron (Perceptron) and build up to Deep Neural Networks.
Abstract Algorithms
TLDR: A neural network is a stack of simple "neurons" that turn raw inputs into predictions by learning the right weights and biases. Training means repeatedly nudging those numbers via back-propagation until the error shrinks. Master the basics and you can build everything from spam filters to image classifiers.
π What Is a Neuron, and Why Do We Stack Them?
A single neuron in a neural network is just a weighted sum plus an activation function:
output = relu(w1*x1 + w2*x2 + bias)
String 100 million of these together with the right weights, and you get GPT-4. The entire architecture of modern AI β from spam filters to self-driving cars β reduces to this one line, repeated at massive scale with learned weights. This post explains how.
Think of each neuron as one expert in a committee deciding whether to approve a loan. Each expert looks at different evidence β credit score, income, debt-to-income ratio β and casts a weighted vote. The final decision is the weighted sum of those votes, passed through a threshold rule. That's one artificial neuron.
| Aspect | Neural Network | Traditional Linear Model |
| Non-linearity | Introduced by activation functions | None |
| Feature engineering | Network learns features automatically | Heavy manual design |
| Interpretability | Low (distributed weights) | High (explicit coefficients) |
| Scalability | Handles millions of parameters | Limited by feature count |
A single neuron can only draw a straight line between classes. The moment we stack neurons into layers, the network can represent curves, spirals, and arbitrary decision boundaries.
π The Perceptron: Neural Network Building Blocks
A single artificial neuron computes:
$$y = f\!\left(\sum_{i=1}^{n} w_i x_i + b ight)$$
where $\mathbf{x}$ are input features, $\mathbf{w}$ are learnable weights, $b$ is a bias, and $f$ is an activation function (ReLU, sigmoid, or tanh).
Toy dataset β loan approval:
| ID | Age | Salary ($k$) | Owns House | Label (Approve?) |
| 1 | 25 | 45 | No | No |
| 2 | 40 | 80 | Yes | Yes |
| 3 | 30 | 60 | No | No |
| 4 | 50 | 120 | Yes | Yes |
A perceptron learns a weighted combination of age, salary, and house ownership to separate approvals from rejections. But it can only learn linear boundaries.
Going deeper: the Multi-Layer Perceptron
Stacking perceptrons into layers lets the network model non-linear functions. The famously unlearnable XOR problem requires just two layers:
input β [Linear β ReLU] β [Linear β Sigmoid] β output
Each hidden layer transforms the representation; the final layer makes a decision.
π Fully-Connected Network: Inputs, Hidden Neurons, and Output
flowchart LR
subgraph Input Layer
I1[x1]
I2[x2]
I3[x3]
end
subgraph Hidden Layer 1
H1[N1]
H2[N2]
H3[N3]
H4[N4]
end
subgraph Output Layer
O1[y]
end
I1 --> H1 & H2 & H3 & H4
I2 --> H1 & H2 & H3 & H4
I3 --> H1 & H2 & H3 & H4
H1 & H2 & H3 & H4 --> O1
This diagram shows a fully connected (dense) network architecture: every input neuron connects to every hidden neuron, and every hidden neuron connects to the output. The three input nodes (x1, x2, x3) fan out to four hidden neurons β each computing a weighted sum followed by an activation function β before converging to a single output prediction. The key takeaway is that "fully connected" means every pair of adjacent-layer neurons shares a weight, which is why the number of parameters grows as the product of layer widths and why matrix multiplication is the natural operation for the forward pass.
βοΈ How a Neural Network Actually Trains
Training a network is a three-step cycle:
Forward pass β data flows through the layers, producing a predicted output.
Loss computation β the prediction is compared to the true label using a loss function (e.g., cross-entropy for classification).
Backward pass β gradients of the loss are computed through the network via the chain rule, and weights are nudged in the direction that reduces loss.
graph TD
A[Input Vector x] --> B[Linear Layer 1]
B --> C[ReLU activation]
C --> D[Linear Layer 2]
D --> E[Softmax]
E --> F[Prediction Ε·]
F --> G[Loss L vs true y]
G --> H[Backward pass: compute gradients]
H --> I[Optimizer: Adam / SGD]
I --> J[Update weights]
J --> B
Forward pass shape transformations
| Step | Operation | Shape |
| Input | raw tensor | (batch, features) |
| Linear 1 | nn.Linear(features, hidden) | (batch, hidden) |
| ReLU | element-wise max(0, x) | (batch, hidden) |
| Linear 2 | nn.Linear(hidden, classes) | (batch, classes) |
| Softmax | normalize to probabilities | (batch, classes) |
π Forward Pass and Backpropagation Signal Flow
sequenceDiagram
participant Input
participant HiddenLayer
participant Output
participant Loss
Input->>HiddenLayer: Forward: activations
HiddenLayer->>Output: Forward: predictions
Output->>Loss: Compute loss
Loss->>Output: Backprop: dL/dO
Output->>HiddenLayer: Backprop: dL/dH
HiddenLayer->>Input: Backprop: dL/dW
Input->>HiddenLayer: Update weights
This diagram traces the bidirectional signal flow in a neural network: the forward pass moves activations from Input through HiddenLayer to Output, where the loss is computed; the backward pass reverses direction, propagating gradient signals from Loss back through each layer to update weights. Each backward arrow carries a partial derivative (dL/dO, dL/dH, dL/dW) computed by the chain rule β the mathematical mechanism that makes deep learning tractable. The key takeaway is that backpropagation is simply a systematic application of the chain rule in reverse, and every weight in the network receives its own gradient signal regardless of how many layers separate it from the output.
π§ Deep Dive: Why Activation Functions Are the Key to Learning
Without non-linear activations, stacking layers is mathematically equivalent to a single linear transformationβadding more layers gains nothing. Activations like ReLU break this linearity so the network can learn curved, complex decision boundaries.
| Activation | Formula | Common use |
| Sigmoid | 1 / (1 + eβ»Λ£) | Binary output layer |
| Tanh | (eΛ£ β eβ»Λ£) / (eΛ£ + eβ»Λ£) | Hidden layers in older RNNs |
| ReLU | max(0, x) | Default for most hidden layers |
| Softmax | eΛ£β± / Ξ£eΛ£Κ² | Multi-class output layer |
π Choosing the Right Activation Function for Each Layer
flowchart TD
A[Choose Activation Fn] --> B{Layer Type?}
B -- Hidden --> C{Deep Network?}
B -- Output --> D{Task Type?}
C -- Yes --> E[ReLU / Leaky ReLU]
C -- No --> F[Sigmoid / Tanh]
D -- Binary Class --> G[Sigmoid]
D -- Multi-class --> H[Softmax]
D -- Regression --> I[Linear / None]
This decision tree shows how to choose the correct activation function based on where a layer sits in the network and what task it performs. Hidden layers in deep networks default to ReLU (or Leaky ReLU) to avoid the vanishing gradient problem, while shallower or older architectures may use Sigmoid or Tanh. Output layers are always determined by the loss function: Sigmoid pairs with binary cross-entropy, Softmax pairs with categorical cross-entropy, and no activation (linear) pairs with MSE for regression. The key takeaway is that the output activation is not a hyperparameter to tune β it is dictated by the task.
π Training Flow: How a Neural Network Learns Step by Step
The following diagram shows the complete training cycle. Each iteration moves data forward through the network to produce a prediction, then propagates error signals backward to adjust every weight. This cycle repeats for every mini-batch in the dataset.
graph TD
A[Load Mini-Batch] --> B[Forward Pass Through Layers]
B --> C[Compute Predictions]
C --> D[Calculate Loss: how wrong is the prediction?]
D --> E[Backward Pass: compute gradients via chain rule]
E --> F[Optimizer: Adam / SGD / RMSProp]
F --> G[Update All Weights]
G --> A
Understanding each step individually makes debugging far easier. If loss is not decreasing, the issue is either the forward pass (wrong architecture), the loss function (wrong objective), or the optimizer (wrong hyperparameters). Identifying which stage is broken narrows the problem immediately.
π§ͺ Building a Simple Classifier in PyTorch
This example implements the complete training loop for a two-layer neural network using PyTorch's nn.Sequential API on a toy 3-feature binary classification problem. It was chosen because it maps directly onto every concept covered so far: nn.Linear layers perform the matrix multiply, nn.ReLU introduces non-linearity, CrossEntropyLoss applies probability-based loss, and loss.backward() runs backpropagation. As you read the code, focus on the four-line inner loop β zero_grad β forward β loss.backward() β optimizer.step() β that sequence is the foundation of every neural network training run, from this toy example to production-scale models.
import torch
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self, n_features, n_hidden, n_classes):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_features, n_hidden),
nn.ReLU(),
nn.Linear(n_hidden, n_classes)
)
def forward(self, x):
return self.net(x)
model = SimpleNet(n_features=3, n_hidden=16, n_classes=2)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
# Training loop (one epoch sketch)
for X_batch, y_batch in dataloader:
optimizer.zero_grad()
logits = model(X_batch)
loss = criterion(logits, y_batch)
loss.backward()
optimizer.step()
The three lines loss.backward() β optimizer.step() are where all the learning happens. PyTorch automatically computes the gradient of every parameter.
π Real-World Applications: Neural Networks Running in Production Today
The same training loop from the PyTorch example above scales to every major AI product in production today:
| Product | Architecture | What it predicts |
| Gmail spam filter | 3-layer MLP | Spam vs. not-spam from email features |
| Netflix recommendations | Embedding + MLP | Watch probability given user history |
| Face ID (iPhone) | Deep CNN | Match probability against enrolled face |
| GPT-4 | 96-layer Transformer | Next token in a sequence |
Each trains by the same loop: forward pass β loss β backward pass β weight update. The difference is data volume, model depth, and compute scale.
ReLU is preferred in hidden layers because it avoids the vanishing gradient problem that plagued sigmoid and tanh in deep networks β gradient magnitudes stay healthy for positive activations, allowing learning signals to reach early layers.
βοΈ Trade-offs & Failure Modes: Common Mistakes and How to Avoid Them
| Mistake | Symptom | Fix |
| No normalization | Loss diverges, slow convergence | Normalize inputs to zero mean, unit variance |
| Learning rate too high | Loss explodes or oscillates | Start at 1e-3 with Adam; use a scheduler |
| Too many layers for small dataset | Overfits quickly | Add dropout, reduce capacity, or use transfer learning |
Forgetting optimizer.zero_grad() | Gradients accumulate across batches | Call it at the start of every batch |
| Using sigmoid output + BCELoss | Numerical instability | Use raw logits + BCEWithLogitsLoss |
π§ Decision Guide: When Neural Networks Are the Right Tool
- Use neural networks when your data is high-dimensional and unstructured (images, audio, raw text) and you have thousands of labeled examples.
- Start with a pre-trained model rather than training from scratchβtransfer learning typically outperforms scratch models with 10Γ less data.
- Prefer gradient-boosted trees (XGBoost, LightGBM) for tabular data with fewer than ~50k rowsβfaster to tune and often more accurate.
- Monitor overfitting early: if train loss keeps dropping but validation loss rises, add dropout or reduce model size before training longer.
π― What to Learn Next
- Deep Learning Architectures: CNNs, RNNs, and Transformers
- Machine Learning Fundamentals
- Mathematics for Machine Learning
π οΈ PyTorch: The Research-to-Production Neural Network Framework
PyTorch is an open-source deep learning framework from Meta AI with dynamic computation graphs, GPU acceleration, and an ecosystem of pre-trained models β it is the dominant framework for neural network research and increasingly the standard for production deployment via TorchScript and ONNX export.
The post already shows a basic PyTorch training loop. Here is a more complete example that adds proper data loading, validation tracking, and a learning-rate scheduler β the key additions that take a toy example to a production-ready training pipeline:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
# --- Data ---
X = torch.randn(200, 10) # 200 examples, 10 features
y = (X[:, 0] + X[:, 1] > 0).long() # binary label
dataset = TensorDataset(X, y)
train_ds, val_ds = torch.utils.data.random_split(dataset, [160, 40])
train_dl = DataLoader(train_ds, batch_size=16, shuffle=True)
val_dl = DataLoader(val_ds, batch_size=16)
# --- Model with Dropout (regularization) ---
class MLPWithDropout(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(10, 64), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(64, 32), nn.ReLU(),
nn.Linear(32, 2)
)
def forward(self, x): return self.net(x)
model = MLPWithDropout()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)
criterion = nn.CrossEntropyLoss()
# --- Training loop with validation ---
for epoch in range(20):
model.train()
for X_b, y_b in train_dl:
optimizer.zero_grad()
loss = criterion(model(X_b), y_b)
loss.backward()
optimizer.step()
scheduler.step()
model.eval()
with torch.no_grad():
val_correct = sum((model(X_b).argmax(1) == y_b).sum().item()
for X_b, y_b in val_dl)
if epoch % 5 == 0:
print(f"Epoch {epoch:3d} | val_acc={val_correct/40:.0%} | lr={scheduler.get_last_lr()[0]:.1e}")
model.eval() + torch.no_grad() during validation ensures Dropout is disabled and no gradients are stored β the two most common omissions that cause incorrect validation metrics.
For a full deep-dive on PyTorch, a dedicated follow-up post is planned.
π οΈ TensorFlow/Keras: High-Level Neural Network Training
TensorFlow is an open-source deep learning framework from Google with tf.keras as its high-level API β it dominates production deployments through TensorFlow Serving, TFLite for mobile, and TensorFlow.js for browser inference, making it the framework of choice when the model must run on edge devices or be served at scale via a REST endpoint.
Keras's model.fit() handles the entire training loop in one call, including validation, callbacks, and metric tracking:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Same architecture as the PyTorch example β directly comparable
model = keras.Sequential([
layers.Dense(64, activation="relu", input_shape=(10,)),
layers.Dropout(0.3),
layers.Dense(32, activation="relu"),
layers.Dense(2, activation="softmax")
])
model.compile(optimizer=keras.optimizers.AdamW(1e-3),
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
# EarlyStopping callback β equivalent to monitoring val_acc in PyTorch loop
early_stop = keras.callbacks.EarlyStopping(monitor="val_accuracy",
patience=5, restore_best_weights=True)
import numpy as np
X_np = tf.random.normal((200, 10)).numpy()
y_np = (X_np[:, 0] + X_np[:, 1] > 0).astype(int)
history = model.fit(X_np, y_np, epochs=50, batch_size=16,
validation_split=0.2, callbacks=[early_stop], verbose=0)
print(f"Best val accuracy: {max(history.history['val_accuracy']):.0%}")
TensorFlow's SavedModel format exports the trained model for serving: model.save("./saved_model") produces a directory deployable directly to TensorFlow Serving without any code changes.
For a full deep-dive on TensorFlow/Keras, a dedicated follow-up post is planned.
π Lessons from Building Neural Networks
Building neural networks teaches a set of hard-won lessons that no amount of theory fully prepares you for.
Debug on a single batch first. Before running a full training loop, verify your model can overfit to a single mini-batch of 4β8 examples. If it cannot memorize a tiny dataset, something is fundamentally broken in the architecture, the loss function, or the data pipeline.
Gradients are your feedback signal. Monitoring gradient norms reveals instability before it manifests as exploding or vanishing loss. A healthy gradient norm stays roughly stable across training. A gradient that grows exponentially signals an unstable learning rate; one that collapses to zero signals dead neurons or a vanishing gradient.
Architecture first, hyperparameters second. Start with the simplest possible architecture that could theoretically solve the problem. Add complexity only when you have evidence the simple model has hit a ceiling. Most beginners add layers when they should be checking data quality.
Transfer learning beats training from scratch. For most practical applications, fine-tuning a pre-trained model outperforms training a custom architecture from scratch with far less data and compute.
Use the simplest architecture that works. Practitioners consistently underestimate how far a two-layer MLP with good preprocessing can go. Complexity has a real cost: harder debugging, longer training, and greater risk of overfitting on small datasets. Add layers only when experiments confirm a simpler model has reached its ceiling. Always benchmark against well-tuned linear baselines before scaling up the model architecture.
π TLDR: Summary & Key Takeaways
- A neural network is a composition of simple linear transformations and non-linear activations.
- Training = forward pass β loss β backward pass β weight update, repeated until the model is good enough.
- ReLU is the default activation; softmax turns raw logits into probabilities for classification.
- Common failures (vanishing gradients, overfitting, unstable learning) each have well-known remedies.
- Start small, verify the training loop is working on a tiny batch before scaling up.
π Practice Quiz
Why can a single perceptron only solve linearly separable problems?
- A) It has no activation function
- B) It can only draw a single straight decision boundary
- C) It does not use backpropagation
Correct Answer: B β A single neuron applies a linear transformation followed by an activation threshold, which produces a hyperplane (straight line in 2D) as its decision boundary. Non-linear patterns like XOR require at least one hidden layer.
What does
loss.backward()compute in PyTorch?- A) It updates the weights directly
- B) It computes gradients of the loss with respect to all parameters
- C) It resets the optimizer state
Correct Answer: B β
loss.backward()runs the reverse-mode automatic differentiation pass, computing the gradient of the scalar loss with respect to every parameter in the model via the chain rule.Why is ReLU preferred over sigmoid in hidden layers?
- A) ReLU is always faster
- B) Sigmoid output is always negative
- C) ReLU alleviates the vanishing gradient problem in deep networks
Correct Answer: C β Sigmoid saturates near 0 and 1, causing gradients to vanish in deep networks. ReLU passes gradients unchanged for positive inputs, allowing learning signals to propagate effectively through many layers.
π Related Posts
- Deep Learning Architectures: CNNs, RNNs, and Transformers
- Machine Learning Fundamentals
- Supervised Learning Algorithms

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions β with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy β but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose β range, hash, consistent hashing, or directory β determines whether range queries stay ch...
