Deep Learning Architectures: CNNs, RNNs, and Transformers

Abstract Algorithms

·Feb 8, 2026·6 min read

TL;DR

Just as you wouldn't use a hammer to tighten a screw, you need different Neural Networks for different tasks.

Cover Image for Deep Learning Architectures: CNNs, RNNs, and Transformers

Introduction: Specialized Brains

In our last post, we built a standard Neural Network (often called a Dense or Fully Connected network). These are great generalists, but they struggle with specific types of data.

Images have spatial structure (pixels near each other matter).
Text has sequential structure (the order of words matters).

To handle these, researchers invented specialized architectures.

1. Convolutional Neural Networks (CNNs): The Eyes of AI

Best For: Image recognition, video analysis, medical imaging.

The Problem

If you feed a standard network a 4K image, it treats every pixel as a separate, unrelated input. It loses the context that "this group of pixels forms an eye."

Deep Dive: How CNNs See (The "Flashlight" Analogy)

Imagine you are looking for "Waldo" in a "Where's Waldo?" book. You don't look at the whole page at once. You scan it with your eyes, focusing on small patches.

Step 1: The Filter (The Flashlight) A CNN uses a small grid of numbers (e.g., 3x3) called a Filter.

Filter A: Detects vertical lines.
Filter B: Detects horizontal lines.

Step 2: The Convolution (The Scan) The filter slides over the image.

When Filter A hovers over a vertical line in the image, the math (Dot Product) produces a high number.
When it hovers over a blank space, it produces a zero.

Step 3: The Feature Map The result is a map showing where in the image the vertical lines are.

Step 4: Hierarchy

Layer 1: Detects lines.
Layer 2: Combines lines to detect shapes (squares, circles).
Layer 3: Combines shapes to detect objects (eyes, wheels).

Real-World Example: When Face ID scans your face, early layers see "edges of nose," middle layers see "nose shape," and the final layer sees "Your Face."

2. Recurrent Neural Networks (RNNs): The Memory of AI

Best For: Text, speech, time-series data (stock prices, weather).

The Problem

Standard networks have no memory. If you show them a movie, they don't know what happened in the previous frame. They can't understand a sentence because they forget the first word by the time they read the last one.

Deep Dive: How RNNs Remember (The "Notebook" Analogy)

Imagine reading a book. You don't start thinking from scratch with every word. You carry over the context of the previous sentence.

The Mechanism: The Hidden State An RNN processes data one step at a time.

Input: Word 1 ("The").
Process: It creates a "Hidden State" (Memory) representing "The".
Input: Word 2 ("Cat").
Process: It combines "Cat" + "Memory of 'The'" to update its memory to "The Cat".

The Limitation (Short-Term Memory): If the sentence is very long ("The cat, which I bought 10 years ago in a small shop in Paris... was hungry"), a basic RNN forgets "The cat" by the time it gets to "was hungry."

Solution: LSTMs (Long Short-Term Memory) networks added a "Write/Erase" gate to keep important info for longer.

3. Transformers: The Language Revolution

Best For: Translation, text generation (ChatGPT), understanding context.

The Problem

RNNs read one word at a time (sequential). This is slow and makes it hard to connect words that are far apart.

Deep Dive: The Transformer Architecture

The Transformer (introduced in "Attention Is All You Need", 2017) changed everything. It ditched recurrence for Attention.

A. Encoder-Decoder Structure

Encoder: Reads the input (e.g., English sentence) and compresses it into a rich understanding (vectors).
Decoder: Takes that understanding and generates the output (e.g., French sentence).
Note: GPT models only use the Decoder part (to generate text). BERT models only use the Encoder part (to understand text).

B. Positional Encoding

Since Transformers read the whole sentence at once (not one by one), they don't know the order of words. Positional Encoding adds a special math signal (using Sine and Cosine waves) to each word vector to tell the model: "This is the 1st word," "This is the 2nd word."

4. Anatomy of a Transformer Block (The Internals)

Let's look inside the "black box" of a Transformer layer. It consists of several key components working together.

1. Projections (Linear Layers)

Before Attention happens, we don't just use the raw word vectors. We project them into three different spaces using Linear Layers (Matrix Multiplication).

Input: Word Vector $X$.
Weights: $W_Q, W_K, W_V$ (Learned parameters).
Output:
- Query ($Q$): $X \times W_Q$
- Key ($K$): $X \times W_K$
- Value ($V$): $X \times W_V$

2. Attention Mechanisms

There are three flavors of attention used in Transformers:

A. Self-Attention (The "Context" Builder)

Where: Inside the Encoder and Decoder.
Goal: Relate words to other words in the same sentence.
Example: In "The bank of the river," linking "bank" to "river".

B. Masked Self-Attention (The "No Cheating" Rule)

Where: Inside the Decoder (GPT).
Goal: When generating text, the model shouldn't see the future.
Mechanism: We apply a "Mask" (setting scores to negative infinity) to all future words.
- Input: "The cat sat..."
- Allowed to see: "The", "cat", "sat".
- Masked: "on", "the", "mat".
Softmax Result: The probability of attending to future words becomes 0.

C. Encoder-Decoder Attention (Cross-Attention)

Where: Inside the Decoder (Translation models like T5).
Goal: Connect the generated output to the original input.
Mechanism:
- Queries come from the Decoder (what I'm writing now).
- Keys/Values come from the Encoder (the source text).
- Example: When writing "Chat" (French), look at "Cat" (English source).

3. Softmax (The Probability Maker)

Used inside Attention to normalize scores.

Input: Raw scores (Logits), e.g., [2.0, 1.0, 0.1].
Math: $e^x / \sum e^x$.
Output: Probabilities, e.g., [0.7, 0.2, 0.1].
Why: Ensures attention weights sum to 1.0 (100% focus distributed).

4. Feed-Forward Network (FFN)

After Attention mixes the information between words, each word vector passes through a standard Feed-Forward Network (MLP) individually.

Role: This is where the "thinking" happens. It processes the gathered context to refine the meaning.
Structure: Linear -> ReLU/GELU -> Linear.

5. Layer Normalization (LayerNorm)

Role: Stabilizes training by keeping the numbers (activations) within a healthy range (mean 0, variance 1).
Difference from Batch Norm: It normalizes across the features of a single example, not across the batch. This is crucial for text where sentence lengths vary.

6. Residual Connections (Add)

Mechanism: We add the original input of a layer to its output: Output = Layer(Input) + Input.
Why: It creates a "highway" for gradients to flow during backpropagation, preventing the "vanishing gradient" problem in deep networks.

Summary & Key Takeaways

CNNs: Use filters to scan for spatial features (Edges -> Shapes -> Objects).
RNNs: Use a hidden state to carry memory forward in time (Sequence).
Transformers: The modern standard.
- Projections: Create Q, K, V vectors.
- Masked Attention: Prevents seeing the future in generation.
- Cross-Attention: Links translation output to input.
- LayerNorm & Residuals: Keep the deep network stable.

What's Next?

We've built the brain (Neural Networks) and specialized it (CNNs/Transformers). Now, let's teach it to read and write. In the next post, we'll dive into Natural Language Processing (NLP) and how machines understand human language.

Curious how ChatGPT actually works? Stay tuned!