All Posts

Deep Learning Architectures: CNNs, RNNs, and Transformers

Choosing the right deep learning architecture matters more than model size. Learn when to use CNNs, RNNs, and Transformers.

Abstract AlgorithmsAbstract Algorithms
Β·Β·13 min read
Cover Image for Deep Learning Architectures: CNNs, RNNs, and Transformers
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: CNNs, RNNs, and Transformers solve different kinds of pattern problems. CNNs are great for spatial data like images, RNNs handle ordered sequences, and Transformers shine when long-range context matters. Choosing the right architecture often matters more than adding more layers.


πŸ“– The Right Tool for the Job: Why Architecture Choice Matters

At ImageNet 2012, AlexNet won the image recognition challenge with 85% accuracy β€” beating every hand-engineered computer vision system by more than 10 percentage points. The team hadn't built better features by hand; they'd trained deep convolutional layers to learn the features automatically from raw pixels. That single result launched the deep learning era and proved that architecture design matters more than manual feature engineering.

A deep learning architecture is the structural blueprint of a neural network β€” how data flows through it, what operations repeat, and what kinds of patterns the model learns best. Pick the wrong architecture and you pay in training time, inference cost, and accuracy, even with enormous compute.

Think of the three main families as specialized tools:

  • CNN β€” a camera lens that detects local visual patterns, building up from edges to shapes to objects.
  • RNN β€” a timeline tracker that processes sequences step by step, updating a hidden memory.
  • Transformer β€” a context graph where every token can directly attend to every other token.
ArchitectureBest forCore intuition
CNNImages, grids, local spatial patternsLearn local features, then compose globally
RNN / LSTM / GRUTime-ordered streams, short-to-medium sequencesUpdate hidden memory one step at a time
TransformerLong-context text, code, multimodalAttention relates any two positions directly

πŸ” Data Shape Drives Architecture Selection

The single most important rule: match architecture to data structure.

Data shapeTypical exampleGood default
2D / 3D gridImage, video frameCNN
Short ordered streamSensor telemetry, short textRNN / LSTM
Long-context sequenceDocument, code file, conversationTransformer

Practical planning order

  1. Pick the architecture family that matches your data shape.
  2. Choose model size within that family.
  3. Tune the training loop last.

A larger model in the wrong family still underperforms a smaller model with the right inductive bias. Do not skip step 1.


βš™οΈ CNN, RNN, and Transformer Internals Side by Side

Every architecture follows the same high-level loop: Input tensor β†’ architecture-specific blocks β†’ prediction head β†’ loss β†’ backpropagation β†’ weight update.

flowchart TD
    A[Input Tensor] --> B[Architecture Block]
    B --> C[Prediction Head]
    C --> D[Loss Function]
    D --> E[Backpropagation]
    E --> F[Updated Weights]
    F --> B

CNN block: small filters slide across a spatial grid. Early layers detect edges; later layers detect shapes and objects. Weight sharing across positions gives CNNs parameter efficiency on image data.

πŸ“Š CNN Layer Stack: From Pixels to Predictions

flowchart LR
    A[Input Image] --> B[Conv Layer]
    B --> C[ReLU]
    C --> D[Pooling Layer]
    D --> E[Conv Layer 2]
    E --> F[Pooling Layer 2]
    F --> G[Flatten]
    G --> H[Dense Layer]
    H --> I[Softmax Output]

Each convolutional layer detects increasingly abstract features β€” edges at layer 1, shapes at layer 2, objects at deeper layers. Pooling reduces spatial size while preserving dominant activations.

RNN block: each time step inputs a new element, updates a hidden state $h_t = f(Wh h{t-1} + W_x x_t + b)$, and optionally emits an output. Sequential execution is the price for temporal order.

πŸ“Š RNN Unrolled Through Time Steps

flowchart LR
    A[x_t-1] --> B[RNN Cell t-1]
    B --> C[h_t-1]
    C --> D[RNN Cell t]
    E[x_t] --> D
    D --> F[h_t]
    F --> G[RNN Cell t+1]
    H[x_t+1] --> G
    G --> I[Output]

The hidden state h carries a compressed memory of all prior inputs β€” but it's a fixed-size vector, so long-range context eventually gets diluted. LSTMs and GRUs add gating to partially address this.

Transformer block:multi-head self-attention computes a weighted sum over all positions simultaneously, so long-range dependencies cost no more than short ones β€” but memory scales with $O(n^2)$ in sequence length.

DimensionCNNRNNTransformer
Parallelism during trainingHighLow (sequential)High
Long-range contextModerateModerateStrong
Memory at long sequencesEfficientEfficient per stepCan be heavy
Typical training stabilityStrongUnstable on long seqsStrong with correct setup

🧠 Deep Dive: How Attention Replaces Recurrence

The key Transformer innovation is scaled dot-product attention: each token computes query, key, and value vectors. Attention scores are dot products of queries against all keys, scaled by √d to prevent vanishing gradients, then softmaxed into weights that blend value vectors. This lets the model attend to any position in one stepβ€”no sequential bottleneck.

StepOperationPurpose
Q, K, V projectionLinear transformEncode what to ask, match, and return
QKα΅€ / √dDot product + scaleScore token compatibility
SoftmaxNormalize scoresProduce attention weights
Weighted sum of VAggregationBuild context-aware token representation

πŸ“Š Transformer Encoder: From Tokens to Contextual Output

flowchart TD
    A[Input Tokens] --> B[Token Embedding]
    B --> C[Positional Encoding]
    C --> D[Multi-Head Attention]
    D --> E[Add & Norm]
    E --> F[Feed Forward]
    F --> G[Add & Norm]
    G --> H[Repeat N Layers]
    H --> I[Output Projection]
    I --> J[Softmax]

Each encoder layer refines token representations by mixing context from all other tokens via attention, then applying a position-wise feed-forward transformation. The Add & Norm steps stabilize training.

πŸ”¬ Internals

CNNs use shared weight kernels that slide across spatial dimensions, reducing parameters from O(nΒ²) to O(kΒ²Β·c_inΒ·c_out). RNNs maintain a hidden state h_t = tanh(WhΒ·h{t-1} + W_xΒ·x_t), which creates a single bottleneck for long sequences. Transformers replace recurrence with self-attention: Attention(Q,K,V) = softmax(QKα΅€/√d_k)V, allowing all positions to attend to each other in parallel.

⚑ Performance Analysis

A ResNet-50 CNN trains in ~1 hour on ImageNet on a single V100 and achieves ~76% top-1 accuracy. LSTM training on long sequences (>500 tokens) is 5–10Γ— slower than Transformers due to sequential dependency. GPT-2 (117M params) fine-tunes in under 10 minutes on an A100 for most classification tasks, versus days for training from scratch.

πŸ“Š Choosing an Architecture: A Decision Flow

graph LR
    A[Raw Data] --> B{Data type?}
    B -->|Image or grid| C[CNN Pipeline]
    B -->|Short ordered stream| D[RNN / LSTM / GRU]
    B -->|Long-context sequence| E[Transformer]
    C --> F[Prediction]
    D --> F
    E --> F

This flow operationalizes the core rule from this post β€” match architecture to data shape β€” into a single branching decision. Starting from Raw Data, the data-type question routes you to CNN for spatial grids, RNN/LSTM/GRU for time-ordered streams, or Transformer for long-context sequences. All three paths converge at Prediction, reinforcing that the architecture choice changes the journey, not the destination.

ConditionStart here
Local spatial patterns dominateCNN
Strict time-order and moderate context windowRNN / LSTM
Long-range dependencies dominateTransformer
Low data and need fast iterationTransfer learning (any family)

🌍 Real-World Applications: Where Each Architecture Lives in Production

CNNs: defect detection in manufacturing, medical imaging triage, OCR preprocessing, real-time video classification on edge devices.

RNNs: forecasting and streaming anomaly detection in telemetry pipelines; still competitive on constrained hardware where Transformer overhead is prohibitive.

Transformers: search ranking, document understanding, code completion, multilingual NLP, any task requiring understanding of long-form text.

Hybrid pipelines are common. A support-ticket router might use a Transformer encoder for semantic features and a lightweight recurrent branch for short metadata sequences β€” combining both views for the final routing decision.


βš–οΈ Trade-offs & Failure Modes: Trade-offs and Failure Modes

Failure modeSymptomTypical causeMitigation
OverfittingStrong train, poor test metricsModel too complex for data volumeRegularization + augmentation
UnderfittingWeak train and testArchitecture too weakIncrease capacity or switch family
Gradient instabilitySpiky / diverging lossPoor optimizer settings, long sequencesLR tuning + gradient clipping
Context truncationMissing key informationSequence exceeds model budgetChunking + retrieval / context windows

The accuracy–latency trade-off is real. A larger Transformer may win on benchmark scores while violating your inference SLA. Measure latency early, not after model quality peaks.

Infrastructure constraints often decide architecture more than benchmark scores:

ConstraintImplication
Edge device memory limitsFavor compact CNN or small recurrent model
Real-time streamingRecurrent or temporal CNN pipelines
Long documents server-sideTransformer with context management
Limited labeled dataTransfer learning before scaling

🧭 Decision Guide: Picking the Right Deep Learning Architecture

  • Choose CNN when your data has local spatial structure (images, audio spectrograms, fixed-length feature grids).
  • Choose RNN/LSTM when sequence order matters and sequences are short to medium length (time series, speech with alignment constraints).
  • Choose Transformer for most NLP tasks, long sequences, or when pre-trained weights (BERT, GPT) give a head start.
  • Skip deep learning when you have fewer than a few thousand labeled examplesβ€”gradient-boosted trees or logistic regression will likely win.

πŸ§ͺ Three Quick Experiments to Validate Architecture Selection

These three experiments translate the architecture-selection framework from the decision flow above into concrete validation steps: a CNN baseline for spatial image data, a Transformer baseline for long-text classification, and a systematic comparison playbook for when the right family isn't obvious. They were chosen because the most common mistake practitioners make is committing to an architecture based on its reputation rather than evidence β€” this section gives you a repeatable process to let data decide. As you read Experiment 3, focus on the three-way evaluation criteria (validation quality, latency, and serving cost): optimizing for benchmark score alone is the failure mode this playbook is designed to prevent.

Experiment 1: CNN baseline for image classification

if task_type == "image":
    model = build_cnn_backbone()   # 3 conv blocks, softmax head
    # Evaluate: accuracy + confusion matrix

Experiment 2: Transformer for ticket triage

elif task_type == "long_text":
    model = build_transformer_encoder()  # pretrained + fine-tuned head
    # Track: macro-F1 across imbalanced classes

Experiment 3: Architecture playbook

When unsure which family to use:

  1. Build one small baseline per candidate architecture.
  2. Hold the training budget constant for a fair comparison.
  3. Compare validation quality and latency and serving cost.
  4. Select on the full objective β€” not benchmark score alone.

This prevents the common mistake of choosing the most "modern" architecture when a smaller, simpler model is equally accurate and far cheaper to operate.


πŸ› οΈ PyTorch: Implementing All Three Architectures with One Consistent API

PyTorch is an open-source deep learning framework maintained by Meta AI that provides nn.Conv2d, nn.LSTM, and nn.MultiheadAttention as first-class building blocks β€” letting you implement CNNs, RNNs, and Transformers with a single consistent training loop.

The key teaching point from this post β€” match architecture to data shape β€” maps directly to which module you instantiate in PyTorch:

import torch
import torch.nn as nn

# --- CNN: spatial data (image classification) ---
class SimpleCNN(nn.Module):
    def __init__(self, num_classes: int = 10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),                             # spatial downsampling
            nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.ReLU(),
            nn.AdaptiveAvgPool2d(4)
        )
        self.classifier = nn.Linear(64 * 4 * 4, num_classes)

    def forward(self, x):                        # x: (B, 1, 28, 28)
        return self.classifier(self.features(x).flatten(1))

# --- RNN: ordered streams (time-series) ---
class SimpleRNN(nn.Module):
    def __init__(self, input_size=1, hidden=64, num_classes=2):
        super().__init__()
        self.rnn = nn.LSTM(input_size, hidden, batch_first=True)
        self.head = nn.Linear(hidden, num_classes)

    def forward(self, x):                        # x: (B, seq_len, features)
        _, (h_n, _) = self.rnn(x)
        return self.head(h_n[-1])

# --- Transformer: long-context sequences ---
class SimpleTransformer(nn.Module):
    def __init__(self, vocab=1000, d_model=128, nhead=4, num_classes=2):
        super().__init__()
        self.embed  = nn.Embedding(vocab, d_model)
        self.encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead, batch_first=True), num_layers=2
        )
        self.head = nn.Linear(d_model, num_classes)

    def forward(self, tokens):                   # tokens: (B, seq_len)
        x = self.embed(tokens)
        x = self.encoder(x)
        return self.head(x[:, 0, :])             # CLS-token pooling

# Training loop is identical for all three:
for model, name in [(SimpleCNN(10), "CNN"), (SimpleRNN(), "RNN"), (SimpleTransformer(), "Transformer")]:
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
    print(f"{name} parameters: {sum(p.numel() for p in model.parameters()):,}")

All three share the same optimizer.zero_grad() β†’ loss.backward() β†’ optimizer.step() loop β€” the architecture only determines what happens in forward().

For a full deep-dive on PyTorch, a dedicated follow-up post is planned.


πŸ› οΈ Keras: High-Level Architecture Prototyping

Keras (now bundled with TensorFlow as tf.keras and available standalone via keras) provides a high-level Sequential and Functional API that lets you build and swap CNN, RNN, and Transformer architectures in minutes β€” ideal for the rapid baseline comparison strategy recommended in this post.

import keras
from keras import layers

# CNN baseline
cnn = keras.Sequential([
    layers.Input(shape=(28, 28, 1)),
    layers.Conv2D(32, 3, activation="relu"),
    layers.MaxPooling2D(),
    layers.Flatten(),
    layers.Dense(10, activation="softmax"),
], name="cnn_baseline")

# Transformer baseline (using built-in MultiHeadAttention)
inputs = keras.Input(shape=(128,), dtype="int32")
x = layers.Embedding(input_dim=5000, output_dim=64)(inputs)
x = layers.MultiHeadAttention(num_heads=4, key_dim=16)(x, x)
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(2, activation="softmax")(x)
transformer = keras.Model(inputs, outputs, name="transformer_baseline")

# Both compile identically β€” swap model variable to compare
for model in [cnn, transformer]:
    model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
    print(model.name, "β†’", model.count_params(), "params")

Keras's model.summary() and keras.utils.plot_model() make architecture comparison visual and auditable β€” directly supporting the three-experiment validation strategy from the practical section.

For a full deep-dive on Keras, a dedicated follow-up post is planned.


πŸ“š What to Learn Next

Start with the baseline that fits your data shape, measure rigorously across validation quality, inference latency, and serving cost, and only move to more complex architectures when simpler ones demonstrably fall short.


πŸ“Œ TLDR: Summary & Key Takeaways

  • CNN, RNN, and Transformer families solve different pattern-learning problems β€” data shape drives the choice.
  • Matching architecture to data structure reduces both training time and operational risk.
  • Benchmark scores alone are not enough: measure latency and serving cost before committing.
  • Document architecture assumptions (data shape, latency budget, model size cap) so future contributors can reason about trade-offs without guessing history.

πŸ“ Practice Quiz

  1. Which architecture is usually the best first choice for image classification?

    A) RNN B) CNN C) Naive Bayes D) Linear regression

    Correct Answer: B β€” CNNs use convolutional filters to detect local spatial patterns (edges, textures, shapes), making them the strongest default for grid-based image tasks.

  2. Why can Transformers be expensive on very long sequences?

    A) They do not support batching B) Attention computation and memory can grow with sequence length C) They cannot use GPUs D) They require labeled data

    Correct Answer: B β€” Self-attention computes pairwise token interactions, so memory and compute can scale with the square of sequence length in naive implementations.

  3. When does an RNN remain a practical choice over a Transformer?

    A) For all tasks regardless of data shape B) For strict ordered streams with tight latency and memory budgets C) Only for image segmentation D) Never β€” Transformers are always better

    Correct Answer: B β€” RNNs process inputs one step at a time with bounded per-step memory, making them efficient for streaming data on constrained hardware.

  4. You have limited labeled data and unstable training loss. What is the best first step?

    A) Scale model size 10x immediately B) Apply transfer learning and regularization on a strong baseline C) Remove the validation split D) Switch to a different deep learning framework

    Correct Answer: B β€” Transfer learning reuses pretrained representations to reduce data requirements; regularization (dropout, weight decay) helps prevent overfitting on small datasets.

  5. Open-ended challenge: You are building a system to classify medical images AND generate diagnostic reports from them. Design an architecture that combines CNN and Transformer components β€” what would each part handle, and how would you connect them?


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms