Deep Learning Architectures: CNNs, RNNs, and Transformers
Choosing the right deep learning architecture matters more than model size. Learn when to use CNNs, RNNs, and Transformers.
Abstract Algorithms
TLDR: CNNs, RNNs, and Transformers solve different kinds of pattern problems. CNNs are great for spatial data like images, RNNs handle ordered sequences, and Transformers shine when long-range context matters. Choosing the right architecture often matters more than adding more layers.
π The Right Tool for the Job: Why Architecture Choice Matters
At ImageNet 2012, AlexNet won the image recognition challenge with 85% accuracy β beating every hand-engineered computer vision system by more than 10 percentage points. The team hadn't built better features by hand; they'd trained deep convolutional layers to learn the features automatically from raw pixels. That single result launched the deep learning era and proved that architecture design matters more than manual feature engineering.
A deep learning architecture is the structural blueprint of a neural network β how data flows through it, what operations repeat, and what kinds of patterns the model learns best. Pick the wrong architecture and you pay in training time, inference cost, and accuracy, even with enormous compute.
Think of the three main families as specialized tools:
- CNN β a camera lens that detects local visual patterns, building up from edges to shapes to objects.
- RNN β a timeline tracker that processes sequences step by step, updating a hidden memory.
- Transformer β a context graph where every token can directly attend to every other token.
| Architecture | Best for | Core intuition |
| CNN | Images, grids, local spatial patterns | Learn local features, then compose globally |
| RNN / LSTM / GRU | Time-ordered streams, short-to-medium sequences | Update hidden memory one step at a time |
| Transformer | Long-context text, code, multimodal | Attention relates any two positions directly |
π Data Shape Drives Architecture Selection
The single most important rule: match architecture to data structure.
| Data shape | Typical example | Good default |
| 2D / 3D grid | Image, video frame | CNN |
| Short ordered stream | Sensor telemetry, short text | RNN / LSTM |
| Long-context sequence | Document, code file, conversation | Transformer |
Practical planning order
- Pick the architecture family that matches your data shape.
- Choose model size within that family.
- Tune the training loop last.
A larger model in the wrong family still underperforms a smaller model with the right inductive bias. Do not skip step 1.
βοΈ CNN, RNN, and Transformer Internals Side by Side
Every architecture follows the same high-level loop: Input tensor β architecture-specific blocks β prediction head β loss β backpropagation β weight update.
flowchart TD
A[Input Tensor] --> B[Architecture Block]
B --> C[Prediction Head]
C --> D[Loss Function]
D --> E[Backpropagation]
E --> F[Updated Weights]
F --> B
CNN block: small filters slide across a spatial grid. Early layers detect edges; later layers detect shapes and objects. Weight sharing across positions gives CNNs parameter efficiency on image data.
π CNN Layer Stack: From Pixels to Predictions
flowchart LR
A[Input Image] --> B[Conv Layer]
B --> C[ReLU]
C --> D[Pooling Layer]
D --> E[Conv Layer 2]
E --> F[Pooling Layer 2]
F --> G[Flatten]
G --> H[Dense Layer]
H --> I[Softmax Output]
Each convolutional layer detects increasingly abstract features β edges at layer 1, shapes at layer 2, objects at deeper layers. Pooling reduces spatial size while preserving dominant activations.
RNN block: each time step inputs a new element, updates a hidden state $h_t = f(Wh h{t-1} + W_x x_t + b)$, and optionally emits an output. Sequential execution is the price for temporal order.
π RNN Unrolled Through Time Steps
flowchart LR
A[x_t-1] --> B[RNN Cell t-1]
B --> C[h_t-1]
C --> D[RNN Cell t]
E[x_t] --> D
D --> F[h_t]
F --> G[RNN Cell t+1]
H[x_t+1] --> G
G --> I[Output]
The hidden state h carries a compressed memory of all prior inputs β but it's a fixed-size vector, so long-range context eventually gets diluted. LSTMs and GRUs add gating to partially address this.
Transformer block:multi-head self-attention computes a weighted sum over all positions simultaneously, so long-range dependencies cost no more than short ones β but memory scales with $O(n^2)$ in sequence length.
| Dimension | CNN | RNN | Transformer |
| Parallelism during training | High | Low (sequential) | High |
| Long-range context | Moderate | Moderate | Strong |
| Memory at long sequences | Efficient | Efficient per step | Can be heavy |
| Typical training stability | Strong | Unstable on long seqs | Strong with correct setup |
π§ Deep Dive: How Attention Replaces Recurrence
The key Transformer innovation is scaled dot-product attention: each token computes query, key, and value vectors. Attention scores are dot products of queries against all keys, scaled by βd to prevent vanishing gradients, then softmaxed into weights that blend value vectors. This lets the model attend to any position in one stepβno sequential bottleneck.
| Step | Operation | Purpose |
| Q, K, V projection | Linear transform | Encode what to ask, match, and return |
| QKα΅ / βd | Dot product + scale | Score token compatibility |
| Softmax | Normalize scores | Produce attention weights |
| Weighted sum of V | Aggregation | Build context-aware token representation |
π Transformer Encoder: From Tokens to Contextual Output
flowchart TD
A[Input Tokens] --> B[Token Embedding]
B --> C[Positional Encoding]
C --> D[Multi-Head Attention]
D --> E[Add & Norm]
E --> F[Feed Forward]
F --> G[Add & Norm]
G --> H[Repeat N Layers]
H --> I[Output Projection]
I --> J[Softmax]
Each encoder layer refines token representations by mixing context from all other tokens via attention, then applying a position-wise feed-forward transformation. The Add & Norm steps stabilize training.
π¬ Internals
CNNs use shared weight kernels that slide across spatial dimensions, reducing parameters from O(nΒ²) to O(kΒ²Β·c_inΒ·c_out). RNNs maintain a hidden state h_t = tanh(WhΒ·h{t-1} + W_xΒ·x_t), which creates a single bottleneck for long sequences. Transformers replace recurrence with self-attention: Attention(Q,K,V) = softmax(QKα΅/βd_k)V, allowing all positions to attend to each other in parallel.
β‘ Performance Analysis
A ResNet-50 CNN trains in ~1 hour on ImageNet on a single V100 and achieves ~76% top-1 accuracy. LSTM training on long sequences (>500 tokens) is 5β10Γ slower than Transformers due to sequential dependency. GPT-2 (117M params) fine-tunes in under 10 minutes on an A100 for most classification tasks, versus days for training from scratch.
π Choosing an Architecture: A Decision Flow
graph LR
A[Raw Data] --> B{Data type?}
B -->|Image or grid| C[CNN Pipeline]
B -->|Short ordered stream| D[RNN / LSTM / GRU]
B -->|Long-context sequence| E[Transformer]
C --> F[Prediction]
D --> F
E --> F
This flow operationalizes the core rule from this post β match architecture to data shape β into a single branching decision. Starting from Raw Data, the data-type question routes you to CNN for spatial grids, RNN/LSTM/GRU for time-ordered streams, or Transformer for long-context sequences. All three paths converge at Prediction, reinforcing that the architecture choice changes the journey, not the destination.
| Condition | Start here |
| Local spatial patterns dominate | CNN |
| Strict time-order and moderate context window | RNN / LSTM |
| Long-range dependencies dominate | Transformer |
| Low data and need fast iteration | Transfer learning (any family) |
π Real-World Applications: Where Each Architecture Lives in Production
CNNs: defect detection in manufacturing, medical imaging triage, OCR preprocessing, real-time video classification on edge devices.
RNNs: forecasting and streaming anomaly detection in telemetry pipelines; still competitive on constrained hardware where Transformer overhead is prohibitive.
Transformers: search ranking, document understanding, code completion, multilingual NLP, any task requiring understanding of long-form text.
Hybrid pipelines are common. A support-ticket router might use a Transformer encoder for semantic features and a lightweight recurrent branch for short metadata sequences β combining both views for the final routing decision.
βοΈ Trade-offs & Failure Modes: Trade-offs and Failure Modes
| Failure mode | Symptom | Typical cause | Mitigation |
| Overfitting | Strong train, poor test metrics | Model too complex for data volume | Regularization + augmentation |
| Underfitting | Weak train and test | Architecture too weak | Increase capacity or switch family |
| Gradient instability | Spiky / diverging loss | Poor optimizer settings, long sequences | LR tuning + gradient clipping |
| Context truncation | Missing key information | Sequence exceeds model budget | Chunking + retrieval / context windows |
The accuracyβlatency trade-off is real. A larger Transformer may win on benchmark scores while violating your inference SLA. Measure latency early, not after model quality peaks.
Infrastructure constraints often decide architecture more than benchmark scores:
| Constraint | Implication |
| Edge device memory limits | Favor compact CNN or small recurrent model |
| Real-time streaming | Recurrent or temporal CNN pipelines |
| Long documents server-side | Transformer with context management |
| Limited labeled data | Transfer learning before scaling |
π§ Decision Guide: Picking the Right Deep Learning Architecture
- Choose CNN when your data has local spatial structure (images, audio spectrograms, fixed-length feature grids).
- Choose RNN/LSTM when sequence order matters and sequences are short to medium length (time series, speech with alignment constraints).
- Choose Transformer for most NLP tasks, long sequences, or when pre-trained weights (BERT, GPT) give a head start.
- Skip deep learning when you have fewer than a few thousand labeled examplesβgradient-boosted trees or logistic regression will likely win.
π§ͺ Three Quick Experiments to Validate Architecture Selection
These three experiments translate the architecture-selection framework from the decision flow above into concrete validation steps: a CNN baseline for spatial image data, a Transformer baseline for long-text classification, and a systematic comparison playbook for when the right family isn't obvious. They were chosen because the most common mistake practitioners make is committing to an architecture based on its reputation rather than evidence β this section gives you a repeatable process to let data decide. As you read Experiment 3, focus on the three-way evaluation criteria (validation quality, latency, and serving cost): optimizing for benchmark score alone is the failure mode this playbook is designed to prevent.
Experiment 1: CNN baseline for image classification
if task_type == "image":
model = build_cnn_backbone() # 3 conv blocks, softmax head
# Evaluate: accuracy + confusion matrix
Experiment 2: Transformer for ticket triage
elif task_type == "long_text":
model = build_transformer_encoder() # pretrained + fine-tuned head
# Track: macro-F1 across imbalanced classes
Experiment 3: Architecture playbook
When unsure which family to use:
- Build one small baseline per candidate architecture.
- Hold the training budget constant for a fair comparison.
- Compare validation quality and latency and serving cost.
- Select on the full objective β not benchmark score alone.
This prevents the common mistake of choosing the most "modern" architecture when a smaller, simpler model is equally accurate and far cheaper to operate.
π οΈ PyTorch: Implementing All Three Architectures with One Consistent API
PyTorch is an open-source deep learning framework maintained by Meta AI that provides nn.Conv2d, nn.LSTM, and nn.MultiheadAttention as first-class building blocks β letting you implement CNNs, RNNs, and Transformers with a single consistent training loop.
The key teaching point from this post β match architecture to data shape β maps directly to which module you instantiate in PyTorch:
import torch
import torch.nn as nn
# --- CNN: spatial data (image classification) ---
class SimpleCNN(nn.Module):
def __init__(self, num_classes: int = 10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding=1), nn.ReLU(),
nn.MaxPool2d(2), # spatial downsampling
nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.ReLU(),
nn.AdaptiveAvgPool2d(4)
)
self.classifier = nn.Linear(64 * 4 * 4, num_classes)
def forward(self, x): # x: (B, 1, 28, 28)
return self.classifier(self.features(x).flatten(1))
# --- RNN: ordered streams (time-series) ---
class SimpleRNN(nn.Module):
def __init__(self, input_size=1, hidden=64, num_classes=2):
super().__init__()
self.rnn = nn.LSTM(input_size, hidden, batch_first=True)
self.head = nn.Linear(hidden, num_classes)
def forward(self, x): # x: (B, seq_len, features)
_, (h_n, _) = self.rnn(x)
return self.head(h_n[-1])
# --- Transformer: long-context sequences ---
class SimpleTransformer(nn.Module):
def __init__(self, vocab=1000, d_model=128, nhead=4, num_classes=2):
super().__init__()
self.embed = nn.Embedding(vocab, d_model)
self.encoder = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model, nhead, batch_first=True), num_layers=2
)
self.head = nn.Linear(d_model, num_classes)
def forward(self, tokens): # tokens: (B, seq_len)
x = self.embed(tokens)
x = self.encoder(x)
return self.head(x[:, 0, :]) # CLS-token pooling
# Training loop is identical for all three:
for model, name in [(SimpleCNN(10), "CNN"), (SimpleRNN(), "RNN"), (SimpleTransformer(), "Transformer")]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
print(f"{name} parameters: {sum(p.numel() for p in model.parameters()):,}")
All three share the same optimizer.zero_grad() β loss.backward() β optimizer.step() loop β the architecture only determines what happens in forward().
For a full deep-dive on PyTorch, a dedicated follow-up post is planned.
π οΈ Keras: High-Level Architecture Prototyping
Keras (now bundled with TensorFlow as tf.keras and available standalone via keras) provides a high-level Sequential and Functional API that lets you build and swap CNN, RNN, and Transformer architectures in minutes β ideal for the rapid baseline comparison strategy recommended in this post.
import keras
from keras import layers
# CNN baseline
cnn = keras.Sequential([
layers.Input(shape=(28, 28, 1)),
layers.Conv2D(32, 3, activation="relu"),
layers.MaxPooling2D(),
layers.Flatten(),
layers.Dense(10, activation="softmax"),
], name="cnn_baseline")
# Transformer baseline (using built-in MultiHeadAttention)
inputs = keras.Input(shape=(128,), dtype="int32")
x = layers.Embedding(input_dim=5000, output_dim=64)(inputs)
x = layers.MultiHeadAttention(num_heads=4, key_dim=16)(x, x)
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(2, activation="softmax")(x)
transformer = keras.Model(inputs, outputs, name="transformer_baseline")
# Both compile identically β swap model variable to compare
for model in [cnn, transformer]:
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
print(model.name, "β", model.count_params(), "params")
Keras's model.summary() and keras.utils.plot_model() make architecture comparison visual and auditable β directly supporting the three-experiment validation strategy from the practical section.
For a full deep-dive on Keras, a dedicated follow-up post is planned.
π What to Learn Next
Start with the baseline that fits your data shape, measure rigorously across validation quality, inference latency, and serving cost, and only move to more complex architectures when simpler ones demonstrably fall short.
- Neural Networks Explained: From Neurons to Deep Learning
- Supervised Learning Algorithms: Regression and Classification
- Large Language Models: The Generative AI Revolution
π TLDR: Summary & Key Takeaways
- CNN, RNN, and Transformer families solve different pattern-learning problems β data shape drives the choice.
- Matching architecture to data structure reduces both training time and operational risk.
- Benchmark scores alone are not enough: measure latency and serving cost before committing.
- Document architecture assumptions (data shape, latency budget, model size cap) so future contributors can reason about trade-offs without guessing history.
π Practice Quiz
Which architecture is usually the best first choice for image classification?
A) RNN B) CNN C) Naive Bayes D) Linear regression
Correct Answer: B β CNNs use convolutional filters to detect local spatial patterns (edges, textures, shapes), making them the strongest default for grid-based image tasks.
Why can Transformers be expensive on very long sequences?
A) They do not support batching B) Attention computation and memory can grow with sequence length C) They cannot use GPUs D) They require labeled data
Correct Answer: B β Self-attention computes pairwise token interactions, so memory and compute can scale with the square of sequence length in naive implementations.
When does an RNN remain a practical choice over a Transformer?
A) For all tasks regardless of data shape B) For strict ordered streams with tight latency and memory budgets C) Only for image segmentation D) Never β Transformers are always better
Correct Answer: B β RNNs process inputs one step at a time with bounded per-step memory, making them efficient for streaming data on constrained hardware.
You have limited labeled data and unstable training loss. What is the best first step?
A) Scale model size 10x immediately B) Apply transfer learning and regularization on a strong baseline C) Remove the validation split D) Switch to a different deep learning framework
Correct Answer: B β Transfer learning reuses pretrained representations to reduce data requirements; regularization (dropout, weight decay) helps prevent overfitting on small datasets.
- Open-ended challenge: You are building a system to classify medical images AND generate diagnostic reports from them. Design an architecture that combines CNN and Transformer components β what would each part handle, and how would you connect them?
π Related Posts
- Neural Networks Explained: From Neurons to Deep Learning
- Unsupervised Learning: Clustering and Dimensionality Reduction
- Large Language Models: The Generative AI Revolution

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions β with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy β but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose β range, hash, consistent hashing, or directory β determines whether range queries stay ch...
