All Posts

Unsupervised Learning: Clustering and Dimensionality Reduction Explained

Unsupervised learning finds structure in unlabeled data using clustering and dimensionality reduction techniques.

Abstract AlgorithmsAbstract Algorithms
ยทยท13 min read
Cover Image for Unsupervised Learning: Clustering and Dimensionality Reduction Explained
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: Unsupervised learning helps you find patterns when you do not have labels. Clustering groups similar data points into segments, and dimensionality reduction compresses large feature spaces into smaller, useful representations for visualization, search, and downstream models.


๐Ÿ“– Finding Structure Where There Are No Labels

Spotify's "Discover Weekly" playlist groups listeners into micro-genres like "Indie Stomp and Holler" and "Vapor Twitch" โ€” without anyone ever labeling the data. Nobody told the algorithm what genres exist or how many to create. It found them by clustering millions of listening patterns, discovering structure that nobody had explicitly defined. That's unsupervised learning at scale.

Unsupervised learning is the part of machine learning that works with unlabeled data. Instead of predicting a known target, you ask the algorithm to discover structure on its own โ€” grouping, compressing, and revealing patterns that weren't explicitly defined.

Two main techniques:

Imagine a box of mixed screws, nails, and bolts. You naturally group similar shapes together โ€” that's clustering. Now imagine each piece has 200 measurements; you compress them into 2โ€“3 meaningful axes to see the structure โ€” that's dimensionality reduction.

TechniqueWhat it answersOutput
Clustering"Which points look similar?"Cluster IDs
Dimensionality reduction"Can I represent this with fewer variables?"Lower-dimensional vectors
Density estimation"Where is data common vs. rare?"Likelihood scores

๐Ÿ“Š Unsupervised Learning Taxonomy: Algorithms by Technique Family

flowchart TD
    A[Unsupervised Learning] --> B[Clustering]
    A --> C[Dimensionality Reduction]
    A --> D[Generative Models]
    B --> E[K-Means]
    B --> F[DBSCAN]
    B --> G[Hierarchical]
    C --> H[PCA]
    C --> I[t-SNE]
    D --> J[VAE]
    D --> K[GAN]

This taxonomy maps the three major families of unsupervised learning algorithms and their most widely-used members. Clustering (K-Means, DBSCAN, Hierarchical) groups data points by similarity; dimensionality reduction (PCA, t-SNE) compresses feature spaces into lower-dimensional representations; generative models (VAE, GAN) learn the underlying data distribution. For practical tabular ML work, start with the clustering and dimensionality reduction branches โ€” the generative model branch requires substantially more data, compute, and tuning and is out of scope for most everyday unsupervised tasks.


๐Ÿ” The Unsupervised Input: Unlabeled Feature Matrices

Unsupervised pipelines start with a feature matrix X โ€” rows are examples, columns are features โ€” but no label column exists.

Toy dataset โ€” user segments:

user_idmonthly_spendsessions/weekavg_order_value
u1210542
u2225640
u318118
u424124
u595331

Even without labels, three groups are visible: high-spend frequent users, low-spend infrequent users, and a mid-range group.

Preprocessing matters: Distance-based methods are sensitive to scale. If monthly_spend ranges 0โ€“1000 and sessions/week ranges 0โ€“10, spend will dominate Euclidean distance unless features are standardized first.


โš™๏ธ Clustering Algorithms: K-Means, DBSCAN, and Hierarchical

K-Means alternates two steps until convergence:

  1. Assign each point to the nearest centroid.
  2. Recompute each centroid as the mean of its assigned points.

K-Means minimizes: $$\min_{{ck}} \sum{i=1}^{n} | xi - c{k(i)} |^2$$

It is fast but assumes roughly spherical clusters and requires choosing $k$ in advance.

DBSCAN groups points by density instead of a fixed $k$:

  • Marks points as core, border, or noise based on neighborhood density.
  • Handles irregular cluster shapes naturally.
  • Sensitive to epsilon (neighborhood radius) and min_samples.

Hierarchical clustering builds a merge tree (dendrogram):

  • Useful when you want multi-level groupings ("customers โ†’ categories โ†’ segments").
  • More expensive for large datasets.
AlgorithmBest forWeakness
K-MeansLarge data, roughly spherical clustersNeeds $k$ upfront; poor on irregular shapes
DBSCANIrregular shapes, noise handlingSensitive to hyperparameters
HierarchicalMulti-level structure explorationExpensive at scale

๐Ÿง  Deep Dive: How K-Means Converges Step by Step

K-Means alternates two steps until assignments stop changing. In the assignment step, each point is assigned to the nearest centroid by Euclidean distance. In the update step, each centroid moves to the mean of its assigned points. Total inertia (within-cluster sum of squares) can only decrease or stay flat each iteration โ€” it never increases โ€” guaranteeing convergence, though not necessarily to the global optimum.

StepActionConvergence signal
AssignmentEach point โ†’ nearest centroidAssignments unchanged between iterations
UpdateCentroid โ†’ mean of its cluster pointsCentroid positions stop moving

๐Ÿ“Š K-Means Convergence Loop: Assign, Update, Repeat

flowchart LR
    A[Initialize K Centroids] --> B[Assign Points to Cluster]
    B --> C[Recompute Centroids]
    C --> D{Centroids Changed?}
    D -- Yes --> B
    D -- No --> E[Final Clusters]

This flowchart traces the iterative convergence loop at the heart of K-Means: initialize centroids, assign every point to its nearest centroid, recompute each centroid as the mean of its assigned cluster, then check whether any centroid moved. The loop runs until the "Centroids Changed?" check returns No โ€” assignments have stabilised and the algorithm has converged. The critical insight is that convergence is guaranteed (inertia can only decrease each iteration), but the final result depends heavily on initialisation; this is why running with n_init=10 and selecting the best-inertia result is standard practice.


๐Ÿ”ฌ Internals

K-Means alternates between assignment (each point to nearest centroid) and update (recompute centroids) until convergence โ€” guaranteed to converge but not to a global optimum. PCA computes eigenvectors of the covariance matrix; the top-k eigenvectors define the projection that maximizes retained variance. t-SNE minimizes KL divergence between high-dimensional and low-dimensional pairwise similarity distributions, making it powerful for visualization but non-deterministic.

โšก Performance Analysis

K-Means with k=10 clusters on 1M points runs in under 60 seconds on a modern CPU using mini-batch variants. PCA on a 10,000ร—500 matrix completes in ~2 seconds with NumPy's SVD. t-SNE on the same matrix takes 5โ€“10 minutes and does not scale well beyond ~50K samples โ€” use UMAP for larger datasets.

๐Ÿ“Š Dimensionality Reduction: PCA, t-SNE, and Autoencoders

flowchart TD
    A[Raw Unlabeled Data] --> B[Clean and Scale Features]
    B --> C{Goal?}
    C -->|Grouping| D[Run Clustering]
    C -->|Compression or Viz| E[Run Dimensionality Reduction]
    D --> F[Validate Cluster Quality]
    E --> G[Validate Information Retention]
    F --> H[Business Interpretation]
    G --> H

PCA (Principal Component Analysis):

  • Linear technique โ€” finds the directions of maximum variance in the data.
  • Great baseline for tabular numeric data.
  • "Variance captured by first 2 components = 87%" is easy to communicate.

t-SNE and UMAP:

  • Non-linear embeddings that preserve neighborhood relationships.
  • Best for visualizing high-dimensional data in 2D/3D โ€” not for production feature inputs (unstable embeddings).

Autoencoders:

  • Neural networks that compress input through a bottleneck and reconstruct it.
  • Best for very high-dimensional data: images, log embeddings, mixed feature types.
  • Require more compute and tuning than PCA.
MethodComplexityBest use
K-MeansFastCustomer segmentation baseline
DBSCANModerateGeospatial clustering, anomaly detection
PCAFastFeature compression, noise reduction
t-SNE / UMAPExpensiveVisualization only
AutoencoderHighImages, logs, complex feature spaces

๐Ÿ“Š PCA Step-by-Step: From High-Dimensional Data to Reduced Components

flowchart TD
    A[High-Dim Data] --> B[Standardize Features]
    B --> C[Compute Covariance Matrix]
    C --> D[Eigen Decomposition]
    D --> E[Sort Eigenvectors]
    E --> F[Select Top K Vectors]
    F --> G[Project Data]
    G --> H[Reduced Dimensions]

This flowchart shows the seven-step PCA pipeline from raw high-dimensional data to a reduced representation. The standardisation step (node B) is mandatory โ€” without it, features with large numeric ranges dominate the covariance matrix and distort the principal components. The eigen decomposition step (node D) finds the directions of maximum variance, and selecting only the top-k eigenvectors (node F) is where you control the compression ratio versus retained information trade-off. Check explained_variance_ratio_.sum() after projection to confirm your chosen k retains enough signal for downstream use.


๐ŸŒ Real-World Applications: Unsupervised Learning in Production

Customer segmentation: cluster purchase behavior to identify high-value, at-risk, and new users. Feed cluster labels into targeted marketing campaigns.

Anomaly detection: DBSCAN marks sparse points as noise โ€” those are your anomalies in log data or sensor streams.

Search and recommendation: PCA or autoencoders compress item feature vectors into dense embeddings. Approximate nearest-neighbor search over those embeddings powers "similar items."

Pre-training / representation learning: compress raw features into lower-dimensional space before passing to a supervised model โ€” often dramatically improves downstream accuracy with limited labels.


โš–๏ธ Trade-offs & Failure Modes: Trade-offs and Common Mistakes

MistakeSymptomFix
Skip normalizationDominant features absorb all distance-based signalStandardScaler before any distance method
Wrong $k$ in K-MeansUnnatural splits or merged true clustersElbow method or silhouette score
Using t-SNE embeddings as model featuresUnstable; different runs produce different layoutsUse PCA or autoencoders for production features
Evaluating clusters by training loss onlyClusters look tight but have no business meaningHuman-in-the-loop validation + silhouette score
Ignoring outliersDBSCAN marks them as noise; K-Means absorbs themDecide up front: treat as noise or own class

๐Ÿงญ Decision Guide: Choosing the Right Method

Your situationStart with
Tabular data, unknown number of clustersK-Means (with elbow plot) + PCA for visualization
Irregular cluster shapes or noiseDBSCAN
Multi-level grouping neededHierarchical clustering
High-dimensional data, need compressionPCA baseline โ†’ autoencoder if PCA insufficient
Visualization for stakeholderst-SNE or UMAP (2D scatter)

๐ŸŽฏ What to Learn Next


๐Ÿงช Hands-On: Customer Segmentation Step by Step

This example implements a complete five-step unsupervised pipeline on the user engagement dataset introduced in the Key Properties section, making the K-Means and silhouette score concepts tangible in working code. Customer segmentation was chosen as the scenario because it is the most common first application of unsupervised learning in industry โ€” the same five-step pattern applies to product categorisation, fraud-ring detection, and patient cohort analysis. As you work through each step, pay attention to how decisions compound: the k you choose in Step 2 drives the cluster labels in Step 3, which determines whether the silhouette score in Step 4 is meaningful or misleading.

Step 1 โ€” Standardize features: apply StandardScaler so monthly_spend and sessions/week contribute equally to distance calculations.

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X[['monthly_spend', 'sessions_per_week', 'avg_order_value']])

Step 2 โ€” Choose k with the elbow method: fit K-Means for k = 2 through 10 and plot inertia (within-cluster sum of squares). The "elbow" โ€” where adding another cluster yields diminishing returns โ€” is your starting k.

Step 3 โ€” Fit and label clusters:

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_scaled)

Step 4 โ€” Validate with silhouette score: a score near 1.0 means clusters are dense and well-separated; below 0.3 suggests poor cluster definition.

Step 5 โ€” Interpret with domain knowledge: attach cluster labels to the original data and compute per-cluster medians. Name each cluster: "High-value loyal," "At-risk," and "New casual." Feed these labels into targeted marketing campaigns.

This five-step pattern applies to most tabular clustering tasks. Replace K-Means with DBSCAN at Step 3 if your data has irregular shapes or significant noise.


๐Ÿ› ๏ธ scikit-learn: Clustering and Dimensionality Reduction Out of the Box

scikit-learn is the go-to open-source Python library for classical machine learning โ€” it implements KMeans, DBSCAN, PCA, and dozens of other algorithms behind a consistent fit / transform / predict API, making it the first tool to reach for on any tabular unsupervised task.

The library solves the problem in this post directly: you get battle-tested implementations with sensible defaults, helpers like silhouette_score and GridSearchCV for validation, and a preprocessing pipeline that prevents the normalization mistakes described in the trade-offs section.

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

# --- Sample data: user engagement features ---
X = np.array([
    [210, 5, 42],
    [225, 6, 40],
    [18,  1, 18],
    [24,  1, 24],
    [95,  3, 31],
    [200, 7, 45],
    [15,  1, 16],
    [100, 4, 33],
])

# Step 1 โ€” Normalize so all features contribute equally to distance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2 โ€” KMeans with k=3
kmeans = KMeans(n_clusters=3, n_init=10, random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)
print("KMeans clusters:", kmeans_labels)
# โ†’ [0 0 1 1 2 0 1 2]  (cluster IDs assigned to each user)

score = silhouette_score(X_scaled, kmeans_labels)
print(f"Silhouette score: {score:.3f}")  # โ†’ ~0.55 means decent separation

# Step 3 โ€” DBSCAN for irregular shapes / noise detection
dbscan = DBSCAN(eps=0.8, min_samples=2)
dbscan_labels = dbscan.fit_predict(X_scaled)
print("DBSCAN labels:", dbscan_labels)
# โ†’ Label -1 = noise point (potential anomaly)

# Step 4 โ€” PCA: compress 3 features to 2 for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)
print(f"Variance explained by 2 components: {pca.explained_variance_ratio_.sum():.1%}")
# โ†’ ~95%: 2 components retain almost all signal
print("2D projections:\n", X_2d.round(2))

The silhouette_score output guides k selection; the PCA explained_variance_ratio_ output tells you how much information survives the compression. Both are standard checkpoints in any unsupervised pipeline.

For a full deep-dive on scikit-learn, a dedicated follow-up post is planned.


๐Ÿ“š Lessons from Deploying Unsupervised Models

Teams that ship unsupervised learning to production converge on the same hard-won lessons.

Cluster stability is fragile: K-Means is sensitive to initialization. Always run multiple seeds (n_init=10) and pick the run with the lowest inertia. Autoencoder embeddings can shift between training runs, breaking downstream pipelines that depend on stable representations.

Silhouette score โ‰  business value: mathematically tight clusters can still be meaningless. Always involve a domain expert to validate that clusters correspond to actionable segments before building campaigns or models on top of them.

Normalization is not optional: skipping StandardScaler before K-Means or PCA is one of the most common mistakes in ML pipelines. A single high-magnitude feature silently dominates all distance calculations.

Dimensionality reduction is a pre-processing step, not a replacement for features: PCA and autoencoders compress existing information; they do not create signal that was not in the original data. If raw features are noisy or irrelevant, dimensionality reduction will compress noise, not remove it.

Version your cluster definitions: if you use cluster IDs as labels in a downstream model, a re-fit of the clustering algorithm may reassign cluster 1 and cluster 2. Pin the centroid coordinates or use a fixed mapping table.


๐Ÿ“Œ TLDR: Summary & Key Takeaways

  • Unsupervised learning discovers structure in data without labels โ€” clustering for grouping, dimensionality reduction for compression.
  • K-Means is fast and effective for spherical clusters; DBSCAN handles irregular shapes and noise.
  • PCA is the first baseline for compression; t-SNE and UMAP are for visualization only.
  • Always normalize features before distance-based methods.
  • Validate clusters with domain knowledge โ€” silhouette score tells you mathematical quality, not business relevance.

๐Ÿ“ Practice Quiz

  1. Q1: Why must features be normalized before running K-Means?

    • A) Normalization speeds up the algorithm
    • B) Features with large ranges dominate Euclidean distance, leading to biased clusters
    • C) K-Means requires all features to be binary

    Correct Answer: B

  2. Q2: Why is t-SNE not recommended as input features for a downstream model?

    • A) t-SNE is too slow to compute
    • B) t-SNE embeddings are non-deterministic and not stable across runs
    • C) t-SNE only works on binary data

    Correct Answer: B

  3. Q3: What is the primary advantage of DBSCAN over K-Means?

    • A) DBSCAN is always faster
    • B) DBSCAN can find irregularly shaped clusters and label outliers as noise
    • C) DBSCAN does not require any hyperparameters

    Correct Answer: B

  4. Open-ended challenge: Your e-commerce platform needs to group 5 million products into meaningful categories with no labeled data. Compare using K-Means versus a neural embedding + clustering approach โ€” what are the practical trade-offs in data preparation, compute, and interpretability?


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms