Chapter 31 — Unsupervised

Unsupervised Learning

No labels, just structure. Clustering (K-Means, DBSCAN, hierarchical), choosing k, and dimensionality reduction (PCA, t-SNE, UMAP) for compression and visualization.

Unsupervised learning finds structure when there's no target: customer segments, anomalies, topic groups, and lower-dimensional views of high-dimensional data.

31.1 Two main jobs

what do you want?

No labels — goal?
│
├── Group similar rows ──────────► Clustering
└── Reduce # of features ────────► Dimensionality reduction
        ├── for modeling/compression ► PCA
        └── for visualization ───────► t-SNE / UMAP

31.2 Clustering algorithms

Algorithm	Use when	Avoid when
K-Means	Round, similar-size clusters; you can pick k	Odd shapes, outliers, varying density
DBSCAN	Arbitrary shapes, finds outliers, k unknown	Clusters of very different density
Hierarchical	Want a dendrogram / nested structure	Large datasets (slow)
Gaussian Mixture	Soft assignment, elliptical clusters	Non-Gaussian shapes

K-Means uses distances — scale your features first (Chapter 15.4), or the largest-range column dominates the clustering.

31.3 Choosing k for K-Means

how many clusters?

Run k = 2..10, then...
│
├── Elbow: plot inertia vs k, find the bend
├── Silhouette: pick k with highest avg silhouette
└── Business sense: do the segments mean something?

python

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
for k in range(2, 9):
    km = KMeans(n_clusters=k, n_init='auto', random_state=42).fit(X_scaled)
    print(k, silhouette_score(X_scaled, km.labels_))

31.4 Dimensionality reduction

Method	Use for	Note
PCA	Compression, denoising, ML input	Linear, fast, preserves variance
t-SNE	2D visualization of clusters	Slow; distances between clusters not meaningful
UMAP	Visualization + preserves global structure	Faster than t-SNE, great default

python

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)         # keep 95% of variance
X_reduced = pca.fit_transform(X_scaled)
print(pca.n_components_, "components kept")

t-SNE/UMAP are for seeing, not measuring. Cluster sizes and between-cluster distances in those plots are distorted — don't read quantities off them.

31.5 Evaluating clusters

Silhouette score — how tight/separated clusters are (no labels needed)
Davies-Bouldin / Calinski-Harabasz — internal quality indices
Profile each cluster — describe it in business terms (the real test)
Stability — do clusters persist on a new sample?

Professional recommendation

First tryK-Means on scaled data

Odd shapesDBSCAN

Pick kSilhouette + business sense

VisualizeUMAP to 2D

Common mistakes to avoid

Running K-Means on unscaled features
Reading cluster sizes/distances off a t-SNE plot as if they're real
Picking k from the elbow alone with no business validation
Using K-Means for non-spherical or varying-density clusters
Never profiling the clusters — segments nobody can describe are useless

Quick cheatsheet

KMeans(n_clusters=k) -> spherical clusters

DBSCAN(eps, min_samples) -> shape + outliers

silhouette_score() -> cluster quality

PCA(n_components=0.95) -> keep 95% variance

umap.UMAP() -> 2D visualization