Chapter 31 — Unsupervised
Unsupervised Learning
No labels, just structure. Clustering (K-Means, DBSCAN, hierarchical), choosing k, and dimensionality reduction (PCA, t-SNE, UMAP) for compression and visualization.
Unsupervised learning finds structure when there's no target: customer segments, anomalies, topic groups, and lower-dimensional views of high-dimensional data.
31.1 Two main jobs
what do you want?
No labels — goal? │ ├── Group similar rows ──────────► Clustering └── Reduce # of features ────────► Dimensionality reduction ├── for modeling/compression ► PCA └── for visualization ───────► t-SNE / UMAP
31.2 Clustering algorithms
| Algorithm | Use when | Avoid when |
|---|---|---|
| K-Means | Round, similar-size clusters; you can pick k | Odd shapes, outliers, varying density |
| DBSCAN | Arbitrary shapes, finds outliers, k unknown | Clusters of very different density |
| Hierarchical | Want a dendrogram / nested structure | Large datasets (slow) |
| Gaussian Mixture | Soft assignment, elliptical clusters | Non-Gaussian shapes |
K-Means uses distances — scale your features first (Chapter 15.4), or the largest-range column dominates the clustering.
31.3 Choosing k for K-Means
how many clusters?
Run k = 2..10, then... │ ├── Elbow: plot inertia vs k, find the bend ├── Silhouette: pick k with highest avg silhouette └── Business sense: do the segments mean something?
python
from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score for k in range(2, 9): km = KMeans(n_clusters=k, n_init='auto', random_state=42).fit(X_scaled) print(k, silhouette_score(X_scaled, km.labels_))
31.4 Dimensionality reduction
| Method | Use for | Note |
|---|---|---|
| PCA | Compression, denoising, ML input | Linear, fast, preserves variance |
| t-SNE | 2D visualization of clusters | Slow; distances between clusters not meaningful |
| UMAP | Visualization + preserves global structure | Faster than t-SNE, great default |
python
from sklearn.decomposition import PCA pca = PCA(n_components=0.95) # keep 95% of variance X_reduced = pca.fit_transform(X_scaled) print(pca.n_components_, "components kept")
t-SNE/UMAP are for seeing, not measuring. Cluster sizes and between-cluster distances in those plots are distorted — don't read quantities off them.
31.5 Evaluating clusters
- Silhouette score — how tight/separated clusters are (no labels needed)
- Davies-Bouldin / Calinski-Harabasz — internal quality indices
- Profile each cluster — describe it in business terms (the real test)
- Stability — do clusters persist on a new sample?
Professional recommendation
First tryK-Means on scaled data
Odd shapesDBSCAN
Pick kSilhouette + business sense
VisualizeUMAP to 2D
Common mistakes to avoid
- Running K-Means on unscaled features
- Reading cluster sizes/distances off a t-SNE plot as if they're real
- Picking k from the elbow alone with no business validation
- Using K-Means for non-spherical or varying-density clusters
- Never profiling the clusters — segments nobody can describe are useless
Quick cheatsheet
KMeans(n_clusters=k) -> spherical clustersDBSCAN(eps, min_samples) -> shape + outlierssilhouette_score() -> cluster qualityPCA(n_components=0.95) -> keep 95% varianceumap.UMAP() -> 2D visualization