Chapter 31 — Unsupervised

Unsupervised Learning

No labels, just structure. Clustering (K-Means, DBSCAN, hierarchical), choosing k, and dimensionality reduction (PCA, t-SNE, UMAP) for compression and visualization.

Unsupervised learning finds structure when there's no target: customer segments, anomalies, topic groups, and lower-dimensional views of high-dimensional data.
31.1 Two main jobs
what do you want?
No labels — goal?
│
├── Group similar rows ──────────► Clustering
└── Reduce # of features ────────► Dimensionality reduction
        ├── for modeling/compression ► PCA
        └── for visualization ───────► t-SNE / UMAP
31.2 Clustering algorithms
AlgorithmUse whenAvoid when
K-MeansRound, similar-size clusters; you can pick kOdd shapes, outliers, varying density
DBSCANArbitrary shapes, finds outliers, k unknownClusters of very different density
HierarchicalWant a dendrogram / nested structureLarge datasets (slow)
Gaussian MixtureSoft assignment, elliptical clustersNon-Gaussian shapes
K-Means uses distances — scale your features first (Chapter 15.4), or the largest-range column dominates the clustering.
31.3 Choosing k for K-Means
how many clusters?
Run k = 2..10, then...
│
├── Elbow: plot inertia vs k, find the bend
├── Silhouette: pick k with highest avg silhouette
└── Business sense: do the segments mean something?
python
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
for k in range(2, 9):
    km = KMeans(n_clusters=k, n_init='auto', random_state=42).fit(X_scaled)
    print(k, silhouette_score(X_scaled, km.labels_))
31.4 Dimensionality reduction
MethodUse forNote
PCACompression, denoising, ML inputLinear, fast, preserves variance
t-SNE2D visualization of clustersSlow; distances between clusters not meaningful
UMAPVisualization + preserves global structureFaster than t-SNE, great default
python
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)         # keep 95% of variance
X_reduced = pca.fit_transform(X_scaled)
print(pca.n_components_, "components kept")
t-SNE/UMAP are for seeing, not measuring. Cluster sizes and between-cluster distances in those plots are distorted — don't read quantities off them.
31.5 Evaluating clusters

Professional recommendation

First tryK-Means on scaled data
Odd shapesDBSCAN
Pick kSilhouette + business sense
VisualizeUMAP to 2D
Common mistakes to avoid
Quick cheatsheet
KMeans(n_clusters=k) -> spherical clusters
DBSCAN(eps, min_samples) -> shape + outliers
silhouette_score() -> cluster quality
PCA(n_components=0.95) -> keep 95% variance
umap.UMAP() -> 2D visualization