Chapter 19 — Troubleshooting

Data Science Troubleshooting Guide

When something breaks or looks wrong, find the symptom here. Each entry gives the likely cause and the concrete fix — for errors, modeling problems, and silent data issues.

Search this chapter (sidebar search) for an error message or symptom. Format is always Symptom → Cause → Fix.

19.1 Common error messages

Error	Cause	Fix
`could not convert string to float`	Text values in a numeric column (e.g. "N/A", "$1,200")	`pd.to_numeric(col, errors='coerce')` then impute
`SettingWithCopyWarning`	Chained indexing on a slice	Use `.loc[rows, col] = value` or `.copy()`
`ValueError: input contains NaN`	Model received missing values	Impute before `.fit()`; check after merges
`shapes not aligned` / dimension mismatch	Train and predict feature columns differ	Align columns; fit encoder on train, reuse on test
`MemoryError` on load	Dataset too large for RAM	`dtype` downcast, `usecols`, `chunksize`, or Polars/Dask
`ConvergenceWarning` (LogReg)	Unscaled features or too few iterations	Scale features; raise `max_iter`
`Found unknown categories` (OneHot)	Test set has a category not seen in train	`OneHotEncoder(handle_unknown='ignore')`
`KeyError: 'column'`	Renamed/whitespace column name	`df.columns.str.strip()`; check `df.columns`

DataXForgeFix broken input fast: Fix Broken JSON · Detect Corrupted Rows · Invalid Row Detector · Data Type Detector.

19.2 Overfitting

Symptom

Train accuracy = 99%, Test accuracy = 62%. The model memorised the training data and fails on new data.

Fix

Cross-validation · regularization (L1/L2, max_depth) · fewer features · more data · early stopping · simpler model.

python

# Detect the gap with cross-validation, not a single split
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")
# Big train-vs-CV gap = overfitting. Add regularization / reduce complexity.

19.3 Underfitting

Symptom

Train and test scores are both low. The model is too simple to capture the pattern.

Fix

More/better features · less regularization · a more powerful model · train longer · reduce feature noise.

19.4 Silent data problems (no error, wrong answer)

Symptom	Likely cause	Fix
Test score suspiciously perfect (≈1.0)	Data leakage — target info in features	Audit features; see Chapter 20
Row count grew after a merge	Many-to-many join on a non-unique key	Check key uniqueness; `validate='1:1'`
Aggregates double the expected total	Duplicate rows	`df.duplicated().sum()` then drop
Model predicts one class only	Severe imbalance / wrong threshold	Class weights, resampling, tune threshold
Great offline, poor in production	Train/serve skew or drift	Match preprocessing; monitor drift (Ch 21)
Feature importance dominated by an ID	Leaky identifier learned by the model	Drop IDs from features

19.5 Performance & scaling

Symptom	Cause	Fix
Pandas operation very slow	Row-wise `apply` / Python loops	Vectorize; use built-in pandas/numpy ops
Notebook RAM keeps growing	Copies of large frames kept alive	`del` + downcast dtypes; process in chunks
Training takes hours	Too many estimators / unscaled SVM/KNN	LightGBM, subsample, fewer trees, GPU

When stuck: isolate the problem on a small sample, print shapes and dtypes at each step, and verify one row by hand. Most "model" bugs are actually data bugs.

Common mistakes to avoid

Skipping business context before running technical steps
Not writing assumptions and limitations explicitly
Treating one metric as the full story

Quick cheatsheet

df.info() -> Structure and non-null counts

df.describe() -> Numeric summary statistics

df.isnull().sum() -> Missing-value counts by column

df.groupby() -> Segmented aggregation

pd.merge() -> Join multiple datasets