Chapter 19 — Troubleshooting

Data Science Troubleshooting Guide

When something breaks or looks wrong, find the symptom here. Each entry gives the likely cause and the concrete fix — for errors, modeling problems, and silent data issues.

Search this chapter (sidebar search) for an error message or symptom. Format is always Symptom → Cause → Fix.
19.1 Common error messages
ErrorCauseFix
could not convert string to floatText values in a numeric column (e.g. "N/A", "$1,200")pd.to_numeric(col, errors='coerce') then impute
SettingWithCopyWarningChained indexing on a sliceUse .loc[rows, col] = value or .copy()
ValueError: input contains NaNModel received missing valuesImpute before .fit(); check after merges
shapes not aligned / dimension mismatchTrain and predict feature columns differAlign columns; fit encoder on train, reuse on test
MemoryError on loadDataset too large for RAMdtype downcast, usecols, chunksize, or Polars/Dask
ConvergenceWarning (LogReg)Unscaled features or too few iterationsScale features; raise max_iter
Found unknown categories (OneHot)Test set has a category not seen in trainOneHotEncoder(handle_unknown='ignore')
KeyError: 'column'Renamed/whitespace column namedf.columns.str.strip(); check df.columns
19.2 Overfitting
Symptom
Train accuracy = 99%, Test accuracy = 62%. The model memorised the training data and fails on new data.
Fix
Cross-validation · regularization (L1/L2, max_depth) · fewer features · more data · early stopping · simpler model.
python
# Detect the gap with cross-validation, not a single split
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.3f} ± {scores.std():.3f}")
# Big train-vs-CV gap = overfitting. Add regularization / reduce complexity.
19.3 Underfitting
Symptom
Train and test scores are both low. The model is too simple to capture the pattern.
Fix
More/better features · less regularization · a more powerful model · train longer · reduce feature noise.
19.4 Silent data problems (no error, wrong answer)
SymptomLikely causeFix
Test score suspiciously perfect (≈1.0)Data leakage — target info in featuresAudit features; see Chapter 20
Row count grew after a mergeMany-to-many join on a non-unique keyCheck key uniqueness; validate='1:1'
Aggregates double the expected totalDuplicate rowsdf.duplicated().sum() then drop
Model predicts one class onlySevere imbalance / wrong thresholdClass weights, resampling, tune threshold
Great offline, poor in productionTrain/serve skew or driftMatch preprocessing; monitor drift (Ch 21)
Feature importance dominated by an IDLeaky identifier learned by the modelDrop IDs from features
19.5 Performance & scaling
SymptomCauseFix
Pandas operation very slowRow-wise apply / Python loopsVectorize; use built-in pandas/numpy ops
Notebook RAM keeps growingCopies of large frames kept alivedel + downcast dtypes; process in chunks
Training takes hoursToo many estimators / unscaled SVM/KNNLightGBM, subsample, fewer trees, GPU
When stuck: isolate the problem on a small sample, print shapes and dtypes at each step, and verify one row by hand. Most "model" bugs are actually data bugs.
Common mistakes to avoid
Quick cheatsheet
df.info() -> Structure and non-null counts
df.describe() -> Numeric summary statistics
df.isnull().sum() -> Missing-value counts by column
df.groupby() -> Segmented aggregation
pd.merge() -> Join multiple datasets