Chapter 19 — Troubleshooting
Data Science Troubleshooting Guide
When something breaks or looks wrong, find the symptom here. Each entry gives the likely cause and the concrete fix — for errors, modeling problems, and silent data issues.
Search this chapter (sidebar search) for an error message or symptom. Format is always Symptom → Cause → Fix.
19.1 Common error messages
| Error | Cause | Fix |
|---|---|---|
could not convert string to float | Text values in a numeric column (e.g. "N/A", "$1,200") | pd.to_numeric(col, errors='coerce') then impute |
SettingWithCopyWarning | Chained indexing on a slice | Use .loc[rows, col] = value or .copy() |
ValueError: input contains NaN | Model received missing values | Impute before .fit(); check after merges |
shapes not aligned / dimension mismatch | Train and predict feature columns differ | Align columns; fit encoder on train, reuse on test |
MemoryError on load | Dataset too large for RAM | dtype downcast, usecols, chunksize, or Polars/Dask |
ConvergenceWarning (LogReg) | Unscaled features or too few iterations | Scale features; raise max_iter |
Found unknown categories (OneHot) | Test set has a category not seen in train | OneHotEncoder(handle_unknown='ignore') |
KeyError: 'column' | Renamed/whitespace column name | df.columns.str.strip(); check df.columns |
DataXForgeFix broken input fast: Fix Broken JSON · Detect Corrupted Rows · Invalid Row Detector · Data Type Detector.
19.2 Overfitting
Symptom
Train accuracy = 99%, Test accuracy = 62%. The model memorised the training data and fails on new data.Fix
Cross-validation · regularization (L1/L2, max_depth) · fewer features · more data · early stopping · simpler model.python
# Detect the gap with cross-validation, not a single split from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc') print(f"CV AUC: {scores.mean():.3f} ± {scores.std():.3f}") # Big train-vs-CV gap = overfitting. Add regularization / reduce complexity.
19.3 Underfitting
Symptom
Train and test scores are both low. The model is too simple to capture the pattern.Fix
More/better features · less regularization · a more powerful model · train longer · reduce feature noise.19.4 Silent data problems (no error, wrong answer)
| Symptom | Likely cause | Fix |
|---|---|---|
| Test score suspiciously perfect (≈1.0) | Data leakage — target info in features | Audit features; see Chapter 20 |
| Row count grew after a merge | Many-to-many join on a non-unique key | Check key uniqueness; validate='1:1' |
| Aggregates double the expected total | Duplicate rows | df.duplicated().sum() then drop |
| Model predicts one class only | Severe imbalance / wrong threshold | Class weights, resampling, tune threshold |
| Great offline, poor in production | Train/serve skew or drift | Match preprocessing; monitor drift (Ch 21) |
| Feature importance dominated by an ID | Leaky identifier learned by the model | Drop IDs from features |
19.5 Performance & scaling
| Symptom | Cause | Fix |
|---|---|---|
| Pandas operation very slow | Row-wise apply / Python loops | Vectorize; use built-in pandas/numpy ops |
| Notebook RAM keeps growing | Copies of large frames kept alive | del + downcast dtypes; process in chunks |
| Training takes hours | Too many estimators / unscaled SVM/KNN | LightGBM, subsample, fewer trees, GPU |
When stuck: isolate the problem on a small sample, print shapes and dtypes at each step, and verify one row by hand. Most "model" bugs are actually data bugs.
Common mistakes to avoid
- Skipping business context before running technical steps
- Not writing assumptions and limitations explicitly
- Treating one metric as the full story
Quick cheatsheet
df.info() -> Structure and non-null countsdf.describe() -> Numeric summary statisticsdf.isnull().sum() -> Missing-value counts by columndf.groupby() -> Segmented aggregationpd.merge() -> Join multiple datasets