Chapter 04 — Cleaning

Data Cleaning

Handle missing values, duplicates, outliers, incorrect types, and messy strings. This is often 60–70% of the work.

Always backup first: df_raw = df.copy() — run this before any cleaning!
4.0 Cleaning decision guide
Cleaning methodUse whenWhy use itSkip when
Drop rowsOnly a small number of rows are missing or invalidFast and simple when data loss is tinySkip when missingness is large, random, or concentrated in important columns
Fill with meanNumeric data is roughly symmetric and outliers are not severeKeeps the overall average stableSkip for skewed numeric columns like income, prices, or counts with strong outliers
Fill with medianNumeric data is skewed or has outliersMore robust than mean for messy real-world valuesSkip only when exact average behavior is required
Fill with mode / "Unknown"Categorical data has missing valuesPreserves the category structureSkip when missingness itself is meaningful and should be analyzed separately
Forward fill / backward fillTime series or ordered recordsUses nearby values where continuity is expectedSkip for unrelated rows, shuffled data, or data without a natural order
Drop duplicatesDuplicate rows are accidental copiesPrevents double countingSkip when repeated rows are valid events, transactions, or measurements
IQR clippingYou want to reduce the impact of extreme numeric outliersProtects analysis from extreme values without removing rowsSkip when extreme values are real business events that must stay visible
Z-score filteringData is close to normal and you need a statistical outlier ruleGood for controlled numeric dataSkip for strongly skewed data or small samples where Z-scores are unstable
String standardizationText values have spaces, case differences, or symbolsMakes categories consistentSkip only if the original text format must be preserved exactly
4.1 Handle missing values
python
# Drop rows where ALL values are missing
df.dropna(how='all', inplace=True)

# Drop rows if KEY columns are missing
df.dropna(subset=['id', 'date'], inplace=True)

# Fill with a constant value
df['col'].fillna(0, inplace=True)
df['category'].fillna('Unknown', inplace=True)

# Fill numeric with mean or median
df['age'].fillna(df['age'].mean(), inplace=True)
df['salary'].fillna(df['salary'].median(), inplace=True)

# Fill by group mean (smarter)
df['salary'] = df.groupby('department')['salary'].transform(
    lambda x: x.fillna(x.mean())
)

# Forward fill / backward fill (for time series)
df['price'].fillna(method='ffill', inplace=True)
4.2 Fix data types
python
# Convert to numeric (coerce turns errors to NaN)
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Convert to datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Convert to category (saves memory)
df['status'] = df['status'].astype('category')

# Convert to integer
df['count'] = df['count'].astype('int64')
4.3 Clean string / text columns
python
# Strip whitespace and lowercase
df['name'] = df['name'].str.strip().str.lower()

# Remove currency symbols and commas
df['price'] = df['price'].str.replace('$', '').str.replace(',', '')

# Remove non-numeric characters
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)

# Standardize inconsistent values
df['gender'] = df['gender'].replace({
    'm': 'Male', 'M': 'Male', 'male': 'Male',
    'f': 'Female', 'F': 'Female', 'female': 'Female'
})
4.4 Remove duplicates
python
# Remove exact duplicates
df.drop_duplicates(inplace=True)

# Remove duplicates based on key column (keep latest)
df.drop_duplicates(subset=['customer_id'], keep='last', inplace=True)
4.5 Handle outliers
python
# IQR method (recommended)
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Option A: Remove outlier rows
df_clean = df[(df['salary'] >= lower) & (df['salary'] <= upper)]

# Option B: Cap/clip outliers (safer)
df['salary'] = df['salary'].clip(lower, upper)

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary'].dropna()))
df_clean = df[z_scores < 3]
Common mistakes to avoid
Quick cheatsheet
df.info() -> Structure and non-null counts
df.describe() -> Numeric summary statistics
df.isnull().sum() -> Missing-value counts by column
df.groupby() -> Segmented aggregation
pd.merge() -> Join multiple datasets