Chapter 04 — Cleaning

Data Cleaning

Handle missing values, duplicates, outliers, incorrect types, and messy strings. This is often 60–70% of the work.

Always backup first: df_raw = df.copy() — run this before any cleaning!

4.0 Cleaning decision guide

Cleaning method	Use when	Why use it	Skip when
Drop rows	Only a small number of rows are missing or invalid	Fast and simple when data loss is tiny	Skip when missingness is large, random, or concentrated in important columns
Fill with mean	Numeric data is roughly symmetric and outliers are not severe	Keeps the overall average stable	Skip for skewed numeric columns like income, prices, or counts with strong outliers
Fill with median	Numeric data is skewed or has outliers	More robust than mean for messy real-world values	Skip only when exact average behavior is required
Fill with mode / "Unknown"	Categorical data has missing values	Preserves the category structure	Skip when missingness itself is meaningful and should be analyzed separately
Forward fill / backward fill	Time series or ordered records	Uses nearby values where continuity is expected	Skip for unrelated rows, shuffled data, or data without a natural order
Drop duplicates	Duplicate rows are accidental copies	Prevents double counting	Skip when repeated rows are valid events, transactions, or measurements
IQR clipping	You want to reduce the impact of extreme numeric outliers	Protects analysis from extreme values without removing rows	Skip when extreme values are real business events that must stay visible
Z-score filtering	Data is close to normal and you need a statistical outlier rule	Good for controlled numeric data	Skip for strongly skewed data or small samples where Z-scores are unstable
String standardization	Text values have spaces, case differences, or symbols	Makes categories consistent	Skip only if the original text format must be preserved exactly

DataXForgeClean without code: CSV Cleaner · Missing Value Cleaner · Remove Duplicate Rows · Outlier Detector · Data Standardizer.

4.1 Handle missing values

python

# Drop rows where ALL values are missing
df.dropna(how='all', inplace=True)

# Drop rows if KEY columns are missing
df.dropna(subset=['id', 'date'], inplace=True)

# Fill with a constant value
df['col'].fillna(0, inplace=True)
df['category'].fillna('Unknown', inplace=True)

# Fill numeric with mean or median
df['age'].fillna(df['age'].mean(), inplace=True)
df['salary'].fillna(df['salary'].median(), inplace=True)

# Fill by group mean (smarter)
df['salary'] = df.groupby('department')['salary'].transform(
    lambda x: x.fillna(x.mean())
)

# Forward fill / backward fill (for time series)
df['price'].fillna(method='ffill', inplace=True)

4.2 Fix data types

python

# Convert to numeric (coerce turns errors to NaN)
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Convert to datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Convert to category (saves memory)
df['status'] = df['status'].astype('category')

# Convert to integer
df['count'] = df['count'].astype('int64')

4.3 Clean string / text columns

python

# Strip whitespace and lowercase
df['name'] = df['name'].str.strip().str.lower()

# Remove currency symbols and commas
df['price'] = df['price'].str.replace('$', '').str.replace(',', '')

# Remove non-numeric characters
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)

# Standardize inconsistent values
df['gender'] = df['gender'].replace({
    'm': 'Male', 'M': 'Male', 'male': 'Male',
    'f': 'Female', 'F': 'Female', 'female': 'Female'
})

4.4 Remove duplicates

python

# Remove exact duplicates
df.drop_duplicates(inplace=True)

# Remove duplicates based on key column (keep latest)
df.drop_duplicates(subset=['customer_id'], keep='last', inplace=True)

4.5 Handle outliers

python

# IQR method (recommended)
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Option A: Remove outlier rows
df_clean = df[(df['salary'] >= lower) & (df['salary'] <= upper)]

# Option B: Cap/clip outliers (safer)
df['salary'] = df['salary'].clip(lower, upper)

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary'].dropna()))
df_clean = df[z_scores < 3]

Common mistakes to avoid

Dropping null values blindly without checking missingness pattern
Converting dtypes without validating failed coercions
Deleting all outliers even when they are real business events

Quick cheatsheet

df.info() -> Structure and non-null counts

df.describe() -> Numeric summary statistics

df.isnull().sum() -> Missing-value counts by column

df.groupby() -> Segmented aggregation

pd.merge() -> Join multiple datasets