Chapter 04 — Cleaning
Data Cleaning
Handle missing values, duplicates, outliers, incorrect types, and messy strings. This is often 60–70% of the work.
Always backup first:
df_raw = df.copy() — run this before any cleaning!4.0 Cleaning decision guide
| Cleaning method | Use when | Why use it | Skip when |
|---|---|---|---|
| Drop rows | Only a small number of rows are missing or invalid | Fast and simple when data loss is tiny | Skip when missingness is large, random, or concentrated in important columns |
| Fill with mean | Numeric data is roughly symmetric and outliers are not severe | Keeps the overall average stable | Skip for skewed numeric columns like income, prices, or counts with strong outliers |
| Fill with median | Numeric data is skewed or has outliers | More robust than mean for messy real-world values | Skip only when exact average behavior is required |
| Fill with mode / "Unknown" | Categorical data has missing values | Preserves the category structure | Skip when missingness itself is meaningful and should be analyzed separately |
| Forward fill / backward fill | Time series or ordered records | Uses nearby values where continuity is expected | Skip for unrelated rows, shuffled data, or data without a natural order |
| Drop duplicates | Duplicate rows are accidental copies | Prevents double counting | Skip when repeated rows are valid events, transactions, or measurements |
| IQR clipping | You want to reduce the impact of extreme numeric outliers | Protects analysis from extreme values without removing rows | Skip when extreme values are real business events that must stay visible |
| Z-score filtering | Data is close to normal and you need a statistical outlier rule | Good for controlled numeric data | Skip for strongly skewed data or small samples where Z-scores are unstable |
| String standardization | Text values have spaces, case differences, or symbols | Makes categories consistent | Skip only if the original text format must be preserved exactly |
DataXForgeClean without code: CSV Cleaner · Missing Value Cleaner · Remove Duplicate Rows · Outlier Detector · Data Standardizer.
4.1 Handle missing values
python
# Drop rows where ALL values are missing df.dropna(how='all', inplace=True) # Drop rows if KEY columns are missing df.dropna(subset=['id', 'date'], inplace=True) # Fill with a constant value df['col'].fillna(0, inplace=True) df['category'].fillna('Unknown', inplace=True) # Fill numeric with mean or median df['age'].fillna(df['age'].mean(), inplace=True) df['salary'].fillna(df['salary'].median(), inplace=True) # Fill by group mean (smarter) df['salary'] = df.groupby('department')['salary'].transform( lambda x: x.fillna(x.mean()) ) # Forward fill / backward fill (for time series) df['price'].fillna(method='ffill', inplace=True)
4.2 Fix data types
python
# Convert to numeric (coerce turns errors to NaN) df['price'] = pd.to_numeric(df['price'], errors='coerce') # Convert to datetime df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d') # Convert to category (saves memory) df['status'] = df['status'].astype('category') # Convert to integer df['count'] = df['count'].astype('int64')
4.3 Clean string / text columns
python
# Strip whitespace and lowercase df['name'] = df['name'].str.strip().str.lower() # Remove currency symbols and commas df['price'] = df['price'].str.replace('$', '').str.replace(',', '') # Remove non-numeric characters df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True) # Standardize inconsistent values df['gender'] = df['gender'].replace({ 'm': 'Male', 'M': 'Male', 'male': 'Male', 'f': 'Female', 'F': 'Female', 'female': 'Female' })
4.4 Remove duplicates
python
# Remove exact duplicates df.drop_duplicates(inplace=True) # Remove duplicates based on key column (keep latest) df.drop_duplicates(subset=['customer_id'], keep='last', inplace=True)
4.5 Handle outliers
python
# IQR method (recommended) Q1 = df['salary'].quantile(0.25) Q3 = df['salary'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR # Option A: Remove outlier rows df_clean = df[(df['salary'] >= lower) & (df['salary'] <= upper)] # Option B: Cap/clip outliers (safer) df['salary'] = df['salary'].clip(lower, upper) # Z-score method from scipy import stats z_scores = np.abs(stats.zscore(df['salary'].dropna())) df_clean = df[z_scores < 3]
Common mistakes to avoid
- Dropping null values blindly without checking missingness pattern
- Converting dtypes without validating failed coercions
- Deleting all outliers even when they are real business events
Quick cheatsheet
df.info() -> Structure and non-null countsdf.describe() -> Numeric summary statisticsdf.isnull().sum() -> Missing-value counts by columndf.groupby() -> Segmented aggregationpd.merge() -> Join multiple datasets