Chapter 03 — EDA
Exploratory Data Analysis
Understand the structure, shape, and content of your dataset before doing anything else. Never skip this step.
3.0 EDA decision rules
| Check | When to prioritize | Why it matters | Can skip? |
|---|---|---|---|
| Shape + dtypes | Always, first 1-2 minutes | Defines what operations are valid | Never skip |
| Missingness map | Any real-world dataset | Drives cleaning strategy | Skip only if you confirmed zero nulls |
| Duplicate checks | Logs, transactions, joins | Avoids double counting bias | Skip only for guaranteed unique IDs |
| Distribution review | Before outlier handling/statistics | Decides mean vs median and test choice | Do not skip for numeric analysis |
| Category balance | Classification or segmentation tasks | Detects class imbalance and rare groups | Skip for pure regression without categories |
If time is short, never skip: shape, dtype, null rate, duplicates, and one distribution check per key numeric column. That minimum prevents most analysis mistakes.
DataXForgeProfile a dataset instantly in the browser: Dataset Health Analyzer · Column Analysis · Missing Data Report · Missing Data Heatmap.
3.1 First look — always run these first
python
df.shape # (rows, columns) df.head(10) # first 10 rows df.tail(5) # last 5 rows df.sample(5) # random sample df.columns.tolist() # column names as list df.dtypes # data type of each column
3.2 Data types and memory usage
python
df.info() # Shows: column names, non-null count, dtype, memory usage memory_mb = df.memory_usage(deep=True).sum() / 1024**2 print(f"Memory: {memory_mb:.1f} MB")
3.3 Summary statistics
python
df.describe() # numeric columns only df.describe(include='all') # all columns including text df.describe(percentiles=[.05, .25, .5, .75, .95]) # custom percentiles
3.4 Missing values — full overview
python
missing = pd.DataFrame({
'count': df.isnull().sum(),
'percent': (df.isnull().sum() / len(df) * 100).round(2)
})
missing[missing['count'] > 0].sort_values('percent', ascending=False)3.5 Unique values and duplicates
python
df.nunique() # unique value count per column df['category'].value_counts() # top values with count df['category'].value_counts(normalize=True) # as percentages df.duplicated().sum() # how many duplicate rows df[df.duplicated()] # show duplicate rows
Write down your observations! Note columns with issues, unexpected types, or high missing rates before proceeding.
Common mistakes to avoid
- Using mean on skewed data without checking distribution first
- Ignoring duplicates and then trusting wrong aggregates
- Assuming correlation means causation
Quick cheatsheet
df.info() -> Structure and non-null countsdf.describe() -> Numeric summary statisticsdf.isnull().sum() -> Missing-value counts by columndf.groupby() -> Segmented aggregationpd.merge() -> Join multiple datasets