Chapter 03 — EDA

Exploratory Data Analysis

Understand the structure, shape, and content of your dataset before doing anything else. Never skip this step.

3.0 EDA decision rules
CheckWhen to prioritizeWhy it mattersCan skip?
Shape + dtypesAlways, first 1-2 minutesDefines what operations are validNever skip
Missingness mapAny real-world datasetDrives cleaning strategySkip only if you confirmed zero nulls
Duplicate checksLogs, transactions, joinsAvoids double counting biasSkip only for guaranteed unique IDs
Distribution reviewBefore outlier handling/statisticsDecides mean vs median and test choiceDo not skip for numeric analysis
Category balanceClassification or segmentation tasksDetects class imbalance and rare groupsSkip for pure regression without categories
If time is short, never skip: shape, dtype, null rate, duplicates, and one distribution check per key numeric column. That minimum prevents most analysis mistakes.
DataXForgeProfile a dataset instantly in the browser: Dataset Health Analyzer · Column Analysis · Missing Data Report · Missing Data Heatmap.
3.1 First look — always run these first
python
df.shape            # (rows, columns)
df.head(10)         # first 10 rows
df.tail(5)          # last 5 rows
df.sample(5)        # random sample
df.columns.tolist() # column names as list
df.dtypes           # data type of each column
3.2 Data types and memory usage
python
df.info()
# Shows: column names, non-null count, dtype, memory usage

memory_mb = df.memory_usage(deep=True).sum() / 1024**2
print(f"Memory: {memory_mb:.1f} MB")
3.3 Summary statistics
python
df.describe()                     # numeric columns only
df.describe(include='all')       # all columns including text
df.describe(percentiles=[.05, .25, .5, .75, .95])  # custom percentiles
3.4 Missing values — full overview
python
missing = pd.DataFrame({
    'count': df.isnull().sum(),
    'percent': (df.isnull().sum() / len(df) * 100).round(2)
})
missing[missing['count'] > 0].sort_values('percent', ascending=False)
3.5 Unique values and duplicates
python
df.nunique()                          # unique value count per column
df['category'].value_counts()         # top values with count
df['category'].value_counts(normalize=True)  # as percentages

df.duplicated().sum()                 # how many duplicate rows
df[df.duplicated()]                   # show duplicate rows
Write down your observations! Note columns with issues, unexpected types, or high missing rates before proceeding.
Common mistakes to avoid
Quick cheatsheet
df.info() -> Structure and non-null counts
df.describe() -> Numeric summary statistics
df.isnull().sum() -> Missing-value counts by column
df.groupby() -> Segmented aggregation
pd.merge() -> Join multiple datasets