Chapter 03 — EDA

Exploratory Data Analysis

Understand the structure, shape, and content of your dataset before doing anything else. Never skip this step.

3.0 EDA decision rules

Check	When to prioritize	Why it matters	Can skip?
Shape + dtypes	Always, first 1-2 minutes	Defines what operations are valid	Never skip
Missingness map	Any real-world dataset	Drives cleaning strategy	Skip only if you confirmed zero nulls
Duplicate checks	Logs, transactions, joins	Avoids double counting bias	Skip only for guaranteed unique IDs
Distribution review	Before outlier handling/statistics	Decides mean vs median and test choice	Do not skip for numeric analysis
Category balance	Classification or segmentation tasks	Detects class imbalance and rare groups	Skip for pure regression without categories

If time is short, never skip: shape, dtype, null rate, duplicates, and one distribution check per key numeric column. That minimum prevents most analysis mistakes.

DataXForgeProfile a dataset instantly in the browser: Dataset Health Analyzer · Column Analysis · Missing Data Report · Missing Data Heatmap.

3.1 First look — always run these first

python

df.shape            # (rows, columns)
df.head(10)         # first 10 rows
df.tail(5)          # last 5 rows
df.sample(5)        # random sample
df.columns.tolist() # column names as list
df.dtypes           # data type of each column

3.2 Data types and memory usage

python

df.info()
# Shows: column names, non-null count, dtype, memory usage

memory_mb = df.memory_usage(deep=True).sum() / 1024**2
print(f"Memory: {memory_mb:.1f} MB")

3.3 Summary statistics

python

df.describe()                     # numeric columns only
df.describe(include='all')       # all columns including text
df.describe(percentiles=[.05, .25, .5, .75, .95])  # custom percentiles

3.4 Missing values — full overview

python

missing = pd.DataFrame({
    'count': df.isnull().sum(),
    'percent': (df.isnull().sum() / len(df) * 100).round(2)
})
missing[missing['count'] > 0].sort_values('percent', ascending=False)

3.5 Unique values and duplicates

python

df.nunique()                          # unique value count per column
df['category'].value_counts()         # top values with count
df['category'].value_counts(normalize=True)  # as percentages

df.duplicated().sum()                 # how many duplicate rows
df[df.duplicated()]                   # show duplicate rows

Write down your observations! Note columns with issues, unexpected types, or high missing rates before proceeding.

Common mistakes to avoid

Using mean on skewed data without checking distribution first
Ignoring duplicates and then trusting wrong aggregates
Assuming correlation means causation

Quick cheatsheet

df.info() -> Structure and non-null counts

df.describe() -> Numeric summary statistics

df.isnull().sum() -> Missing-value counts by column

df.groupby() -> Segmented aggregation

pd.merge() -> Join multiple datasets