Chapter 05 — Transform

Data Transformation

Reshape, aggregate, merge, and manipulate your data into the form needed for analysis.

5.0 Transformation strategy (when / why / skip)

Method	Use when	Why use it	Skip when
Filter rows	You need scope-specific analysis	Removes irrelevant records early	Skip permanent filtering until you save a full backup slice
GroupBy aggregation	You need KPIs by segment	Converts row-level noise into decision-level metrics	Skip when event-level timing is critical
Pivot table	Matrix-style summaries for reporting	Fast comparison across two dimensions	Skip for high-cardinality columns with huge sparse matrices
Merge / join	Metrics are split across sources	Builds a complete analytical view	Skip before checking key uniqueness and join cardinality
Custom apply/map	Business logic is not built-in	Encodes domain rules explicitly	Skip row-wise apply on very large data when vectorized options exist

Before any merge, check one-to-one vs one-to-many key behavior. Wrong join cardinality can silently duplicate rows and break all downstream metrics.

DataXForgeReshape data visually: Filter Rows · Sort Data · Merge Datasets · Rename Columns · ETL Studio for full pipelines.

5.1 Select and filter

python

# Select specific columns
df[['name', 'age', 'salary']]

# Filter rows by condition
df[df['age'] > 30]
df[(df['age'] > 30) & (df['salary'] > 50000)]
df[df['city'].isin(['Dhaka', 'Chittagong'])]
df[~df['status'].isin(['Inactive'])]   # NOT in list

# Query syntax (cleaner for complex filters)
df.query('age > 30 and salary > 50000 and city == "Dhaka"')

5.2 Sort and rank

python

# Sort by one column
df.sort_values('salary', ascending=False)

# Sort by multiple columns
df.sort_values(['department', 'salary'], ascending=[True, False])

# Add rank column
df['rank'] = df['salary'].rank(ascending=False).astype('int')
df['pct_rank'] = df['salary'].rank(pct=True)

5.3 Group and aggregate

python

# Simple groupby
df.groupby('department')['salary'].mean()

# Multiple aggregations (named)
summary = df.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    max_salary=('salary', 'max'),
    min_salary=('salary', 'min'),
    headcount=('id', 'count')
).reset_index()

# Pivot table
pivot = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    fill_value=0
)

5.4 Merge and join datasets

python

# Inner join (only matching rows in both)
merged = pd.merge(df1, df2, on='customer_id', how='inner')

# Left join (keep all rows from df1)
merged = pd.merge(df1, df2, on='id', how='left')

# Join on different column names
merged = pd.merge(df1, df2, left_on='cust_id', right_on='id')

# Stack rows (same columns)
combined = pd.concat([df_2023, df_2024], ignore_index=True)

5.5 Apply custom transformations

python

# Apply to a column
df['name_upper'] = df['name'].apply(lambda x: x.upper() if pd.notna(x) else x)

# Apply across columns (axis=1)
df['full_name'] = df.apply(lambda row: row['first'] + ' ' + row['last'], axis=1)

# Map values
status_map = {1: 'Active', 0: 'Inactive'}
df['status_label'] = df['status'].map(status_map)

Common mistakes to avoid

Skipping business context before running technical steps
Not writing assumptions and limitations explicitly
Treating one metric as the full story

Quick cheatsheet

df.info() -> Structure and non-null counts

df.describe() -> Numeric summary statistics

df.isnull().sum() -> Missing-value counts by column

df.groupby() -> Segmented aggregation

pd.merge() -> Join multiple datasets