Chapter 05 — Transform

Data Transformation

Reshape, aggregate, merge, and manipulate your data into the form needed for analysis.

5.0 Transformation strategy (when / why / skip)
MethodUse whenWhy use itSkip when
Filter rowsYou need scope-specific analysisRemoves irrelevant records earlySkip permanent filtering until you save a full backup slice
GroupBy aggregationYou need KPIs by segmentConverts row-level noise into decision-level metricsSkip when event-level timing is critical
Pivot tableMatrix-style summaries for reportingFast comparison across two dimensionsSkip for high-cardinality columns with huge sparse matrices
Merge / joinMetrics are split across sourcesBuilds a complete analytical viewSkip before checking key uniqueness and join cardinality
Custom apply/mapBusiness logic is not built-inEncodes domain rules explicitlySkip row-wise apply on very large data when vectorized options exist
Before any merge, check one-to-one vs one-to-many key behavior. Wrong join cardinality can silently duplicate rows and break all downstream metrics.
DataXForgeReshape data visually: Filter Rows · Sort Data · Merge Datasets · Rename Columns · ETL Studio for full pipelines.
5.1 Select and filter
python
# Select specific columns
df[['name', 'age', 'salary']]

# Filter rows by condition
df[df['age'] > 30]
df[(df['age'] > 30) & (df['salary'] > 50000)]
df[df['city'].isin(['Dhaka', 'Chittagong'])]
df[~df['status'].isin(['Inactive'])]   # NOT in list

# Query syntax (cleaner for complex filters)
df.query('age > 30 and salary > 50000 and city == "Dhaka"')
5.2 Sort and rank
python
# Sort by one column
df.sort_values('salary', ascending=False)

# Sort by multiple columns
df.sort_values(['department', 'salary'], ascending=[True, False])

# Add rank column
df['rank'] = df['salary'].rank(ascending=False).astype('int')
df['pct_rank'] = df['salary'].rank(pct=True)
5.3 Group and aggregate
python
# Simple groupby
df.groupby('department')['salary'].mean()

# Multiple aggregations (named)
summary = df.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    max_salary=('salary', 'max'),
    min_salary=('salary', 'min'),
    headcount=('id', 'count')
).reset_index()

# Pivot table
pivot = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    fill_value=0
)
5.4 Merge and join datasets
python
# Inner join (only matching rows in both)
merged = pd.merge(df1, df2, on='customer_id', how='inner')

# Left join (keep all rows from df1)
merged = pd.merge(df1, df2, on='id', how='left')

# Join on different column names
merged = pd.merge(df1, df2, left_on='cust_id', right_on='id')

# Stack rows (same columns)
combined = pd.concat([df_2023, df_2024], ignore_index=True)
5.5 Apply custom transformations
python
# Apply to a column
df['name_upper'] = df['name'].apply(lambda x: x.upper() if pd.notna(x) else x)

# Apply across columns (axis=1)
df['full_name'] = df.apply(lambda row: row['first'] + ' ' + row['last'], axis=1)

# Map values
status_map = {1: 'Active', 0: 'Inactive'}
df['status_label'] = df['status'].map(status_map)
Common mistakes to avoid
Quick cheatsheet
df.info() -> Structure and non-null counts
df.describe() -> Numeric summary statistics
df.isnull().sum() -> Missing-value counts by column
df.groupby() -> Segmented aggregation
pd.merge() -> Join multiple datasets