Chapter 05 — Transform
Data Transformation
Reshape, aggregate, merge, and manipulate your data into the form needed for analysis.
5.0 Transformation strategy (when / why / skip)
| Method | Use when | Why use it | Skip when |
|---|---|---|---|
| Filter rows | You need scope-specific analysis | Removes irrelevant records early | Skip permanent filtering until you save a full backup slice |
| GroupBy aggregation | You need KPIs by segment | Converts row-level noise into decision-level metrics | Skip when event-level timing is critical |
| Pivot table | Matrix-style summaries for reporting | Fast comparison across two dimensions | Skip for high-cardinality columns with huge sparse matrices |
| Merge / join | Metrics are split across sources | Builds a complete analytical view | Skip before checking key uniqueness and join cardinality |
| Custom apply/map | Business logic is not built-in | Encodes domain rules explicitly | Skip row-wise apply on very large data when vectorized options exist |
Before any merge, check one-to-one vs one-to-many key behavior. Wrong join cardinality can silently duplicate rows and break all downstream metrics.
DataXForgeReshape data visually: Filter Rows · Sort Data · Merge Datasets · Rename Columns · ETL Studio for full pipelines.
5.1 Select and filter
python
# Select specific columns df[['name', 'age', 'salary']] # Filter rows by condition df[df['age'] > 30] df[(df['age'] > 30) & (df['salary'] > 50000)] df[df['city'].isin(['Dhaka', 'Chittagong'])] df[~df['status'].isin(['Inactive'])] # NOT in list # Query syntax (cleaner for complex filters) df.query('age > 30 and salary > 50000 and city == "Dhaka"')
5.2 Sort and rank
python
# Sort by one column df.sort_values('salary', ascending=False) # Sort by multiple columns df.sort_values(['department', 'salary'], ascending=[True, False]) # Add rank column df['rank'] = df['salary'].rank(ascending=False).astype('int') df['pct_rank'] = df['salary'].rank(pct=True)
5.3 Group and aggregate
python
# Simple groupby df.groupby('department')['salary'].mean() # Multiple aggregations (named) summary = df.groupby('department').agg( avg_salary=('salary', 'mean'), max_salary=('salary', 'max'), min_salary=('salary', 'min'), headcount=('id', 'count') ).reset_index() # Pivot table pivot = df.pivot_table( values='sales', index='region', columns='product', aggfunc='sum', fill_value=0 )
5.4 Merge and join datasets
python
# Inner join (only matching rows in both) merged = pd.merge(df1, df2, on='customer_id', how='inner') # Left join (keep all rows from df1) merged = pd.merge(df1, df2, on='id', how='left') # Join on different column names merged = pd.merge(df1, df2, left_on='cust_id', right_on='id') # Stack rows (same columns) combined = pd.concat([df_2023, df_2024], ignore_index=True)
5.5 Apply custom transformations
python
# Apply to a column df['name_upper'] = df['name'].apply(lambda x: x.upper() if pd.notna(x) else x) # Apply across columns (axis=1) df['full_name'] = df.apply(lambda row: row['first'] + ' ' + row['last'], axis=1) # Map values status_map = {1: 'Active', 0: 'Inactive'} df['status_label'] = df['status'].map(status_map)
Common mistakes to avoid
- Skipping business context before running technical steps
- Not writing assumptions and limitations explicitly
- Treating one metric as the full story
Quick cheatsheet
df.info() -> Structure and non-null countsdf.describe() -> Numeric summary statisticsdf.isnull().sum() -> Missing-value counts by columndf.groupby() -> Segmented aggregationpd.merge() -> Join multiple datasets