Chapter 06 — Features

Feature Engineering

Create new meaningful columns from existing data to improve your analysis and model performance.

6.0 Feature design rules

Feature type	Use when	Why use it	Skip when
Date parts	Seasonality/week patterns matter	Captures time behavior for trends and demand	Skip if timestamp has no relation to target
Binning	Need interpretable tiers	Makes insights easier for business users	Skip excessive binning that removes predictive detail
One-hot encoding	Nominal categories	Prepares non-ordered labels for ML	Skip for very high-cardinality columns unless reduced first
Label/ordinal encoding	True ordered categories	Preserves rank meaning	Skip for unordered categories (creates fake order)
Scaling/log transforms	Distance-based models or skewed features	Improves model stability and convergence	Skip scaling for tree-based models unless pipeline consistency is required

Keep a feature log: feature name, formula, why it helps, and leakage risk. If a feature uses future information, do not use it for prediction.

DataXForgePrep columns without code: Convert Data Types · Date Format Normalizer · Data Type Detector.

6.1 Date and time features

python

# Make sure date column is datetime first
df['date'] = pd.to_datetime(df['date'])

# Extract date parts
df['year']     = df['date'].dt.year
df['month']    = df['date'].dt.month
df['day']      = df['date'].dt.day
df['weekday']  = df['date'].dt.day_name()
df['quarter']  = df['date'].dt.quarter
df['week_num'] = df['date'].dt.isocalendar().week

# Business logic features
df['is_weekend']  = df['date'].dt.dayofweek >= 5
df['is_month_end'] = df['date'].dt.is_month_end
df['days_since']   = (pd.Timestamp.today() - df['date']).dt.days

6.2 Binning / bucketing continuous values

python

# Equal-width bins
df['age_group'] = pd.cut(df['age'],
    bins=[0, 18, 35, 60, 100],
    labels=['Minor', 'Young Adult', 'Middle Age', 'Senior']
)

# Equal-frequency bins (quartiles)
df['salary_tier'] = pd.qcut(df['salary'], q=4,
    labels=['Low', 'Mid', 'High', 'Top']
)

6.3 Encode categorical variables

python

# Label encoding (use for ordinal/ordered categories)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['gender_enc'] = le.fit_transform(df['gender'].fillna('Unknown'))

# One-hot encoding (use for nominal/unordered categories)
df = pd.get_dummies(df, columns=['city', 'product_type'], drop_first=True)

# Manual ordinal mapping
size_order = {'Small': 1, 'Medium': 2, 'Large': 3, 'XL': 4}
df['size_num'] = df['size'].map(size_order)

6.4 Scaling numeric features

python

from sklearn.preprocessing import StandardScaler, MinMaxScaler

cols_to_scale = ['age', 'salary', 'experience']

# Standardize (mean=0, std=1) — use for ML models
scaler = StandardScaler()
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])

# Normalize to 0-1 range
minmax = MinMaxScaler()
df[cols_to_scale] = minmax.fit_transform(df[cols_to_scale])

# Log transform (for heavily skewed data like income, prices)
df['salary_log'] = np.log1p(df['salary'])  # log1p handles zeros

Feature engineering is where domain knowledge matters most. Think about what combinations or transformations would be meaningful for your specific problem.

Common mistakes to avoid

Skipping business context before running technical steps
Not writing assumptions and limitations explicitly
Treating one metric as the full story

Quick cheatsheet

df.info() -> Structure and non-null counts

df.describe() -> Numeric summary statistics

df.isnull().sum() -> Missing-value counts by column

df.groupby() -> Segmented aggregation

pd.merge() -> Join multiple datasets