Chapter 06 — Features

Feature Engineering

Create new meaningful columns from existing data to improve your analysis and model performance.

6.0 Feature design rules
Feature typeUse whenWhy use itSkip when
Date partsSeasonality/week patterns matterCaptures time behavior for trends and demandSkip if timestamp has no relation to target
BinningNeed interpretable tiersMakes insights easier for business usersSkip excessive binning that removes predictive detail
One-hot encodingNominal categoriesPrepares non-ordered labels for MLSkip for very high-cardinality columns unless reduced first
Label/ordinal encodingTrue ordered categoriesPreserves rank meaningSkip for unordered categories (creates fake order)
Scaling/log transformsDistance-based models or skewed featuresImproves model stability and convergenceSkip scaling for tree-based models unless pipeline consistency is required
Keep a feature log: feature name, formula, why it helps, and leakage risk. If a feature uses future information, do not use it for prediction.
DataXForgePrep columns without code: Convert Data Types · Date Format Normalizer · Data Type Detector.
6.1 Date and time features
python
# Make sure date column is datetime first
df['date'] = pd.to_datetime(df['date'])

# Extract date parts
df['year']     = df['date'].dt.year
df['month']    = df['date'].dt.month
df['day']      = df['date'].dt.day
df['weekday']  = df['date'].dt.day_name()
df['quarter']  = df['date'].dt.quarter
df['week_num'] = df['date'].dt.isocalendar().week

# Business logic features
df['is_weekend']  = df['date'].dt.dayofweek >= 5
df['is_month_end'] = df['date'].dt.is_month_end
df['days_since']   = (pd.Timestamp.today() - df['date']).dt.days
6.2 Binning / bucketing continuous values
python
# Equal-width bins
df['age_group'] = pd.cut(df['age'],
    bins=[0, 18, 35, 60, 100],
    labels=['Minor', 'Young Adult', 'Middle Age', 'Senior']
)

# Equal-frequency bins (quartiles)
df['salary_tier'] = pd.qcut(df['salary'], q=4,
    labels=['Low', 'Mid', 'High', 'Top']
)
6.3 Encode categorical variables
python
# Label encoding (use for ordinal/ordered categories)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['gender_enc'] = le.fit_transform(df['gender'].fillna('Unknown'))

# One-hot encoding (use for nominal/unordered categories)
df = pd.get_dummies(df, columns=['city', 'product_type'], drop_first=True)

# Manual ordinal mapping
size_order = {'Small': 1, 'Medium': 2, 'Large': 3, 'XL': 4}
df['size_num'] = df['size'].map(size_order)
6.4 Scaling numeric features
python
from sklearn.preprocessing import StandardScaler, MinMaxScaler

cols_to_scale = ['age', 'salary', 'experience']

# Standardize (mean=0, std=1) — use for ML models
scaler = StandardScaler()
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])

# Normalize to 0-1 range
minmax = MinMaxScaler()
df[cols_to_scale] = minmax.fit_transform(df[cols_to_scale])

# Log transform (for heavily skewed data like income, prices)
df['salary_log'] = np.log1p(df['salary'])  # log1p handles zeros
Feature engineering is where domain knowledge matters most. Think about what combinations or transformations would be meaningful for your specific problem.
Common mistakes to avoid
Quick cheatsheet
df.info() -> Structure and non-null counts
df.describe() -> Numeric summary statistics
df.isnull().sum() -> Missing-value counts by column
df.groupby() -> Segmented aggregation
pd.merge() -> Join multiple datasets