Chapter 06 — Features
Feature Engineering
Create new meaningful columns from existing data to improve your analysis and model performance.
6.0 Feature design rules
| Feature type | Use when | Why use it | Skip when |
|---|---|---|---|
| Date parts | Seasonality/week patterns matter | Captures time behavior for trends and demand | Skip if timestamp has no relation to target |
| Binning | Need interpretable tiers | Makes insights easier for business users | Skip excessive binning that removes predictive detail |
| One-hot encoding | Nominal categories | Prepares non-ordered labels for ML | Skip for very high-cardinality columns unless reduced first |
| Label/ordinal encoding | True ordered categories | Preserves rank meaning | Skip for unordered categories (creates fake order) |
| Scaling/log transforms | Distance-based models or skewed features | Improves model stability and convergence | Skip scaling for tree-based models unless pipeline consistency is required |
Keep a feature log: feature name, formula, why it helps, and leakage risk. If a feature uses future information, do not use it for prediction.
DataXForgePrep columns without code: Convert Data Types · Date Format Normalizer · Data Type Detector.
6.1 Date and time features
python
# Make sure date column is datetime first df['date'] = pd.to_datetime(df['date']) # Extract date parts df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day'] = df['date'].dt.day df['weekday'] = df['date'].dt.day_name() df['quarter'] = df['date'].dt.quarter df['week_num'] = df['date'].dt.isocalendar().week # Business logic features df['is_weekend'] = df['date'].dt.dayofweek >= 5 df['is_month_end'] = df['date'].dt.is_month_end df['days_since'] = (pd.Timestamp.today() - df['date']).dt.days
6.2 Binning / bucketing continuous values
python
# Equal-width bins df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100], labels=['Minor', 'Young Adult', 'Middle Age', 'Senior'] ) # Equal-frequency bins (quartiles) df['salary_tier'] = pd.qcut(df['salary'], q=4, labels=['Low', 'Mid', 'High', 'Top'] )
6.3 Encode categorical variables
python
# Label encoding (use for ordinal/ordered categories) from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['gender_enc'] = le.fit_transform(df['gender'].fillna('Unknown')) # One-hot encoding (use for nominal/unordered categories) df = pd.get_dummies(df, columns=['city', 'product_type'], drop_first=True) # Manual ordinal mapping size_order = {'Small': 1, 'Medium': 2, 'Large': 3, 'XL': 4} df['size_num'] = df['size'].map(size_order)
6.4 Scaling numeric features
python
from sklearn.preprocessing import StandardScaler, MinMaxScaler cols_to_scale = ['age', 'salary', 'experience'] # Standardize (mean=0, std=1) — use for ML models scaler = StandardScaler() df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale]) # Normalize to 0-1 range minmax = MinMaxScaler() df[cols_to_scale] = minmax.fit_transform(df[cols_to_scale]) # Log transform (for heavily skewed data like income, prices) df['salary_log'] = np.log1p(df['salary']) # log1p handles zeros
Feature engineering is where domain knowledge matters most. Think about what combinations or transformations would be meaningful for your specific problem.
Common mistakes to avoid
- Skipping business context before running technical steps
- Not writing assumptions and limitations explicitly
- Treating one metric as the full story
Quick cheatsheet
df.info() -> Structure and non-null countsdf.describe() -> Numeric summary statisticsdf.isnull().sum() -> Missing-value counts by columndf.groupby() -> Segmented aggregationpd.merge() -> Join multiple datasets