Chapter 07 — Statistics
Statistical Analysis
Use descriptive and inferential statistics to understand distributions, relationships, and test hypotheses.
7.0 Test selection matrix
| Question type | Recommended method | Why | Skip / alternative |
|---|---|---|---|
| Compare two numeric group means | T-test | Directly tests mean difference | If non-normal + small sample, use Mann-Whitney U |
| Compare 3+ numeric groups | ANOVA | Controls error better than many t-tests | If assumptions fail, use Kruskal-Wallis |
| Association of two categorical vars | Chi-square | Tests independence in contingency table | If expected counts are tiny, use Fisher exact (2x2) |
| Linear numeric relationship | Pearson correlation | Measures linear association strength | Use Spearman for monotonic non-linear / ranked data |
| Pre/post on same subjects | Paired t-test | Accounts for within-subject pairing | Use Wilcoxon signed-rank if non-normal |
Always check assumptions before claiming significance: independence, sample size, outliers, and distribution shape. Statistical significance is not business impact.
DataXForgeRun stats in the browser: Mean / Median / Mode · Standard Deviation · Correlation Matrix · Distribution Analysis · Z-Score Calculator.
7.1 Descriptive statistics
Use descriptive statistics to understand the shape of the data before making conclusions. Mean and standard deviation are good for balanced numeric data. Median and percentiles are better when data is skewed. Mode is the best choice for categories or repeated values. If you are not sure which one to use, check the distribution first and decide based on whether the data has outliers.
python
# Central tendency df['salary'].mean() df['salary'].median() df['salary'].mode()[0] # Spread / dispersion df['salary'].std() df['salary'].var() df['salary'].quantile([.10, .25, .5, .75, .90]) df['salary'].max() - df['salary'].min() # range # Shape of distribution df['salary'].skew() # >0 right-skewed, <0 left-skewed df['salary'].kurtosis() # >3 = heavy tails
7.2 Correlation analysis
python
# Pearson correlation matrix (linear relationships) corr = df.select_dtypes(include='number').corr() # Visualize as heatmap plt.figure(figsize=(10, 8)) mask = np.triu(np.ones_like(corr, dtype=bool)) # hide upper triangle sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', center=0, mask=mask, square=True) plt.title('Correlation Matrix') plt.tight_layout(); plt.show() # Correlation with target variable only corr['target_col'].sort_values(ascending=False)
7.3 Hypothesis testing
Use tests when you need to compare groups or check whether a pattern is statistically reliable. T-tests compare two groups, ANOVA compares three or more groups, and chi-square tests are for categorical relationships. Do not use these tests just because they exist. Use them only when the question is about a difference, relationship, or association that you want to verify.
python
from scipy import stats # T-test: compare means of two groups group_a = df[df['group'] == 'A']['score'] group_b = df[df['group'] == 'B']['score'] t_stat, p_val = stats.ttest_ind(group_a, group_b) print(f"t={t_stat:.3f}, p={p_val:.4f}") print("Significant!" if p_val < 0.05 else "Not significant") # Chi-square test: categorical association ct = pd.crosstab(df['gender'], df['purchased']) chi2, p, dof, expected = stats.chi2_contingency(ct) print(f"Chi2={chi2:.3f}, p={p:.4f}, dof={dof}") # Normality test stat, p = stats.normaltest(df['salary'].dropna()) print("Normal distribution?", "Yes" if p > 0.05 else "No")
p-value < 0.05 = statistically significant at 95% confidence. p-value < 0.01 = 99% confidence.
7.4 ANOVA — compare 3+ groups
python
# One-way ANOVA groups = [df[df['dept'] == d]['salary'] for d in df['dept'].unique()] f_stat, p_val = stats.f_oneway(*groups) print(f"F={f_stat:.3f}, p={p_val:.4f}")
Common mistakes to avoid
- Skipping business context before running technical steps
- Not writing assumptions and limitations explicitly
- Treating one metric as the full story
Quick cheatsheet
df.info() -> Structure and non-null countsdf.describe() -> Numeric summary statisticsdf.isnull().sum() -> Missing-value counts by columndf.groupby() -> Segmented aggregationpd.merge() -> Join multiple datasets