Chapter 07 — Statistics

Statistical Analysis

Use descriptive and inferential statistics to understand distributions, relationships, and test hypotheses.

7.0 Test selection matrix
Question typeRecommended methodWhySkip / alternative
Compare two numeric group meansT-testDirectly tests mean differenceIf non-normal + small sample, use Mann-Whitney U
Compare 3+ numeric groupsANOVAControls error better than many t-testsIf assumptions fail, use Kruskal-Wallis
Association of two categorical varsChi-squareTests independence in contingency tableIf expected counts are tiny, use Fisher exact (2x2)
Linear numeric relationshipPearson correlationMeasures linear association strengthUse Spearman for monotonic non-linear / ranked data
Pre/post on same subjectsPaired t-testAccounts for within-subject pairingUse Wilcoxon signed-rank if non-normal
Always check assumptions before claiming significance: independence, sample size, outliers, and distribution shape. Statistical significance is not business impact.
7.1 Descriptive statistics
Use descriptive statistics to understand the shape of the data before making conclusions. Mean and standard deviation are good for balanced numeric data. Median and percentiles are better when data is skewed. Mode is the best choice for categories or repeated values. If you are not sure which one to use, check the distribution first and decide based on whether the data has outliers.
python
# Central tendency
df['salary'].mean()
df['salary'].median()
df['salary'].mode()[0]

# Spread / dispersion
df['salary'].std()
df['salary'].var()
df['salary'].quantile([.10, .25, .5, .75, .90])
df['salary'].max() - df['salary'].min()   # range

# Shape of distribution
df['salary'].skew()     # >0 right-skewed, <0 left-skewed
df['salary'].kurtosis() # >3 = heavy tails
7.2 Correlation analysis
python
# Pearson correlation matrix (linear relationships)
corr = df.select_dtypes(include='number').corr()

# Visualize as heatmap
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(corr, dtype=bool))  # hide upper triangle
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, mask=mask, square=True)
plt.title('Correlation Matrix')
plt.tight_layout(); plt.show()

# Correlation with target variable only
corr['target_col'].sort_values(ascending=False)
7.3 Hypothesis testing
Use tests when you need to compare groups or check whether a pattern is statistically reliable. T-tests compare two groups, ANOVA compares three or more groups, and chi-square tests are for categorical relationships. Do not use these tests just because they exist. Use them only when the question is about a difference, relationship, or association that you want to verify.
python
from scipy import stats

# T-test: compare means of two groups
group_a = df[df['group'] == 'A']['score']
group_b = df[df['group'] == 'B']['score']
t_stat, p_val = stats.ttest_ind(group_a, group_b)
print(f"t={t_stat:.3f}, p={p_val:.4f}")
print("Significant!" if p_val < 0.05 else "Not significant")

# Chi-square test: categorical association
ct = pd.crosstab(df['gender'], df['purchased'])
chi2, p, dof, expected = stats.chi2_contingency(ct)
print(f"Chi2={chi2:.3f}, p={p:.4f}, dof={dof}")

# Normality test
stat, p = stats.normaltest(df['salary'].dropna())
print("Normal distribution?", "Yes" if p > 0.05 else "No")
p-value < 0.05 = statistically significant at 95% confidence. p-value < 0.01 = 99% confidence.
7.4 ANOVA — compare 3+ groups
python
# One-way ANOVA
groups = [df[df['dept'] == d]['salary'] for d in df['dept'].unique()]
f_stat, p_val = stats.f_oneway(*groups)
print(f"F={f_stat:.3f}, p={p_val:.4f}")
Common mistakes to avoid
Quick cheatsheet
df.info() -> Structure and non-null counts
df.describe() -> Numeric summary statistics
df.isnull().sum() -> Missing-value counts by column
df.groupby() -> Segmented aggregation
pd.merge() -> Join multiple datasets