Chapter 07 — Statistics

Statistical Analysis

Use descriptive and inferential statistics to understand distributions, relationships, and test hypotheses.

7.0 Test selection matrix

Question type	Recommended method	Why	Skip / alternative
Compare two numeric group means	T-test	Directly tests mean difference	If non-normal + small sample, use Mann-Whitney U
Compare 3+ numeric groups	ANOVA	Controls error better than many t-tests	If assumptions fail, use Kruskal-Wallis
Association of two categorical vars	Chi-square	Tests independence in contingency table	If expected counts are tiny, use Fisher exact (2x2)
Linear numeric relationship	Pearson correlation	Measures linear association strength	Use Spearman for monotonic non-linear / ranked data
Pre/post on same subjects	Paired t-test	Accounts for within-subject pairing	Use Wilcoxon signed-rank if non-normal

Always check assumptions before claiming significance: independence, sample size, outliers, and distribution shape. Statistical significance is not business impact.

DataXForgeRun stats in the browser: Mean / Median / Mode · Standard Deviation · Correlation Matrix · Distribution Analysis · Z-Score Calculator.

7.1 Descriptive statistics

Use descriptive statistics to understand the shape of the data before making conclusions. Mean and standard deviation are good for balanced numeric data. Median and percentiles are better when data is skewed. Mode is the best choice for categories or repeated values. If you are not sure which one to use, check the distribution first and decide based on whether the data has outliers.

python

# Central tendency
df['salary'].mean()
df['salary'].median()
df['salary'].mode()[0]

# Spread / dispersion
df['salary'].std()
df['salary'].var()
df['salary'].quantile([.10, .25, .5, .75, .90])
df['salary'].max() - df['salary'].min()   # range

# Shape of distribution
df['salary'].skew()     # >0 right-skewed, <0 left-skewed
df['salary'].kurtosis() # >3 = heavy tails

7.2 Correlation analysis

python

# Pearson correlation matrix (linear relationships)
corr = df.select_dtypes(include='number').corr()

# Visualize as heatmap
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(corr, dtype=bool))  # hide upper triangle
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, mask=mask, square=True)
plt.title('Correlation Matrix')
plt.tight_layout(); plt.show()

# Correlation with target variable only
corr['target_col'].sort_values(ascending=False)

7.3 Hypothesis testing

Use tests when you need to compare groups or check whether a pattern is statistically reliable. T-tests compare two groups, ANOVA compares three or more groups, and chi-square tests are for categorical relationships. Do not use these tests just because they exist. Use them only when the question is about a difference, relationship, or association that you want to verify.

python

from scipy import stats

# T-test: compare means of two groups
group_a = df[df['group'] == 'A']['score']
group_b = df[df['group'] == 'B']['score']
t_stat, p_val = stats.ttest_ind(group_a, group_b)
print(f"t={t_stat:.3f}, p={p_val:.4f}")
print("Significant!" if p_val < 0.05 else "Not significant")

# Chi-square test: categorical association
ct = pd.crosstab(df['gender'], df['purchased'])
chi2, p, dof, expected = stats.chi2_contingency(ct)
print(f"Chi2={chi2:.3f}, p={p:.4f}, dof={dof}")

# Normality test
stat, p = stats.normaltest(df['salary'].dropna())
print("Normal distribution?", "Yes" if p > 0.05 else "No")

p-value < 0.05 = statistically significant at 95% confidence. p-value < 0.01 = 99% confidence.

7.4 ANOVA — compare 3+ groups

python

# One-way ANOVA
groups = [df[df['dept'] == d]['salary'] for d in df['dept'].unique()]
f_stat, p_val = stats.f_oneway(*groups)
print(f"F={f_stat:.3f}, p={p_val:.4f}")

Common mistakes to avoid

Skipping business context before running technical steps
Not writing assumptions and limitations explicitly
Treating one metric as the full story

Quick cheatsheet

df.info() -> Structure and non-null counts

df.describe() -> Numeric summary statistics

df.isnull().sum() -> Missing-value counts by column

df.groupby() -> Segmented aggregation

pd.merge() -> Join multiple datasets