Day 29

Math 216: Statistical Thinking

Dr. Bastola

Independent Sampling Overview

When samples cannot be paired, they are treated as independent. For two independent samples:

  • Sample 1: Size \(n_1\), mean \(\bar{x}_1\), variance \(s_1^2\)
  • Sample 2: Size \(n_2\), mean \(\bar{x}_2\), variance \(s_2^2\)

Conditions to Check:

  1. Sample Sizes: \(n_1 \geq 30\), \(n_2 \geq 30\) (for large samples).
  2. Normality: For smaller samples, ensure the data is approximately normally distributed.
  3. Variances: Known variances or assume equal variance under normality.

Sampling Distribution Properties

For \(\bar{X}_1 - \bar{X}_2\):

  • Expected Value: \(E(\bar{X}_1 - \bar{X}_2) = \mu_1 - \mu_2\)

  • Standard Error:

    \[\sigma_{\bar{X}_1 - \bar{X}_2} = \sqrt{ \underbrace{\frac{\sigma_1^2}{n_1}}_{\text{Group 1 variability}} + \underbrace{\frac{\sigma_2^2}{n_2}}_{\text{Group 2 variability}}}\]

  • Distribution Shape:

    • Exactly normal if populations are normal
    • Approximately normal via CLT for \(n \geq 30\)

Variance Homogeneity Testing

  • Formal Tests: Levene’s test
  • Practical Approach:
    • Compare variance ratios (\(s_1^2/s_2^2\))
    • Consider sample sizes (unequal n makes it harder to find effects)

Pooled Variance

Used when assuming equal population variances (\(\sigma_1^2 = \sigma_2^2\)):

\[s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}\]

  • Weighted Average: Combines variances proportionally to sample sizes
  • Degrees of Freedom: \(df = n_1 + n_2 - 2\) (total observations minus groups)
Component Meaning
\((n_1-1)s_1^2\) Scaled variability from Group 1
\((n_2-1)s_2^2\) Scaled variability from Group 2
Denominator Total degrees of freedom

Hypothesis Testing Framework

  • Null Hypothesis: \(H_0: \mu_1 - \mu_2 = 0\)
  • Alternatives:
    • \(H_a: \mu_1 - \mu_2 \neq 0\) (Two-tailed)
    • \(H_a: \mu_1 - \mu_2 > 0\) (One-tailed)

Case 1: Equal Variances (Pooled) \[t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\] \[s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}\] \[df = n_1 + n_2 - 2\]

Case 2: Unequal Variances (Welch) \[t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\] \[df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}\]

Confidence Interval

General Form

\[(\bar{x}_1 - \bar{x}_2) \pm t^*_{\alpha/2} \cdot SE\]

Pooled Variance CI: \[SE = s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\]

Welch’s CI: \[SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

Interpretation Example

“With 95% confidence, the true mean difference lies between [−3.2, 5.8]. As this interval contains 0, we cannot reject the null hypothesis at α=0.05.”

Practical Considerations

When to Use Pooled Variance

  • Small samples with strong evidence of equal variances
  • Experimental designs with enforced variance control
  • Historical data showing consistent variance ratios

When to Use Welch’s Test

  • Default approach for unknown population variances
  • Unequal sample sizes (\(n_1 \neq n_2\))
  • Variance ratio > 4:1 between groups
  • Conservative error rate control