Day 27

Math 216: Statistical Thinking

Bastola

Recap

graph TD
  A[Start] --> B{"σ known?"}
  B -->|Yes| C["Use z-test/z-interval"]
  B -->|No| D{"n ≥ 30?"}
  D -->|Yes| E["CLT: Use t-test (z ≈ t)"]
  D -->|No| F["Normal? QQ-plot/test"]
  F -->|Yes| G[Use t-test]
  F -->|No| H[Non-parametric test]

With small samples (n < 30), normality checks become critical. Let’s examine real data from the Davis dataset (car package) of self-reported vs actual weights:

library(car)
data(Davis)
small_sample <- Davis$weight[1:15]  # Small subsample
ad.test(small_sample)$p.value 
[1] 7.911712e-06

Challenges with Non-normal Distributions

What if the population data is decidedly non-normal?

  • Small Sample Sizes and Non-normality: When sample sizes are small (\(n < 30\)) and the data is non-normal, traditional tests like t-tests may become unreliable. This can lead to inflated Type I errors—incorrectly rejecting the null hypothesis (\(H_0\)) when it is true.

  • Nonparametric Statistics: These tests do not assume a normal distribution. Instead, they rely on ranks or medians, making them robust to outliers and extreme values.

Visual Diagnostics: The Illusion of Normality (QQ plot)

Example: 15-weight sample from Davis dataset:

qqPlot(small_sample, main="QQ-Plot: Small Sample") + theme_tufte()
NULL

Visual Diagnostics: The Illusion of Normality (Histogram)

Example: 15-weight sample from Davis dataset:

ggplot(tibble(x=small_sample), aes(x)) + 
  geom_histogram(fill="#1f77b4", bins=5) + 
  geom_vline(xintercept=57, color="red") +
  labs(title="Seemingly Normal? (n=15)")

Case Study 1: Davis Weight Data (n=15)

Population Context: Full dataset (N=200) has median=57kg, but our sample (first 15 obs) has median=68kg:

SIGN.test(small_sample, md=57)$p.value  
[1] 0.03515625
t.test(small_sample, mu=57)$p.value     
[1] 0.05803929

Resolution: Sign test detects true median shift (68 vs 57) while t-test is confused by:

  • Right skew (γ₁ = 1.2)
  • Outlier (166kg) inflating mean (64.1 vs median 68)

Case Study 2: Simulated Skewed Data (n=15)

Population: Lognormal distribution (median=7.38, mean=12.18)

set.seed(123)
skewed_pop <- exp(rnorm(1000, mean=2))  # True median=7.38
samp <- sample(skewed_pop, 15)

# Wrong approach: t-test for median
t.test(samp, mu=7.38)$p.value    
[1] 0.626605
# Right approach: Sign test
SIGN.test(samp, md=7.38)$p.value
[1] 0.6072388

Type I Error Rates (10,000 Simulations)

When H₀ is TRUE (testing median=7.38 in lognormal population):

set.seed(456)
err_rates <- replicate(10000, {
  samp <- sample(skewed_pop, 15)
  c(
    t = t.test(samp, mu = 7.38)$p.value < 0.05,
    sign = SIGN.test(samp, md = 7.38)$p.value < 0.05
  )
})

# Get one error rate per method:
rowMeans(err_rates)  
     t   sign 
0.0956 0.0354 

Results:

  1. T-test falsely rejects 9.6% of time (inflated Type I error)
  2. Sign test maintains 3.5% error rate

Recommendations

  1. Small n: Use sign test unless strong evidence of normality
  2. Visual Cues:
    • Always pair histograms (≤5 bins) with QQ-plots
    • Treat “normal-looking” plots with skepticism
  3. Test Alignment:
    • Means → t-test (requires normality)
    • Medians → sign test (requires only ranked data)

How P-values are Calculated: Sign Test

Binomial Foundation: Under \(H_0\): median \(= \eta_0\), each observation has 50% chance of being above/below \(\eta_0\)

Davis Example (\(H_0\): \(\eta = 57\) kg):

small_sample
 [1]  77  58  53  68  59  76  76  69  71  65  70 166  51  64  52
above <- sum(small_sample > 57)  
above
[1] 12
n <- length(small_sample - 57)
n
[1] 15

Exact Binomial Formula:

\[ \begin{aligned} \text{p-value} &= 2 \times P(X \geq 12) \\ &= 2 \times \sum_{k=12}^{15} \binom{15}{k} (0.5)^{15} \\ &= 2 \times (0.01389 + 0.00320 + 0.00046 + 0.00003) \\ &= 0.03516 \end{aligned} \]

R Calculation:

2 * pbinom(11, 15, 0.5, lower.tail=FALSE)  # Matches SIGN.test()
[1] 0.03515625
SIGN.test(small_sample, md=57)
    One-sample Sign-Test

data:  small_sample
s = 12, p-value = 0.03516
alternative hypothesis: true median is not equal to 57
95 percent confidence interval:
 58.17817 75.10916