Day 33

Math 216: Statistical Thinking

Bastola

Introduction to Categorical Data Analysis

  • Quantitative vs. Qualitative Data: Transition from numerical to categorical data, which includes discrete categories like opinions or health status.
  • Examples: Opinions range from Excellent to Poor; Health Status is either Sick or Healthy; Vaccination Status can be Yes or No.
  • Data Presentation: Often shown in tables counting occurrences across categories.

Categorical Data Representation

Table 1: Opinions on Pricing

Category Observed
Pay much more 218
Pay somewhat more 497
Pay the same 425
Pay less 15

Table 2: Vaccination vs. Flu Status

Status No Vaccine One Shot Two Shot Total
Flu 24 9 13 46
No flu 289 100 565 954
Total 313 109 578 1000

Hypothesis Setup

  • Scenario: Evaluating gender distribution in a university STEM program against a regional expectation of a 50:50 male-to-female ratio.
  • Null Hypothesis (\(H_0\)): \(p_f = p_m = \frac{1}{2}\)
  • Alternative Hypothesis (\(H_a\)): \(p_f \neq p_m\)

Data Overview

  • Total Students: 500 (200 female, 300 male)
  • Expected under \(H_0\): Each gender = \(500 \times \frac{1}{2} = 250\)

Chi-Square Test Statistic

  • Test Statistic (\(T\)): A test statistic is one number, computed from the data, which we can use to assess the null hypothesis. Measures divergence of observed counts from expected counts under \(H_0\).
  • Calculation: Sum of the squared differences between observed and expected counts, normalized by expected counts.
\[\begin{aligned} \chi^2 = \sum{\frac{(observed - expected)^2}{expected}} = \sum{\frac{(O - E)^2}{E}} \end{aligned}\]

Chi-Square Distribution


Chi-Square Test Table

\[\begin{array}{|c|c|c|c|c|} \hline \textbf{Category} & \hat{p}_j & \textbf{Expected, } E_j & \textbf{Observed, } O_j \\ \hline \text{Female} & \frac{1}{2} & 250 & 200 \\ \hline \text{Male} & \frac{1}{2} & 250 & 300 \\ \hline \end{array}\]

Chi-Square Test Data Table

  • This table displays the categorical distribution of data for hypothesis testing.
    • \(j\): Category index
    • Category: Description of each category
    • \(\hat{p}_j\): Estimated proportion for each category
    • Expected, \(E_j\): Expected count under null hypothesis = \(n \hat{p}_j\)
    • Observed, \(O_j\): Observed count in sample
\(j\) Category \(\hat{p}_j\) Expected, \(E_j\) Observed, \(O_j\)
1 \(\mathbf{C 1}\) \(\hat{p}_1\) \(n \hat{p}_1\) \(n_1\)
2 \(\mathbf{C 2}\) \(\hat{p}_2\) \(n \hat{p}_2\) \(n_2\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
\(k\) \(\mathbf{C k}\) \(\hat{p}_k\) \(n \hat{p}_k\) \(n_k\)

Summary of the Steps

  • State Hypothesis
  • Calculate a test statistic, based on your sample data
  • Create a distribution of this statistic, as it would be observed if the null hypothesis were true
  • Measure how extreme your test statistic is, as compared to the distribution generated under null
  • Decision: Reject or fail to reject \(H_0\) based on the Chi-Square test value compared to critical values from Chi-Square distribution tables.
  • Implication: Determines if there is a statistically significant difference in gender distribution compared to regional expectations.

Example: Testing the Fairness of a Die

Outcome (Category) 1 2 3 4 5 6
Observed 12 7 14 15 4 8
  • Objective: Test whether the outcome of die rolls matches a fair die’s expected distribution.
  • Method: Use the Chi-Square test to compare observed outcomes against the expected frequencies for a fair die.

Testing the Fairness of a Die: R-code

# Assuming the die is fair, each of the 6 sides should appear with equal probability.
observed_counts <- c(12, 7, 14, 15, 4, 8)
expected_probabilities <- rep(1/6, 6)
chisq.test(x=observed_counts, p=expected_probabilities)

    Chi-squared test for given probabilities

data:  observed_counts
X-squared = 9.4, df = 5, p-value = 0.09413
  • Expected: Each side of the die should appear 10 times out of 60 rolls under the assumption of fairness.
  • Analysis: Perform the Chi-Square test and review the p-value to determine if there is significant evidence to reject the fairness assumption