Day 4

Math 216: Statistical Thinking

Bastola

Recap: Student Survey

Dataset on 362 responses to a student survey given at one college.

Year Gender Smoke Height Weight SAT Pulse
Senior M No 71 180 1210 54
Sophomore F Yes 66 120 1150 66
FirstYear M No 72 208 1110 130
Junior M No 63 110 1120 78
Sophomore F No 65 150 1170 40
Sophomore F No 65 114 1150 80
FirstYear F No 66 128 1320 94
Sophomore M No 74 235 1370 77
Junior F No 61 NA 1100 60
FirstYear F No 60 115 1370 94
Sophomore F No 65 140 1170 63
FirstYear M No 63 200 1180 72
Sophomore M No 68 162 1150 54
Junior F No 68 135 1300 66
FirstYear M No 68 193 1350 72
FirstYear F No 63 110 1200 59
FirstYear F No 63 99 1200 88
Sophomore M No 72 165 1350 59
Sophomore F No 62 120 1410 64
Sophomore F No 67 154 1000 72

Histogram

ggplot(survey, aes(x = Pulse)) +
  geom_histogram(binwidth = 5, col = "gold", fill = "maroon") +
  labs(
    title = "Histogram of pulse rates",
    x = "Pulse",
    y = "Count"
  )

Histogram: Shape

  • Histogram: Aggregates values into bins and counts how many cases fall into each bin.
  • Pulse rates are symmetrically distributed around a rate of about 70 beats per minute.
  • Symmetric distributions are “centered” around a mean and median that are roughly the same in value.

Shape and Stats

  • Mean and standard deviation are good summary stats of a symmetric distribution.
  • Similar variation to the left and right of the mean so one measure of SD is fine.
survey %>% summarize(mean = mean(Pulse), sd = sd(Pulse))
      mean       sd
1 69.63611 12.10508

Shape: Data Distribution

If a distribution of data is approximately bell-shaped, about 95% of the data should fall within two standard deviations of the sample mean.

  • For a sample: 95% of values between \(\bar{y} - 2s\) and \(\bar{y} + 2s\)
  • For a population: 95% of values between \(\mu - 2\sigma\) and \(\mu + 2\sigma\)

Think-Pair-Share

  • Objective: Understand symmetric distributions through real-world examples.
  • Instructions: Think of an example, discuss with a partner, and share with the class.

04:00

Bell-shaped Distribution

The standard deviation for math SAT scores is closest to

  1. 100
  2. 75
  3. 200
  4. 25

Standardizing Data: Z-score

The z-score of a data value, x, tells us how many standard deviations the value is above or below the mean:

\[ z = \dfrac{x - \text{mean}}{\text{SD}} \]

  • E.g., if a value \(x\) has \(z = -1.5\), then the value \(x\) is 1.5 standard deviations below the mean.

If we standardize all values in a bell-shaped distribution, 95% of all z-scores fall between -2 and 2.

Quiz

  • Objective: Assess understanding of z-scores.
  • Instructions: Answer the quiz question.

02:00

Shape: Left Skew & Right Skew (Histograms)

mean(survey$GPA, na.rm = TRUE)
[1] 3.157942
median(survey$GPA, na.rm = TRUE)
[1] 3.2

mean(survey$Exercise, na.rm = TRUE)
[1] 9.048747
median(survey$Exercise, na.rm = TRUE)
[1] 8

Shape: Left Skew & Right Skew (Boxplots)

mean(survey$GPA, na.rm = TRUE)
[1] 3.157942
median(survey$GPA, na.rm = TRUE)
[1] 3.2

mean(survey$Exercise, na.rm = TRUE)
[1] 9.048747
median(survey$Exercise, na.rm = TRUE)
[1] 8

Shape: Boxplots

A graphical representation of the distribution of a dataset, showing the median, quartiles, and outliers.

  • Box: Represents the interquartile range (IQR) between the 1st quartile (Q1) and the 3rd quartile (Q3).
  • Median: The middle value of the dataset, represented by a line inside the box.
  • Whiskers: Extend from the box to the minimum and maximum data points within 1.5 times the IQR.
  • Outliers: Data points outside of the whiskers, often represented as individual points.

Shape: Boxplots

Symmetry: If the median is roughly centered within the box, and the whiskers are of similar length, the distribution is likely symmetric.

Skewness:

  • Left-skewed: The median is closer to the upper quartile (Q3), and the left whisker is longer than the right whisker.
  • Right-skewed: The median is closer to the lower quartile (Q1), and the right whisker is longer than the left whisker.

Elements of a Box Plot: Inner Fences

Lower Inner Fence: \[ \text{Lower inner fence} = Q_L - 1.5 \times \text{IQR} \]

Upper Inner Fence: \[ \text{Upper inner fence} = Q_U + 1.5 \times \text{IQR} \]

Inner fences are critical for identifying what is typically considered potential mild outliers. These are points that are unusual but not extreme. The 1.5 multiplier stretches beyond the box enough to capture data variability while excluding more extreme values.

Adding a Qualitative Variable: Stats

Use group_by() and summarise() to get summary statistics using dplyr package to compare distributions across different levels of a qualitative variable.

survey %>%
  group_by(Smoke) %>%
  summarise(Min = min(Pulse),
            Q1 = quantile(Pulse, 0.25),
            Median = median(Pulse),
            Mean = mean(Pulse),
            Q3 = quantile(Pulse, 0.75),
            Max = max(Pulse),
            SD = sd(Pulse),
            N = n()) %>% 
  knitr::kable(caption = "Summary Stats of Smoke Categories")
Summary Stats of Smoke Categories
Smoke Min Q1 Median Mean Q3 Max SD N
No 35 61 69 69.34069 77 130 12.14980 317
Yes 42 65 72 71.81395 79 96 11.67671 43

Smokers have a slightly higher mean pulse rate than non-smokers (71.8 vs. 69.3).

Side-by-side Histogram

ggplot(survey, aes(x = Pulse)) + 
  geom_histogram(binwidth = 4, fill = "maroon", col = "gold") +
  facet_wrap(~Smoke)

Side-by-side Boxplot

ggplot(survey, aes(x = Smoke, y = Pulse)) +
  geom_boxplot(fill = c("maroon", "purple"), col = "darkgreen") +
  labs(x = "Smoking Status", y = "Pulse Rate")

Stacked Bar Graph

ggplot(survey, aes(x = Year, fill = Gender)) + 
  geom_bar(position = "fill") + 
  scale_fill_manual(values = c("maroon", "gold")) +
  labs(title = "Proportional Distribution by Year and Gender", 
       y = "proportion", 
       x = "year", 
       fill = "Gender")

Stacked Bar Graph (Counts)

ggplot(survey, aes(x = Year, fill = Gender)) + 
  geom_bar(position = "stack") + 
  scale_fill_manual(values = c("maroon", "gold")) +
  labs(title = "Proportional Distribution by Year and Gender", 
       y = "proportion", 
       x = "year", 
       fill = "Gender")