Worksheet 03: Exploring the partitoning of variance in cluster sampling

Worksheet

Published

April 14, 2025

library(tidyverse);library(knitr)
library(sampling); library(survey)

A student wants to estimate the average GPA in their dorm. Obtaining a listing of all students in the hall and conducting an SRS would take a lot of time. Instead, since each of the 100 suites in the hall have 4 students, the student randomly samples 5 suites and collects GPA data for each student in the suite. This data is part of Example 5.2 and 5.4. Lets explore that data.

What is contained in each row?

gpa.data <- readr::read_csv(here::here("data", "gpa.csv"))
head(gpa.data)

# A tibble: 6 × 3
  suite   gpa    wt
  <dbl> <dbl> <dbl>
1     1  3.08    20
2     1  2.6     20
3     1  3.44    20
4     1  3.04    20
5     2  2.36    20
6     2  3.04    20

What is the explanatory variable? What is the response variable?

Recreate the ANOVA table in 5.4.

aov(gpa ~ suite, data=gpa.data) |> summary()

            Df Sum Sq Mean Sq F value Pr(>F)
suite        1  0.008 0.00784   0.028  0.869
Residuals   18  5.023 0.27908

What went wrong? Explain how you detected this, how you fixed it, and rerun the ANOVA with the correct data.

From the ANOVA table calculate the unbiased estimate of the population standard deviation $S$ . Interpret this number ${\hat{S}}^{2} = \frac{(N - 1) \hat{M S B} + N (M - 1) \hat{M S W}}{N M - 1}$
Calculate the ICC and R2.
How much is the increase in variance for using clustering sampling compared to an SRS?