Cluster Sampling

When units are not necessarily nicely defined, even when the population is.
Author

DRAFT

Published

April 1, 2025

library(tidyverse);library(knitr)
library(sampling); library(survey)

Introduction

Sampling Units

A psu or psus (plural) is the shortening of the term, primary sampling unit(s). These are the sampling units in a cluster sample. An alternate term is cluster.

A ssu or ssus (plural) is the shortening of the term, secondary sampling unit(s). These are the observation units observed within our clusters.

Example of Clustering:

When taking a survey of a community, rather than do an SRS of households, you might break up the community into blocks and take an SRS of blocks and sample all or some of the households that fall within those blocks. The blocks are your psus, and the households are your ssus. This method might be less costly if you were doing an in person survey, but might also be less accurate.

Why use Cluster Sampling?

Clustering is most common for its application.The purpose of stratifying a sample before was to better your prediction or accuracy when the variance varied quite differently among the population. It is not quite the same for Cluster Sampling, for it actually decreases in precision due to some positive correlation between unmeasurable factors that effect certain specific clusters.

We use Cluster Sampling when: 1. It may be unfeasible or expensive to construct a sampling frame of the population, and breaking into groups may make sampling doable. 2. In some populations, it’s less costly to sample in clusters, in terms of travel and time.

Textbook Figure 5.1. Contrasting stratified sampling with one-stage cluster sampling.

When clusters are Ignored

When clusters are ignored in the design or analysis phase, the sample can mistakenly be treated as if it were using the simple random sampling design. When they are ignored they can lead to incorrect formulas that have inaccurate conclusions. For example, when we analyze our sample as an SRS the variability can be underestimated on account of the clusters being designed to be more similar and homogeneous compared to an SRS and stratified sample. This will lead to values that are too small and confidence intervals that can be too narrow.

Example: Studies in education

Many studies in education (either involving teachers, students, other faculty, etc.) use cluster sampling for data collection. Clusters can include entire schools or even just individual classrooms. This is much more cost effective and logistically simpler than taking an SRS of individuals (students, staff, etc.).

One known example [conducted by Basow and Silberg (1987) and Basow and Martin (2012)] examined whether students rated professors differently based on gender (female vs male professors).

  • In this study they took 16 female and 16 male professors and matched them by their subjects, experience, and tenure status.
  • They then sampled each of their classrooms full of students for data collection.
  • They, in total, collected 1029 individual student responses but since we are using cluster sampling the sample size will be n = 32 as there are 32 total clusters. The analysis would be wrong if we ignored that it was a cluster sample and treated it as an SRS of size 1029.

This is a great real world example of how cluster sampling can streamline collecting important data while still gaining valuable insight.

Notation & Formulas

In SRS, each individual element is the sampling unit, and has an equal chance to be sampled. In stratified sampling, the population is divided into H known strata, and a sample is taken from each. Our sampling unit in the cluster design is the entire cluster, or primary sampling unit psus, and not the individual elements.

In stratified sampling we have our entire population being N total population of each individual element, whereas our entire universe in cluster sampling is the number of clusters, that is N is the number of psus.

⚠️ Note notation change!

  • yij: measurement for the jth element in ith psu
  • N: the number of clusters (psus) in the population
    • n: the number of psus from the sample
  • Mi: number of ssus in psu i in the population
    • mi: the number of ssu’s in psu i from the sample
  • M0: total number of ssus in the population
Population Quantity Estimator (θ^) Estimated variance of (θ^)
Total in psu i: ti=jyij t^i=Mimijyij -
Variance of psu totals : St2=1N1i(titN)2 st2=1n1i(tit^N)2 AKA variance between psu
Overall Total: t=iti t^=Nnit^i N2(1nN)st2n
mean in psu i: μi=1Mijyij y¯i=1mijyij si2=1mi1j(yijy¯i)2 (AKA variance within psu)
Overall mean: μ=1M0ijyij y¯=t^NM (1nN)st2nM2

Overall there are many differences between clustering, stratification, and SRS, that are important to note.


One-Stage Clustering Sampling

Equal size clusters: Estimation

One-stage clustering is the type of clustering you’re probably already familiar with - either all (or none) of the observations inside of a cluster (psu) are sampled.

While this is an unrealistic scenario for most statistical work in the social sciences, agricultural or industrial surveys may be able to take advantage of the much simpler process that equal size clustering brings.

Because the psus are selected through an SRS, we can use the per-cluster estimates as single observations in an SRS. There’s no need to consider the individual ssus within each cluster. Important distinction: we treat this as an SRS of n psus, not an SRS of nM observational units.

In this setup, we typically aim to estimate the population total or mean of a variable of interest based on the totals from each sampled cluster.

Example 5.2: Average GPA in a dorm

Perhaps we want to estimate the average GPA in a college dormitory with a one-stage cluster design. The dorm consists of 100 units, each with 4 students. Here’s the process:

  1. An SRS is conducted to select 5 clusters.
  2. All 4 students (ssus) per dormitory unit (psus) have their GPAs measured
  3. Calculate your estimates based on your psu statistics, not your ssu measurements.

(Slightly modified version of Example 5.2 from Lohr’s R Companion to Sampling 3rd ed. Page 57. DOI: 10.1201/9781003228196-5)

gpa <- readr::read_csv(here::here("data", "gpa.csv"))
N <- 100 # total dorm population (this won't be NROW(gpa)!)
n <- 5   # total number of clusters/psus selected
M <- 4   # observational units/ssus selected per psu

Using equations

To estimate the mean, we first estimate the total (because it’s easier math).

The estimated total is calculated by a weighted sum of the cluster totals. t^=NniSti

We can use the tapply function to calculate the sum of gpa across each suite.

(suitesum <- tapply(gpa$gpa, gpa$suite, sum))
    1     2     3     4     5 
12.16 11.36  8.96 12.96 11.08 
# gpa %>% group_by(suite) %>% summarize(t_i = sum(gpa)) # alternative method

The variance across these totals can be found using the standard variance formula: st2=1n1iS(tit^N)2

(st2 <- round(var(suitesum), digits = 3)) 
[1] 2.256

Now we can convert these estimates of the total to estimates of the mean using:

y^=t^NM and SE(y^)=1M(1nN)st2n

t.hat <- sum(suitesum)*(N/n)
(y.bar.hat <- t.hat/(M*N))
[1] 2.826
(se.ybar.yat <- (1/M)*sqrt((1-n/N)*(st2/n)))
[1] 0.1636765

Using the survey package

Create a survey design for one-stage clustering

OSC_gpa <- svydesign(id = ~suite,
                     weights = ~wt,#already on the data set
                     fpc = ~rep(100,20),
                     data = gpa)

calculate the mean GPA with SE

svymean(~gpa, OSC_gpa)
     mean     SE
gpa 2.826 0.1637

Sampling Weights

In one-stage cluster sampling with equal-sized clusters, each cluster is treated as a single observation. The sampling weight for each observation unit j in PSU i is based on the inverse probability of that PSU being selected. This results in a self-weighting sample, meaning every sampled element gets the same weight.

wij=Nn

The total sampling weight across all sampled units equals the total number of units in the population (NM). Make sure to always verify that the sum of your weights matches the population size. Population total and mean can also be estimated using these weights by summing the weighted values or averaging them appropriately.

This weighting approach allows you to use one-stage cluster sampling estimates similarly to how you’d handle stratified or SRS data, despite the cluster structure.


Equal size clusters: Theory

Comparing a cluster sample with an SRS of the same size

ANalysis Of VAriance

Don’t recall how ANOVA works to break up the variance into components? See this great video from the Open Intro Statistics author, Mine Cetinkaya-Rundel

MSB (Mean Squared Between psus)

The sum of squares between psus (SSB) measures the variability across the psu means. The mean squared between psus (MSB) is the SSB divided by the number of psus -1.

SSB=i(y¯iy¯)2MSB=SSBN1

If MSB is relatively large, it means that there is more of a difference between data points in different clusters than within the same clusters

For cluster sampling, the variability of the unbiased estimator of t depends entirely on between the psus rather than within psus.

V(t^cluster)=N2(1nN)(M(MSB)n)

MSW (Mean Squared Within psus)

The sum of squares within psus (SSB) measures the variability within the psus. The MSW is the SSW divided by the total number of ssus minus the number of psus.

SSW=ij(yijy¯i)2MSW=SSWN(M1)

The MSW is not used for calculating the variance in cluster sampling like it is in stratified sampling. While stratified sampling is best when there is a lot of variance within strata, cluster sampling is less effective if there’s a lot of variance within psus.

If the ratio of MSB/MSW is large, it means that stratified sampling will be more precise and cluster sampling will be less precise.

ICC (Intraclass Correlation Coefficient)

The ICC measures how similar items are within the same group or cluster. It helps determine the usefulness of cluster sampling for a specific sample.

A high ICC means that items in the same cluster are very similar, indicating a high level of homogeneity. A low ICC means that items within the cluster are more different from each other.

The higher the ICC, the less efficient the sample is compared to simple random sampling (SRS). This is because high ICC indicates redundancy within clusters. The loss in precision due to clustering can be estimated by comparing the variance of cluster sampling over the variance of the SRS.

ICC is typically positive in naturally occurring clusters. For example, wealthy families living in the same neighborhood, or crops in a field exposed to the same pesticide levels, are likely to show positive ICC due to environmental or social factors.

In contrast, ICC can be negative in artificially constructed or systematic clusters. A negative ICC indicates that items within a cluster are more dispersed than in a randomly selected group. This often results in cluster means being roughly equal, and in such cases, cluster sampling can actually be more efficient than SRS.

Adjusted R Squared

Adjusted R^2 can be used as an alternative measure of homogeneity due to it being very close to the ICC in many populations. The adj R^2 can only be used as a replacement in cases where all primary sampling units (psus) are the same size.

1MSWS2

Example 5.3. Consider two artificial populations

Refer to the textbook for numeric specifics

  • A & B both come from the same population, and when sampled each has 3 clusters and 3 elements (ssus) inside each cluster.
  • Even though they share the same S2 and y¯ there variances differ in a real specific way.
    • Population A has a large variance within each cluster which is reflected in the negative ICC and R2.
    • Whereas, in Population B the variance occurs mostly between clusters.
Example 5.5. When is a cluster not a cluster?

When it’s the whole population!

In Example 4.5 (in Chapter 4), we look at data from a study evaluating the effects of feral pig activity and drought on the native vegetation in Santa Cruz Island in California.

In that example, the sampling unit was a single tree, the observational unit was a single seedling by the tree, and the population of interest was all the trees on Santa Cruz Island. Then, an SRS was conducted!

But we’re doing cluster sampling right now. So, suppose that the researcher was instead intersted in seedling survival across all of California. The researcher would have divided the regions into equal-sized areas, randomly-selected some of those areas to study, and recorded the measurements.

The sampling unit is no longer a tree, it’s an area!

So, if Santa Cruz Island was selected as one of those areas in our study about all of California, then we could no longer treat the trees from Santa Cruz Island as though they were part of an SRS from all trees in California. They’re all similar trees, from the same island, that experience identical climates, in almost the same soil. We would expect all ten trees on Santa Cruz Island to experience, as a group, different environmental factors (such as weather conditions and numbers of seedling eaters) than the ten trees selected in the Santa Ynez Valley on the mainland. Thus the ICC within each cluster (area) would likely be positive.

But…

What if we were only interested in the seedlings from the 10th tree on Santa Cruz Island? Then, the population is all seedlings from the 10th tree and the psu is the seedling. At this point, we’re not doing cluster sampling, we’re doing a census of the population!

Unequal size clusters

Motivating Example: Census of 1937

In the Enumerative Check Census of 1937, they took a 2% sample of postal routes (psus), and questionnaires were give to every household(ssus) on each chosen postal routes. Since not all postal routes had the same number of households, this survey was a clustered sample with clusters of unequal sizes. The variable they were investigating, the total number of unemployed people, was impacted within each cluster by the size of the cluster. The more people on a route, the more unemployed people on the postal route. This means that the between variance of the cluster totals would be higher. Something else to note about this survey is that they only knew the number of houses in each of the selected groups.

The unbiased estimator(s) for the total are: t^unb=NniSti with SE(t^unb)=N(1nN)(st2n)

  • Note that the error of the estimator is dependent on the variance of the cluster totals (st2)
  • The variation among the individual cluster totals ti is likely to be large when clusters have different sizes.

M0=iNMi

y¯^unb=t^unb/M0

SE(y¯^unb)=SE(t^unb)/M0

  • The size of every psu is needed to be known for this formula.
  • The unbiased estimator for the mean can be inefficient when values of M_i’s are unequal for the same reasons the unbiased estimate of the total can be inefficient
  • A ratio estimator of the mean often has smaller variance.

Ratio Estimation

We usually expect ti to be positively correlated with Mi.

  • Example: If the primary sampling units (PSUs) are counties, we would expect the total number of households living in poverty in county i(ti) to be roughly proportional to the total number of households in the country i(Mi)

Therefore our population mean is a ratio where ti and mi are positively correlated. We define that now as:

yr¯^=t^umbM0^=NniStiNniSMi=iSMiyi¯iSMi

  • The estimator yr¯^ is the quantity B in 4.1. (Our sample’s ratio mean.)

The denominator is a random quantity that depends on which particular clusters are included in the sample. If the Mi are unequal and a different cluster sample of size n is taken, the denominator will likely be different. From taking ei=tiMiyr¯^=Mi(y¯iyi¯^) we have:

SE(yr¯^)=(1nN)sr2nM2

and,

sr2=1n1iSMi2(y¯iyi¯^)2

  • M¯=M0^N represents the average size of the PSUs in the sample.
  • The variance of the ratio estimator depends on the variability of the means per element in the PSUs and can be much smaller than that of the unbiased estimator y¯^unb.
  • This can also be done when we know our cluster sizes.

Estimation using weights

w_{ij} = \frac{1}{\mbox{probability of w_{ij} being in the sample}} = \frac{N}{n}

  • One stage clustering has self-weighting samples no matter whether the clusters are equal sizes or not
  • In clusters of equal sizes, the weights sum to the population size but in clusters of unequal sizes the sum of the weights estimates the population size.

M^0=iSjSiwij=NniS

  • M^0 is a random variable, different for every sample.

Using weights:

t^unb=iSjSiwijyij

yr¯^=iSjSiwijyijiSjSiwij

Example. High school Algebra

Consider a population of 187 high school algebra classes in a city. An investigator takes an SRS of classes, and gives each student in the sampled classes a test about function knowledge. The (hypothetical) data are given in the file algebra.csv.

  1. Load in the data and look at the first few rows. Explain what the values in each column mean.
algebra <- readr::read_csv(here::here("data", "algebra.csv"))
head(algebra)
# A tibble: 6 × 3
  class    Mi score
  <dbl> <dbl> <dbl>
1    23    20    57
2    23    20    90
3    23    20    56
4    23    20    57
5    23    20    46
6    23    20    55
  • Each row is a student, with three columns: class, Mi, and score.
  • class is a number representing which class the student is in. These are the psus.
  • Mi is the number of students in that specific class. For class 23, there are 20 students.
  • score is the score the student earned on the given test.
  1. Plot the distribution of test score by class. Interpret your graph by comparing the medians, ranges, and variances across classes. What do you notice?
algebra$class <- as.factor(algebra$class)
ggplot(algebra, aes(x = class, y = score)) + geom_boxplot()

algebra %>% group_by(class) %>% summarise(median = median(score),
                                          min = min(score),
                                          max = max(score),
                                          var = var(score))
# A tibble: 12 × 5
   class median   min   max   var
   <fct>  <dbl> <dbl> <dbl> <dbl>
 1 23      58      25    90  310.
 2 37      64      24   100  461.
 3 38      58.5    24   100  420.
 4 39      54      32    89  200.
 5 41      58.5    19    87  188.
 6 44      64.5    27   100  193.
 7 46      49      34    88  253.
 8 51      71.5    22   100  329.
 9 58      51      28   100  546.
10 62      71      31    99  282.
11 106     67.5    14    94  355.
12 108     69.5    31   100  227.

There is a large range between the minimum and the maximum score in each class, with all of the minimums being between 14 and 34 and all of the maximums being between 87 and 100. We can also see some pretty large differences in median between classes, for example Class 46 has a median score of 49.0 and Class 51 has a median score of 71.5. For variance, the lowest class is Class 41 with about 188, and the largest class is Class 58 with about 546.

  1. How many psus are there? What is the psu with the highest number of ssus? Which one has the least? You must answer this question using R functions such asunique or table. Don’t count this by hand (practice for when you have a billion rows)
(n <- length(unique(algebra$class))) # Total number of psu
[1] 12
psu <- table(algebra$class)

(psu.max.ssu<- names(which.max(psu))) # PSU with most SSUs
[1] "39"
(max.ssu<- max(algebra$Mi)) #This shows the max SSU
[1] 34
(psu.min.ssu<- names(which.min(psu))) # PSU with fewest SSUs
[1] "58"
(min.ssu<-min(algebra$Mi)) #This shows the minimum SSU
[1] 17

There are 12 psus which are the classes. The psu with the highest number of ssus is class 39 with 34 ssus, and the smallest ssus is class 58 with 17 ssus.

  1. Add variables containing the weights and the fpc to this data set.
N <- 187
algebra <- algebra %>%
  mutate(wt = N/n, fpc=rep(187,299))
  1. Using the proper functions from the survey package, estimate the population average algebra score, with a 95% interval and interpret your interval. Be sure to NOT assume a Z-distribution for this confidence interval but use the degf function to get the proper degrees of freedom out of the survey design object and pass it to the confint function.
# This is without weights, do we want them?

design <- svydesign(
  id = ~class,
  fpc = ~Mi,
  data = algebra
)

# the mean
mean_estimate <- svymean(~score, design)

# degrees of freedom function
df <- degf(design)

#conf. interval, t distribution
ci <- confint(mean_estimate, df = df)

(mean_estimate)
        mean     SE
score 62.864 1.0913
(ci)
         2.5 %   97.5 %
score 60.46203 65.26596

Two-Stage Clustering Sampling

Textbook Figure 5.2. Differences between one-stage and two-stage cluster sampling.

Unbiased estimator of population total.

Survey weights

Variance estimation for two-stage cluster samples.

Simplified version

Example 5.7: More estimation of math scores.

The file schools.csv contains data from a two-stage sample of students from hypothetical classes. The final weights are provided in the variable finalwt.

  1. Create side by side boxplots for the math score for sampled schools. Which do you think is larger, the within or between cluster variance?
  2. Create an ANOVA table for this example. Is your interpretation from the graph upheld by this analysis?
  3. Estimate with proper 95% CI the average math score for the population of students under consideration. Do not include a value for the fpc in this example.
You Try It: The case of the six-legged puppy.

Chapter Summary