library(tidyverse);library(knitr)
library(sampling); library(survey)
Cluster Sampling
Introduction
Example of Clustering:
When taking a survey of a community, rather than do an SRS of households, you might break up the community into blocks and take an SRS of blocks and sample all or some of the households that fall within those blocks. The blocks are your psus, and the households are your ssus. This method might be less costly if you were doing an in person survey, but might also be less accurate.
Why use Cluster Sampling?
Clustering is most common for its application.The purpose of stratifying a sample before was to better your prediction or accuracy when the variance varied quite differently among the population. It is not quite the same for Cluster Sampling, for it actually decreases in precision due to some positive correlation between unmeasurable factors that effect certain specific clusters.
We use Cluster Sampling when: 1. It may be unfeasible or expensive to construct a sampling frame of the population, and breaking into groups may make sampling doable. 2. In some populations, it’s less costly to sample in clusters, in terms of travel and time.
When clusters are Ignored
When clusters are ignored in the design or analysis phase, the sample can mistakenly be treated as if it were using the simple random sampling design. When they are ignored they can lead to incorrect formulas that have inaccurate conclusions. For example, when we analyze our sample as an SRS the variability can be underestimated on account of the clusters being designed to be more similar and homogeneous compared to an SRS and stratified sample. This will lead to values that are too small and confidence intervals that can be too narrow.
- In this study they took 16 female and 16 male professors and matched them by their subjects, experience, and tenure status.
- They then sampled each of their classrooms full of students for data collection.
- They, in total, collected 1029 individual student responses but since we are using cluster sampling the sample size will be n = 32 as there are 32 total clusters. The analysis would be wrong if we ignored that it was a cluster sample and treated it as an SRS of size 1029.
This is a great real world example of how cluster sampling can streamline collecting important data while still gaining valuable insight.
Notation & Formulas
In SRS, each individual element is the sampling unit, and has an equal chance to be sampled. In stratified sampling, the population is divided into
In stratified sampling we have our entire population being
⚠️ Note notation change!
: measurement for the th element in th psu : the number of clusters (psus) in the population : the number of psus from the sample
: number of ssus in psu in the population : the number of ssu’s in psu from the sample
: total number of ssus in the population
Population Quantity | Estimator |
Estimated variance of |
---|---|---|
Total in psu |
- | |
Variance of psu totals : |
AKA variance between psu | |
Overall Total: |
||
mean in psu |
||
Overall mean: |
Overall there are many differences between clustering, stratification, and SRS, that are important to note.
One-Stage Clustering Sampling
Equal size clusters: Estimation
One-stage clustering is the type of clustering you’re probably already familiar with - either all (or none) of the observations inside of a cluster (psu) are sampled.
While this is an unrealistic scenario for most statistical work in the social sciences, agricultural or industrial surveys may be able to take advantage of the much simpler process that equal size clustering brings.
Because the psus are selected through an SRS, we can use the per-cluster estimates as single observations in an SRS. There’s no need to consider the individual ssus within each cluster. Important distinction: we treat this as an SRS of
In this setup, we typically aim to estimate the population total or mean of a variable of interest based on the totals from each sampled cluster.
Perhaps we want to estimate the average GPA in a college dormitory with a one-stage cluster design. The dorm consists of 100 units, each with 4 students. Here’s the process:
- An SRS is conducted to select 5 clusters.
- All 4 students (ssus) per dormitory unit (psus) have their GPAs measured
- Calculate your estimates based on your psu statistics, not your ssu measurements.
(Slightly modified version of Example 5.2 from Lohr’s R Companion to Sampling 3rd ed. Page 57. DOI: 10.1201/9781003228196-5)
<- readr::read_csv(here::here("data", "gpa.csv"))
gpa <- 100 # total dorm population (this won't be NROW(gpa)!)
N <- 5 # total number of clusters/psus selected
n <- 4 # observational units/ssus selected per psu M
Using equations
To estimate the mean, we first estimate the total (because it’s easier math).
The estimated total is calculated by a weighted sum of the cluster totals.
We can use the tapply
function to calculate the sum of gpa
across each suite
.
<- tapply(gpa$gpa, gpa$suite, sum)) (suitesum
1 2 3 4 5
12.16 11.36 8.96 12.96 11.08
# gpa %>% group_by(suite) %>% summarize(t_i = sum(gpa)) # alternative method
The variance across these totals can be found using the standard variance formula:
<- round(var(suitesum), digits = 3)) (st2
[1] 2.256
Now we can convert these estimates of the total to estimates of the mean using:
<- sum(suitesum)*(N/n)
t.hat <- t.hat/(M*N)) (y.bar.hat
[1] 2.826
<- (1/M)*sqrt((1-n/N)*(st2/n))) (se.ybar.yat
[1] 0.1636765
Using the survey
package
Create a survey design for one-stage clustering
<- svydesign(id = ~suite,
OSC_gpa weights = ~wt,#already on the data set
fpc = ~rep(100,20),
data = gpa)
calculate the mean GPA with SE
svymean(~gpa, OSC_gpa)
mean SE
gpa 2.826 0.1637
Sampling Weights
In one-stage cluster sampling with equal-sized clusters, each cluster is treated as a single observation. The sampling weight for each observation unit
The total sampling weight across all sampled units equals the total number of units in the population (
This weighting approach allows you to use one-stage cluster sampling estimates similarly to how you’d handle stratified or SRS data, despite the cluster structure.
Equal size clusters: Theory
Comparing a cluster sample with an SRS of the same size
For cluster sampling, the variability of the unbiased estimator of
The MSW is not used for calculating the variance in cluster sampling like it is in stratified sampling. While stratified sampling is best when there is a lot of variance within strata, cluster sampling is less effective if there’s a lot of variance within psus.
If the ratio of MSB/MSW is large, it means that stratified sampling will be more precise and cluster sampling will be less precise.
A high ICC means that items in the same cluster are very similar, indicating a high level of homogeneity. A low ICC means that items within the cluster are more different from each other.
The higher the ICC, the less efficient the sample is compared to simple random sampling (SRS). This is because high ICC indicates redundancy within clusters. The loss in precision due to clustering can be estimated by comparing the variance of cluster sampling over the variance of the SRS.
ICC is typically positive in naturally occurring clusters. For example, wealthy families living in the same neighborhood, or crops in a field exposed to the same pesticide levels, are likely to show positive ICC due to environmental or social factors.
In contrast, ICC can be negative in artificially constructed or systematic clusters. A negative ICC indicates that items within a cluster are more dispersed than in a randomly selected group. This often results in cluster means being roughly equal, and in such cases, cluster sampling can actually be more efficient than SRS.
In that example, the sampling unit was a single tree, the observational unit was a single seedling by the tree, and the population of interest was all the trees on Santa Cruz Island. Then, an SRS was conducted!
But we’re doing cluster sampling right now. So, suppose that the researcher was instead intersted in seedling survival across all of California. The researcher would have divided the regions into equal-sized areas, randomly-selected some of those areas to study, and recorded the measurements.
The sampling unit is no longer a tree, it’s an area!
So, if Santa Cruz Island was selected as one of those areas in our study about all of California, then we could no longer treat the trees from Santa Cruz Island as though they were part of an SRS from all trees in California. They’re all similar trees, from the same island, that experience identical climates, in almost the same soil. We would expect all ten trees on Santa Cruz Island to experience, as a group, different environmental factors (such as weather conditions and numbers of seedling eaters) than the ten trees selected in the Santa Ynez Valley on the mainland. Thus the ICC within each cluster (area) would likely be positive.
But…
What if we were only interested in the seedlings from the 10th tree on Santa Cruz Island? Then, the population is all seedlings from the 10th tree and the psu is the seedling. At this point, we’re not doing cluster sampling, we’re doing a census of the population!
Unequal size clusters
The unbiased estimator(s) for the total are:
- Note that the error of the estimator is dependent on the variance of the cluster totals (
) - The variation among the individual cluster totals
is likely to be large when clusters have different sizes.
- The size of every psu is needed to be known for this formula.
- The unbiased estimator for the mean can be inefficient when values of M_i’s are unequal for the same reasons the unbiased estimate of the total can be inefficient
- A ratio estimator of the mean often has smaller variance.
Ratio Estimation
We usually expect
- Example: If the primary sampling units (PSUs) are counties, we would expect the total number of households living in poverty in county
to be roughly proportional to the total number of households in the country
Therefore our population mean is a ratio where
- The estimator
is the quantity in 4.1. (Our sample’s ratio mean.)
The denominator is a random quantity that depends on which particular clusters are included in the sample. If the
and,
represents the average size of the PSUs in the sample.- The variance of the ratio estimator depends on the variability of the means per element in the PSUs and can be much smaller than that of the unbiased estimator
. - This can also be done when we know our cluster sizes.
Estimation using weights
- One stage clustering has self-weighting samples no matter whether the clusters are equal sizes or not
- In clusters of equal sizes, the weights sum to the population size but in clusters of unequal sizes the sum of the weights estimates the population size.
is a random variable, different for every sample.
Using weights:
- Load in the data and look at the first few rows. Explain what the values in each column mean.
<- readr::read_csv(here::here("data", "algebra.csv"))
algebra head(algebra)
# A tibble: 6 × 3
class Mi score
<dbl> <dbl> <dbl>
1 23 20 57
2 23 20 90
3 23 20 56
4 23 20 57
5 23 20 46
6 23 20 55
- Each row is a student, with three columns:
class
,Mi
, andscore
. class
is a number representing which class the student is in. These are the psus.Mi
is the number of students in that specific class. For class 23, there are 20 students.score
is the score the student earned on the given test.
- Plot the distribution of test score by class. Interpret your graph by comparing the medians, ranges, and variances across classes. What do you notice?
$class <- as.factor(algebra$class)
algebraggplot(algebra, aes(x = class, y = score)) + geom_boxplot()
%>% group_by(class) %>% summarise(median = median(score),
algebra min = min(score),
max = max(score),
var = var(score))
# A tibble: 12 × 5
class median min max var
<fct> <dbl> <dbl> <dbl> <dbl>
1 23 58 25 90 310.
2 37 64 24 100 461.
3 38 58.5 24 100 420.
4 39 54 32 89 200.
5 41 58.5 19 87 188.
6 44 64.5 27 100 193.
7 46 49 34 88 253.
8 51 71.5 22 100 329.
9 58 51 28 100 546.
10 62 71 31 99 282.
11 106 67.5 14 94 355.
12 108 69.5 31 100 227.
There is a large range between the minimum and the maximum score in each class, with all of the minimums being between 14 and 34 and all of the maximums being between 87 and 100. We can also see some pretty large differences in median between classes, for example Class 46 has a median score of 49.0 and Class 51 has a median score of 71.5. For variance, the lowest class is Class 41 with about 188, and the largest class is Class 58 with about 546.
- How many psus are there? What is the psu with the highest number of ssus? Which one has the least? You must answer this question using R functions such as
unique
ortable
. Don’t count this by hand (practice for when you have a billion rows)
<- length(unique(algebra$class))) # Total number of psu (n
[1] 12
<- table(algebra$class)
psu
<- names(which.max(psu))) # PSU with most SSUs (psu.max.ssu
[1] "39"
<- max(algebra$Mi)) #This shows the max SSU (max.ssu
[1] 34
<- names(which.min(psu))) # PSU with fewest SSUs (psu.min.ssu
[1] "58"
<-min(algebra$Mi)) #This shows the minimum SSU (min.ssu
[1] 17
There are 12 psus which are the classes. The psu with the highest number of ssus is class 39 with 34 ssus, and the smallest ssus is class 58 with 17 ssus.
- Add variables containing the weights and the
fpc
to this data set.
<- 187
N <- algebra %>%
algebra mutate(wt = N/n, fpc=rep(187,299))
- Using the proper functions from the
survey
package, estimate the population average algebra score, with a 95% interval and interpret your interval. Be sure to NOT assume a Z-distribution for this confidence interval but use thedegf
function to get the proper degrees of freedom out of the survey design object and pass it to theconfint
function.
# This is without weights, do we want them?
<- svydesign(
design id = ~class,
fpc = ~Mi,
data = algebra
)
# the mean
<- svymean(~score, design)
mean_estimate
# degrees of freedom function
<- degf(design)
df
#conf. interval, t distribution
<- confint(mean_estimate, df = df)
ci
(mean_estimate)
mean SE
score 62.864 1.0913
(ci)
2.5 % 97.5 %
score 60.46203 65.26596
Two-Stage Clustering Sampling
Unbiased estimator of population total.
Survey weights
Variance estimation for two-stage cluster samples.
Simplified version
The file schools.csv
contains data from a two-stage sample of students from hypothetical classes. The final weights are provided in the variable finalwt
.
- Create side by side boxplots for the math score for sampled schools. Which do you think is larger, the within or between cluster variance?
- Create an ANOVA table for this example. Is your interpretation from the graph upheld by this analysis?
- Estimate with proper 95% CI the average math score for the population of students under consideration. Do not include a value for the
fpc
in this example.