Homework 05: Cluster Sampling

Homework

Published

April 13, 2025

Template

Use this HackMD file as a template for this assignment.

library(tidyverse);library(sampling)
library(survey)

A. Introductory Exercise

1. City enagagement

A city council of a small city wants to know the proportion of eligible voters that oppose having an incinerator of Phoenix garbage opened just outside of the city limits. They randomly select 100 residential numbers from the city’s telephone book that contains 3,000 such numbers. Each selected residence is then called and asked for (a) the total number of eligible voters and (b) the number of voters opposed to the incinerator. A total of 157 voters were surveyed; of these, 23 refused to answer the question. Of the remaining 134 voters, 112 opposed the incinerator, so the council estimates the proportion by

$\hat{p} = \frac{112}{134} = 0.83482$

with

$\hat{V} (\hat{p}) = 0.83582 (1 - 0.83582) / 134 = 0.00102$ Are these estimates valid? Why, or why not

2. Childrens’ access to firearms

Senturia et al. (1994) described a survey taken to study how many children have access to guns in their households. Questionnaires were distributed to all parents who attended selected clinics in the Chicago area during a one-week period for well or sick child visits.

Suppose that the quantity of interest is the percentage of households containing children that own at least one gun. Describe why this is a cluster sample. What is the psu? The ssu? Is it a one-stage or two-stage cluster sample?
What is the sampled population for this study? Do you think this sampling procedure results in a representative sample of households with children? Why, or why not?

3. Wetland management

Kleppel et al. (2004) reported on a study of wetlands in upstate New York. Four wetlands were selected for the study: Two of the wetlands drain watersheds from small towns and the other two drain suburban watersheds. Quantities such as pH were measured at two to four randomly selected sites within each of the four wetlands.

Describe why this is a cluster sample. What are the psus? The ssus? How would you estimate the average pH in the suburban wetlands?
The authors used Student’s two-sample t test to compare the average pH from the sites in the suburban wetlands with the average pH from the sites in the small town wetlands, treating all sites as independent. Is this analysis appropriate? Why, or why not?

B. Working with Survey Data

1. Error rate in accounting An accounting firm is interested in estimating the error rate in a compliance audit it is conducting. The population contains 828 claims, and the firm audits an SRS of 85 of those claims. In each of the 85 sampled claims, 215 fields are checked for errors. One claim has errors in 4 of the 215 fields, 1 claim has 3 errors, 4 claims have 2 errors, 22 claims have 1 error, and the remaining 57 claims have no errors. (Data courtesy of Fritz Scheuren.)

Treating the claims as psus and the observations for each field as ssus, estimate the error rate, defined to be the average number of errors per field, along with the standard error for your estimate.
Estimate (with standard error) the total number of errors in the 828 claims.

2. Measles should be eradicated

The file measles.csv contains data consistent with that obtained in a survey of parents whose children had not been immunized for measles during a recent campaign to immunize all children between the ages of 11 and 15. During the campaign, 7,633 children from the 46 schools in the area were immunized; 9,962 children whose records showed no previous immunization were not immunized. In a follow-up survey to explore why the children had not been immunized during the campaign, Roberts et al. (1995) sent questionnaires to the parents of a cluster sample of the 9,962 children. Ten schools were randomly selected, then a sample of the $M_{i}$ nonimmunized children from each school was selected, and the parents of those children were sent a questionnaire. Not all parents replied to the survey.

One measure of interest is whether or not the parent returned an immunization consent form to the school (variable returnf). Separately for each school estimate the percentage of parents who returned a consent form. For this exercise, treat the “no answer” responses (value 9) as not returned.

R advice: Create a new binary indicator for a returned consent form where it has a value of 0 when returnf is 9 or 0.

Using the number of respondents in school $i$ as $m_{i}$ , construct the sampling weight for each observation. (Hint: Use the existing variables mi and Mitotal)
Estimate the overall percentage of parents who received a consent form along with a 95% CI.

R advice: You can extract just the CI from a svyciprop object by calling confint() on the result. E.g. a <- svyciprop(~x, dsgn); confint(a)

How does your estimate and interval in part (c.) compare with the results you would have obtained if you had ignored the clustering and analyzed the data as an SRS? Find the ratio: $\frac{estimated variance from (c)}{estimated variance if the data were analyzed as an SRS}$ What is the effect of clustering?

R advice: You can extract the variance from a svyciprop object by extracting it’s attributes. E.g. a <- svyciprop(~x, dsgn); attr(a, "var")

D. Projects and Activities - Baseball Data

Use the population in the file baseball.csv to take a one-stage cluster sample with the teams as the psus. Your sample should have approximately 150 players altogether, as in the SRS from Exercise 37 of Chapter 2. Describe how you selected your sample. Don’t forget to set a seed.

baseball <- read_csv(here::here("data", "baseball.csv"))

$Definition$ : The logarithmic functionlog() can be used to express a percent of change in salary.

Using the log() function, create the variable logsal. Draw side-by-side boxplots for logsal for the teams in your sample. Note, log(0) is an undefined value, so records with missing or 0 salary must be dropped from the data set prior to creating the logsal variable.

baseball <- baseball %>%
  filter(!is.na(salary) & salary > 0) %>% #Removes zeros in salaries column 
  mutate(logsal = log(salary), # creates logsal 
         is.pitcher = ifelse(pos=="P",1, 0)) # indicator of being a pitcher


N <- NROW(baseball)
n <- 150

Using your sample, estimate the mean of logsal with a 95% CI.
Using your sample, estimate the proportion of players in the data set who are pitchers with a 95% CI.
Compare the estimates you calculated to the ones provided below from an SRS. Which one has smaller confidence intervals? Why do you think that is?

Code provided below for reference

set.seed(123)  # For reproducibility
# Take an SRS of 150 players
srs.idx <- srswor(n, N)
bbl <- getdata(baseball, srs.idx)
bbl$wt <- N/n
bbl.srs.dsgn <- svydesign(id = ~1, weights = ~wt, fpc = rep(N,n), data = bbl)

svymean(~logsal, bbl.srs.dsgn)

         mean     SE
logsal 13.904 0.0935

svymean(~logsal, bbl.srs.dsgn) |> confint()

          2.5 %   97.5 %
logsal 13.72093 14.08733

Mean of logsal: 13.9 95% CI: (13.7, 14.1)

svyciprop(~is.pitcher, bbl.srs.dsgn)

                  2.5% 97.5%
is.pitcher 0.480 0.408 0.553

Proportion of pitchers: 0.48 95% CI: (0.41, 0.56)