library(tidyverse);library(sampling)
library(survey)
Homework 05: Cluster Sampling
A. Introductory Exercise
1. City enagagement
A city council of a small city wants to know the proportion of eligible voters that oppose having an incinerator of Phoenix garbage opened just outside of the city limits. They randomly select 100 residential numbers from the city’s telephone book that contains 3,000 such numbers. Each selected residence is then called and asked for (a) the total number of eligible voters and (b) the number of voters opposed to the incinerator. A total of 157 voters were surveyed; of these, 23 refused to answer the question. Of the remaining 134 voters, 112 opposed the incinerator, so the council estimates the proportion by
with
2. Childrens’ access to firearms
Senturia et al. (1994) described a survey taken to study how many children have access to guns in their households. Questionnaires were distributed to all parents who attended selected clinics in the Chicago area during a one-week period for well or sick child visits.
Suppose that the quantity of interest is the percentage of households containing children that own at least one gun. Describe why this is a cluster sample. What is the psu? The ssu? Is it a one-stage or two-stage cluster sample?
What is the sampled population for this study? Do you think this sampling procedure results in a representative sample of households with children? Why, or why not?
3. Wetland management
Kleppel et al. (2004) reported on a study of wetlands in upstate New York. Four wetlands were selected for the study: Two of the wetlands drain watersheds from small towns and the other two drain suburban watersheds. Quantities such as pH were measured at two to four randomly selected sites within each of the four wetlands.
Describe why this is a cluster sample. What are the psus? The ssus? How would you estimate the average pH in the suburban wetlands?
The authors used Student’s two-sample t test to compare the average pH from the sites in the suburban wetlands with the average pH from the sites in the small town wetlands, treating all sites as independent. Is this analysis appropriate? Why, or why not?
B. Working with Survey Data
1. Error rate in accounting An accounting firm is interested in estimating the error rate in a compliance audit it is conducting. The population contains 828 claims, and the firm audits an SRS of 85 of those claims. In each of the 85 sampled claims, 215 fields are checked for errors. One claim has errors in 4 of the 215 fields, 1 claim has 3 errors, 4 claims have 2 errors, 22 claims have 1 error, and the remaining 57 claims have no errors. (Data courtesy of Fritz Scheuren.)
Treating the claims as psus and the observations for each field as ssus, estimate the error rate, defined to be the average number of errors per field, along with the standard error for your estimate.
Estimate (with standard error) the total number of errors in the 828 claims.
2. Measles should be eradicated
The file measles.csv
contains data consistent with that obtained in a survey of parents whose children had not been immunized for measles during a recent campaign to immunize all children between the ages of 11 and 15. During the campaign, 7,633 children from the 46 schools in the area were immunized; 9,962 children whose records showed no previous immunization were not immunized. In a follow-up survey to explore why the children had not been immunized during the campaign, Roberts et al. (1995) sent questionnaires to the parents of a cluster sample of the 9,962 children. Ten schools were randomly selected, then a sample of the
- One measure of interest is whether or not the parent returned an immunization consent form to the school (variable
returnf
). Separately for each school estimate the percentage of parents who returned a consent form. For this exercise, treat the “no answer” responses (value 9) as not returned.
R advice: Create a new binary indicator for a returned consent form where it has a value of 0 when
returnf
is 9 or 0.
Using the number of respondents in school
as , construct the sampling weight for each observation. (Hint: Use the existing variablesmi
andMitotal
)Estimate the overall percentage of parents who received a consent form along with a 95% CI.
R advice: You can extract just the CI from a
svyciprop
object by callingconfint()
on the result. E.g.a <- svyciprop(~x, dsgn); confint(a)
- How does your estimate and interval in part (c.) compare with the results you would have obtained if you had ignored the clustering and analyzed the data as an SRS? Find the ratio:
What is the effect of clustering?
R advice: You can extract the variance from a
svyciprop
object by extracting it’s attributes. E.g.a <- svyciprop(~x, dsgn); attr(a, "var")
D. Projects and Activities - Baseball Data
- Use the population in the file
baseball.csv
to take a one-stage cluster sample with the teams as the psus. Your sample should have approximately 150 players altogether, as in the SRS from Exercise 37 of Chapter 2. Describe how you selected your sample. Don’t forget to set a seed.
<- read_csv(here::here("data", "baseball.csv")) baseball
log()
can be used to express a percent of change in salary.
- Using the
log()
function, create the variablelogsal
. Draw side-by-side boxplots forlogsal
for the teams in your sample. Note,log(0)
is an undefined value, so records with missing or 0 salary must be dropped from the data set prior to creating thelogsal
variable.
<- baseball %>%
baseball filter(!is.na(salary) & salary > 0) %>% #Removes zeros in salaries column
mutate(logsal = log(salary), # creates logsal
is.pitcher = ifelse(pos=="P",1, 0)) # indicator of being a pitcher
<- NROW(baseball)
N <- 150 n
Using your sample, estimate the mean of
logsal
with a 95% CI.Using your sample, estimate the proportion of players in the data set who are pitchers with a 95% CI.
Compare the estimates you calculated to the ones provided below from an SRS. Which one has smaller confidence intervals? Why do you think that is?
Code provided below for reference
set.seed(123) # For reproducibility
# Take an SRS of 150 players
<- srswor(n, N)
srs.idx <- getdata(baseball, srs.idx)
bbl $wt <- N/n
bbl<- svydesign(id = ~1, weights = ~wt, fpc = rep(N,n), data = bbl) bbl.srs.dsgn
svymean(~logsal, bbl.srs.dsgn)
mean SE
logsal 13.904 0.0935
svymean(~logsal, bbl.srs.dsgn) |> confint()
2.5 % 97.5 %
logsal 13.72093 14.08733
Mean of logsal: 13.9 95% CI: (13.7, 14.1)
svyciprop(~is.pitcher, bbl.srs.dsgn)
2.5% 97.5%
is.pitcher 0.480 0.408 0.553
Proportion of pitchers: 0.48 95% CI: (0.41, 0.56)