Homework 04: Stratified Sampling

Homework
Published

March 29, 2025

Part 1: Conceptual and calculation exercises

1. What stratification variable(s) would you use for each of the following situations. Pick 3 to answer.

  1. A political poll to estimate the percentage of registered voters in Arizona who approve of the governor’s performance.

  2. An e-mail survey of students at your university, to estimate the total amount of money students spend on textbooks in a term.

  3. A sample of high schools in New York City to estimate the number of schools that offer one or more classes in computer programming.

  4. A sample of public libraries in California to study the availability and usage of computer resources.

  5. A survey of fishers visiting a freshwater lake, to learn about which species of fish are preferred.

  6. An aerial survey to estimate the number of walrus in the pack ice near Alaska between 173◦ East and 154◦ West longitude.

  7. A survey of businesses to learn about policies for paid leave.

2. Lohr Ch 3, question 2. Consider the hypothetical population and stratification presented in the textbook. Let N1 = N2 = 4, and consider the stratified sampling design in which n1 = n2 = 2. Find the Expected Value and Variance of tstr.

Here is some guidance on an approach:

  • Write out all possible SRSs of size 2 from stratum 1, and find the probability of each sample. Do the same for stratum 2.
  • Calculate tj from all possible samples by finding the mean of each combination in the stratum * 4(the amount of values the combination is made from)
  • Then assign the probability of observing that each unique tj within each stratum.
  • Put these together as a sampling distribution (table or data frame that combines t with p(t)) within each strata
  • Based on how tstr is calculated, identify all possible values of tstr, and then the probability of observing those values.
  • Now calculate E(tstr) and V(tstr)2

3. Proportional allocation was used in the stratified sample in Example 3.2. It was noted, however, that variability was much higher in the West than in the other regions. Using the estimated variances displayed below, and assuming that the sampling costs are the same in each stratum, find an optimal allocation for a stratified sample of size 300.

region Number of Counties in population Number of Counties in sample Sample average in region Sample variance in region
Northeast 220 21 97,629.8 7,647,472,708
North Central 1054 103 300,504.2 29,618,183,543
South 1382 135 211,315.0 53,587,487,856
West 422 41 662,295.5 396,185,950,266
Total 3078 300

Part 2: Analyzing data

The U.S. government conducts a Census of Agriculture every five years, collecting data on all farms (defined as any place from which $1000 or more of agricultural products were produced and sold). We previously used the agpop.csv data for cn03 while studying Simple Random Samples. Now, we’re going to do a similar analysis using stratification. The data file agstrat.csv contains a sample of 300 total farms, stratified by region. It also contains information on different variables.

1. Assume that the data in agpop.csv is representative of the population. Are the strata in agstrat.csv proportionally allocated? Justify your answer using numbers.

2. Is stratification doing what it’s supposed to? That is, which allocation provides a more precise estimate?

3. Draw another stratified sample of 300 total farms from the agpop.csv data. We’ll use the sampling package to do this. Choose an appropriate allocation and justify its use. For each of the following quantities, report an estimate along with its standard error, using the appropriate svy functions.

  1. Average size of a farm in acres in 1992
  2. Total number of acres devoted to farming in 1992

4. Compare your answers with those we found in cn03-srs (don’t worry, they’re listed below) - which sampling plan was better? (Hint: there is code on the course website that can help you out with this)

In cn03-srs, we calculated (using a Simple Random Sample):

  • Average size of a farm in acres in 1992: 326,732 acres with SE of 21,908 acres.
  • Total number of acres devoted to farming in 1992: 1,005,681,527 acres with SE of 6,743,308 acres.

Part 3: Conduct your own stratified survey.

Don’t take a census!

Remember to sample from your sampling frame first then only measure the selected items. Use feedback (if present) from the last homework as a guide.

Take a small stratified random sample of something you are interested in. The data collection for this exercise should not take a great deal of effort, as you are surrounded by things waiting to be sampled. Some examples: average weights of 1-pound bags of beans or rice from different brands at the supermarket, proportion of unpopped kernels after cooking popcorn from different brands, total number of multi-family housing in different neighborhoods of Chico.

  1. Explain what it is you decide to measure. Be explicit about what parameter you are trying to estimate.
  2. Define the following terms for your sample:
    • Target population: The broader group you want to make inferences about.
    • Sampled population: The group you actually have access to.
    • Strata: The subgroups within your population that you stratify by.
    • Observational unit: The unit on which measurements are taken.
  3. Create your sampling frame in some type of spreadsheet (google sheets, excel, csv). Describe how you set up this sampling frame, read it into R and show the top few rows using head()
  4. Set a seed and then use the strata function from the sampling package to generate your sample to be measured. Save this resulting data frame as my_sample and print out the results.
  5. Collect your data, do your measurements, and record the data in a new variable/vector in the my_sample data set you just generated.
  6. Calculate the stratum means and variances.
  7. Using the svydesign and appropriate svy* functions, calculate BOTH a point estimate and confidence interval for the parameter of interest.
  8. Interpret your results in a full complete english sentence with point estimate and CI.