Homework 04: Stratified Sampling
Part 1: Conceptual and calculation exercises
1. What stratification variable(s) would you use for each of the following situations. Pick 3 to answer.
A political poll to estimate the percentage of registered voters in Arizona who approve of the governor’s performance.
An e-mail survey of students at your university, to estimate the total amount of money students spend on textbooks in a term.
A sample of high schools in New York City to estimate the number of schools that offer one or more classes in computer programming.
A sample of public libraries in California to study the availability and usage of computer resources.
A survey of fishers visiting a freshwater lake, to learn about which species of fish are preferred.
An aerial survey to estimate the number of walrus in the pack ice near Alaska between 173◦ East and 154◦ West longitude.
A survey of businesses to learn about policies for paid leave.
2. Lohr Ch 3, question 2. Consider the hypothetical population and stratification presented in the textbook. Let N1 = N2 = 4, and consider the stratified sampling design in which n1 = n2 = 2. Find the Expected Value and Variance of
Here is some guidance on an approach:
- Write out all possible SRSs of size 2 from stratum 1, and find the probability of each sample. Do the same for stratum 2.
- Calculate
from all possible samples by finding the mean of each combination in the stratum * 4(the amount of values the combination is made from) - Then assign the probability of observing that each unique
within each stratum. - Put these together as a sampling distribution (table or data frame that combines t with p(t)) within each strata
- Based on how
is calculated, identify all possible values of , and then the probability of observing those values. - Now calculate
and
3. Proportional allocation was used in the stratified sample in Example 3.2. It was noted, however, that variability was much higher in the West than in the other regions. Using the estimated variances displayed below, and assuming that the sampling costs are the same in each stratum, find an optimal allocation for a stratified sample of size 300.
region | Number of Counties in population | Number of Counties in sample | Sample average in region | Sample variance in region |
---|---|---|---|---|
Northeast | 220 | 21 | 97,629.8 | 7,647,472,708 |
North Central | 1054 | 103 | 300,504.2 | 29,618,183,543 |
South | 1382 | 135 | 211,315.0 | 53,587,487,856 |
West | 422 | 41 | 662,295.5 | 396,185,950,266 |
Total | 3078 | 300 |
Part 2: Analyzing data
The U.S. government conducts a Census of Agriculture every five years, collecting data on all farms (defined as any place from which $1000 or more of agricultural products were produced and sold). We previously used the agpop.csv
data for cn03 while studying Simple Random Samples. Now, we’re going to do a similar analysis using stratification. The data file agstrat.csv
contains a sample of 300 total farms, stratified by region
. It also contains information on different variables.
1. Assume that the data in agpop.csv
is representative of the population. Are the strata in agstrat.csv
proportionally allocated? Justify your answer using numbers.
2. Is stratification doing what it’s supposed to? That is, which allocation provides a more precise estimate?
3. Draw another stratified sample of 300 total farms from the agpop.csv
data. We’ll use the sampling
package to do this. Choose an appropriate allocation and justify its use. For each of the following quantities, report an estimate along with its standard error, using the appropriate svy
functions.
- Average size of a farm in acres in 1992
- Total number of acres devoted to farming in 1992
4. Compare your answers with those we found in cn03-srs
(don’t worry, they’re listed below) - which sampling plan was better? (Hint: there is code on the course website that can help you out with this)
In cn03-srs, we calculated (using a Simple Random Sample):
- Average size of a farm in acres in 1992: 326,732 acres with SE of 21,908 acres.
- Total number of acres devoted to farming in 1992: 1,005,681,527 acres with SE of 6,743,308 acres.
Part 3: Conduct your own stratified survey.
Take a small stratified random sample of something you are interested in. The data collection for this exercise should not take a great deal of effort, as you are surrounded by things waiting to be sampled. Some examples: average weights of 1-pound bags of beans or rice from different brands at the supermarket, proportion of unpopped kernels after cooking popcorn from different brands, total number of multi-family housing in different neighborhoods of Chico.
- Explain what it is you decide to measure. Be explicit about what parameter you are trying to estimate.
- Define the following terms for your sample:
- Target population: The broader group you want to make inferences about.
- Sampled population: The group you actually have access to.
- Strata: The subgroups within your population that you stratify by.
- Observational unit: The unit on which measurements are taken.
- Create your sampling frame in some type of spreadsheet (google sheets, excel, csv). Describe how you set up this sampling frame, read it into R and show the top few rows using
head()
- Set a seed and then use the
strata
function from thesampling
package to generate your sample to be measured. Save this resulting data frame asmy_sample
and print out the results. - Collect your data, do your measurements, and record the data in a new variable/vector in the
my_sample
data set you just generated. - Calculate the stratum means and variances.
- Using the
svydesign
and appropriatesvy*
functions, calculate BOTH a point estimate and confidence interval for the parameter of interest. - Interpret your results in a full complete english sentence with point estimate and CI.