Simple Random Sampling

SRS is the basic form of probability sampling, and serves as the basis for more complicated forms.
Author

COMPLETE

Published

February 20, 2023

These notes use functions from the sampling, and survey packages.

The best way to learn a new package is to reference the help file and vignette often for examples. The R Companion for the book (free pdf from book website) is also very helpful.

Introduction

This section will aim to answer the following questions:

  • How can we draw a simple random sample of observations from a data set both with, and without replacement:
  • What is the finite population correction (fpc) and why do we need it?
  • What are sampling weights, why do we need them, and how are they created?
  • How do we calculate parameter estimates from an SRS that account for both the fpc and sampling weights?

Drawing a Simple Random Sample (Lohr Ch 2.3)

Recall there are two ways to draw simple random samples, with and without replacement.

Definition: Simple random sample (SRSWR) with replacement:

A SRSWR of size \(n\) from a population of size \(N\) can be thought of as drawing \(n\) independent samples of size 1. Each unit has the same probability of selection: \(\delta_{i} = \frac{1}{N}\)

The procedure is repeated until the sample has \(n\) units, which may include duplicates.

Definition: Simple random sample (SRS) without replacement:

A SRS of size \(n\) is selected so that every possible subset of \(n\) distinct units in the population has the same probability of being selected as the sample. There are \(\binom{N}{n}\) possible samples, resulting in a selection probability for an individual unit \(\delta_{i} = \frac{n}{N}\). (See Lohr 2.10 and Appendix A for derivation)

Intentionality in sampling

Random does not mean haphazard, contrary it’s actually quite intentional. Avoid selecting a sample that you “feel” is random or representative of the population. These practices can lead to bias and lack of generalizability.

To avoid the haphazard nature of “blindly choosing”, or worse looking at what was sampled and changing it because “it doesn’t look random enough”, we use techniques that leverage pseudo-random number generating algorithms.

Drawing a SRS using a computer

Process
  1. Generate a list of all observational units in the population (sampling frame).
  2. Assign each observational unit a unique number, from 1 to the size of the sampling frame \(N\).
  3. Use a computer to draw \(n\) numbers from 1 to \(N\) without replacement.
  4. Subset the data to keep only the selected rows.
Example: Sampling Statistics PhD programs.

To demonstrate this example, lets only grab the first 10 programs in the StatisticsPhD data set. From that we’ll draw a sample of \(n=4\) programs out of these 10.

stats <- readr::read_csv(here::here("data", "StatisticsPhD.csv"))
(stats10 <- stats[1:10,])
# A tibble: 10 × 3
   University                      Department    FTGradEnrollment
   <chr>                           <chr>                    <dbl>
 1 Baylor University               Statistics                  26
 2 Boston University               Biostatistics               39
 3 Brown University                Biostatistics               21
 4 Carnegie Mellon University      Statistics                  39
 5 Case Western Reserve University Statistics                  11
 6 Colorado State University       Statistics                  14
 7 Columbia University             Biostatistics               64
 8 Columbia University             Statistics                 196
 9 Cornell University              Statistics                  78
10 Duke University                 Statistics                  31

Using the sample() function

without replacement

pop.idx <- 1:NROW(stats10)  # Steps 1, 2
(idx <- sample(pop.idx, 4)) # Step 3 
[1] 8 6 3 9
stats10[idx,]               # Step 4
# A tibble: 4 × 3
  University                Department    FTGradEnrollment
  <chr>                     <chr>                    <dbl>
1 Columbia University       Statistics                 196
2 Colorado State University Statistics                  14
3 Brown University          Biostatistics               21
4 Cornell University        Statistics                  78
  • Rows 8, 6, 3, 9 were chosen.

with replacement

set.seed(4134) # again, for demonstration of duplicate records
(idx.with.dups <- sample(pop.idx, 4, replace=TRUE)) #3 
[1]  8 10  4  4
stats10[idx.with.dups,] #4
# A tibble: 4 × 3
  University                 Department FTGradEnrollment
  <chr>                      <chr>                 <dbl>
1 Columbia University        Statistics              196
2 Duke University            Statistics               31
3 Carnegie Mellon University Statistics               39
4 Carnegie Mellon University Statistics               39
  • Rows 8, 10, 4, 4 were chosen.
  • Carnegie (row 4) was sampled twice.

Using the sampling package

  • The functions srswr(n,N) and srswor(n,N) draw SRS with, and without replacement respectively.
  • Each take two arguments: \(n\) the sample size, and \(N\) the population size.
    • The vector returned is a vector of length \(N\) that indicates how many times that position in the vector is selected.
      Then the getdata() function is used to extract the values from the population the indicated number of times.

without replacement

(choose.these.wor <- srswor(4, 10)) # Steps 1,2,3
 [1] 1 0 0 1 0 0 0 0 1 1
getdata(stats10, choose.these.wor)  # Steps 4
  ID_unit                 University Department FTGradEnrollment
1       1          Baylor University Statistics               26
2       4 Carnegie Mellon University Statistics               39
3       9         Cornell University Statistics               78
4      10            Duke University Statistics               31

with replacement

set.seed(4134)
(choose.these.wr <- srswr(4,10)) 
 [1] 0 1 0 1 0 0 0 2 0 0
getdata(stats10, choose.these.wr)
# A tibble: 4 × 3
  University                 Department    FTGradEnrollment
  <chr>                      <chr>                    <dbl>
1 Boston University          Biostatistics               39
2 Carnegie Mellon University Statistics                  39
3 Columbia University        Statistics                 196
4 Columbia University        Statistics                 196
  • Again, notice Columbia was chosen twice when sampling with replacement.
  • Interestingly, the result from sampling without replacement added a column called ID_unit on it that denotes the row it was sampled from.
You try it

The U.S. government conducts a Census of Agriculture every five years, collecting data on all farms (defined as any place from which $1000 or more of agricultural products were produced and sold). The file agpop.csv (textbook data) contains historical information from 1982, 1987, and 1992 on the number of farms, total acreage devoted to farms, number of farms with fewer than 9 acres, and number of farms with more than 1000 acres for the population consisting of the \(N=3078\) counties and county-equivalents in the United States.

Draw a sample of 300 farms without replacement using both sample and srswor. Save one of these data frames with the name ag.srs for later use.

Import the data and define our sample and population sizes.

ag <- readr::read_csv(here::here("data", "agpop.csv")) 
N <- NROW(ag)
n <- 300

Using the sample() function

Generate list of numbers for each observational unit, then draw 5 numbers without replacement.

sampling.frame <- 1:N #1, 2
sample.idx <- sample(sampling.frame, n, replace=FALSE) #3
head(sample.idx)
[1] 1788   38  450 1019 2976 1031

Create a subset data frame with only the rows that were chosen and stored in the vector get.these.

sample.ag <- ag[sample.idx, ]  #4
head(sample.ag[,1:5]) # only showing first 5 columns as an example
# A tibble: 6 × 5
  county         state acres92 acres87 acres82
  <chr>          <chr>   <dbl>   <dbl>   <dbl>
1 BUFFALO COUNTY NE     587595  581861  567657
2 HALE COUNTY    AL     167583  154581  179618
3 LIBERTY COUNTY GA      15583   18248   19965
4 HICKMAN COUNTY KY      99066   95560  105051
5 PIERCE COUNTY  WI     272876  269644  298538
6 LEE COUNTY     KY      20803   23097   22866

Using the srswor() function

set.seed(1067)
srswor.idx <- srswor(n, N)
ag.srs <- getdata(ag, srswor.idx) 
Get in the habit of opening the data set and visually looking at your process at each step. The best way to learn is to check that you know exactly what was done at each step.

Formulas for Estimation

Below is a table of common statistics and how they are estimated under the SRS framework. This table can also be found on the formulas page.

Measure Unbiased Estimate \((\hat{\theta})\) Estimated variance of \((\hat{\theta})\)
Mean \(\bar{y} = \frac{1}{n}\sum_{i\in S} y_{i}\) \(\hat{V}(\bar{y}) = (1-\frac{n}{N})\frac{s^{2}}{n}\)
Total \(\hat{\tau} = N\bar{y}\) \(\hat{V}(\hat{\tau}) = N^{2}\hat{V}(\bar{y})\)
Proportion \(\hat{p} = \bar{y}\) \(\hat{V}(\hat{p}) = (1-\frac{n}{N})\frac{\hat{p}(1-\hat{p})}{n-1}\)
  • \(i \in S\) : Unit \(i\) is an element in the sample \(S\)

Did you notice something different about the formula for the variances?

Finite Population Correction

\[\Big(1-\frac{n}{N}\Big)\]

The larger % of the population that you include in your sample (sampling fraction = \(\frac{n}{N}\)), the closer you are to a census, the smaller the variability your estimate will have.

  • Most samples that are taken from a very large population, the fpc is close to 1.
  • So the variance is more determined by the size of the sample, not the % of the population sampled.
Example

Calculate the estimated standard deviation of the sample mean \(\sqrt{\hat{V}(\bar{y})}\) of the the number of acres devoted to farms in 1992 (variable acres92). Interpret this number.

y.bar <- mean(ag.srs$acres92)
s2 <- sum((ag.srs$acres92-y.bar)^2)/(n-1)
(sd.ybar <- sqrt((1-n/N)*(s2/n)))
[1] 24924.16

Sample means generated from samples of size 300 will vary from sample to sample by 24,924.16 acres.

Inline R code: prettyNum(sd.ybar, big.mark=',')
You try it

Draw a sample of size 500, and estimate the standard deviation of the sample proportion of the number of farms with less than 200,000 acres.

n2 <- 500
srswor.idx2 <- srswor(n2, NROW(ag))
sample.ag.500 <- getdata(ag, srswor.idx2)

p.hat <- mean(sample.ag.500$acres92<200000)
(s.phat <- sqrt((1-n2/N)*(p.hat * (1-p.hat))/(n2-1)))
[1] 0.02046818

The estimated proportion of number of farms with less than 200k acres varies by 2.04% from sample to sample.

As we go through these notes, we create a lot of the same named objects, like n and N, or idx and idx2. Every once in a while its a good idea to clean out your global environment to not get confused. Use the rm() function to remove everything except the population ag, our sample of 300 ag.srs, and their respective sample sizes: n and N.

rm(sample.ag, choose.these.wor, choose.these.wr, my.animals, sample.idx, sampling.frame, 
   n2, p.hat, s.phat, s2, y.bar, srswor.idx, srswor.idx2, sample.ag.500, sd.ybar)

Sampling Weights (Lohr Ch 2.4)

Recall that a goal of sampling is to obtain a representative sample, one that is similar to the true unknown population. Thus, conceptually if we duplicate certain units from our sample a certain amount of times, we could “reconstruct” what the population looks like. That is, we could create \(w_{i}\) copies of unit \(i\) for each unit in the sample.

Definition: Sampling Weight (Design weight)

Inverse of the inclusion/selection probability for unit \(i\).

\[w_{i} = \frac{1}{\delta_i}\]

Also interpreted as the number of population units represented by unit \(i\).

In an SRS, each unit has an inclusion probability of \(\delta_{i} = \frac{n}{N}\), so the sampling weights are all \(w_{i} =\frac{N}{n}\).

We don’t actually make \(w_{i}\) copies of record \(i\), but use these weights as a multiplier in our estimation calculations.

We did do this in cn03 with the quiz scores when using the rep function to repeat the data values \(y_{i}\), \(f_{i}\) times. ref
Population size Total Mean
\(\hat{N} = \sum_{i \in S}w_{i}\) \(\hat{\tau} = \sum_{i \in S}w_{i}y_{i}\) \(\bar{y} = \frac{\hat{\tau}}{\hat{N}}\)

These weighted estimators are used in all probability sampling designs.

Example: Calculating weighted estimates

Estimate the total and average number of acres devoted to farms in 1992 using both weighted and unweighted estimates. Then compare these values to the parameter.

The sampling weights are \(w_{i} = \frac{3078}{300}\) for each unit \(i\) in the sample, so we’ll add that on as a new column before calculating the weighted estimates.

See Lohr Table 2.1 for a nicer visual of the data frame with weights.
ag.srs$wt <- N/n
(N.hat <- sum(ag.srs$wt)) # just to confirm to myself
[1] 3078

Calculate weighted and unweighted estimates, then the pop parameters.

tau.hat.wt <- sum(ag.srs$acres92*ag.srs$wt)
y.bar.wt   <- tau.hat.wt/N.hat
tau.hat.nowt <- sum(ag.srs$acres92)
y.bar.nowt   <- mean(ag.srs$acres92)
mu  <- mean(ag$acres92)
tau <- sum(ag$acres92)

Package it in a data frame for easier viewing.

data.frame(
  Measure = c("Total", "Mean"), 
  Parameter = c(tau,mu),
  Unweighted = c(tau.hat.nowt, y.bar.nowt), 
  Weighted = c(tau.hat.wt, y.bar.wt)
  ) |> knitr::kable(align = 'lccc', 
             caption = "Comparing weighted and unweighted estimates")
Comparing weighted and unweighted estimates
Measure Parameter Unweighted Weighted
Total 943951718 90345498.0 926944809.5
Mean 306677 301151.7 301151.7

Note that when we have an SRS with equal sampling weights \(w_{i} = \frac{N}{n}\), the weighted mean is equal to the weighted mean. :::{.callout-warning icon=false} ### You try it Calculate the proportion of farms with less than 200k acres, with, and without weights. Compare to the population proportion. :::

ag.srs$lt200k <- 1*(ag.srs$acres92<200000)
(p.hat.nowt <- mean(ag.srs$lt200k))
[1] 0.5533333
(p.hat.wt   <- sum(ag.srs$lt200k * ag.srs$wt)/N)
[1] 0.5533333
(p          <- mean(ag$acres92<200000))
[1] 0.5175439

Analysis using the survey package (Lohr Ch 2.6)

Survey designs are specified using the svydesign function. The main arguments to the the function are id to specify sampling units, weights to specify sampling weights, and fpc to specify finite population size corrections. These arguments should be given as formulas, referring to columns in a data frame given as the data argument.

Similar to using group_by in dplyr, that adds a “flag” to the data set indicting that all subsequent actions are to be done to each group separately, the svydesign function adds information to the data set so subsequent actions know how to incorporate measures like weights and strata
library(survey)
ag.srs.dsgn <- svydesign(id = ~1, weights = ~wt, fpc = rep(N,n), data = ag.srs)

The id argument specifies the clusters. We are not using any clustering, so each unit has it’s own id.

Point and interval estimates

Survey design adjusted estimates can be obtained using svymean and svytotal.

svymean(~acres92, ag.srs.dsgn)
          mean    SE
acres92 301152 24924
svytotal(~acres92, ag.srs.dsgn)
            total       SE
acres92 926944810 76716552
svymean(~lt200k, ag.srs.dsgn)
          mean     SE
lt200k 0.55333 0.0273

Estimates for multiple variables can be obtained at the same time.

svymean(~acres92+farms92, ag.srs.dsgn)
             mean        SE
acres92 301151.66 24924.156
farms92    582.28    24.156

You can pass the results to construct confidence intervals.

svytotal(~acres92+farms92, ag.srs.dsgn) |> confint()
            2.5 %     97.5 %
acres92 776583131 1077306488
farms92   1646531    1937985

We can be 95% confident that the interval (1,646,531, 1,937,985) covers the true number of farms in the united states in 1992.

You try it

Using your sample of 300 farms, estimate the total number of farms in the United States in 1987 and the average farm size in acres for the same year. Report both point and interval estimates.

Determining sample size (Lohr Ch 2.7)

This topic will be covered later and include various sampling designs.

Topic Summary

👥 Summarize your understanding

Using the notes in this document, and Lohr Ch 2.11, summarize your understanding of this section. See Canvas for details.