SRS is the basic form of probability sampling, and serves as the basis for more complicated forms.
Published
February 24, 2025
These notes use functions from the sampling, survey and knitr packages. See the formulas page for links to vignettes and handbooks.
The best way to learn a new package is to reference the help file and vignette often for examples. The R Companion for the book (free pdf from book website) is also very helpful.
Introduction
This section will aim to answer the following questions:
How can we draw a simple random sample of observations from a data set both with, and without replacement:
What is the finite population correction (fpc) and why do we need it?
What are sampling weights, why do we need them, and how are they created?
How do we calculate parameter estimates from an SRS that account for both the fpc and sampling weights?
Drawing a Simple Random Sample (Lohr Ch 2.3)
Recall there are two ways to draw simple random samples, with and without replacement.
Definition: Simple random sample (SRSWR) with replacement:
A SRSWR of size from a population of size can be thought of as drawing independent samples of size 1. Each unit has the same probability of selection:
The procedure is repeated until the sample has units, which may include duplicates.
Definition: Simple random sample (SRS) without replacement:
A SRS of size is selected so that every possible subset of distinct units in the population has the same probability of being selected as the sample. There are possible samples, resulting in a selection probability for an individual unit . (See Lohr 2.10 and Appendix A for derivation)
Intentionality in sampling
Random does not mean haphazard, contrary it’s actually quite intentional. Avoid selecting a sample that you “feel” is random or representative of the population. These practices can lead to bias and lack of generalizability.
To avoid the haphazard nature of “blindly choosing”, or worse looking at what was sampled and changing it because “it doesn’t look random enough”, we use techniques that leverage pseudo-random number generating algorithms.
Drawing a SRS using a computer
Process
Generate a list of all observational units in the population (sampling frame).
Assign each observational unit a unique number, from 1 to the size of the sampling frame .
Use a computer to draw numbers from 1 to without replacement.
Subset the data to keep only the selected rows.
Example: Sampling Statistics PhD programs.
To demonstrate this example, lets only grab the first 10 programs in the StatisticsPhD data set. From that we’ll draw a sample of programs out of these 10.
# A tibble: 10 × 3
University Department FTGradEnrollment
<chr> <chr> <dbl>
1 Baylor University Statistics 26
2 Boston University Biostatistics 39
3 Brown University Biostatistics 21
4 Carnegie Mellon University Statistics 39
5 Case Western Reserve University Statistics 11
6 Colorado State University Statistics 14
7 Columbia University Biostatistics 64
8 Columbia University Statistics 196
9 Cornell University Statistics 78
10 Duke University Statistics 31
# A tibble: 4 × 3
University Department FTGradEnrollment
<chr> <chr> <dbl>
1 Duke University Statistics 31
2 Carnegie Mellon University Statistics 39
3 Brown University Biostatistics 21
4 Columbia University Statistics 196
Rows 10, 4, 3, 8 were chosen.
with replacement
set.seed(4134) # again, for demonstration of duplicate records(idx.with.dups <-sample(pop.idx, 4, replace=TRUE)) #3
[1] 8 10 4 4
stats10[idx.with.dups,] #4
# A tibble: 4 × 3
University Department FTGradEnrollment
<chr> <chr> <dbl>
1 Columbia University Statistics 196
2 Duke University Statistics 31
3 Carnegie Mellon University Statistics 39
4 Carnegie Mellon University Statistics 39
Rows 8, 10, 4, 4 were chosen.
Carnegie (row 4) was sampled twice.
Using the sampling package
The functions srswr(n,N) and srswor(n,N) draw SRS with, and without replacement respectively.
Each take two arguments: the sample size, and the population size.
The vector returned is a vector of length that indicates how many times that position in the vector is selected.
Then the getdata() function is used to extract the values from the population the indicated number of times.
without replacement
(choose.these.wor <-srswor(4, 10)) # Steps 1,2,3
[1] 1 0 0 1 0 0 0 0 1 1
getdata(stats10, choose.these.wor) # Steps 4
ID_unit University Department FTGradEnrollment
1 1 Baylor University Statistics 26
2 4 Carnegie Mellon University Statistics 39
3 9 Cornell University Statistics 78
4 10 Duke University Statistics 31
with replacement
set.seed(4134)(choose.these.wr <-srswr(4,10))
[1] 0 1 0 1 0 0 0 2 0 0
getdata(stats10, choose.these.wr)
# A tibble: 4 × 3
University Department FTGradEnrollment
<chr> <chr> <dbl>
1 Boston University Biostatistics 39
2 Carnegie Mellon University Statistics 39
3 Columbia University Statistics 196
4 Columbia University Statistics 196
Again, notice Columbia was chosen twice when sampling with replacement.
Interestingly, the result from sampling without replacement added a column called ID_unit on it that denotes the row it was sampled from.
You try it
The U.S. government conducts a Census of Agriculture every five years, collecting data on all farms (defined as any place from which $1000 or more of agricultural products were produced and sold). The file agpop.csv (textbook data) contains historical information from 1982, 1987, and 1992 on the number of farms, total acreage devoted to farms, number of farms with fewer than 9 acres, and number of farms with more than 1000 acres for the population consisting of the counties and county-equivalents in the United States.
Draw a sample of 300 farms without replacement using both sample and srswor. Save one of these data frames with the name ag.srs for later use.
Solution
Import the data and define our sample and population sizes.
ag <- readr::read_csv(here::here("data", "agpop.csv")) N <-NROW(ag)n <-300
Using the sample() function
Generate list of numbers for each observational unit, then draw 5 numbers without replacement.
sampling.frame <-1:N #1, 2sample.idx <-sample(sampling.frame, n, replace=FALSE) #3head(sample.idx)
[1] 1788 38 450 1019 2976 1031
Create a subset data frame with only the rows that were chosen and stored in the vector get.these.
sample.ag <- ag[sample.idx, ] #4head(sample.ag[,1:5]) # only showing first 5 columns as an example
# A tibble: 6 × 5
county state acres92 acres87 acres82
<chr> <chr> <dbl> <dbl> <dbl>
1 BUFFALO COUNTY NE 587595 581861 567657
2 HALE COUNTY AL 167583 154581 179618
3 LIBERTY COUNTY GA 15583 18248 19965
4 HICKMAN COUNTY KY 99066 95560 105051
5 PIERCE COUNTY WI 272876 269644 298538
6 LEE COUNTY KY 20803 23097 22866
Get in the habit of opening the data set and visually looking at your process at each step. The best way to learn is to check that you know exactly what was done at each step.
Formulas for Estimation
Below is a table of common statistics and how they are estimated under the SRS framework. This table can also be found on the formulas page.
Measure
Unbiased Estimate
Estimated variance of
Mean
Total
Proportion
: Unit is an element in the sample
Did you notice something different about the formula for the variances?
Finite Population Correction
The larger % of the population that you include in your sample (sampling fraction = ), the closer you are to a census, the smaller the variability your estimate will have.
Most samples that are taken from a very large population, the fpc is close to 1.
So the variance is more determined by the size of the sample, not the % of the population sampled.
Example
Calculate the estimated standard deviation of the sample mean of the the number of acres devoted to farms in 1992 (variable acres92). Interpret this number.
The estimated proportion of number of farms with less than 200k acres varies by 2.04% from sample to sample.
Clean up
As we go through these notes, we create a lot of the same named objects, like n and N, or idx and idx2. Every once in a while its a good idea to clean out your global environment to not get confused. Use the rm() function to remove everything except the population ag, our sample of 300 ag.srs, and their respective sample sizes: n and N.
Recall that a goal of sampling is to obtain a representative sample, one that is similar to the true unknown population. Thus, conceptually if we duplicate certain units from our sample a certain amount of times, we could “reconstruct” what the population looks like. That is, we could create copies of unit for each unit in the sample.
Definition: Sampling Weight (Design weight)
Inverse of the inclusion/selection probability for unit .
Also interpreted as the number of population units represented by unit .
In an SRS, each unit has an inclusion probability of , so the sampling weights are all .
We don’t actually make copies of record , but use these weights as a multiplier in our estimation calculations.
Population size
Total
Mean
These weighted estimators are used in all probability sampling designs.
Example: Calculating weighted estimates
Estimate the total and average number of acres devoted to farms in 1992 using both weighted and unweighted estimates. Then compare these values to the parameter.
The sampling weights are for each unit in the sample, so we’ll add that on as a new column before calculating the weighted estimates.
See Lohr Table 2.1 for a nicer visual of the data frame with weights.
ag.srs$wt <- N/n(N.hat <-sum(ag.srs$wt)) # just to confirm to myself
[1] 3078
Calculate weighted and unweighted estimates, then the pop parameters.
Survey designs are specified using the svydesign function. The main arguments to the the function are id to specify sampling units, weights to specify sampling weights, and fpc to specify finite population size corrections. These arguments should be given as formulas, referring to columns in a data frame given as the data argument.
Similar to using group_by in dplyr, that adds a “flag” to the data set indicting that all subsequent actions are to be done to each group separately, the svydesign function adds information to the data set so subsequent actions know how to incorporate measures like weights and strata
library(survey)ag.srs.dsgn <-svydesign(id =~1, weights =~wt, fpc =rep(N,n), data = ag.srs)
The id argument specifies the clusters. We are not using any clustering, so each unit has it’s own id.
Point and interval estimates
Survey design adjusted estimates can be obtained using svymean and svytotal.
svymean(~acres92, ag.srs.dsgn)
mean SE
acres92 301152 24924
svytotal(~acres92, ag.srs.dsgn)
total SE
acres92 926944810 76716552
svymean(~lt200k, ag.srs.dsgn)
mean SE
lt200k 0.55333 0.0273
Estimates for multiple variables can be obtained at the same time.
svymean(~acres92+farms92, ag.srs.dsgn)
mean SE
acres92 301151.66 24924.156
farms92 582.28 24.156
You can pass the results to construct confidence intervals.
We can be 95% confident that the interval (1,646,531, 1,937,985) covers the true number of farms in the united states in 1992.
You try it
Using your sample of 300 farms, estimate the total number of farms in the United States in 1987 and the average farm size in acres for the same year. Report both point and interval estimates.
Based on this sample of 300 farms, the total number of farms in the US in 1987 is 1,957,310, with 95% CI (1,800,099,2,114,521 )
and the average acres for those farms is 307,899.7, with 95% CI (258,523.3,357,276.1)
Fancy table
data.frame(Measure =c("Total number of farms", "Average farm size"), Estimate =prettyNum(c(pt.est.t[1], pt.est.m[1]), big.mark =','),LCL =c(ci.t[1], ci.m[1]), UCL =c(ci.t[2], ci.m[2]) ) |> knitr::kable(align ='lccc', row.names =FALSE,caption ="Point estimates and confidence intervals for the total number of farms in the US in 1987 and the average farm size in that same year.")
Point estimates and confidence intervals for the total number of farms in the US in 1987 and the average farm size in that same year.