Formulas and Definitions

Equations and R code

Parameters and Statistics

Measure Population (θ) Sample (θ^)
Total τ=i=1Nyi τ^=i=1nyi
Mean μ=1Ni=1Nyi y¯=1ni=1nyi
Variance σ2=1Ni=1N(yiμ)2 s2=1n1i=1n(yiy¯)2
Proportion p=μ p^=y¯
  • N: total population size
  • n: total sample size
  • yi: value of measurement y on unit i
  • For proportions yi is a binary indicator of success. y[0,1]. I.e. I(yi=1).

Expected Value and Variance

E(θ^)=θ^p(θ^) V(θ^)=[θ^E(θ^)]2p(θ^) -or- V(θ^)=E(θ^2)E(θ^)2

  • Sums are over all possible values of θ^.
  • p(θ^) is the probability of θ^ occurring.

Definition: Bias, Variance, Accuracy

Bias(θ^)=E(θ^)θ

V(θ^)=E[(θ^E[θ^])2]

MSE(θ^)=V(θ^)+[Bias(θ^)]2

Sample Weights

The sampling weights wiare the reciprocal of the inclusion probability πi

  • SRS: wi=1n
  • SRSWOR: wi=Nn
  • Stratified: whj=Nhnh
  • One stage cluster: wij=Nn

Simple Random Sample

Population Quantity Estimator (θ^) Estimated variance of (θ^)
Mean: μ=τN t^N=iSwiyiiSwi=y¯ V^(y¯)=(1nN)s2n
Total: τ=i=1Nyi τ^=1niSwiyi=Ny¯ V^(τ^)=N2V^(y¯)
Proportion: p p^=y¯ V^(p^)=(1nN)p^(1p^)n1
  • iS : Unit i is an element in the sample S
  • The standard error of the estimate is the square root of the estimated variance.

Stratified Random Sample

See section-04 for notation.

Population Quantity Estimator (θ^) Estimated variance of (θ^)
Within strata total: τh=jyhj τ^h=Nhyh¯ V^(τ^h)=(1nhNh)Nh2sh2nh
Overall total: τ=hτh τ^str=hτ^h V^(τ^str)=hV^(τ^h)
Within strata mean: μh=τhNh y¯h=1nhjyhj V^(y¯h)=1N2V(τ^h)
Overall mean:μ=τN y¯str=hNhNy¯h V^(y¯str)=1N2V^(τ^str)
Within strata proportion: ph=μh p^h=y¯h V^(p^h)=(nhnh1)p^h(1p^h)
Overall proportion: p=μ p^str=hNhNp^h V^(p^str)=h(1nhNh)(NhN)2(p^h(1p^h)nh1)
  • h is a simplified version of h=1H and j is a simplified version of j=1Nh

Cluster Random Sample

⚠️ Note notation change!

  • yij: measurement for the jth element in ith psu
  • N: the number of clusters (psus) in the population
    • n: the number of psus from the sample
  • Mi: number of ssus in psu i in the population
    • mi: the number of ssu’s in psu i from the sample
  • M0: total number of ssus in the population
Population Quantity Estimator (θ^) Estimated variance of (θ^)
Total in psu i: ti=jyij t^i=Mimijyij -
Variance of psu totals : St2=1N1i(titN)2 st2=1n1i(tit^N)2 AKA variance between psu
Overall Total: t=iti t^=Nnit^i N2(1nN)st2n
mean in psu i: μi=1Mijyij y¯i=1mijyij si2=1mi1j(yijy¯i)2 (AKA variance within psu)
Overall mean: μ=1M0ijyij y¯=t^NM (1nN)st2nM2

R commands

This is a quick reference list. See the R companion for the textbook, the package help files, vignettes or other tutorials listed at the bottom of this page for more information.

A note on missing data

If the result of any of the below functions is NA, this may indicate that your variable has missing values. Add the na.rm=TRUE argument to the svymean or svytotal functions and that will calculate a complete-case mean/total.

Analysis

The survey package supports the analysis of data collected using complex survey designs.

Specify survey design svydesign

  • Function call: svydesign(id = , weights=, fpc= , data = )
  • id = variable that identifies clusters
  • weights = variable that holds the sampling weights
  • fpc = finite population correction. Typically defined in the function call.

The argument details can be found on the specified pages in the R companion for the book, and in the respective sections of these notes.

  • SRS: pg 21
  • Stratified Random Sample: pg 34
  • Cluster sampling: p57

Estimators

  • mean: svymean(~x, design)
  • total: svytotal(~x, design)
  • proportion: Use svytotal and divide by N
  • CI for the mean or total: Use confint() after calculating the point estimate
  • CI for proportion: svyciprop(~x, design) Will also print out p^

Calculating stratum means and variances

  • The first argument of svyby is the formula for the variable(s) for which statistics are desired
  • (by=) is the variable that defines the groups.
  • Then list the design object
  • and the name of the function that calculates the statistics.
  • Set keep.var=TRUE to display the standard errors for the statistics.
svyby(~acres92, by=~region, agstrat.dsgn, svytotal, keep.var = TRUE)

Sampling

The sampling package allows you to take random samples from a sampling frame using different sampling frameworks in a reproducible manner.

  1. Setup your sampling frame in a spreadsheet. This example uses google sheets and the googlesheets4 package.

  2. Import your sampling frame into R.

library(googlesheets4)
frame <- read_sheet("https://docs.google.com/spreadsheets/d/17bg__F6Cq0zBnbPtMBsNCKNM-pyybVnhujvI2J66n_4")
  1. Use functions from the sampling package to draw random samples according to your design. See the links for more details on what the arguments mean.
library(sampling)
set.seed(12345)
srs.idx <- srswor(4, length(frame$unit_id)) 
getdata(frame, srs.idx)
  ID_unit unit_id group
1      14      14     B
2      16      16     B
3      26      26     C
4      28      28     C
library(dplyr)
frame <- frame %>% arrange(group) # sort first
strata.idx <- sampling::strata(data = frame,      # data set
                 stratanames = "group", # variable name
                 size = c(2,3,2,1,2),      # stratum sample sizes     
                 method = "srswor")     # method for selecting within strata
getdata(frame, strata.idx)
   unit_id group ID_unit      Prob Stratum
2        2     A       2 0.2500000       1
8        8     A       8 0.2500000       1
10      10     B      10 0.3333333       2
14      14     B      14 0.3333333       2
16      16     B      16 0.3333333       2
23      23     C      23 0.1428571       3
28      28     C      28 0.1428571       3
38      38     D      38 0.1428571       4
39      39     E      39 0.1666667       5
48      48     E      48 0.1666667       5

One stage cluster

onestage.idx <- sampling::cluster(data=frame,         # Data set
                  clustername = "group",  # variable name containing clusters 
                  size = 3,               # number of clusters
                  method = "srswor",      # how to draw clusters 
                  description = TRUE)     # show descriptive output
Number of selected clusters: 3 
Number of units in the population and number of selected units: 50 24 
getdata(frame, onestage.idx)
   unit_id group ID_unit Prob
1        3     A       3  0.6
2        2     A       2  0.6
3        7     A       7  0.6
4        4     A       4  0.6
5        1     A       1  0.6
6        6     A       6  0.6
7        8     A       8  0.6
8        5     A       5  0.6
9       12     B      12  0.6
10       9     B       9  0.6
11      11     B      11  0.6
12      16     B      16  0.6
13      13     B      13  0.6
14      10     B      10  0.6
15      15     B      15  0.6
16      17     B      17  0.6
17      14     B      14  0.6
18      38     D      38  0.6
19      32     D      32  0.6
20      33     D      33  0.6
21      34     D      34  0.6
22      35     D      35  0.6
23      36     D      36  0.6
24      37     D      37  0.6

Two stage cluster

mstage.idx <- sampling::mstage(data=frame, 
                 stage = c("cluster", ""),  # sampling method for each stage, blank means SRS
                 varnames = list("group", "unit_id"),  # variable names for each stage
                 size = list(3, c(5,5,5)), # 3 psus, 5 ssus from each psu
                 method = c("srswor", "srswor"))

getdata(frame, mstage.idx)
[[1]]
   unit_id group ID_unit Prob_ 1 _stage
1        8     A       8            0.6
2        6     A       6            0.6
3        7     A       7            0.6
4        3     A       3            0.6
5        5     A       5            0.6
6        1     A       1            0.6
7        2     A       2            0.6
8        4     A       4            0.6
9       25     C      25            0.6
10      18     C      18            0.6
11      19     C      19            0.6
12      20     C      20            0.6
13      21     C      21            0.6
14      22     C      22            0.6
15      23     C      23            0.6
16      24     C      24            0.6
17      29     C      29            0.6
18      26     C      26            0.6
19      27     C      27            0.6
20      28     C      28            0.6
21      30     C      30            0.6
22      31     C      31            0.6
23      38     D      38            0.6
24      32     D      32            0.6
25      33     D      33            0.6
26      34     D      34            0.6
27      35     D      35            0.6
28      36     D      36            0.6
29      37     D      37            0.6

[[2]]
   unit_id group ID_unit Prob_ 2 _stage      Prob
1        6     A       6      0.6250000 0.3750000
2        7     A       7      0.6250000 0.3750000
3        3     A       3      0.6250000 0.3750000
4        5     A       5      0.6250000 0.3750000
5        1     A       1      0.6250000 0.3750000
6       18     C      18      0.3571429 0.2142857
7       20     C      20      0.3571429 0.2142857
8       29     C      29      0.3571429 0.2142857
9       27     C      27      0.3571429 0.2142857
10      31     C      31      0.3571429 0.2142857
11      38     D      38      0.7142857 0.4285714
12      32     D      32      0.7142857 0.4285714
13      33     D      33      0.7142857 0.4285714
14      34     D      34      0.7142857 0.4285714
15      37     D      37      0.7142857 0.4285714

Vignettes and handbooks