Formulas and Definitions

Equations and R code

Parameters and Statistics

Measure Population (θ) Sample (θ^)
Total τ=i=1Nyi τ^=i=1nyi
Mean μ=1Ni=1Nyi y¯=1ni=1nyi
Variance σ2=1Ni=1N(yiμ)2 s2=1n1i=1n(yiy¯)2
Proportion p=μ p^=y¯
  • N = total population size

  • yi = Value of measurement y on unit i

  • For proportions yi is a binary indicator of success. y[0,1]. I.e. I(yi=1).

Expected Value and Variance

E(θ^)=θ^p(θ^) V(θ^)=[θ^E(θ^)]2p(θ^) -or- V(θ^)=E(θ^2)E(θ^)2

  • Sums are over all possible values of θ^.
  • p(θ^) is the probability of θ^ occurring.

Definition: Bias, Variance, Accuracy

Bias(θ^)=E(θ^)θ

V(θ^)=E[(θ^E[θ^])2]

MSE(θ^)=V(θ^)+[Bias(θ^)]2

Sample Weights

The sampling weights wiare the reciprocal of the inclusion probability πi

  • SRS: wi=1n
  • SRSWOR: wi=Nn
  • Stratified: whj=Nhnh

Simple Random Sample

Population Quantity Estimator (θ^) Estimated variance of (θ^)
Mean: μ=τN t^N=iSwiyiiSwi=y¯ V^(y¯)=(1nN)s2n
Total: τ=i=1Nyi τ^=1niSwiyi=Ny¯ V^(τ^)=N2V^(y¯)
Proportion: p p^=y¯ V^(p^)=(1nN)p^(1p^)n1
  • iS : Unit i is an element in the sample S
  • The standard error of the estimate is the square root of the estimated variance.

Stratified Random Sample

See section-04 for notation.

Population Quantity Estimator (θ^) Estimated variance of (θ^)
Within strata total: τh=jyhj τ^h=Nhyh¯ V^(τ^h)=(1nhNh)Nh2sh2nh
Overall total: τ=hτh τ^str=hτ^h V^(τ^str)=hV^(τ^h)
Within strata mean: μh=τhNh y¯h=1nhjyhj V^(y¯h)=1N2V(τ^h)
Overall mean:μ=τN y¯str=hNhNy¯h V^(y¯str)=1N2V^(τ^str)
Within strata proportion: ph=μh p^h=y¯h V^(p^h)=(nhnh1)p^h(1p^h)
Overall proportion: p=μ p^str=hNhNp^h V^(p^str)=h(1nhNh)(NhN)2(p^h(1p^h)nh1)
  • h is a simplified version of h=1H and j is a simplified version of j=1Nh

R commands

This is a quick reference list. See the R companion for the textbook, the package help files, vignettes or other tutorials listed at the bottom of this page for more information.

A note on missing data

If the result of any of the below functions is NA, this may indicate that your variable has missing values. Add the na.rm=TRUE argument to the svymean or svytotal functions and that will calculate a complete-case mean/total.

Analysis

The survey package supports the analysis of data collected using complex survey designs.

Specify survey design svydesign

  • Function call: svydesign(id = , weights=, fpc= , data = )
  • id = variable that identifies clusters
  • weights = variable that holds the sampling weights
  • fpc = finite population correction. Typically defined in the function call.

The argument details can be found on the specified pages in the R companion for the book, and in the respective sections of these notes.

  • SRS: pg 21
  • Stratified Random Sample: pg 34

Estimators

  • mean: svymean(~x, design)
  • total: svytotal(~x, design)
  • proportion: Use svytotal and divide by N
  • CI for the mean or total: Use confint() after calculating the point estimate
  • CI for proportion: svyciprop(~x, design) Will also print out p^

Calculating stratum means and variances

  • The first argument of svyby is the formula for the variable(s) for which statistics are desired
  • (by=) is the variable that defines the groups.
  • Then list the design object
  • and the name of the function that calculates the statistics.
  • Set keep.var=TRUE to display the standard errors for the statistics.
svyby(~acres92, by=~region, agstrat.dsgn, svytotal, keep.var = TRUE)

Sampling

The sampling package allows you to take random samples from a sampling frame using different sampling frameworks in a reproducible manner.

  1. Setup your sampling frame in a spreadsheet. This example uses google sheets and the googlesheets4 package.

  2. Import your sampling frame into R.

library(googlesheets4)
frame <- read_sheet("https://docs.google.com/spreadsheets/d/13t_2a1nymS-RfAdDN1lq_WrpLD2xjy2rNsol95rD9VA")
  1. Use functions from the sampling package to draw random samples according to your design. See the links for more details on what the arguments mean.
library(sampling)
set.seed(12345)
srs.idx <- srswor(4, length(frame$unit_id)) 
getdata(frame, srs.idx)
  ID_unit group unit_id
1      14     B       7
2      16     B       9
3      26     D       1
4      28     D       3
library(dplyr)
frame <- frame %>% arrange(group) # sort first
strata.idx <- sampling::strata(data = frame,      # data set
                 stratanames = "group", # variable name
                 size = c(2,3,2,1,2),      # stratum sample sizes     
                 method = "srswor")     # method for selecting within strata
getdata(frame, strata.idx)
   unit_id group ID_unit       Prob Stratum
2        2     A       2 0.28571429       1
5        5     A       5 0.28571429       1
9        2     B       9 0.30000000       2
13       6     B      13 0.30000000       2
15       8     B      15 0.30000000       2
20       3     C      20 0.25000000       3
23       6     C      23 0.25000000       3
32       7     D      32 0.07692308       4
39       1     E      39 0.16666667       5
48      10     E      48 0.16666667       5

One stage cluster

onestage.idx <- sampling::cluster(data=frame,         # Data set
                  clustername = "group",  # variable name containing clusters 
                  size = 3,               # number of clusters
                  method = "srswor",      # how to draw clusters 
                  description = TRUE)     # show descriptive output
Number of selected clusters: 3 
Number of units in the population and number of selected units: 50 30 
getdata(frame, onestage.idx)
   unit_id group ID_unit Prob
1        4     A       4  0.6
2        1     A       1  0.6
3        2     A       2  0.6
4        3     A       3  0.6
5        7     A       7  0.6
6        5     A       5  0.6
7        6     A       6  0.6
8        6     B      13  0.6
9        5     B      12  0.6
10      10     B      17  0.6
11       7     B      14  0.6
12       8     B      15  0.6
13       9     B      16  0.6
14       1     B       8  0.6
15       2     B       9  0.6
16       3     B      10  0.6
17       4     B      11  0.6
18      13     D      38  0.6
19       1     D      26  0.6
20       2     D      27  0.6
21       3     D      28  0.6
22       4     D      29  0.6
23       5     D      30  0.6
24       6     D      31  0.6
25       7     D      32  0.6
26       8     D      33  0.6
27       9     D      34  0.6
28      10     D      35  0.6
29      11     D      36  0.6
30      12     D      37  0.6

Two stage cluster

mstage.idx <- sampling::mstage(data=frame, 
                 stage = c("cluster", ""),  # sampling method for each stage, blank means SRS
                 varnames = list("group", "unit_id"),  # variable names for each stage
                 size = list(3, c(5,5,5)), # 3 psus, 5 ssus from each psu
                 method = c("srswor", "srswor"))

getdata(frame, mstage.idx)
[[1]]
   unit_id group ID_unit Prob_ 1 _stage
1        6     A       6            0.6
2        7     A       7            0.6
3        3     A       3            0.6
4        5     A       5            0.6
5        1     A       1            0.6
6        2     A       2            0.6
7        4     A       4            0.6
8        8     C      25            0.6
9        1     C      18            0.6
10       2     C      19            0.6
11       3     C      20            0.6
12       4     C      21            0.6
13       5     C      22            0.6
14       6     C      23            0.6
15       7     C      24            0.6
16      13     D      38            0.6
17       1     D      26            0.6
18       2     D      27            0.6
19       3     D      28            0.6
20       4     D      29            0.6
21       5     D      30            0.6
22       6     D      31            0.6
23       7     D      32            0.6
24       8     D      33            0.6
25       9     D      34            0.6
26      10     D      35            0.6
27      11     D      36            0.6
28      12     D      37            0.6

[[2]]
   group unit_id ID_unit Prob_ 2 _stage      Prob
1      A       7       7      0.7142857 0.4285714
2      A       3       3      0.7142857 0.4285714
3      A       5       5      0.7142857 0.4285714
4      A       1       1      0.7142857 0.4285714
5      A       2       2      0.7142857 0.4285714
6      C       1      18      0.6250000 0.3750000
7      C       2      19      0.6250000 0.3750000
8      C       3      20      0.6250000 0.3750000
9      C       5      22      0.6250000 0.3750000
10     C       7      24      0.6250000 0.3750000
11     D       3      28      0.3846154 0.2307692
12     D       7      32      0.3846154 0.2307692
13     D       8      33      0.3846154 0.2307692
14     D      11      36      0.3846154 0.2307692
15     D      12      37      0.3846154 0.2307692

Vignettes and handbooks