Formulas and Definitions

Equations and R code

Parameters and Statistics

Measure Population \((\theta)\) Sample \((\hat{\theta})\)
Total \(\tau = \sum_{i=1}^N y_{i}\) \(\hat{\tau} = \sum_{i=1}^n y_{i}\)
Mean \(\mu = \frac{1}{N}\sum_{i=1}^N y_{i}\) \(\bar{y} = \frac{1}{n}\sum_{i=1}^n y_{i}\)
Variance \(\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(y_i-\mu)^2\) \(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(y_i-\bar{y})^2\)
Proportion\(^*\) \(p = \mu\) \(\hat{p} = \bar{y}\)
  • \(N\): total population size
  • \(n\): total sample size
  • \(y_{i}\): value of measurement \(y\) on unit \(i\)
  • \(^*\)For proportions \(y_{i}\) is a binary indicator of success. \(y \in {[0,1]}\). I.e. \(I(y_{i}=1)\).

Expected Value and Variance

\[E(\hat{\theta}) = \sum \hat{\theta}*p(\hat{\theta})\] \[ V(\hat{\theta}) = \sum [\hat{\theta} - E(\hat{\theta})]^2*p(\hat{\theta}) \qquad \mbox{ -or- }\qquad V(\hat{\theta}) = E(\hat{\theta}^2) - E(\hat{\theta})^2 \]

  • Sums are over all possible values of \(\hat{\theta}\).
  • \(p(\hat{\theta})\) is the probability of \(\hat{\theta}\) occurring.

Definition: Bias, Variance, Accuracy

\[\mbox{Bias}(\hat{\theta}) = E(\hat{\theta}) - \theta\]

\[V(\hat{\theta}) = E \Big[(\hat{\theta} - E[\hat{\theta}])^2\Big]\]

\[\mbox{MSE}(\hat{\theta}) = V(\hat{\theta}) + [Bias(\hat{\theta})]^2\]

Sample Weights

The sampling weights \(w_{i}\)are the reciprocal of the inclusion probability \(\pi_{i}\)

  • SRS: \(w_{i} =\frac{1}{n}\)
  • SRSWOR: \(w_{i} =\frac{N}{n}\)
  • Stratified: \(w_{hj}=\frac{N_h}{n_h}\)
  • One stage cluster: \(w_{ij}=\frac{N}{n}\)

Simple Random Sample

Population Quantity Estimator \((\hat{\theta})\) Estimated variance of \((\hat{\theta})\)
Mean: \(\mu = \frac{\tau}{N}\) \(\frac{\hat{t}}{N} = \frac{\sum_{i\in S} w_{i}y_{i}}{\sum_{i\in S}w_{i}} = \bar{y}\) \(\hat{V}(\bar{y}) = (1-\frac{n}{N})\frac{s^{2}}{n}\)
Total: \(\tau = \sum_{i=1}^{N} y_{i}\) \(\hat{\tau} = \frac{1}{n}\sum_{i\in S} w_{i}y_{i} = N\bar{y}\) \(\hat{V}(\hat{\tau}) = N^{2}\hat{V}(\bar{y})\)
Proportion: \(p\) \(\hat{p} = \bar{y}\) \(\hat{V}(\hat{p}) = (1-\frac{n}{N})\frac{\hat{p}(1-\hat{p})}{n-1}\)
  • \(i \in S\) : Unit \(i\) is an element in the sample \(S\)
  • The standard error of the estimate is the square root of the estimated variance.

Stratified Random Sample

See section-04 for notation.

Population Quantity Estimator \((\hat{\theta})\) Estimated variance of \((\hat{\theta})\)
Within strata total: \(\tau_{h} = \sum_{j} y_{hj}\) \(\hat{\tau}_{h} = N_{h}\bar{y_h}\) \(\hat{V}(\hat{\tau}_{h}) = (1-\frac{n_h}{N_h})N_{h}^2 \frac{s^2_h}{n_h}\)
Overall total: \(\tau = \sum_{h} \tau_{h}\) \(\hat{\tau}_{str} = \sum_{h}\hat{\tau}_{h}\) \(\hat{V}(\hat{\tau}_{str}) = \sum_{h} \hat{V}(\hat{\tau}_{h})\)
Within strata mean: \(\mu_{h} = \frac{\tau_{h}}{N_{h}}\) \(\bar{y}_{h} = \frac{1}{n_{h}}\sum_{j} y_{hj}\) \(\hat{V}(\bar{y}_{h}) = \frac{1}{N^2}V(\hat{\tau}_{h})\)
Overall mean:\(\mu = \frac{\tau}{N}\) \(\bar{y}_{str} = \sum_{h} \frac{N_{h}}{N}\bar{y}_{h}\) \(\hat{V}(\bar{y}_{str}) = \frac{1}{N^2}\hat{V}(\hat{\tau}_{str})\)
Within strata proportion: \(p_{h} = \mu_{h}\) \(\hat{p}_{h} = \bar{y}_{h}\) \(\hat{V}(\hat{p}_{h}) = (\frac{n_h}{n_h-1})\hat{p}_h(1-\hat{p}_h)\)
Overall proportion: \(p = \mu\) \(\hat{p}_{str} = \sum_{h}\frac{N_{h}}{N}\hat{p}_{h}\) \(\hat{V}(\hat{p}_{str}) = \sum_{h} (1-\frac{n_h}{N_h})\big(\frac{N_{h}}{N}\big)^2\Big(\frac{\hat{p}_h(1-\hat{p}_h)}{n_h-1}\Big)\)
  • \(\sum_{h}\) is a simplified version of \(\sum_{h=1}^{H}\) and \(\sum_{j}\) is a simplified version of \(\sum_{j=1}^{N_{h}}\)

Cluster Random Sample

⚠️ Note notation change!

  • \(y_{ij}\): measurement for the \(j\)th element in \(i\)th psu
  • \(N\): the number of clusters (psus) in the population
    • \(n\): the number of psus from the sample
  • \(M_{i}\): number of ssus in psu \(i\) in the population
    • \(m_i\): the number of ssu’s in psu \(i\) from the sample
  • \(M_{0}\): total number of ssus in the population
Population Quantity Estimator \((\hat{\theta})\) Estimated variance of \((\hat{\theta})\)
Total in psu \(i\): \(t_{i}=\sum_{j} y_{ij}\) \(\hat{t}_{i} = \frac{M_{i}}{m_i}\sum_{j} y_{ij}\) -
Variance of psu totals : \(S_t^2 = \frac{1}{N-1} \sum_{i} \left( t_i - \frac{t}{N} \right)^2\) \(s_t^2 = \frac{1}{n-1} \sum_{i} \left( t_i - \frac{\hat{t}}{N} \right)^2\) AKA variance between psu
Overall Total: \(t = \sum_{i} t_i\) \(\hat{t} = \frac{N}{n}\sum_{i} \hat{t}_i\) \(N^2 \left(1 - \frac{n}{N}\right) \frac{s_t^2}{n}\)
mean in psu \(i\): \(\mu_{i} = \frac{1}{M_i} \sum_{j} y_{ij}\) \(\bar{y}_{i} = \frac{1}{m_i} \sum_{j} y_{ij}\) \(s_{i}^{2} = \frac{1}{m_i-1} \sum_{j} (y_{ij}-\bar{y}_{i})^2\) (AKA variance within psu)
Overall mean: \(\mu = \frac{1}{M_0} \sum_{i} \sum_{j} y_{ij}\) \(\bar{y} = \frac{\hat{t}}{NM}\) \(\left( 1 - \frac{n}{N} \right) \frac{s_t^2}{nM^2}\)

R commands

This is a quick reference list. See the R companion for the textbook, the package help files, vignettes or other tutorials listed at the bottom of this page for more information.

A note on missing data

If the result of any of the below functions is NA, this may indicate that your variable has missing values. Add the na.rm=TRUE argument to the svymean or svytotal functions and that will calculate a complete-case mean/total.

Analysis

The survey package supports the analysis of data collected using complex survey designs.

Specify survey design svydesign

  • Function call: svydesign(id = , weights=, fpc= , data = )
  • id = variable that identifies clusters
  • weights = variable that holds the sampling weights
  • fpc = finite population correction. Typically defined in the function call.

The argument details can be found on the specified pages in the R companion for the book, and in the respective sections of these notes.

  • SRS: pg 21
  • Stratified Random Sample: pg 34
  • Cluster sampling: p57

Estimators

  • mean: svymean(~x, design)
  • total: svytotal(~x, design)
  • proportion: Use svytotal and divide by N
  • CI for the mean or total: Use confint() after calculating the point estimate
  • CI for proportion: svyciprop(~x, design) Will also print out \(\hat{p}\)

Calculating stratum means and variances

  • The first argument of svyby is the formula for the variable(s) for which statistics are desired
  • (by=) is the variable that defines the groups.
  • Then list the design object
  • and the name of the function that calculates the statistics.
  • Set keep.var=TRUE to display the standard errors for the statistics.
svyby(~acres92, by=~region, agstrat.dsgn, svytotal, keep.var = TRUE)

Sampling

The sampling package allows you to take random samples from a sampling frame using different sampling frameworks in a reproducible manner.

  1. Setup your sampling frame in a spreadsheet. This example uses google sheets and the googlesheets4 package.

  2. Import your sampling frame into R.

library(googlesheets4)
frame <- read_sheet("https://docs.google.com/spreadsheets/d/17bg__F6Cq0zBnbPtMBsNCKNM-pyybVnhujvI2J66n_4")
  1. Use functions from the sampling package to draw random samples according to your design. See the links for more details on what the arguments mean.
library(sampling)
set.seed(12345)
srs.idx <- srswor(4, length(frame$unit_id)) 
getdata(frame, srs.idx)
  ID_unit unit_id group
1      14      14     B
2      16      16     B
3      26      26     C
4      28      28     C
library(dplyr)
frame <- frame %>% arrange(group) # sort first
strata.idx <- sampling::strata(data = frame,      # data set
                 stratanames = "group", # variable name
                 size = c(2,3,2,1,2),      # stratum sample sizes     
                 method = "srswor")     # method for selecting within strata
getdata(frame, strata.idx)
   unit_id group ID_unit      Prob Stratum
2        2     A       2 0.2500000       1
8        8     A       8 0.2500000       1
10      10     B      10 0.3333333       2
14      14     B      14 0.3333333       2
16      16     B      16 0.3333333       2
23      23     C      23 0.1428571       3
28      28     C      28 0.1428571       3
38      38     D      38 0.1428571       4
39      39     E      39 0.1666667       5
48      48     E      48 0.1666667       5

One stage cluster

onestage.idx <- sampling::cluster(data=frame,         # Data set
                  clustername = "group",  # variable name containing clusters 
                  size = 3,               # number of clusters
                  method = "srswor",      # how to draw clusters 
                  description = TRUE)     # show descriptive output
Number of selected clusters: 3 
Number of units in the population and number of selected units: 50 24 
getdata(frame, onestage.idx)
   unit_id group ID_unit Prob
1        3     A       3  0.6
2        2     A       2  0.6
3        7     A       7  0.6
4        4     A       4  0.6
5        1     A       1  0.6
6        6     A       6  0.6
7        8     A       8  0.6
8        5     A       5  0.6
9       12     B      12  0.6
10       9     B       9  0.6
11      11     B      11  0.6
12      16     B      16  0.6
13      13     B      13  0.6
14      10     B      10  0.6
15      15     B      15  0.6
16      17     B      17  0.6
17      14     B      14  0.6
18      38     D      38  0.6
19      32     D      32  0.6
20      33     D      33  0.6
21      34     D      34  0.6
22      35     D      35  0.6
23      36     D      36  0.6
24      37     D      37  0.6

Two stage cluster

mstage.idx <- sampling::mstage(data=frame, 
                 stage = c("cluster", ""),  # sampling method for each stage, blank means SRS
                 varnames = list("group", "unit_id"),  # variable names for each stage
                 size = list(3, c(5,5,5)), # 3 psus, 5 ssus from each psu
                 method = c("srswor", "srswor"))

getdata(frame, mstage.idx)
[[1]]
   unit_id group ID_unit Prob_ 1 _stage
1        8     A       8            0.6
2        6     A       6            0.6
3        7     A       7            0.6
4        3     A       3            0.6
5        5     A       5            0.6
6        1     A       1            0.6
7        2     A       2            0.6
8        4     A       4            0.6
9       25     C      25            0.6
10      18     C      18            0.6
11      19     C      19            0.6
12      20     C      20            0.6
13      21     C      21            0.6
14      22     C      22            0.6
15      23     C      23            0.6
16      24     C      24            0.6
17      29     C      29            0.6
18      26     C      26            0.6
19      27     C      27            0.6
20      28     C      28            0.6
21      30     C      30            0.6
22      31     C      31            0.6
23      38     D      38            0.6
24      32     D      32            0.6
25      33     D      33            0.6
26      34     D      34            0.6
27      35     D      35            0.6
28      36     D      36            0.6
29      37     D      37            0.6

[[2]]
   unit_id group ID_unit Prob_ 2 _stage      Prob
1        6     A       6      0.6250000 0.3750000
2        7     A       7      0.6250000 0.3750000
3        3     A       3      0.6250000 0.3750000
4        5     A       5      0.6250000 0.3750000
5        1     A       1      0.6250000 0.3750000
6       18     C      18      0.3571429 0.2142857
7       20     C      20      0.3571429 0.2142857
8       29     C      29      0.3571429 0.2142857
9       27     C      27      0.3571429 0.2142857
10      31     C      31      0.3571429 0.2142857
11      38     D      38      0.7142857 0.4285714
12      32     D      32      0.7142857 0.4285714
13      33     D      33      0.7142857 0.4285714
14      34     D      34      0.7142857 0.4285714
15      37     D      37      0.7142857 0.4285714

Vignettes and handbooks