Formulas and Definitions

Equations and R code

Parameters and Statistics

Measure	Population $(θ)$	Sample $(\hat{θ})$
Total	$τ = \sum_{i = 1}^{N} y_{i}$	$\hat{τ} = \sum_{i = 1}^{n} y_{i}$
Mean	$μ = \frac{1}{N} \sum_{i = 1}^{N} y_{i}$	$\bar{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}$
Variance	$σ^{2} = \frac{1}{N} \sum_{i = 1}^{N} (y_{i} - μ)^{2}$	$s^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}$
Proportion $^{*}$	$p = μ$	$\hat{p} = \bar{y}$

$N$ : total population size
$n$ : total sample size
$y_{i}$ : value of measurement $y$ on unit $i$
$^{*}$ For proportions $y_{i}$ is a binary indicator of success. $y \in [0, 1]$ . I.e. $I (y_{i} = 1)$ .

Expected Value and Variance

$E (\hat{θ}) = \sum \hat{θ} * p (\hat{θ})$ $V (\hat{θ}) = \sum [\hat{θ} - E (\hat{θ})]^{2} * p (\hat{θ}) -or- V (\hat{θ}) = E ({\hat{θ}}^{2}) - E (\hat{θ})^{2}$

Sums are over all possible values of $\hat{θ}$ .
$p (\hat{θ})$ is the probability of $\hat{θ}$ occurring.

Definition: Bias, Variance, Accuracy

$Bias (\hat{θ}) = E (\hat{θ}) - θ$

$V (\hat{θ}) = E [(\hat{θ} - E [\hat{θ}])^{2}]$

$MSE (\hat{θ}) = V (\hat{θ}) + [B i a s (\hat{θ})]^{2}$

Sample Weights

The sampling weights $w_{i}$ are the reciprocal of the inclusion probability $π_{i}$

SRS: $w_{i} = \frac{1}{n}$
SRSWOR: $w_{i} = \frac{N}{n}$
Stratified: $w_{h j} = \frac{N_{h}}{n_{h}}$
One stage cluster: $w_{i j} = \frac{N}{n}$

Simple Random Sample

Population Quantity	Estimator $(\hat{θ})$	Estimated variance of $(\hat{θ})$
Mean: $μ = \frac{τ}{N}$	$\frac{\hat{t}}{N} = \frac{\sum_{i \in S} w_{i} y_{i}}{\sum_{i \in S} w_{i}} = \bar{y}$	$\hat{V} (\bar{y}) = (1 - \frac{n}{N}) \frac{s^{2}}{n}$
Total: $τ = \sum_{i = 1}^{N} y_{i}$	$\hat{τ} = \frac{1}{n} \sum_{i \in S} w_{i} y_{i} = N \bar{y}$	$\hat{V} (\hat{τ}) = N^{2} \hat{V} (\bar{y})$
Proportion: $p$	$\hat{p} = \bar{y}$	$\hat{V} (\hat{p}) = (1 - \frac{n}{N}) \frac{\hat{p} (1 - \hat{p})}{n - 1}$

$i \in S$ : Unit $i$ is an element in the sample $S$
The standard error of the estimate is the square root of the estimated variance.

Stratified Random Sample

See section-04 for notation.

Population Quantity	Estimator $(\hat{θ})$	Estimated variance of $(\hat{θ})$
Within strata total: $τ_{h} = \sum_{j} y_{h j}$	${\hat{τ}}_{h} = N_{h} \bar{y_{h}}$	$\hat{V} ({\hat{τ}}_{h}) = (1 - \frac{n_{h}}{N_{h}}) N_{h}^{2} \frac{s_{h}^{2}}{n_{h}}$
Overall total: $τ = \sum_{h} τ_{h}$	${\hat{τ}}_{s t r} = \sum_{h} {\hat{τ}}_{h}$	$\hat{V} ({\hat{τ}}_{s t r}) = \sum_{h} \hat{V} ({\hat{τ}}_{h})$
Within strata mean: $μ_{h} = \frac{τ_{h}}{N_{h}}$	${\bar{y}}_{h} = \frac{1}{n_{h}} \sum_{j} y_{h j}$	$\hat{V} ({\bar{y}}_{h}) = \frac{1}{N^{2}} V ({\hat{τ}}_{h})$
Overall mean: $μ = \frac{τ}{N}$	${\bar{y}}_{s t r} = \sum_{h} \frac{N_{h}}{N} {\bar{y}}_{h}$	$\hat{V} ({\bar{y}}_{s t r}) = \frac{1}{N^{2}} \hat{V} ({\hat{τ}}_{s t r})$
Within strata proportion: $p_{h} = μ_{h}$	${\hat{p}}_{h} = {\bar{y}}_{h}$	$\hat{V} ({\hat{p}}_{h}) = (\frac{n_{h}}{n_{h} - 1}) {\hat{p}}_{h} (1 - {\hat{p}}_{h})$
Overall proportion: $p = μ$	${\hat{p}}_{s t r} = \sum_{h} \frac{N_{h}}{N} {\hat{p}}_{h}$	$\hat{V} ({\hat{p}}_{s t r}) = \sum_{h} (1 - \frac{n_{h}}{N_{h}}) (\frac{N_{h}}{N})^{2} (\frac{{\hat{p}}_{h} (1 - {\hat{p}}_{h})}{n_{h} - 1})$

$\sum_{h}$ is a simplified version of $\sum_{h = 1}^{H}$ and $\sum_{j}$ is a simplified version of $\sum_{j = 1}^{N_{h}}$

Cluster Random Sample

⚠️ Note notation change!

$y_{i j}$ : measurement for the $j$ th element in $i$ th psu
$N$ : the number of clusters (psus) in the population
- $n$ : the number of psus from the sample
$M_{i}$ : number of ssus in psu $i$ in the population
- $m_{i}$ : the number of ssu’s in psu $i$ from the sample
$M_{0}$ : total number of ssus in the population

Population Quantity	Estimator $(\hat{θ})$	Estimated variance of $(\hat{θ})$
Total in psu $i$ : $t_{i} = \sum_{j} y_{i j}$	${\hat{t}}_{i} = \frac{M_{i}}{m_{i}} \sum_{j} y_{i j}$	-
Variance of psu totals : $S_{t}^{2} = \frac{1}{N - 1} \sum_{i} {(t_{i} - \frac{t}{N})}^{2}$	$s_{t}^{2} = \frac{1}{n - 1} \sum_{i} {(t_{i} - \frac{\hat{t}}{N})}^{2}$	AKA variance between psu
Overall Total: $t = \sum_{i} t_{i}$	$\hat{t} = \frac{N}{n} \sum_{i} {\hat{t}}_{i}$	$N^{2} (1 - \frac{n}{N}) \frac{s_{t}^{2}}{n}$
mean in psu $i$ : $μ_{i} = \frac{1}{M_{i}} \sum_{j} y_{i j}$	${\bar{y}}_{i} = \frac{1}{m_{i}} \sum_{j} y_{i j}$	$s_{i}^{2} = \frac{1}{m_{i} - 1} \sum_{j} (y_{i j} - {\bar{y}}_{i})^{2}$ (AKA variance within psu)
Overall mean: $μ = \frac{1}{M_{0}} \sum_{i} \sum_{j} y_{i j}$	$\bar{y} = \frac{\hat{t}}{N M}$	$(1 - \frac{n}{N}) \frac{s_{t}^{2}}{n M^{2}}$

R commands

This is a quick reference list. See the R companion for the textbook, the package help files, vignettes or other tutorials listed at the bottom of this page for more information.

A note on missing data

If the result of any of the below functions is NA, this may indicate that your variable has missing values. Add the na.rm=TRUE argument to the svymean or svytotal functions and that will calculate a complete-case mean/total.

Analysis

The survey package supports the analysis of data collected using complex survey designs.

Specify survey design `svydesign`

Function call: svydesign(id = , weights=, fpc= , data = )
id = variable that identifies clusters
weights = variable that holds the sampling weights
fpc = finite population correction. Typically defined in the function call.

The argument details can be found on the specified pages in the R companion for the book, and in the respective sections of these notes.

SRS: pg 21
Stratified Random Sample: pg 34
Cluster sampling: p57

Estimators

mean: svymean(~x, design)
total: svytotal(~x, design)
proportion: Use svytotal and divide by N
CI for the mean or total: Use confint() after calculating the point estimate
CI for proportion: svyciprop(~x, design) Will also print out $\hat{p}$

Calculating stratum means and variances

The first argument of svyby is the formula for the variable(s) for which statistics are desired
(by=) is the variable that defines the groups.
Then list the design object
and the name of the function that calculates the statistics.
Set keep.var=TRUE to display the standard errors for the statistics.

svyby(~acres92, by=~region, agstrat.dsgn, svytotal, keep.var = TRUE)

Sampling

The sampling package allows you to take random samples from a sampling frame using different sampling frameworks in a reproducible manner.

Setup your sampling frame in a spreadsheet. This example uses google sheets and the googlesheets4 package.
Import your sampling frame into R.

library(googlesheets4)
frame <- read_sheet("https://docs.google.com/spreadsheets/d/17bg__F6Cq0zBnbPtMBsNCKNM-pyybVnhujvI2J66n_4")

Use functions from the sampling package to draw random samples according to your design. See the links for more details on what the arguments mean.

SRSWOR.

library(sampling)
set.seed(12345)
srs.idx <- srswor(4, length(frame$unit_id)) 
getdata(frame, srs.idx)

  ID_unit unit_id group
1      14      14     B
2      16      16     B
3      26      26     C
4      28      28     C

Stratified

library(dplyr)
frame <- frame %>% arrange(group) # sort first
strata.idx <- sampling::strata(data = frame,      # data set
                 stratanames = "group", # variable name
                 size = c(2,3,2,1,2),      # stratum sample sizes     
                 method = "srswor")     # method for selecting within strata
getdata(frame, strata.idx)

   unit_id group ID_unit      Prob Stratum
2        2     A       2 0.2500000       1
8        8     A       8 0.2500000       1
10      10     B      10 0.3333333       2
14      14     B      14 0.3333333       2
16      16     B      16 0.3333333       2
23      23     C      23 0.1428571       3
28      28     C      28 0.1428571       3
38      38     D      38 0.1428571       4
39      39     E      39 0.1666667       5
48      48     E      48 0.1666667       5

Cluster

One stage cluster

onestage.idx <- sampling::cluster(data=frame,         # Data set
                  clustername = "group",  # variable name containing clusters 
                  size = 3,               # number of clusters
                  method = "srswor",      # how to draw clusters 
                  description = TRUE)     # show descriptive output

Number of selected clusters: 3 
Number of units in the population and number of selected units: 50 24

getdata(frame, onestage.idx)

   unit_id group ID_unit Prob
1        3     A       3  0.6
2        2     A       2  0.6
3        7     A       7  0.6
4        4     A       4  0.6
5        1     A       1  0.6
6        6     A       6  0.6
7        8     A       8  0.6
8        5     A       5  0.6
9       12     B      12  0.6
10       9     B       9  0.6
11      11     B      11  0.6
12      16     B      16  0.6
13      13     B      13  0.6
14      10     B      10  0.6
15      15     B      15  0.6
16      17     B      17  0.6
17      14     B      14  0.6
18      38     D      38  0.6
19      32     D      32  0.6
20      33     D      33  0.6
21      34     D      34  0.6
22      35     D      35  0.6
23      36     D      36  0.6
24      37     D      37  0.6

Two stage cluster

mstage.idx <- sampling::mstage(data=frame, 
                 stage = c("cluster", ""),  # sampling method for each stage, blank means SRS
                 varnames = list("group", "unit_id"),  # variable names for each stage
                 size = list(3, c(5,5,5)), # 3 psus, 5 ssus from each psu
                 method = c("srswor", "srswor"))

getdata(frame, mstage.idx)

[[1]]
   unit_id group ID_unit Prob_ 1 _stage
1        8     A       8            0.6
2        6     A       6            0.6
3        7     A       7            0.6
4        3     A       3            0.6
5        5     A       5            0.6
6        1     A       1            0.6
7        2     A       2            0.6
8        4     A       4            0.6
9       25     C      25            0.6
10      18     C      18            0.6
11      19     C      19            0.6
12      20     C      20            0.6
13      21     C      21            0.6
14      22     C      22            0.6
15      23     C      23            0.6
16      24     C      24            0.6
17      29     C      29            0.6
18      26     C      26            0.6
19      27     C      27            0.6
20      28     C      28            0.6
21      30     C      30            0.6
22      31     C      31            0.6
23      38     D      38            0.6
24      32     D      32            0.6
25      33     D      33            0.6
26      34     D      34            0.6
27      35     D      35            0.6
28      36     D      36            0.6
29      37     D      37            0.6

[[2]]
   unit_id group ID_unit Prob_ 2 _stage      Prob
1        6     A       6      0.6250000 0.3750000
2        7     A       7      0.6250000 0.3750000
3        3     A       3      0.6250000 0.3750000
4        5     A       5      0.6250000 0.3750000
5        1     A       1      0.6250000 0.3750000
6       18     C      18      0.3571429 0.2142857
7       20     C      20      0.3571429 0.2142857
8       29     C      29      0.3571429 0.2142857
9       27     C      27      0.3571429 0.2142857
10      31     C      31      0.3571429 0.2142857
11      38     D      38      0.7142857 0.4285714
12      32     D      32      0.7142857 0.4285714
13      33     D      33      0.7142857 0.4285714
14      34     D      34      0.7142857 0.4285714
15      37     D      37      0.7142857 0.4285714