Formulas and Definitions

Equations and R code
Published

January 20, 2023

Parameters and Statistics

Measure Population \((\theta)\) Sample \((\hat{\theta})\)
Total \(\tau = \sum_{i=1}^N y_{i}\) \(\hat{\tau} = \sum_{i=1}^n y_{i}\)
Mean \(\mu = \frac{1}{N}\sum_{i=1}^N y_{i}\) \(\bar{y} = \frac{1}{n}\sum_{i=1}^n y_{i}\)
Variance \(\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(y_i-\mu)^2\) \(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(y_i-\bar{y})^2\)
Proportion\(^*\) \(p = \mu\) \(\hat{p} = \bar{y}\)
  • \(N\) = total population size

  • \(y_{i}\) = Value of measurement \(y\) on unit \(i\)

  • \(^*\)For proportions \(y_{i}\) is a binary indicator of success. \(y \in {[0,1]}\). I.e. \(I(y_{i}=1)\).

Expected Value and Variance of estimates

\[E(\hat{\theta}) = \sum\hat{\theta}*p(\hat{\theta}) \qquad \qquad V(\hat{\theta}) = \sum [\hat{\theta} - E(\hat{\theta})]^2*p(\hat{\theta})\]

  • Sums are over all possible values of \(\hat{\theta}\).
  • \(p(\hat{\theta})\) is the probability of \(\hat{\theta}\) occurring.
    • This is similar to what we’re calling \(\delta_{i}\), the probability that population unit \(i\) appears in the sample. The book uses the symbol \(\pi_{i}\)

Weighted Estimators for Probability sampling designs

Survey weights can be used in all probability sampling designs and calculated as the reciprocal of the inclusion probability \(\delta_{i}\). Weights \(w_{i}\) are applied directly to the data value \(y\) before use in any formulas.

Population size Total Mean
\(\hat{N} = \sum_{i \in S}w_{i}\) \(\hat{\tau} = \sum_{i \in S}w_{i}y_{i}\) \(\bar{y} = \frac{\hat{\tau}}{\hat{N}}\)
  • SRS: \(w_{i} =\frac{N}{n}\)
  • Stratified: \(w_{hj}=\frac{N_h}{n_h}\)
  • One stage cluster: is an SRS of clusters.
  • Two stage cluster: \(w_{ij} = \frac{N}{n}\frac{M_{i}}{m_{i}}\)
    • Stage 1: \(P\)(\(i\)th psu selected) = \(\frac{n}{N}\)
    • Stage 2: \(P\)(\(j\)th ssu selected | \(i\)th psu selected) = \(\frac{m_i}{M_i}\)

Estimators under different Sampling Designs

Simple Random Sample

Measure Unbiased Estimate \((\hat{\theta})\) Estimated variance of \((\hat{\theta})\)
Mean \(\bar{y} = \frac{1}{n}\sum_{i\in S} y_{i}\) \(\hat{V}(\bar{y}) = (1-\frac{n}{N})\frac{s^{2}}{n}\)
Total \(\hat{\tau} = N\bar{y}\) \(\hat{V}(\hat{\tau}) = N^{2}\hat{V}(\bar{y})\)
Proportion \(\hat{p} = \bar{y}\) \(\hat{V}(\hat{p}) = (1-\frac{n}{N})\frac{\hat{p}(1-\hat{p})}{n-1}\)
  • \(i \in S\) : Unit \(i\) is an element in the sample \(S\)
  • \((1-\frac{n}{N})\): Finite population correction (fpc)

Stratified Random Sample

  • CI’s should use a \(t_{n-H}\) distribution instead of using the normal assumption.
Measure Population \((\theta)\) Sample \((\hat{\theta})\)
Sample size in strata \(h\) \(N_{h}\) \(n_{h}\)
Overall sample size \(N = \sum^{H}_{h=1} N_{h}\) \(n = \sum^{H}_{h=1} n_{h}\)
Var(y) in strata \(h\) \(\sigma_{h}^2 = \frac{1}{N_{h}-1}\sum_{j=1}^{N_{h}}(y_{hj}-\mu_{h})^2\) \(s_{h}^2 = \frac{1}{n_{h}-1}\sum_{j}(y_{hj}-\bar{y}_{h})^2\)
Total in strata \(h\) \(\tau_{h} = \sum_{j=1}^{N_{h}} y_{hj}\) \(\hat{\tau}_{h} = N_{h}\bar{y_h}\)
Overall total \(\tau = \sum_{h=1}^{H} \tau_{h}\) \(\hat{\tau}_{str} = \sum_{h=1}^{H}N_{h}\bar{y}_{h}\)
Mean in strata \(h\) \(\mu_{h} = \frac{\tau_{h}}{N_{h}}\) \(\bar{y}_{h} = \frac{1}{n_{h}}\sum_{j} y_{hj}\)
Overall mean \(\mu = \frac{\tau}{N}\) \(\bar{y}_{str} = \sum_{h=1}^{H} \frac{N_{h}}{N}\bar{y}_{h}\)
Proportion in strata \(h\) \(p_{h} = \mu_h\) \(\bar{y}_h = \hat{p}_h\)
Variance of \(p_{h}\) \(\sigma_h^2=\frac{N_h}{N_h-1} p_h(1-p_h)\) \(s_h^2=\frac{n_h}{n_h-1} \hat{p}_h(1-\hat{p}_h)\)
Overall proportion \(p_{str} = \sum_{h=1}^{H} \frac{N_h}{N} p_h\) \(\hat{p}_{str} = \sum_{h=1}^{H} \frac{N_h}{N}\hat{p}_h\)

Estimated variances;

  • Within strata total \(\hat{V}(\hat{\tau}_{h}) = (1-\frac{n_h}{N_h})N^2 \frac{s^2_h}{n_h}\)
  • Within strata mean \(\hat{V}(\bar{y}_{h}) = \frac{1}{N^2}V(\hat{\tau}_{h})\)
  • Overall population total \(\hat{V}(\hat{\tau}_{str}) = \sum_{h=1}^H (1-\frac{n_h}{N_h})N^2 \frac{s^2_h}{n_h}\)
  • Population proportion \(\hat{V}(\hat{p}_{str}) = \sum_{h=1}^H (1-\frac{n_h}{N_h}) (\frac{N_h}{N})^2 \frac{\hat{p}_h (1-\hat{p}_h)}{n_h-1}\)
  • Estimated total number of population units having a specified characteristic: \(\hat{t}_{\it{str}} = \sum_{h=1}^H N_h \hat{p}_h\)

Cluster Random Sample

Symbol Formula Description
\(y_{ij}\) measurement for the \(j\)th element in \(i\)th psu
\(N\) number of clusters (psus) in the population
\(M_{i}\) number of ssus in psu \(i\)
\(M_{0}\) \(\sum_{i=1}^{N} M_{i}\) total number of ssus in the population
\(t_{i}\) \(\sum_{j=1}^{M_{i}} y_{ij}\) total in psu \(i\)
\(t\) \(\sum_{i=1}^{N} t_{i}= \sum_{i=1}^{N} \sum_{j=1}^{M_{i}} y_{ij}\) population total
\(S_{t}^{2}\) \(\frac{1}{N-1} \sum_{i=1}^{N} (t_{i} - \frac{t}{N})^2\) population variance of the psu totals
Symbol Formula Description
\(\bar{y}_{\mu}\) \(\frac{1}{M_0} \sum_{i=1}^N \sum_{j=1}^{M_i} y_{ij}\) population mean
\(\bar{y}_{i\mu}\) \(\frac{1}{M_i} \sum_{j=1}^{M_i} y_{ij} = \frac{t_i}{M_i}\) population mean in psu \(i\)
\(S^2\) \(\frac{1}{M_0-1} \sum_{i=1}^N \sum_{j=1}^{M_i} (y_{ij}-\bar{y}_{\mu})^2\) population variance (per ssu)
\(S_{i}^{2}\) \(\frac{1}{M_i-1} \sum_{j=1}^{M_i} (y_{ij}-\bar{y}_{\mu})^2\) population variance within psu \(i\)
Symbol Formula Description
\(y_{ij}\) Measurement for\(j\)th element in \(i\)ht psu
\(N\) Number of clusters (psus) in the population
\(M_i\) Number of ssus in psu \(i\)
\(M_{o}\) \(\sum_{i=1}^NM_i\) Total number of psus in the popultion
\(t_i\) \(\sum_{j=1}^{M_i}y_{ij}\) Total in psu \(i\)
\(t\) \(\sum_{i=1}^Nt_i = \sum_{i=1}^N\sum_{j=i}^{M_i}y_{ij}\) Popultion Total
\(S_i^2\) \(\frac{1}{N-1} \sum_{i=1}^N(t_i-\frac{t}{N})^2\) Popultion variance of the psu totals

Ratio Estimation in SRS

Let \(y_{i}\) be the measure of interest, and \(x_{i}\) be the auxiliary variable. In a population of size \(N\) the totals for each measure are

\[ t_{y} = \sum_{i=1}^Ny_{i} \qquad t_{x} = \sum_{i=1}^Nx_{i} \]

and their ratio is calculated as \[ B = \frac{t_{y}}{t_{x}} = \frac{\mu_{y}}{\mu_{x}}. \]

Regression Estimation in SRS

Regression estimation uses the method of ordinary least squares to calculate estimates for the slope \(B_1\) and intercept \(B_{0}\) of a ‘line of best fit’ between the outcome measure of interest \(y\) and the auxiliary variable \(x\).

\[ y = B_0 + B_1x \]

where

\[ \hat{B}_1 = \frac{rs_y}{s_x} \qquad \hat{B}_0 = \bar{y} - \hat{B}_1\bar{x} \]

where \(r\) is the sample correlation coefficient between \(x\) and \(y\), and \(s_x\) and \(s_y\) are sample standard deviations of \(x\) and \(y\) respectively.


R commands

This is a quick reference list. See the R companion for the textbook, the survey package help file or vignette for more information.

Specify survey design svydesign

  • Function call: svydesign(id = , weights=, fpc= , data = )
  • id = variable that identifies clusters
  • weights = variable that holds the sampling weights
  • fpc = finite population correction. Typically defined in the function call.

The argument details can be found on the specified pages in the R companion for the book.

  • SRS: pg 21
  • Stratified Random Sample: pg 34
  • Cluster sampling:
  • Ratio estimates: pg 41
  • Regression estimates: pg 44

Estimators

  • mean: svymean(~x, design)
  • total: svytotal(~x, design)
  • proportion: Use svytotal and divide by N
  • CI for the mean or total: Use confint() after calculating the point estimate
  • CI for proportion: svyciprop(~x, design) Will also print out \(\hat{p}\)
  • Ratio estimate: svyratio(~numerator, ~denominator, design)
  • Regression estimate: svyglm(y~x, design)

Calculating stratum means and variances

  • The first argument of svyby is the formula for the variable(s) for which statistics are desired
  • (by=) is the variable that defines the groups.
  • Then list the design object
  • and the name of the function that calculates the statistics.
  • Set keep.var=TRUE to display the standard errors for the statistics.
svyby(~acres92, by=~region, agstrat.dsgn, svytotal, keep.var = TRUE)

Manage data collection

Setup a data collection sheet using random sampling from a sampling frame

  1. Setup your sampling frame in a spreadsheet. See this example using google sheets.
  2. Import your sampling frame into R. More info can be found on the googlesheets4 package vignette.
library(googlesheets4)
sheet.url <- "https://docs.google.com/spreadsheets/d/1L0voUNhw5CsZ3YPWsuTJCBYm963L0zFIhDKS5HUjmms"
frame.srs <- read_sheet(sheet.url, sheet = "srs") 
frame.str <- read_sheet(sheet.url, sheet = "stratified") 
frame.clu <- read_sheet(sheet.url, sheet = "cluster") 
  1. Use functions from the sampling package to draw random samples according to your design. See the links for more details on what the arguments mean.
library(sampling)
set.seed(12345)
srs.idx <- srswor(8, length(frame.srs$unit_id)) 
getdata(frame.srs, srs.idx)
  ID_unit unit_id
1       2       2
2       6       6
3       8       8
4      11      11
5      14      14
6      16      16
7      18      18
8      19      19
library(dplyr)
frame.str <- frame.str %>% arrange(group) # sort first
strata.idx <- sampling::strata(data = frame.str,      # data set
                 stratanames = "group", # variable name
                 size = c(6,4,5),      # stratum sample sizes     
                 method = "srswor")     # method for selecting within strata
getdata(frame.str, strata.idx)
   unit_id group ID_unit      Prob Stratum
1        1     A       1 0.5454545       1
2        2     A       2 0.5454545       1
6        6     A       6 0.5454545       1
7        7     A       7 0.5454545       1
10      10     A      10 0.5454545       1
11      11     A      11 0.5454545       1
12       1     B      12 0.5000000       2
17       6     B      17 0.5000000       2
18       7     B      18 0.5000000       2
19       8     B      19 0.5000000       2
21       2     C      21 0.8333333       3
22       3     C      22 0.8333333       3
23       4     C      23 0.8333333       3
24       5     C      24 0.8333333       3
25       6     C      25 0.8333333       3
  • One stage cluster
onestage.idx <- sampling::cluster(data=frame.clu,         # Data set
                  clustername = "group",  # variable name containing clusters 
                  size = 3,               # number of clusters
                  method = "srswor",      # how to draw clusters 
                  description = TRUE)     # show descriptive output
Number of selected clusters: 3 
Number of units in the population and number of selected units: 50 25 
getdata(frame.clu, onestage.idx)
   unit_id group ID_unit Prob
1        3     A       3  0.6
2        1     A       1  0.6
3        2     A       2  0.6
4        7     A       7  0.6
5        4     A       4  0.6
6        5     A       5  0.6
7        6     A       6  0.6
8        5     B      12  0.6
9        1     B       8  0.6
10       2     B       9  0.6
11       3     B      10  0.6
12       4     B      11  0.6
13       9     B      16  0.6
14       6     B      13  0.6
15       7     B      14  0.6
16       8     B      15  0.6
17      10     B      17  0.6
18       8     C      25  0.6
19       1     C      18  0.6
20       2     C      19  0.6
21       3     C      20  0.6
22       4     C      21  0.6
23       5     C      22  0.6
24       6     C      23  0.6
25       7     C      24  0.6
  • Two stage cluster
mstage.idx <- sampling::mstage(data=frame.clu, 
                 stage = c("cluster", ""),  # sampling method for each stage, blank means SRS
                 varnames = list("group", "unit_id"),  # variable names for each stage
                 size = list(3, c(5,5,5)), # 3 psus, 5 ssus from each psu
                 method = c("srswor", "srswor"))

getdata(frame.clu, mstage.idx)
[[1]]
   unit_id group ID_unit Prob_ 1 _stage
1        6     B      13            0.6
2       10     B      17            0.6
3        7     B      14            0.6
4        8     B      15            0.6
5        9     B      16            0.6
6        1     B       8            0.6
7        2     B       9            0.6
8        3     B      10            0.6
9        4     B      11            0.6
10       5     B      12            0.6
11       1     D      26            0.6
12       2     D      27            0.6
13       3     D      28            0.6
14       4     D      29            0.6
15       5     D      30            0.6
16       6     D      31            0.6
17       7     D      32            0.6
18       8     D      33            0.6
19       9     D      34            0.6
20      10     D      35            0.6
21      11     D      36            0.6
22      12     D      37            0.6
23      13     D      38            0.6
24       1     E      39            0.6
25       2     E      40            0.6
26       3     E      41            0.6
27       4     E      42            0.6
28       5     E      43            0.6
29       6     E      44            0.6
30       7     E      45            0.6
31       8     E      46            0.6
32       9     E      47            0.6
33      10     E      48            0.6
34      11     E      49            0.6
35      12     E      50            0.6

[[2]]
   group unit_id ID_unit Prob_ 2 _stage      Prob
1      B      10      17      0.5000000 0.3000000
2      B       7      14      0.5000000 0.3000000
3      B       8      15      0.5000000 0.3000000
4      B       2       9      0.5000000 0.3000000
5      B       3      10      0.5000000 0.3000000
6      D       4      29      0.3846154 0.2307692
7      D       6      31      0.3846154 0.2307692
8      D       8      33      0.3846154 0.2307692
9      D       9      34      0.3846154 0.2307692
10     D      13      38      0.3846154 0.2307692
11     E       1      39      0.4166667 0.2500000
12     E       3      41      0.4166667 0.2500000
13     E       5      43      0.4166667 0.2500000
14     E       7      45      0.4166667 0.2500000
15     E       9      47      0.4166667 0.2500000

See ?mstage for more examples of multistage sampling.

  1. Make a new spreadsheet containing the sampled rows, and add a column for \(y\) the thing you are going to measure on each unit.

  2. Go collect data!

Other R help References