svyby(~acres92, by=~region, agstrat.dsgn, svytotal, keep.var = TRUE)
Formulas and Definitions
Parameters and Statistics
Measure | Population \((\theta)\) | Sample \((\hat{\theta})\) |
---|---|---|
Total | \(\tau = \sum_{i=1}^N y_{i}\) | \(\hat{\tau} = \sum_{i=1}^n y_{i}\) |
Mean | \(\mu = \frac{1}{N}\sum_{i=1}^N y_{i}\) | \(\bar{y} = \frac{1}{n}\sum_{i=1}^n y_{i}\) |
Variance | \(\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(y_i-\mu)^2\) | \(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(y_i-\bar{y})^2\) |
Proportion\(^*\) | \(p = \mu\) | \(\hat{p} = \bar{y}\) |
\(N\) = total population size
\(y_{i}\) = Value of measurement \(y\) on unit \(i\)
\(^*\)For proportions \(y_{i}\) is a binary indicator of success. \(y \in {[0,1]}\). I.e. \(I(y_{i}=1)\).
Expected Value and Variance of estimates
\[E(\hat{\theta}) = \sum\hat{\theta}*p(\hat{\theta}) \qquad \qquad V(\hat{\theta}) = \sum [\hat{\theta} - E(\hat{\theta})]^2*p(\hat{\theta})\]
- Sums are over all possible values of \(\hat{\theta}\).
- \(p(\hat{\theta})\) is the probability of \(\hat{\theta}\) occurring.
- This is similar to what we’re calling \(\delta_{i}\), the probability that population unit \(i\) appears in the sample. The book uses the symbol \(\pi_{i}\)
Weighted Estimators for Probability sampling designs
Survey weights can be used in all probability sampling designs and calculated as the reciprocal of the inclusion probability \(\delta_{i}\). Weights \(w_{i}\) are applied directly to the data value \(y\) before use in any formulas.
Population size | Total | Mean |
---|---|---|
\(\hat{N} = \sum_{i \in S}w_{i}\) | \(\hat{\tau} = \sum_{i \in S}w_{i}y_{i}\) | \(\bar{y} = \frac{\hat{\tau}}{\hat{N}}\) |
- SRS: \(w_{i} =\frac{N}{n}\)
- Stratified: \(w_{hj}=\frac{N_h}{n_h}\)
- One stage cluster: is an SRS of clusters.
- Two stage cluster: \(w_{ij} = \frac{N}{n}\frac{M_{i}}{m_{i}}\)
- Stage 1: \(P\)(\(i\)th psu selected) = \(\frac{n}{N}\)
- Stage 2: \(P\)(\(j\)th ssu selected | \(i\)th psu selected) = \(\frac{m_i}{M_i}\)
Estimators under different Sampling Designs
Simple Random Sample
Measure | Unbiased Estimate \((\hat{\theta})\) | Estimated variance of \((\hat{\theta})\) |
---|---|---|
Mean | \(\bar{y} = \frac{1}{n}\sum_{i\in S} y_{i}\) | \(\hat{V}(\bar{y}) = (1-\frac{n}{N})\frac{s^{2}}{n}\) |
Total | \(\hat{\tau} = N\bar{y}\) | \(\hat{V}(\hat{\tau}) = N^{2}\hat{V}(\bar{y})\) |
Proportion | \(\hat{p} = \bar{y}\) | \(\hat{V}(\hat{p}) = (1-\frac{n}{N})\frac{\hat{p}(1-\hat{p})}{n-1}\) |
- \(i \in S\) : Unit \(i\) is an element in the sample \(S\)
- \((1-\frac{n}{N})\): Finite population correction (fpc)
Stratified Random Sample
- CI’s should use a \(t_{n-H}\) distribution instead of using the normal assumption.
Measure | Population \((\theta)\) | Sample \((\hat{\theta})\) |
---|---|---|
Sample size in strata \(h\) | \(N_{h}\) | \(n_{h}\) |
Overall sample size | \(N = \sum^{H}_{h=1} N_{h}\) | \(n = \sum^{H}_{h=1} n_{h}\) |
Var(y) in strata \(h\) | \(\sigma_{h}^2 = \frac{1}{N_{h}-1}\sum_{j=1}^{N_{h}}(y_{hj}-\mu_{h})^2\) | \(s_{h}^2 = \frac{1}{n_{h}-1}\sum_{j}(y_{hj}-\bar{y}_{h})^2\) |
Total in strata \(h\) | \(\tau_{h} = \sum_{j=1}^{N_{h}} y_{hj}\) | \(\hat{\tau}_{h} = N_{h}\bar{y_h}\) |
Overall total | \(\tau = \sum_{h=1}^{H} \tau_{h}\) | \(\hat{\tau}_{str} = \sum_{h=1}^{H}N_{h}\bar{y}_{h}\) |
Mean in strata \(h\) | \(\mu_{h} = \frac{\tau_{h}}{N_{h}}\) | \(\bar{y}_{h} = \frac{1}{n_{h}}\sum_{j} y_{hj}\) |
Overall mean | \(\mu = \frac{\tau}{N}\) | \(\bar{y}_{str} = \sum_{h=1}^{H} \frac{N_{h}}{N}\bar{y}_{h}\) |
Proportion in strata \(h\) | \(p_{h} = \mu_h\) | \(\bar{y}_h = \hat{p}_h\) |
Variance of \(p_{h}\) | \(\sigma_h^2=\frac{N_h}{N_h-1} p_h(1-p_h)\) | \(s_h^2=\frac{n_h}{n_h-1} \hat{p}_h(1-\hat{p}_h)\) |
Overall proportion | \(p_{str} = \sum_{h=1}^{H} \frac{N_h}{N} p_h\) | \(\hat{p}_{str} = \sum_{h=1}^{H} \frac{N_h}{N}\hat{p}_h\) |
Estimated variances;
- Within strata total \(\hat{V}(\hat{\tau}_{h}) = (1-\frac{n_h}{N_h})N^2 \frac{s^2_h}{n_h}\)
- Within strata mean \(\hat{V}(\bar{y}_{h}) = \frac{1}{N^2}V(\hat{\tau}_{h})\)
- Overall population total \(\hat{V}(\hat{\tau}_{str}) = \sum_{h=1}^H (1-\frac{n_h}{N_h})N^2 \frac{s^2_h}{n_h}\)
- Population proportion \(\hat{V}(\hat{p}_{str}) = \sum_{h=1}^H (1-\frac{n_h}{N_h}) (\frac{N_h}{N})^2 \frac{\hat{p}_h (1-\hat{p}_h)}{n_h-1}\)
- Estimated total number of population units having a specified characteristic: \(\hat{t}_{\it{str}} = \sum_{h=1}^H N_h \hat{p}_h\)
Cluster Random Sample
Symbol | Formula | Description |
---|---|---|
\(y_{ij}\) | measurement for the \(j\)th element in \(i\)th psu | |
\(N\) | number of clusters (psus) in the population | |
\(M_{i}\) | number of ssus in psu \(i\) | |
\(M_{0}\) | \(\sum_{i=1}^{N} M_{i}\) | total number of ssus in the population |
\(t_{i}\) | \(\sum_{j=1}^{M_{i}} y_{ij}\) | total in psu \(i\) |
\(t\) | \(\sum_{i=1}^{N} t_{i}= \sum_{i=1}^{N} \sum_{j=1}^{M_{i}} y_{ij}\) | population total |
\(S_{t}^{2}\) | \(\frac{1}{N-1} \sum_{i=1}^{N} (t_{i} - \frac{t}{N})^2\) | population variance of the psu totals |
Symbol | Formula | Description |
---|---|---|
\(\bar{y}_{\mu}\) | \(\frac{1}{M_0} \sum_{i=1}^N \sum_{j=1}^{M_i} y_{ij}\) | population mean |
\(\bar{y}_{i\mu}\) | \(\frac{1}{M_i} \sum_{j=1}^{M_i} y_{ij} = \frac{t_i}{M_i}\) | population mean in psu \(i\) |
\(S^2\) | \(\frac{1}{M_0-1} \sum_{i=1}^N \sum_{j=1}^{M_i} (y_{ij}-\bar{y}_{\mu})^2\) | population variance (per ssu) |
\(S_{i}^{2}\) | \(\frac{1}{M_i-1} \sum_{j=1}^{M_i} (y_{ij}-\bar{y}_{\mu})^2\) | population variance within psu \(i\) |
Symbol | Formula | Description |
---|---|---|
\(y_{ij}\) | Measurement for\(j\)th element in \(i\)ht psu | |
\(N\) | Number of clusters (psus) in the population | |
\(M_i\) | Number of ssus in psu \(i\) | |
\(M_{o}\) | \(\sum_{i=1}^NM_i\) | Total number of psus in the popultion |
\(t_i\) | \(\sum_{j=1}^{M_i}y_{ij}\) | Total in psu \(i\) |
\(t\) | \(\sum_{i=1}^Nt_i = \sum_{i=1}^N\sum_{j=i}^{M_i}y_{ij}\) | Popultion Total |
\(S_i^2\) | \(\frac{1}{N-1} \sum_{i=1}^N(t_i-\frac{t}{N})^2\) | Popultion variance of the psu totals |
Ratio Estimation in SRS
Let \(y_{i}\) be the measure of interest, and \(x_{i}\) be the auxiliary variable. In a population of size \(N\) the totals for each measure are
\[ t_{y} = \sum_{i=1}^Ny_{i} \qquad t_{x} = \sum_{i=1}^Nx_{i} \]
and their ratio is calculated as \[ B = \frac{t_{y}}{t_{x}} = \frac{\mu_{y}}{\mu_{x}}. \]
Regression Estimation in SRS
Regression estimation uses the method of ordinary least squares to calculate estimates for the slope \(B_1\) and intercept \(B_{0}\) of a ‘line of best fit’ between the outcome measure of interest \(y\) and the auxiliary variable \(x\).
\[ y = B_0 + B_1x \]
where
\[ \hat{B}_1 = \frac{rs_y}{s_x} \qquad \hat{B}_0 = \bar{y} - \hat{B}_1\bar{x} \]
where \(r\) is the sample correlation coefficient between \(x\) and \(y\), and \(s_x\) and \(s_y\) are sample standard deviations of \(x\) and \(y\) respectively.
R commands
This is a quick reference list. See the R companion for the textbook, the survey
package help file or vignette for more information.
Specify survey design svydesign
- Function call:
svydesign(id = , weights=, fpc= , data = )
id
= variable that identifies clustersweights
= variable that holds the sampling weightsfpc
= finite population correction. Typically defined in the function call.
The argument details can be found on the specified pages in the R companion for the book.
- SRS: pg 21
- Stratified Random Sample: pg 34
- Cluster sampling:
- Ratio estimates: pg 41
- Regression estimates: pg 44
Estimators
- mean:
svymean(~x, design)
- total:
svytotal(~x, design)
- proportion: Use
svytotal
and divide byN
- CI for the mean or total: Use
confint()
after calculating the point estimate - CI for proportion:
svyciprop(~x, design)
Will also print out \(\hat{p}\) - Ratio estimate:
svyratio(~numerator, ~denominator, design)
- Regression estimate:
svyglm(y~x, design)
Calculating stratum means and variances
- The first argument of
svyby
is the formula for the variable(s) for which statistics are desired (by=)
is the variable that defines the groups.- Then list the
design
object - and the name of the function that calculates the statistics.
- Set
keep.var=TRUE
to display the standard errors for the statistics.
Manage data collection
Setup a data collection sheet using random sampling from a sampling frame
- Setup your sampling frame in a spreadsheet. See this example using google sheets.
- Import your sampling frame into R. More info can be found on the googlesheets4 package vignette.
library(googlesheets4)
<- "https://docs.google.com/spreadsheets/d/1L0voUNhw5CsZ3YPWsuTJCBYm963L0zFIhDKS5HUjmms"
sheet.url <- read_sheet(sheet.url, sheet = "srs")
frame.srs <- read_sheet(sheet.url, sheet = "stratified")
frame.str <- read_sheet(sheet.url, sheet = "cluster") frame.clu
- Use functions from the
sampling
package to draw random samples according to your design. See the links for more details on what the arguments mean.
- SRS.
library(sampling)
set.seed(12345)
<- srswor(8, length(frame.srs$unit_id))
srs.idx getdata(frame.srs, srs.idx)
ID_unit unit_id
1 2 2
2 6 6
3 8 8
4 11 11
5 14 14
6 16 16
7 18 18
8 19 19
library(dplyr)
<- frame.str %>% arrange(group) # sort first
frame.str <- sampling::strata(data = frame.str, # data set
strata.idx stratanames = "group", # variable name
size = c(6,4,5), # stratum sample sizes
method = "srswor") # method for selecting within strata
getdata(frame.str, strata.idx)
unit_id group ID_unit Prob Stratum
1 1 A 1 0.5454545 1
2 2 A 2 0.5454545 1
6 6 A 6 0.5454545 1
7 7 A 7 0.5454545 1
10 10 A 10 0.5454545 1
11 11 A 11 0.5454545 1
12 1 B 12 0.5000000 2
17 6 B 17 0.5000000 2
18 7 B 18 0.5000000 2
19 8 B 19 0.5000000 2
21 2 C 21 0.8333333 3
22 3 C 22 0.8333333 3
23 4 C 23 0.8333333 3
24 5 C 24 0.8333333 3
25 6 C 25 0.8333333 3
- One stage cluster
<- sampling::cluster(data=frame.clu, # Data set
onestage.idx clustername = "group", # variable name containing clusters
size = 3, # number of clusters
method = "srswor", # how to draw clusters
description = TRUE) # show descriptive output
Number of selected clusters: 3
Number of units in the population and number of selected units: 50 25
getdata(frame.clu, onestage.idx)
unit_id group ID_unit Prob
1 3 A 3 0.6
2 1 A 1 0.6
3 2 A 2 0.6
4 7 A 7 0.6
5 4 A 4 0.6
6 5 A 5 0.6
7 6 A 6 0.6
8 5 B 12 0.6
9 1 B 8 0.6
10 2 B 9 0.6
11 3 B 10 0.6
12 4 B 11 0.6
13 9 B 16 0.6
14 6 B 13 0.6
15 7 B 14 0.6
16 8 B 15 0.6
17 10 B 17 0.6
18 8 C 25 0.6
19 1 C 18 0.6
20 2 C 19 0.6
21 3 C 20 0.6
22 4 C 21 0.6
23 5 C 22 0.6
24 6 C 23 0.6
25 7 C 24 0.6
- Two stage cluster
<- sampling::mstage(data=frame.clu,
mstage.idx stage = c("cluster", ""), # sampling method for each stage, blank means SRS
varnames = list("group", "unit_id"), # variable names for each stage
size = list(3, c(5,5,5)), # 3 psus, 5 ssus from each psu
method = c("srswor", "srswor"))
getdata(frame.clu, mstage.idx)
[[1]]
unit_id group ID_unit Prob_ 1 _stage
1 6 B 13 0.6
2 10 B 17 0.6
3 7 B 14 0.6
4 8 B 15 0.6
5 9 B 16 0.6
6 1 B 8 0.6
7 2 B 9 0.6
8 3 B 10 0.6
9 4 B 11 0.6
10 5 B 12 0.6
11 1 D 26 0.6
12 2 D 27 0.6
13 3 D 28 0.6
14 4 D 29 0.6
15 5 D 30 0.6
16 6 D 31 0.6
17 7 D 32 0.6
18 8 D 33 0.6
19 9 D 34 0.6
20 10 D 35 0.6
21 11 D 36 0.6
22 12 D 37 0.6
23 13 D 38 0.6
24 1 E 39 0.6
25 2 E 40 0.6
26 3 E 41 0.6
27 4 E 42 0.6
28 5 E 43 0.6
29 6 E 44 0.6
30 7 E 45 0.6
31 8 E 46 0.6
32 9 E 47 0.6
33 10 E 48 0.6
34 11 E 49 0.6
35 12 E 50 0.6
[[2]]
group unit_id ID_unit Prob_ 2 _stage Prob
1 B 10 17 0.5000000 0.3000000
2 B 7 14 0.5000000 0.3000000
3 B 8 15 0.5000000 0.3000000
4 B 2 9 0.5000000 0.3000000
5 B 3 10 0.5000000 0.3000000
6 D 4 29 0.3846154 0.2307692
7 D 6 31 0.3846154 0.2307692
8 D 8 33 0.3846154 0.2307692
9 D 9 34 0.3846154 0.2307692
10 D 13 38 0.3846154 0.2307692
11 E 1 39 0.4166667 0.2500000
12 E 3 41 0.4166667 0.2500000
13 E 5 43 0.4166667 0.2500000
14 E 7 45 0.4166667 0.2500000
15 E 9 47 0.4166667 0.2500000
See ?mstage
for more examples of multistage sampling.
Make a new spreadsheet containing the sampled rows, and add a column for \(y\) the thing you are going to measure on each unit.
Go collect data!