This vignette describes the mathematical method for estimating confidence intervals of the structural survey and mobility and transport survey conducted by the Swiss Federal Statistical Office (FSO).
The FSO provides formulas to estimate populations and variances of the structural survey in German (Section 6).
The estimator depends on:
The estimator of variable \(y\) depends on the type of the variable and the desired statistic:
| Variable type | Estimate type | Estimate |
|---|---|---|
| Categorical | Total | \(\hat{y} = \sum_k w_k I_c(y_k)\) |
| Continuous | Total | \(\hat{y} = \sum_k w_k y_k\) |
| Categorical | Proportion | \(\bar y = \frac{\sum_k w_k I_c(y_k)} {\sum _k w_k}\) |
| Continuous | Mean | \(\bar y = \frac {\sum_k w_k y_k} {\sum _k w_k}\) |
where:
The variance of the estimator of the variable \(y\) is approximated by the variance of the estimate of variable \(z\) defined as:
\[\hat z = \sum_{k} w_k z_k\]
where the transformation \(z_k\) depends on both the type of variable \(y\) and the desired statistic:
| Variable type | Estimate type | Transformation \(z_k\) |
|---|---|---|
| Categorical | Total | \(z_k = I_c(y_k)\) |
| Continuous | Total | \(z_k = y_k\) |
| Categorical | Proportion | \(z_ k = \frac{ y _k - \bar y} {\sum _i w_i}\) |
| Continuous | Mean | \(z_k=\frac{y_k - \bar y} {\sum _i w_i}\) |
The variance estimator for the estimator \(\hat{z}\) is given by:
\[\hat V(\hat z) = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\hat z_h}{m_h}\right)^2\] where:
zone),The confidence interval is given by:
\[ \text{CI} = \sqrt{\hat{V}(\hat{z})} \times \text{qnorm}\left(1 - \frac{\alpha}{2}\right) \] where \(\alpha\) is the significance level, for example \(\alpha = 0.05\) for confidence interval 95%.
The estimated total for a condition c is given by:
\[\hat{N}_c = \sum_{i \in r} w_i I_c\] with corresponding variance estimate:
\[\hat{V}(\hat{N}_c) = \sum_h \frac{m_h}{m_h - 1} \left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h} \left(w_i I_c - \frac{\hat{N}_{hc}}{m_h}\right)^2\] where:
For condition \(c\), this term becomes:
\[ \begin{aligned} \sum_{i \in r_h} \left(w_i I_c - \frac{\hat{N}_{hc}}{m_h}\right)^2 &= \sum_{i \notin r_{hc}} \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2 \\ &= \left(m_h - m_{hc}\right) \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2 \end{aligned} \]
where \(r_{hc}\) is the set of respondents in stratum \(h\) who fulfill condition \(c\), and \(m_{hc}\) is the number of respondents in \(r_{hc}\).
Thus, the original variance estimate equation becomes:
\[ \hat{V}(\hat{N}_c) = \sum_h \frac{m_h}{m_h - 1} \left(1 - \frac{m_h}{N_h}\right) \left[(m_h - m_{hc}) \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2\right] \]
The estimate of the mean of a continuous variable \(y\), for example the average rent
rentnet, is given by the weighted mean:
\[\bar y = \frac{\sum_k w_k y_k}{\sum_k w_k}\]
Variance of \(\bar y\) is approximated by that of the total of variable \(\hat{z} = \sum_k w_k z_k\) where: \[z_k = \frac{y_k - \bar y}{\sum_i w_i}\]
In other words:
\[\begin{align*} \hat V(\bar y) & = \hat V(\hat z) \\ & = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\hat z_h}{m_h}\right)^2 \\ & = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\sum_{j \in r_h} w_j z_j}{m_h}\right)^2 \\ & = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i \frac{y_i - \bar y}{\sum_{j \in r_h} w_j} - \frac{\sum_{j \in r_h} w_j \left(\frac{y_j - \bar y}{\sum_{k \in r_h} w_k}\right)}{m_h}\right)^2 \end{align*}\]
From the survey (MZMV/MRMT) data, mzmv_mean()
estimates:
while mzmv_mean_map() additionally uses grouping
variables.
Note that one can simply use mzmv_mean() to estimate
both proportions and means, as shown below.
The FSO provides formulas to estimate variances of the MZMV/MRMT.
The estimated mean is:
\[\hat{Y} = \frac{1}{\sum\limits_{i\in r} w_i}\sum_{i \in r} w_i y_i\] where:
The confidence interval of the estimated mean is:
\[ \begin{aligned}\text{CI} &= 1.14\times Z_{\alpha}\frac{\hat{\sigma}_{y}}{\sqrt{n}}\\ &= 1.14 \times \frac{\hat{\sigma}_{y}}{\sqrt{n}} \times \text{qnorm}(1 - \frac{\alpha}2) \end{aligned}\]
where:
The (sample) variance of variable \(Y\) is estimated by:
\[\hat{\sigma}_{y}^2 = \frac{\sum\limits_{i\in r} w_i \left(y_i - \bar{y}\right)^2}{\left(\sum\limits_{i \in r} w_i \right)- 1}\] where \(\bar{y}\) is the estimated mean \(\hat{Y}\).
If \(y_i \in \{0, 1\}\), for example possession of a car, then the mean estimate becomes the proportion estimate:
\[p = \frac{1}{\sum\limits_{i \in r} w_i} \sum_{i \in r} w_i I_c\] where:
The sample variance in the previous section then becomes:
\[\hat{\sigma}_{p}^2 = \frac{\sum\limits_{i\in r} w_i \left(I_c - p\right)^2}{\left(\sum\limits_{i \in r} w_i \right)- 1}\]
Noting that \(I_c^2 = I_c\) and \(\sum\limits_i w_i I_c = p \sum\limits_i w_i\), the nominator then becomes:
\[ \begin{aligned} \sum\limits_{i\in r} w_i \left(I_c - p\right)^2 &= \sum_i w_i \left(I_c^2 +p^2 -2pI_c\right) \\ &= \sum_i w_i I_c + p^2 \sum_i w_i -2p\sum_i w_i I_c\\ &= p \sum_i w_i + p^2 \sum_i w_i - 2p^2 \sum_i w_i\\ &= p \sum_i w_i - p^2 \sum_i w_i\\ &= p(1-p) \sum_i w_i \end{aligned} \]
Therefore, the estimated sample variance becomes:
\[\hat{\sigma}_{p}^2 = \frac{p(1-p) \sum\limits_{i} w_i}{\left(\sum\limits_{i} w_i \right)- 1}\]
which when \(\sum\limits_i w_i >> 1\) can be approximated with:
\[\hat{\sigma}_{p}^2 \approx p(1-p)\]
The confidence interval for proportions could therefore be approximated with:
\[\text{CI} \approx 1.14 \times \sqrt{\frac{p(1-p)}{n}} \times \text{qnorm}(1 - \frac{\alpha} 2)\] where:
A confidence interval is a range of plausible values for a population parameter, calculated from sample data. A 95% confidence interval means that if the same sampling procedure were repeated many times, approximately 95% of the resulting intervals would contain the true population value. This does not imply that there is a 95% probability that the true value lies within any single interval, rather, it reflects the reliability of the estimation method across repeated samples.