This vignette describes the mathematical method for estimating confidence intervals of the structural survey and mobility and transport survey conducted by the Swiss Federal Statistical Office (FSO).

Structural survey

The FSO provides formulas to estimate populations and variances of the structural survey in German (Section 6).

The estimator depends on:

The type of variable:
- Categorical: a factor-like variable, e.g., gender, country of birth.
- Continuous: a numeric variable, e.g., income, household size.
The type of estimate:
- Total (sum across the population).
- Proportion (relative frequency) or mean (average).

Population Estimator

The estimator of variable \(y\) depends on the type of the variable and the desired statistic:

Variable type	Estimate type	Estimate
Categorical	Total	\(\hat{y} = \sum_k w_k I_c(y_k)\)
Continuous	Total	\(\hat{y} = \sum_k w_k y_k\)
Categorical	Proportion	\(\bar y = \frac{\sum_k w_k I_c(y_k)} {\sum _k w_k}\)
Continuous	Mean	\(\bar y = \frac {\sum_k w_k y_k} {\sum _k w_k}\)

where:

\(w_k\) is the sampling weight for respondent \(k\),
\(I_c = 1\) if condition(s) \(c\) is true, 0 otherwise,
\(y_k\) is the observed value for respondent \(k\).

The variance of the estimator of the variable \(y\) is approximated by the variance of the estimate of variable \(z\) defined as:

\[\hat z = \sum_{k} w_k z_k\]

where the transformation \(z_k\) depends on both the type of variable \(y\) and the desired statistic:

Variable type	Estimate type	Transformation \(z_k\)
Categorical	Total	\(z_k = I_c(y_k)\)
Continuous	Total	\(z_k = y_k\)
Categorical	Proportion	\(z_ k = \frac{ y _k - \bar y} {\sum _i w_i}\)
Continuous	Mean	\(z_k=\frac{y_k - \bar y} {\sum _i w_i}\)

Variance Estimator

The variance estimator for the estimator \(\hat{z}\) is given by:

\[\hat V(\hat z) = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\hat z_h}{m_h}\right)^2\] where:

\(h\) is index stratum (zone),
\(r_h\) is the set of respondents in stratum \(h\),
\(m_h\) is the number of respondents in \(r_h\),
\(N_h = \sum_{i \in r_h} w_i\) is the estimated population size in stratum \(h\),
\(w_i\) is the sampling weight for respondent \(i\),
\(z_i\) is a transformation of \(y_i\).
\(\hat{z}_h\) is the estimate of variable \(z\) in stratum \(h\).

The confidence interval is given by:

\[ \text{CI} = \sqrt{\hat{V}(\hat{z})} \times \text{qnorm}\left(1 - \frac{\alpha}{2}\right) \] where \(\alpha\) is the significance level, for example \(\alpha = 0.05\) for confidence interval 95%.

Simplification of Variance Estimates

Total of Categorical Variable

The estimated total for a condition c is given by:

\[\hat{N}_c = \sum_{i \in r} w_i I_c\] with corresponding variance estimate:

\[\hat{V}(\hat{N}_c) = \sum_h \frac{m_h}{m_h - 1} \left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h} \left(w_i I_c - \frac{\hat{N}_{hc}}{m_h}\right)^2\] where:

\(\hat{N}_c\) is the total estimate of condition \(c\),
\(\hat{N}_{hc}\) is the total estimate of conditions \(c\) in stratum \(h\),

For condition \(c\), this term becomes:

\[ \begin{aligned} \sum_{i \in r_h} \left(w_i I_c - \frac{\hat{N}_{hc}}{m_h}\right)^2 &= \sum_{i \notin r_{hc}} \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2 \\ &= \left(m_h - m_{hc}\right) \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2 \end{aligned} \]

where \(r_{hc}\) is the set of respondents in stratum \(h\) who fulfill condition \(c\), and \(m_{hc}\) is the number of respondents in \(r_{hc}\).

Thus, the original variance estimate equation becomes:

\[ \hat{V}(\hat{N}_c) = \sum_h \frac{m_h}{m_h - 1} \left(1 - \frac{m_h}{N_h}\right) \left[(m_h - m_{hc}) \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2\right] \]

Mean of a Continuous Variable

The estimate of the mean of a continuous variable \(y\), for example the average rent rentnet, is given by the weighted mean:

\[\bar y = \frac{\sum_k w_k y_k}{\sum_k w_k}\]

Variance of \(\bar y\) is approximated by that of the total of variable \(\hat{z} = \sum_k w_k z_k\) where: \[z_k = \frac{y_k - \bar y}{\sum_i w_i}\]

In other words:

\[\begin{align*} \hat V(\bar y) & = \hat V(\hat z) \\ & = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\hat z_h}{m_h}\right)^2 \\ & = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\sum_{j \in r_h} w_j z_j}{m_h}\right)^2 \\ & = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i \frac{y_i - \bar y}{\sum_{j \in r_h} w_j} - \frac{\sum_{j \in r_h} w_j \left(\frac{y_j - \bar y}{\sum_{k \in r_h} w_k}\right)}{m_h}\right)^2 \end{align*}\]

Mobility and Transport Survey

From the survey (MZMV/MRMT) data, mzmv_mean() estimates:

mean or proportion of a variable in the real population: weighted mean of sub-population of interest,
confidence interval of the estimate with significance level \(\alpha\),

while mzmv_mean_map() additionally uses grouping variables.

Note that one can simply use mzmv_mean() to estimate both proportions and means, as shown below.

The FSO provides formulas to estimate variances of the MZMV/MRMT.

Means

The estimated mean is:

\[\hat{Y} = \frac{1}{\sum\limits_{i\in r} w_i}\sum_{i \in r} w_i y_i\] where:

\(w_i\) is the weight for participant \(i\),
\(y_i\) is the response of participant \(i\),
\(r\) is the set of respondents.

The confidence interval of the estimated mean is:

\[ \begin{aligned}\text{CI} &= 1.14\times Z_{\alpha}\frac{\hat{\sigma}_{y}}{\sqrt{n}}\\ &= 1.14 \times \frac{\hat{\sigma}_{y}}{\sqrt{n}} \times \text{qnorm}(1 - \frac{\alpha}2) \end{aligned}\]

where:

1.14 is a correction factor,
\(\alpha\) is the significance level, for example 0.05 for confidence interval 95%,
\(Z_{\alpha}\) is the Z-value for the desired confidence level (\(Z_{0.05} = 1.96\) for double-sided 95% confidence interval),
\(n\) is the size of set \(r\), i.e. number of respondents,
\(\hat{\sigma}_{y}^2\) is the variance of variable \(Y\) estimated with sample \(r\).

The (sample) variance of variable \(Y\) is estimated by:

\[\hat{\sigma}_{y}^2 = \frac{\sum\limits_{i\in r} w_i \left(y_i - \bar{y}\right)^2}{\left(\sum\limits_{i \in r} w_i \right)- 1}\] where \(\bar{y}\) is the estimated mean \(\hat{Y}\).

Proportions

If \(y_i \in \{0, 1\}\), for example possession of a car, then the mean estimate becomes the proportion estimate:

\[p = \frac{1}{\sum\limits_{i \in r} w_i} \sum_{i \in r} w_i I_c\] where:

\(w_i\) is the weight for participant \(i\),
\(I_c = 1\) if condition \(c\) is true (\(y_i = 1\)), 0 otherwise (\(y_i = 0\)),
\(r\) is the set of participants.

The sample variance in the previous section then becomes:

\[\hat{\sigma}_{p}^2 = \frac{\sum\limits_{i\in r} w_i \left(I_c - p\right)^2}{\left(\sum\limits_{i \in r} w_i \right)- 1}\]

Noting that \(I_c^2 = I_c\) and \(\sum\limits_i w_i I_c = p \sum\limits_i w_i\), the nominator then becomes:

\[ \begin{aligned} \sum\limits_{i\in r} w_i \left(I_c - p\right)^2 &= \sum_i w_i \left(I_c^2 +p^2 -2pI_c\right) \\ &= \sum_i w_i I_c + p^2 \sum_i w_i -2p\sum_i w_i I_c\\ &= p \sum_i w_i + p^2 \sum_i w_i - 2p^2 \sum_i w_i\\ &= p \sum_i w_i - p^2 \sum_i w_i\\ &= p(1-p) \sum_i w_i \end{aligned} \]

Therefore, the estimated sample variance becomes:

\[\hat{\sigma}_{p}^2 = \frac{p(1-p) \sum\limits_{i} w_i}{\left(\sum\limits_{i} w_i \right)- 1}\]

which when \(\sum\limits_i w_i >> 1\) can be approximated with:

\[\hat{\sigma}_{p}^2 \approx p(1-p)\]

The confidence interval for proportions could therefore be approximated with:

\[\text{CI} \approx 1.14 \times \sqrt{\frac{p(1-p)}{n}} \times \text{qnorm}(1 - \frac{\alpha} 2)\] where:

\(\alpha\) is the significance level,
\(\text{qnorm}\) outputs the Z-score for the required significance level \(\alpha\),
\(n\) is the size of set \(r\), i.e. number of respondents.

Confidence Interval - Definition

A confidence interval is a range of plausible values for a population parameter, calculated from sample data. A 95% confidence interval means that if the same sampling procedure were repeated many times, approximately 95% of the resulting intervals would contain the true population value. This does not imply that there is a 95% probability that the true value lies within any single interval, rather, it reflects the reliability of the estimation method across repeated samples.

Mathematical Method