--- title: "Mathematical Method" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Method} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette describes the mathematical method for estimating confidence intervals of the structural survey and mobility and transport survey conducted by the Swiss Federal Statistical Office (FSO). # Structural survey The FSO provides [formulas to estimate populations and variances of the structural survey](https://www.bfs.admin.ch/bfs/de/home/statistiken/bevoelkerung/erhebungen/se/methodische-grundlagen-forschung-regionale-partner.assetdetail.11187024.html) in German (Section 6). The estimator depends on: - The type of variable: - Categorical: a factor-like variable, e.g., gender, country of birth. - Continuous: a numeric variable, e.g., income, household size. - The type of estimate: - Total (sum across the population). - Proportion (relative frequency) or mean (average). ## Population Estimator The estimator of variable $y$ depends on the type of the variable and the desired statistic: | Variable type | Estimate type | Estimate | |------------------------|------------------------|------------------------| | Categorical | Total | $\hat{y} = \sum_k w_k I_c(y_k)$ | | Continuous | Total | $\hat{y} = \sum_k w_k y_k$ | | Categorical | Proportion | $\bar y = \frac{\sum_k w_k I_c(y_k)} {\sum _k w_k}$ | | Continuous | Mean | $\bar y = \frac {\sum_k w_k y_k} {\sum _k w_k}$ | where: - $w_k$ is the sampling weight for respondent $k$, - $I_c = 1$ if condition(s) $c$ is true, 0 otherwise, - $y_k$ is the observed value for respondent $k$. The variance of the estimator of the variable $y$ is approximated by the variance of the estimate of variable $z$ defined as: $$\hat z = \sum_{k} w_k z_k$$ where the transformation $z_k$ depends on both the type of variable $y$ and the desired statistic: | Variable type | Estimate type | Transformation $z_k$ | |------------------------|------------------------|------------------------| | Categorical | Total | $z_k = I_c(y_k)$ | | Continuous | Total | $z_k = y_k$ | | Categorical | Proportion | $z_ k = \frac{ y _k - \bar y} {\sum _i w_i}$ | | Continuous | Mean | $z_k=\frac{y_k - \bar y} {\sum _i w_i}$ | ## Variance Estimator The variance estimator for the estimator $\hat{z}$ is given by: $$\hat V(\hat z) = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\hat z_h}{m_h}\right)^2$$ where: - $h$ is index stratum (`zone`), - $r_h$ is the set of respondents in stratum $h$, - $m_h$ is the number of respondents in $r_h$, - $N_h = \sum_{i \in r_h} w_i$ is the estimated population size in stratum $h$, - $w_i$ is the sampling weight for respondent $i$, - $z_i$ is a transformation of $y_i$. - $\hat{z}_h$ is the estimate of variable $z$ in stratum $h$. The confidence interval is given by: $$ \text{CI} = \sqrt{\hat{V}(\hat{z})} \times \text{qnorm}\left(1 - \frac{\alpha}{2}\right) $$ where $\alpha$ is the significance level, for example $\alpha = 0.05$ for [confidence interval](#confidence-interval) 95%. ## Simplification of Variance Estimates ### Total of Categorical Variable The estimated total for a condition c is given by: $$\hat{N}_c = \sum_{i \in r} w_i I_c$$ with corresponding variance estimate: $$\hat{V}(\hat{N}_c) = \sum_h \frac{m_h}{m_h - 1} \left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h} \left(w_i I_c - \frac{\hat{N}_{hc}}{m_h}\right)^2$$ where: - $\hat{N}_c$ is the total estimate of condition $c$, - $\hat{N}_{hc}$ is the total estimate of conditions $c$ in stratum $h$, For condition $c$, this term becomes: $$ \begin{aligned} \sum_{i \in r_h} \left(w_i I_c - \frac{\hat{N}_{hc}}{m_h}\right)^2 &= \sum_{i \notin r_{hc}} \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2 \\ &= \left(m_h - m_{hc}\right) \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2 \end{aligned} $$ where $r_{hc}$ is the set of respondents in stratum $h$ who fulfill condition $c$, and $m_{hc}$ is the number of respondents in $r_{hc}$. Thus, the original variance estimate equation becomes: $$ \hat{V}(\hat{N}_c) = \sum_h \frac{m_h}{m_h - 1} \left(1 - \frac{m_h}{N_h}\right) \left[(m_h - m_{hc}) \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2\right] $$ ### Mean of a Continuous Variable The estimate of the mean of a continuous variable $y$, for example the average rent `rentnet`, is given by the weighted mean: $$\bar y = \frac{\sum_k w_k y_k}{\sum_k w_k}$$ Variance of $\bar y$ is approximated by that of the total of variable $\hat{z} = \sum_k w_k z_k$ where: $$z_k = \frac{y_k - \bar y}{\sum_i w_i}$$ In other words: \begin{align*} \hat V(\bar y) & = \hat V(\hat z) \\ & = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\hat z_h}{m_h}\right)^2 \\ & = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\sum_{j \in r_h} w_j z_j}{m_h}\right)^2 \\ & = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i \frac{y_i - \bar y}{\sum_{j \in r_h} w_j} - \frac{\sum_{j \in r_h} w_j \left(\frac{y_j - \bar y}{\sum_{k \in r_h} w_k}\right)}{m_h}\right)^2 \end{align*} # Mobility and Transport Survey From the survey (MZMV/MRMT) data, `mzmv_mean()` estimates: - mean or proportion of a variable in the real population: weighted mean of sub-population of interest, - confidence interval of the estimate with significance level $\alpha$, while `mzmv_mean_map()` additionally uses grouping variables. Note that one can simply use `mzmv_mean()` to estimate both proportions and means, as shown below. The FSO provides [formulas to estimate variances of the MZMV/MRMT](https://www.bfs.admin.ch/bfs/fr/home/statistiques/mobilite-transports/enquetes/mzmv.assetdetail.4262242.html). ## Means The estimated mean is: $$\hat{Y} = \frac{1}{\sum\limits_{i\in r} w_i}\sum_{i \in r} w_i y_i$$ where: - $w_i$ is the weight for participant $i$, - $y_i$ is the response of participant $i$, - $r$ is the set of respondents. The [confidence interval of the estimated mean](https://www.bfs.admin.ch/bfs/fr/home/statistiques/mobilite-transports/enquetes/mzmv.assetdetail.4262242.html) is: $$ \begin{aligned}\text{CI} &= 1.14\times Z_{\alpha}\frac{\hat{\sigma}_{y}}{\sqrt{n}}\\ &= 1.14 \times \frac{\hat{\sigma}_{y}}{\sqrt{n}} \times \text{qnorm}(1 - \frac{\alpha}2) \end{aligned}$$ where: - 1.14 is a correction factor, - $\alpha$ is the significance level, for example 0.05 for confidence interval 95%, - $Z_{\alpha}$ is the Z-value for the desired confidence level ($Z_{0.05} = 1.96$ for double-sided 95% confidence interval), - $n$ is the size of set $r$, i.e. number of respondents, - $\hat{\sigma}_{y}^2$ is the variance of variable $Y$ estimated with sample $r$. The (sample) variance of variable $Y$ is estimated by: $$\hat{\sigma}_{y}^2 = \frac{\sum\limits_{i\in r} w_i \left(y_i - \bar{y}\right)^2}{\left(\sum\limits_{i \in r} w_i \right)- 1}$$ where $\bar{y}$ is the estimated mean $\hat{Y}$. ## Proportions If $y_i \in \{0, 1\}$, for example possession of a car, then the mean estimate becomes the proportion estimate: $$p = \frac{1}{\sum\limits_{i \in r} w_i} \sum_{i \in r} w_i I_c$$ where: - $w_i$ is the weight for participant $i$, - $I_c = 1$ if condition $c$ is true ($y_i = 1$), 0 otherwise ($y_i = 0$), - $r$ is the set of participants. The sample variance in the previous section then becomes: $$\hat{\sigma}_{p}^2 = \frac{\sum\limits_{i\in r} w_i \left(I_c - p\right)^2}{\left(\sum\limits_{i \in r} w_i \right)- 1}$$ Noting that $I_c^2 = I_c$ and $\sum\limits_i w_i I_c = p \sum\limits_i w_i$, the nominator then becomes: $$ \begin{aligned} \sum\limits_{i\in r} w_i \left(I_c - p\right)^2 &= \sum_i w_i \left(I_c^2 +p^2 -2pI_c\right) \\ &= \sum_i w_i I_c + p^2 \sum_i w_i -2p\sum_i w_i I_c\\ &= p \sum_i w_i + p^2 \sum_i w_i - 2p^2 \sum_i w_i\\ &= p \sum_i w_i - p^2 \sum_i w_i\\ &= p(1-p) \sum_i w_i \end{aligned} $$ Therefore, the estimated sample variance becomes: $$\hat{\sigma}_{p}^2 = \frac{p(1-p) \sum\limits_{i} w_i}{\left(\sum\limits_{i} w_i \right)- 1}$$ which when $\sum\limits_i w_i >> 1$ can be approximated with: $$\hat{\sigma}_{p}^2 \approx p(1-p)$$ The confidence interval for proportions could therefore be approximated with: $$\text{CI} \approx 1.14 \times \sqrt{\frac{p(1-p)}{n}} \times \text{qnorm}(1 - \frac{\alpha} 2)$$ where: - $\alpha$ is the significance level, - $\text{qnorm}$ outputs the Z-score for the required significance level $\alpha$, - $n$ is the size of set $r$, i.e. number of respondents. # Confidence Interval - Definition {#confidence-interval} A confidence interval is a range of plausible values for a population parameter, calculated from sample data. A 95% confidence interval means that if the same sampling procedure were repeated many times, approximately 95% of the resulting intervals would contain the true population value. This does not imply that there is a 95% probability that the true value lies within any single interval, rather, it reflects the reliability of the estimation method across repeated samples.