---
title: "Mathematical Method"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Method}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette describes the mathematical method for estimating confidence intervals of the structural survey and mobility and transport survey conducted by the Swiss Federal Statistical Office (FSO).

# Structural survey

The FSO provides [formulas to estimate populations and variances of the structural survey](https://www.bfs.admin.ch/bfs/de/home/statistiken/bevoelkerung/erhebungen/se/methodische-grundlagen-forschung-regionale-partner.assetdetail.11187024.html) in German (Section 6).

The estimator depends on:

-   The type of variable:
    -   Categorical: a factor-like variable, e.g., gender, country of birth.
    -   Continuous: a numeric variable, e.g., income, household size.
-   The type of estimate:
    -   Total (sum across the population).
    -   Proportion (relative frequency) or mean (average).

## Population Estimator

The estimator of variable $y$ depends on the type of the variable and the desired statistic:

| Variable type | Estimate type | Estimate |
|------------------------|------------------------|------------------------|
| Categorical | Total | $\hat{y} = \sum_k w_k I_c(y_k)$ |
| Continuous | Total | $\hat{y} = \sum_k w_k y_k$ |
| Categorical | Proportion | $\bar y = \frac{\sum_k w_k I_c(y_k)} {\sum _k w_k}$ |
| Continuous | Mean | $\bar y = \frac {\sum_k w_k y_k} {\sum _k w_k}$ |

where:

-   $w_k$ is the sampling weight for respondent $k$,
-   $I_c = 1$ if condition(s) $c$ is true, 0 otherwise,
-   $y_k$ is the observed value for respondent $k$.

The variance of the estimator of the variable $y$ is approximated by the variance of the estimate of variable $z$ defined as:

$$\hat z = \sum_{k} w_k z_k$$

where the transformation $z_k$ depends on both the type of variable $y$ and the desired statistic:

| Variable type | Estimate type | Transformation $z_k$                         |
|------------------------|------------------------|------------------------|
| Categorical   | Total         | $z_k = I_c(y_k)$                             |
| Continuous    | Total         | $z_k = y_k$                                  |
| Categorical   | Proportion    | $z_ k = \frac{ y _k - \bar y} {\sum _i w_i}$ |
| Continuous    | Mean          | $z_k=\frac{y_k - \bar y} {\sum _i w_i}$      |

## Variance Estimator

The variance estimator for the estimator $\hat{z}$ is given by:

$$\hat V(\hat z) =  \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\hat z_h}{m_h}\right)^2$$ where:

-   $h$ is index stratum (`zone`),
-   $r_h$ is the set of respondents in stratum $h$,
-   $m_h$ is the number of respondents in $r_h$,
-   $N_h = \sum_{i \in r_h} w_i$ is the estimated population size in stratum $h$,
-   $w_i$ is the sampling weight for respondent $i$,
-   $z_i$ is a transformation of $y_i$.
-   $\hat{z}_h$ is the estimate of variable $z$ in stratum $h$.

The confidence interval is given by:

$$
\text{CI} = \sqrt{\hat{V}(\hat{z})} \times \text{qnorm}\left(1 - \frac{\alpha}{2}\right)
$$ where $\alpha$ is the significance level, for example $\alpha = 0.05$ for [confidence interval](#confidence-interval) 95%.

## Simplification of Variance Estimates

### Total of Categorical Variable

The estimated total for a condition c is given by:

$$\hat{N}_c = \sum_{i \in r} w_i I_c$$ with corresponding variance estimate:

$$\hat{V}(\hat{N}_c) = \sum_h \frac{m_h}{m_h - 1} \left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h} \left(w_i I_c - \frac{\hat{N}_{hc}}{m_h}\right)^2$$ where:

-   $\hat{N}_c$ is the total estimate of condition $c$,
-   $\hat{N}_{hc}$ is the total estimate of conditions $c$ in stratum $h$,

For condition $c$, this term becomes:

$$
\begin{aligned}
\sum_{i \in r_h} \left(w_i I_c - \frac{\hat{N}_{hc}}{m_h}\right)^2 &= \sum_{i \notin r_{hc}} \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2 \\
&= \left(m_h - m_{hc}\right) \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2
\end{aligned}
$$

where $r_{hc}$ is the set of respondents in stratum $h$ who fulfill condition $c$, and $m_{hc}$ is the number of respondents in $r_{hc}$.

Thus, the original variance estimate equation becomes:

$$
\hat{V}(\hat{N}_c) = \sum_h \frac{m_h}{m_h - 1} \left(1 - \frac{m_h}{N_h}\right) \left[(m_h - m_{hc}) \left(\frac{\hat{N}_{hc}}{m_h}\right)^2 + \sum_{i \in r_{hc}} \left(w_i - \frac{\hat{N}_{hc}}{m_h}\right)^2\right]
$$

### Mean of a Continuous Variable

The estimate of the mean of a continuous variable $y$, for example the average rent `rentnet`, is given by the weighted mean:

$$\bar y = \frac{\sum_k w_k y_k}{\sum_k w_k}$$

Variance of $\bar y$ is approximated by that of the total of variable $\hat{z} = \sum_k w_k z_k$ where: $$z_k = \frac{y_k - \bar y}{\sum_i w_i}$$

In other words:

\begin{align*} 
\hat V(\bar y) & = \hat V(\hat z) \\
& = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\hat z_h}{m_h}\right)^2 \\
& = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i z_i - \frac{\sum_{j \in r_h} w_j z_j}{m_h}\right)^2 \\
& = \sum_h \frac{m_h}{m_h - 1}\left(1 - \frac{m_h}{N_h}\right) \sum_{i \in r_h}\left(w_i \frac{y_i - \bar y}{\sum_{j \in r_h} w_j} - \frac{\sum_{j \in r_h} w_j \left(\frac{y_j - \bar y}{\sum_{k \in r_h} w_k}\right)}{m_h}\right)^2 \end{align*}

# Mobility and Transport Survey

From the survey (MZMV/MRMT) data, `mzmv_mean()` estimates:

-   mean or proportion of a variable in the real population: weighted mean of sub-population of interest,
-   confidence interval of the estimate with significance level $\alpha$,

while `mzmv_mean_map()` additionally uses grouping variables.

Note that one can simply use `mzmv_mean()` to estimate both proportions and means, as shown below.

The FSO provides [formulas to estimate variances of the MZMV/MRMT](https://www.bfs.admin.ch/bfs/fr/home/statistiques/mobilite-transports/enquetes/mzmv.assetdetail.4262242.html).

## Means

The estimated mean is:

$$\hat{Y} = \frac{1}{\sum\limits_{i\in r} w_i}\sum_{i \in r} w_i y_i$$ where:

-   $w_i$ is the weight for participant $i$,
-   $y_i$ is the response of participant $i$,
-   $r$ is the set of respondents.

The [confidence interval of the estimated mean](https://www.bfs.admin.ch/bfs/fr/home/statistiques/mobilite-transports/enquetes/mzmv.assetdetail.4262242.html) is:

$$
\begin{aligned}\text{CI} &= 
1.14\times Z_{\alpha}\frac{\hat{\sigma}_{y}}{\sqrt{n}}\\
&= 1.14 \times \frac{\hat{\sigma}_{y}}{\sqrt{n}} \times \text{qnorm}(1 - \frac{\alpha}2)
\end{aligned}$$

where:

-   1.14 is a correction factor,
-   $\alpha$ is the significance level, for example 0.05 for confidence interval 95%,
-   $Z_{\alpha}$ is the Z-value for the desired confidence level ($Z_{0.05} = 1.96$ for double-sided 95% confidence interval),
-   $n$ is the size of set $r$, i.e. number of respondents,
-   $\hat{\sigma}_{y}^2$ is the variance of variable $Y$ estimated with sample $r$.

The (sample) variance of variable $Y$ is estimated by:

$$\hat{\sigma}_{y}^2 = \frac{\sum\limits_{i\in r} w_i \left(y_i - \bar{y}\right)^2}{\left(\sum\limits_{i \in r} w_i \right)- 1}$$ where $\bar{y}$ is the estimated mean $\hat{Y}$.

## Proportions

If $y_i \in \{0, 1\}$, for example possession of a car, then the mean estimate becomes the proportion estimate:

$$p = \frac{1}{\sum\limits_{i \in r} w_i} \sum_{i \in r} w_i I_c$$ where:

-   $w_i$ is the weight for participant $i$,
-   $I_c = 1$ if condition $c$ is true ($y_i = 1$), 0 otherwise ($y_i = 0$),
-   $r$ is the set of participants.

The sample variance in the previous section then becomes:

$$\hat{\sigma}_{p}^2 = \frac{\sum\limits_{i\in r} w_i \left(I_c - p\right)^2}{\left(\sum\limits_{i \in r} w_i \right)- 1}$$

Noting that $I_c^2 = I_c$ and $\sum\limits_i w_i I_c = p \sum\limits_i w_i$, the nominator then becomes:

$$
\begin{aligned} \sum\limits_{i\in r} w_i \left(I_c - p\right)^2 &=
\sum_i w_i \left(I_c^2 +p^2 -2pI_c\right) \\
&= \sum_i w_i I_c + p^2 \sum_i w_i -2p\sum_i w_i I_c\\
&= p \sum_i w_i + p^2 \sum_i w_i - 2p^2 \sum_i w_i\\
&= p \sum_i w_i - p^2 \sum_i w_i\\
&= p(1-p) \sum_i w_i
\end{aligned}
$$

Therefore, the estimated sample variance becomes:

$$\hat{\sigma}_{p}^2 = \frac{p(1-p) \sum\limits_{i} w_i}{\left(\sum\limits_{i} w_i \right)- 1}$$

which when $\sum\limits_i w_i >> 1$ can be approximated with:

$$\hat{\sigma}_{p}^2 \approx p(1-p)$$

The confidence interval for proportions could therefore be approximated with:

$$\text{CI} \approx 1.14 \times \sqrt{\frac{p(1-p)}{n}} \times \text{qnorm}(1 - \frac{\alpha} 2)$$ where:

-   $\alpha$ is the significance level,
-   $\text{qnorm}$ outputs the Z-score for the required significance level $\alpha$,
-   $n$ is the size of set $r$, i.e. number of respondents.

# Confidence Interval - Definition {#confidence-interval}

A confidence interval is a range of plausible values for a population parameter, calculated from sample data. A 95% confidence interval means that if the same sampling procedure were repeated many times, approximately 95% of the resulting intervals would contain the true population value. This does not imply that there is a 95% probability that the true value lies within any single interval, rather, it reflects the reliability of the estimation method across repeated samples.