---
title: "chensus"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{chensus}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, warning=FALSE, message=FALSE}
library(chensus)
library(dplyr)
```

# Introduction

The `chensus` package estimates population frequencies, means, proportions and confidence intervals from surveys conducted by the Federal Statistical Office (FSO):

-   structural survey: *Strukturerhebung* (SE) / *relevé structurel* (RS),
-   mobility and transport survey: *Mikrozensus Mobilität und Verkehr* (MZMV) / *Microrecensement mobilité et transports* (MRMT).

In this vignette, we demonstrate the main features of the package using the built-in `nhanes` dataset, which contains a subset of data from the [National Health and Nutrition Examination Survey](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm) for the period 2015-2016 (more with `?nhanes` and `vignette("nhanes")`). Its structure is similar to FSO survey data in that it contains `strata` and `weights` columns and demographic features such as `gender` and `household_size`.

# Structural Survey

## Total Estimates

Suppose we want to estimate the population in the `nhanes` data set by gender and birth country. We can use the main analysis function `se_total()`:

```{r}
se_total(
  data = nhanes,
  weight = weights,
  strata = strata,
  gender, birth_country
)
```

Column names can be passed programmatically with the help of `rlang`'s `!!sym()` and `!!!syms()` in the function call:

```{r}
w <- "weights"
s <- "strata"
v <- c("gender", "birth_country")

se_total(
  data = nhanes,
  strata = !!sym(s),
  weight = !!sym(w),
  !!!syms(v)
)
```

We can also estimate population in parallel for multiple groups:

```{r}
se_total_map(
  nhanes,
  weight = weights,
  strata = strata,
  gender, birth_country
)
```

If we wish to estimate population for all combinations of grouping variables including no or partial grouping, we can use `se_total_ogd()`, a wrapper function for the main `se_total()` function:

```{r}
se_total_ogd(nhanes, strata = strata, weight = weights, gender, birth_country)
```

## Proportion Estimates

We can also estimate the proportion of males and females by birth country in the `nhanes` survey:

```{r}
se_prop(
  data = nhanes,
  gender,
  birth_country,
  weight = weights,
  strata = strata
)
```

and we can display total and proportion estimates in a single table using the FSO format. The FSO publication format qualifies the reliability of estimates and hides confidential estimates (fewer than five observations):

```{r}
se_total_prop(
  data = nhanes,
  gender,
  birth_country,
  weight = weights,
  strata = strata
) |>
  fso_flag_mask()
```
## Mean Estimates

If on the other hand we wish to estimate the mean household size then we can use the function `se_mean()`:

```{r}
se_mean(
  data = nhanes,
  variable = household_size,
  strata = strata,
  weight = weights
)
```

or the wrapper function `se_mean_ogd()` for all possible combinations of grouping variables `gender` and `interview_lang`:

```{r}
se_mean_ogd(
  nhanes,
  variable = household_size,
  strata = strata,
  weight = weights,
  gender, interview_lang
)
```

and with FSO format:
```{r}
nhanes |>
  se_mean_ogd(
    variable = household_size,
    gender, birth_country,
    strata = strata,
    weight = weights,
  ) |>
  fso_flag_mask(lang = "en") # Default is "de", further possibilities: "fr", "it"
```

# Mobility Survey

If we want to estimate the mean household income then we can use `mzmv_mean()`:

```{r}
mzmv_mean(
  data = nhanes,
  variable = annual_household_income,
  weight = weights
)
```

and grouped by gender (note the variable argument must be quoted here):

```{r}
mzmv_mean_map(
  data = nhanes,
  variable = "annual_household_income",
  gender,
  weight = weights
)
```

# Flagging Estimate Reliability

`fso_flag_mask` applies FSO's reliability rules for survey estimates, based on the number of observations (`occ`). It flags low reliability estimates and masks them when sample size is too small (occ \<= 4) as follows:

|              |                             |
|--------------|-----------------------------|
| `occ <= 4`  | No estimate (confidential)  |
| `occ <= 49` | Estimate of low reliability |
| `occ > 49`  | Reliable estimate           |

```{r}
results <- nhanes |>
  se_total(
    strata = strata,
    weight = weights,
    gender,
    birth_country,
    interview_lang,
    edu_level
  )
results |>
  filter(occ < 60) |>
  fso_flag_mask() |>
  select(gender, birth_country, interview_lang, occ, total, ci, obs_status)
```