--- title: "Prepare data" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Prepare data} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 84 --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Overview To run projections with `{propop}`, you need a starting population and projection parameters. If you already have this information, you need to ensure that the input files have the required structure. If you don't have the relevant data, you can download them from the Federal Statistical Office (FSO). This vignette explains how to get the data. You'll also learn how to prepare the relevant information to run population projections `{propop}`. # Required data If you don't have the information and data required to run `propop::propop()` (or `propop::project_raw()`), you can download most of the data from [STAT-TAB](https://www.pxweb.bfs.admin.ch/pxweb). More specifically, the information from the following tables are needed: +-------------------+--------------------------------+----------------------------+ | Table ID | Parameters expressed as... | Variables required for | | | | projection | +===================+================================+============================+ | px- | number of people (*reference* | - Inter-cantonal | | x-0104020000_101 | scenario) | immigration | | | | - Inter-cantonal | | | | emigration | | | | - International | | | | immigration | | | | - International | | | | emigration | | | | - end of year population | | | | size | +-------------------+--------------------------------+----------------------------+ | px- | number of people (*high | - same as 101 | | x-0104020000_102 | growth* scenario) | | +-------------------+--------------------------------+----------------------------+ | px- | number of people (*low growth* | - same as 101 | | x-0104020000_103 | scenario) | | +-------------------+--------------------------------+----------------------------+ | px- | rates / probabilities (five | - Births per mother | | x-0104020000_109 | scenarios) | | | | | - Mortality | | | | | | | | - International | | | | emigration | | | | | | | | - Inter-cantonal | | | | emigration | | | | | | | | - Acquisition of Swiss | | | | citizenship | +-------------------+--------------------------------+----------------------------+ | px | share of newborns with Swiss | - Live newborns | | -x-0104020000_106 | nationality born to non-Swiss | | | | mothers | - Live births by age and | | | | nationality of the | | | | mother (varies between | | | | cantons) | +-------------------+--------------------------------+----------------------------+ | Constant | | - Start (16) and end (50)| | parameters | | of the fertile age of | | **not** directly | | women | | available from | | | | STAT-TAB must be | | - Proportion of newborns | | provided as | | with female sex | | arguments | | (100/205) | +-------------------+--------------------------------+----------------------------+ : Overview of required FSO tables (STAT-TAB) # Convenient way to get FSO data The `propop` package provides two convenience functions to download data from the FSO. To get the starting population for a spatial unit, you must use the spelling defined in the corresponding FSO table. The entries in the FSO tables may contain special characters. The spelling may also vary between FSO tables. `BFS::bfs_get_metadata()` is helpful to identify the required spelling(s) (see further down on this page). Here's an example of how to get the population for the canton of Aargau: ```{r eval = FALSE} library(propop) ag_population <- get_population( number_fso = "px-x-0102010000_101", year_first = 2022, year_last = 2022, spatial_units = "- Aargau" ) ``` Get the parameters for a sample canton (mind using the same spelling as in the FSO tables; see comment above): ```{r get ag params, eval = FALSE} ag_parameters <- get_parameters( year_first = 2023, year_last = 2026, spatial_units = c("Aargau") ) ``` The projection can be run as follows: ```{r eval = FALSE} # select reference scenario ag_parameters_ref <- ag_parameters |> dplyr::filter(scen == "reference") propop( parameters = ag_parameters_ref, year_first = 2023, year_last = 2026, age_groups = 101, fert_first = 16, fert_last = 50, share_born_female = 100 / 205, population = ag_population, subregional = FALSE, binational = TRUE ) ``` **Note of caution:** As long as the FSO's API interface and the underlying data structure remain stable, the functions will work. However, changes in the API are likely to break the functions. # Manual way to get FSO data In case the above shouldn't work or if you want to retrace the necessary steps manually, we also provide a step-by-step description of how to get the population data and the projection parameters from the FSO. # Packages To download the data, we need the following packages: ```{r load packages, , echo= T, eval=T, results='hide', message=FALSE, warning=FALSE} library(BFS) # to download FSO data library(dplyr) # to process data library(tidyr) # to wrangle data library(propop) # to run population projection ``` # Prepare queries and download data To make the data download faster, save disk space, and avoid filtering after the download, it is advisable to specify and download only the information that we really need. To prepare such a customised, reduced data download, the [instructions](https://felixluginbuhl.com/BFS/#query-specific-elements) from the `BFS` package are very helpful. Following these instructions, we can use the `text` and `valueTexts` variables to generate a **query** dimension object for each table and to **download** the data (see following subsections). To illustrate, for table `px-x-0104020000_101`, we can obtain the meta data as follows: ```{r get meta data, echo= T, eval=TRUE, message=FALSE} metadata <- BFS::bfs_get_metadata(number_bfs = "px-x-0104020000_101") metadata_tidy <- metadata |> select(-valueTexts) |> unnest_longer(values) |> dplyr::mutate( valueTexts = metadata |> select(valueTexts) |> unnest_longer(valueTexts) |> pull(valueTexts) ) |> select(code, text, values, valueTexts, everything()) head(metadata_tidy) ``` Although the structure of the first three tables should be identical, the low growth scenario (`_103`) contains different meta information and requires some changes. ## Get "number of people" parameters Some of FSO's expectations are expressed in "number of people" parameters (first three entries in the table). These parameters indicate FSO expectations about how many people do certain things (e.g., how many 64-year old Swiss men will emigrate to another country in 2043). To prepare the download of these parameters, we can specify the following query: ```{r filter dimensions, echo = T, eval = FALSE, results='hide', message=FALSE, warning=FALSE} # Specify the elements to download dim1 <- metadata_tidy |> dplyr::filter( text == "Kanton" & # Canton valueTexts %in% c("Aargau") ) dim2 <- metadata_tidy |> dplyr::filter( text == "Geschlecht" & # sex valueTexts %in% c( "Mann", # male "Frau" ) ) # female dim3 <- metadata_tidy |> dplyr::filter( text == "Alter" & # get each age group !(valueTexts %in% "Alter - Total") ) # but exclude "Total" dim4 <- metadata_tidy |> dplyr::filter(text == "Jahr") # get all years # adapt to the different structure of the "low" scenario table dim4_103 <- metadata_tidy |> dplyr::filter( text == "Jahr" ) |> # get all years dplyr::mutate(values = as.character(0:31)) dim5 <- metadata_tidy |> dplyr::filter( text == "Staatsangehörigkeit (Kategorie)" & # nationality valueTexts %in% c( "Schweiz", # Swiss "Ausland" ) ) # Foreign / international dim6 <- metadata_tidy |> dplyr::filter( text == "Beobachtungseinheit" & # parameters for projection valueTexts %in% c( "Einwanderungen", # international immigration "Auswanderungen", # international emigration "Interkantonale Zuwanderungen", # inter-cantonal immigration "Interkantonale Abwanderungen" # inter-cantonal emigration ) ) # build dimensions list object dimensions <- list( dim1$values, dim2$values, dim3$values, dim4$values, dim5$values, dim6$values ) # add names names(dimensions) <- c( unique(dim1$code), unique(dim2$code), unique(dim3$code), unique(dim4$code), unique(dim5$code), unique(dim6$code) ) # version for _103 # build dimensions list object dimensions_103 <- list( dim1$values, dim2$values, dim3$values, dim4_103$values, dim5$values, dim6$values ) # add names names(dimensions_103) <- c( unique(dim1$code), unique(dim2$code), unique(dim3$code), unique(dim4_103$code), unique(dim5$code), unique(dim6$code) ) ``` Using the above specifications, we can **download** the FSO "number of people" parameters as follows: ```{r query, eval = FALSE, echo = TRUE, message=FALSE} # reference scenario fso_numbers_r <- BFS::bfs_get_data( number_bfs = "px-x-0104020000_101", query = dimensions ) |> rename(value = paste0( "Szenarien zur Bevölkerungsentwicklung der Kantone 2020-2050,", " Referenzszenario AR-00-2020 - zukünftige Bevölkerungsentwicklung" )) |> dplyr::mutate(scen = "reference") # high growth scenario fso_numbers_h <- BFS::bfs_get_data( number_bfs = "px-x-0104020000_102", query = dimensions ) |> rename(value = paste0( "Szenarien zur Bevölkerungsentwicklung der Kantone 2020-2050,", " 'hohes' Szenario BR-00-2020 - zukünftige Bevölkerungsentwicklung" )) |> dplyr::mutate(scen = "high") # low growth scenario fso_numbers_l <- BFS::bfs_get_data( number_bfs = "px-x-0104020000_103", query = dimensions_103 ) |> rename(value = paste0( "Szenarien zur Bevölkerungsentwicklung der Kantone 2020-2050,", " 'tiefes' Szenario CR-00-2020 - zukünftige Bevölkerungsentwicklung" )) |> dplyr::mutate(scen = "low") # combine into a single data frame fso_numbers_raw <- full_join(fso_numbers_r, fso_numbers_h) |> full_join(fso_numbers_l) ``` ```{r process, eval = FALSE, echo = FALSE} # Bring variable names and factor levels into the format required later fso_numbers <- fso_numbers_raw |> dplyr::rename( nat = "Staatsangehörigkeit (Kategorie)", sex = Geschlecht, age = Alter, year = Jahr, fso_parameter = Beobachtungseinheit ) |> # change factor levels dplyr::mutate( fso_parameter = case_match( fso_parameter, "Auswanderungen" ~ "emi_n", "Einwanderungen" ~ "imm_int", "Interkantonale Abwanderungen" ~ "interc_emi", "Interkantonale Zuwanderungen" ~ "interc_imm" ), nat = case_match( nat, "Schweiz" ~ "ch", "Ausland" ~ "int" ), sex = case_when( sex == "Mann" ~ "m", sex == "Frau" ~ "f" ), age = as.numeric(stringr::str_extract(age, "\\d+")) ) ``` ```{r quick-check-numbers, echo = F, eval=FALSE} # quick check to see if figures match STAAG input table for projection fso_numbers |> dplyr::filter( year == 2029 & scen == "reference" & nat == "int" & sex == "f" & age == 0 ) ``` ## Get "rates" and "probabilities" The FSO indicates some of its expectations as "rates" or "probabilities" (row four in the overview table at the top). To illustrate, these parameters could indicate the likelihood of 24-year old Swiss women to have a child in the year 2034. Before we can download the data, we again need the metadata: ```{r get meta data2, echo= T, eval=TRUE, message=FALSE, warning=FALSE} metadata <- BFS::bfs_get_metadata(number_bfs = "px-x-0104020000_109") metadata_tidy <- metadata |> select(-valueTexts) |> unnest_longer(values) |> dplyr::mutate( valueTexts = metadata |> select(valueTexts) |> unnest_longer(valueTexts) |> pull(valueTexts) ) |> select(code, text, values, valueTexts, everything()) head(metadata_tidy) ``` To **download** the "rate" and "probability" parameters (last row in the table), we can use the following specifications: ```{r filter dimensions2, echo = T, eval = FALSE, results='hide', message=FALSE, warning=FALSE} # Specify the elements to download dim1 <- metadata_tidy |> dplyr::filter( text == "Kanton" & # Canton valueTexts %in% c("Aargau") ) dim2 <- metadata_tidy |> dplyr::filter( text == "Szenario-Variante" & # sex valueTexts %in% c( "Referenzszenario AR-00-2020", # reference scenario "'hohes' Szenario BR-00-2020", # high growth "'tiefes' Szenario CR-00-2020" ) ) # low growth dim3 <- metadata_tidy |> dplyr::filter( text == "Staatsangehörigkeit (Kategorie)" & # nationality valueTexts %in% c( "Schweiz", # Swiss "Ausland" ) ) # Foreign / international dim4 <- metadata_tidy |> dplyr::filter( text == "Geschlecht" & # sex valueTexts %in% c( "Mann", # male "Frau" ) ) # female dim5 <- metadata_tidy |> dplyr::filter( text == "Alter" & # all 1-year age groups !(valueTexts %in% "Alter - Total") ) # but exclude "Total" dim6 <- metadata_tidy |> dplyr::filter( text == "Jahr" ) # get all years dim7 <- metadata_tidy |> dplyr::filter( text == "Beobachtungseinheit" & # type of parameter types valueTexts %in% c( "Geburtenziffern", # births "Prospektive Sterbewahrscheinlichkeiten", # mortality "Auswanderungsziffern", # international emigration "Interkantonale Abwanderungsziffern", # inter-cantonal emigration "Einbürgerungsziffern" ) ) # acquisition of Swiss citizenship # build dimensions list object dimensions <- list( dim1$values, dim2$values, dim3$values, dim4$values, dim5$values, dim6$values, dim7$values ) # add names names(dimensions) <- c( unique(dim1$code), unique(dim2$code), unique(dim3$code), unique(dim4$code), unique(dim5$code), unique(dim6$code), unique(dim7$code) ) ``` Using the above specifications, we can download the FSO "rate" parameters as follows: ```{r query2, eval = FALSE, message=FALSE,results='hide'} # Download rate parameters fso_rates_raw <- BFS::bfs_get_data( number_bfs = "px-x-0104020000_109", query = dimensions ) ``` We need to process the data to ensure that the structure of the rate parameters conforms to the expectations of the projection function: ```{r process query2, eval = FALSE, message=FALSE} # Bring variable names and factor levels into the format required later fso_rates <- fso_rates_raw |> dplyr::rename( nat = "Staatsangehörigkeit (Kategorie)", sex = Geschlecht, age = Alter, year = Jahr, fso_parameter = Beobachtungseinheit, scen = "Szenario-Variante", value = "Szenarien zur Bevölkerungsentwicklung der Kantone 2020-2050 - Ziffern" ) |> # change factor levels dplyr::mutate( scen = case_match( scen, "Referenzszenario AR-00-2020" ~ "reference", "'hohes' Szenario BR-00-2020" ~ "high", "'tiefes' Szenario CR-00-2020" ~ "low" ), nat = case_match( nat, "Schweiz" ~ "ch", "Ausland" ~ "int" ), sex = case_when( sex == "Mann" ~ "m", sex == "Frau" ~ "f" ), age = as.numeric(stringr::str_extract(age, "\\d+")), fso_parameter = case_match( fso_parameter, "Prospektive Sterbewahrscheinlichkeiten" ~ "mor", "Auswanderungsziffern" ~ "emi", "Interkantonale Abwanderungsziffern" ~ "intercant", "Einbürgerungsziffern" ~ "acq", "Geburtenziffern" ~ "birth_rate" ) ) ``` ## Get share of newborns with Swiss nationality born to non-Swiss mothers Mothers who do not have Swiss citizenship may have a Swiss partner. Their children will be Swiss nationals. To use this information in the projection, we need to download another table and compute the corresponding parameter. Before we can download the data, we again need the metadata: ```{r get meta data3, echo= T, eval=FALSE, message=FALSE, warning=FALSE} # Get meta data to determine what to download metadata <- BFS::bfs_get_metadata(number_bfs = "px-x-0104020000_106") metadata_tidy <- metadata |> dplyr::select(-valueTexts) |> tidyr::unnest_longer(values) |> dplyr::mutate( valueTexts = metadata |> dplyr::select(valueTexts) |> tidyr::unnest_longer(valueTexts) |> dplyr::pull(valueTexts) ) |> dplyr::select(code, text, values, valueTexts, everything()) ``` To **download** the "rate" and "probability" parameters (last row in the table), we can use the following specifications: ```{r filter dimensions3, echo = T, eval = FALSE, results='hide', message=FALSE, warning=FALSE} # Specify the elements to download dim1 <- metadata_tidy |> dplyr::filter( text == "Kanton" & # Canton valueTexts %in% c("Aargau") ) dim2 <- metadata_tidy |> dplyr::filter( text == "Szenario-Variante" & # scenario valueTexts %in% c( "Referenzszenario AR-00-2020", "'hohes' Szenario BR-00-2020", "'tiefes' Szenario CR-00-2020" ) ) dim3 <- metadata_tidy |> dplyr::filter( text == "Staatsangehörigkeit (Kategorie)" & # nationality valueTexts %in% "Ausland" ) dim4 <- metadata_tidy |> dplyr::filter( text == "Geschlecht" & # sex valueTexts %in% "Geschlecht - Total" ) dim5 <- metadata_tidy |> dplyr::filter( text == "Altersklasse" & # age valueTexts %in% "Altersklasse - Total" ) dim6 <- metadata_tidy |> dplyr::filter( text == "Jahr" ) # get all years dim7 <- metadata_tidy |> dplyr::filter( text == "Beobachtungseinheit" & # type of parameter types valueTexts %in% c( # live births of international newborns to international mothers "Lebendgeburten", "Lebendgeburten nach Alter und Staatsangehörigkeit der Mutter" ) ) # all live newborns to international mothers # build dimensions list object dimensions <- list( dim1$values, dim2$values, dim3$values, dim4$values, dim5$values, dim6$values, dim7$values ) # add names names(dimensions) <- c( unique(dim1$code), unique(dim2$code), unique(dim3$code), unique(dim4$code), unique(dim5$code), unique(dim6$code), unique(dim7$code) ) ``` Using the above specifications, we can download the parameter as follows: ```{r query3, eval = FALSE, message=FALSE,results='hide'} # Download rate parameters fso_births_int_ch_raw <- BFS::bfs_get_data( number_bfs = "px-x-0104020000_106", query = dimensions ) ``` We need to process the data to ensure that the structure of the parameter conforms to the expectations of the projection function: ```{r process query3, eval = FALSE, message=FALSE} # Bring variable names and factor levels into the format required later # Process data fso_births_int_ch <- fso_births_int_ch_raw |> # Compute share of Swiss newborns to international mothers tidyr::pivot_wider( names_from = Beobachtungseinheit, values_from = paste0( "Szenarien zur Bevölkerungsentwicklung der Kantone 2020-2050", " - zukünftige Bevölkerungsentwicklung" ) ) |> # use shorter, clearer names dplyr::rename( # all live births from international mothers live_birth_total = "Lebendgeburten nach Alter und Staatsangehörigkeit der Mutter", # live births of international newborns to international mothers live_birth_int = Lebendgeburten ) |> dplyr::mutate( births_int_ch = (live_birth_total - live_birth_int) / live_birth_total ) |> # Bring variable names and factor levels into the format required later dplyr::rename( scen = "Szenario-Variante", age = Altersklasse, year = Jahr ) |> # change factor levels dplyr::mutate( scen = dplyr::case_match( scen, "Referenzszenario AR-00-2020" ~ "reference", "'hohes' Szenario BR-00-2020" ~ "high", "'tiefes' Szenario CR-00-2020" ~ "low" ) ) |> # remove unnecessary variables select(year, scen, births_int_ch) |> dplyr::arrange(year, scen) ``` ## All FSO parameters Now we can merge "number of people" and "rate" parameters, make the data frame wider, and compute the required parameter `inter-cantonal net migration`: ```{r all in one, echo = T, eval=FALSE, message=F} projection_parameters <- dplyr::full_join(fso_rates, fso_numbers) |> tidyr::pivot_wider(names_from = fso_parameter, values_from = value) |> # compute inter-cantonal net migration dplyr::mutate(mig_ch = interc_imm - interc_emi) |> left_join(fso_births_int_ch, by = c("year", "scen")) |> # add mandatory column spatial_unit dplyr::mutate(spatial_unit = "Aargau") |> # remove unnecessary variables dplyr::select(-c(Kanton, intercant, emi_n, interc_imm, interc_emi)) |> dplyr::arrange(year) ``` ``` {r save-data, message = FALSE, echo=FALSE, eval = FALSE} # Save data once save(projection_parameters, file='../vignettes/prepare_projection_parameters.rda') ``` ``` {r load-data, message = FALSE, echo=FALSE, eval = TRUE} # Load data load(file='prepare_projection_parameters.rda') ``` Show parameters for one demographic group for the year 2024: ```{r all in one2, echo = T, eval=TRUE, message=F} projection_parameters |> dplyr::filter(year == 2024 & sex == "f" & nat == "int" & age == 0) |> DT::datatable() ``` ## Population In addition to the parameters, the projection function `propop` also requires a starting population. To prepare the corresponding query, we again start with the metadata: ```{r get meta data population, echo= T, eval=FALSE, results='hide', message=FALSE, warning=FALSE} metadata_pop <- BFS::bfs_get_metadata(number_bfs = "px-x-0102010000_101") metadata_pop_tidy <- metadata_pop |> select(-valueTexts) |> unnest_longer(values) |> mutate( valueTexts = metadata_pop |> select(valueTexts) |> unnest_longer(valueTexts) |> pull(valueTexts) ) |> select(code, text, values, valueTexts, everything()) ``` We can now specify which levels of the variables we want: ```{r filter dimensions 3, echo = T, eval = FALSE, results='hide', message=FALSE, warning=FALSE} # Specify the elements to download dim1 <- metadata_pop_tidy |> dplyr::filter( text == "Kanton (-) / Bezirk (>>) / Gemeinde (......)" & # Canton valueTexts %in% c("- Aargau") ) dim2 <- metadata_pop_tidy |> dplyr::filter( text == "Jahr" & # year valueTexts %in% c("2018") ) dim3 <- metadata_pop_tidy |> dplyr::filter( text == "Bevölkerungstyp" & # permanent valueTexts %in% "Ständige Wohnbevölkerung" ) dim4 <- metadata_pop_tidy |> dplyr::filter( text == "Staatsangehörigkeit (Kategorie)" & # nationality valueTexts %in% c("Schweiz", "Ausland") ) dim5 <- metadata_pop_tidy |> dplyr::filter( text == "Geschlecht" & # sex valueTexts %in% c("Mann", "Frau") ) dim6 <- metadata_pop_tidy |> dplyr::filter( text == "Alter" & # age !(valueTexts %in% "Alter - Total") ) # exclude "Total" # build dimensions list object dimensions <- list( dim1$values, dim2$values, dim3$values, dim4$values, dim5$values, dim6$values ) # add names names(dimensions) <- c( unique(dim1$code), unique(dim2$code), unique(dim3$code), unique(dim4$code), unique(dim5$code), unique(dim6$code) ) ``` Using the above specifications, we can **download** the FSO "population" as follows: ```{r get pop, eval = FALSE, message=FALSE} # Download population fso_pop_raw <- BFS::bfs_get_data( number_bfs = "px-x-0102010000_101", # reference scenario query = dimensions ) ``` ``` {r save-data2, message = FALSE, echo=FALSE, eval = FALSE} # Save data once save(fso_pop_raw, file='../vignettes/prepare_fso_pop_raw.rda') ``` ``` {r load-data2, message = FALSE, echo=FALSE, eval = TRUE} # Load data load(file='prepare_fso_pop_raw.rda') ``` We now process the data to ensure that the population data conforms to the structure expected in `propop::propop()`: ```{r prep pop, eval = TRUE, message=FALSE} # Bring variable names and factor levels into the format required later starting_population <- fso_pop_raw |> dplyr::select(-"Bevölkerungstyp") |> dplyr::rename( year = Jahr, Kanton = "Kanton (-) / Bezirk (>>) / Gemeinde (......)", nat = "Staatsangehörigkeit (Kategorie)", sex = Geschlecht, age = Alter, n = "Ständige und nichtständige Wohnbevölkerung" ) |> # change factor levels mutate( Kanton = stringr::str_remove_all(Kanton, "- "), nat = case_match( nat, "Schweiz" ~ "ch", "Ausland" ~ "int" ), sex = case_when( sex == "Mann" ~ "m", sex == "Frau" ~ "f" ), age = as.numeric(stringr::str_extract(age, "\\d+")) ) |> dplyr::rename(spatial_unit = Kanton) starting_population |> DT::datatable() ``` \ # Run population projections Now that the parameters and the starting population are available, we can run the population projections (see vignette \`run_projections\` for more details). The result is shown for one demographic group. \ ```{r project, eval = TRUE, message=FALSE} # only keep reference scenario projection_parameters_ref <- projection_parameters |> filter(scen == "reference") # run propop with data from prepare vignette to make sure vignette is okay results_clean <- propop( parameters = projection_parameters_ref, year_first = 2019, year_last = 2030, age_groups = 101, fert_first = 16, fert_last = 50, share_born_female = 100 / 205, population = starting_population, subregional = FALSE, binational = TRUE ) results_clean |> # select demographic group dplyr::filter(sex == "f" & nat == "int" & age == 49) |> dplyr::mutate(across(n, \(x) sprintf(fmt = "%.0f", x))) |> DT::datatable(filter = "top") ``` \