--- title: "Projections for subregions" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Projections for subregions} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 84 chunk_output_type: console --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, warning=FALSE, message = FALSE, echo = FALSE} library(propop) ``` ```{r setup-display, eval = FALSE} library(propop) ``` # Overview This vignette explains how to use `propop::propop()` to perform population projections for multiple regions and particularly for subregions within a larger spatial entity (e.g., municipalities within a canton). It discusses two **challenges** that arise when conducting such projections and presents potential strategies for addressing them: 1. How to obtain the **required input data** for spatial entities below the level of cantons? 2. How to account for **migration between subregions**? For more background and general information about the required input data, see [this vignette](project_single_region.html)). # Projection parameters for subregions Numeric information about demographic processes such as births, migration, or deaths are an essential prerequisite for population projections. However, in Switzerland this information is typically only available for cantons but not for spatial entities at smaller scales. Supplying input data for spatial units at the sub-cantonal level (e.g. municipalities) can be straightforward for data expressed as rates (e.g. mortality rate). The simplest approach is to use the same figures for the subregions as for the canton (unless the figures are implausible for some theoretical or empirical reason). The task is more demanding, though, if you want to alter rates for subregions or when you need to downscale input data expressed as "number of people". While we don't offer a solution to adjust rates (yet), the next two sections show two possibilities to distribute the cantonal "number of people" estimates among subregions. ## Distribution of people according to population size The simplest approach to allocating canton-wide "number of people" estimates to subregions is to determine each subregion's population size relative to the canton's total population and distribute the numbers accordingly. To put it somewhat simplistically, if a municipality represents 10% of the canton's population, it should proportionally receive 10% of the canton's incoming immigrants. The approach described below is more sophisticated in that it uses the spatial units' shares per demographic group but the core idea is the same. Let us look at a concrete, numeric example. For the sake of simplicity, we use the data included in the package to create small, fictitious input data with five regions: Input parameters (after this step, the rates and numbers for the subregions are still identical to those for the whole canton): ```{r data-parameters} # FSO parameters for fictitious subregions fso_parameters_sub <- fso_parameters |> # duplicating rows 5 times tidyr::uncount(5) |> # create 5 subregions dplyr::mutate(spatial_unit = rep(1:5, times = nrow(fso_parameters))) |> dplyr::mutate(spatial_unit = as.character(spatial_unit)) ``` Population data: ```{r data-population} # Generate 5 random "cuts" to distribute the original population; # avoid extreme values with a range of 0.1 to 0.5 cut_1 <- { set.seed(1) round(runif(1, min = 0.1, max = 0.5), digits = 2) } cut_2 <- { set.seed(2) round(runif(1, min = 0.1, max = 0.5), digits = 2) } cut_3 <- { set.seed(3) round(runif(1, min = 0.1, max = 0.5), digits = 2) } cut_4 <- { set.seed(4) round(runif(1, min = 0.1, max = 0.5), digits = 2) } # make sure everything adds up to 100% cut_5 <- 1 - cut_1 - cut_2 - cut_3 - cut_4 # Generate population data for five subregions df_population <- fso_population |> # duplicating rows 5 times tidyr::uncount(5) |> # create 5 subregions dplyr::mutate( spatial_unit = as.character(rep(1:5, times = nrow(fso_population))) ) |> dplyr::mutate( # Distribute original population according to "cuts" n = dplyr::case_match( spatial_unit, "1" ~ round(n * cut_1), "2" ~ round(n * cut_2), "3" ~ round(n * cut_3), "4" ~ round(n * cut_4), "5" ~ round(n * cut_5), .default = NA ), .keep = "all" ) ``` To calculate the shares, we count the number of people in each demographic group across all spatial units (`sum_n`). Next, the `share` is obtained by dividing the spatial unit's `n` by the sum of all people across all spatial units in this demographic group (`sum_n`). Note that the sum of all five region's shares is 1 (e.g., 0.21 + 0.17 + 0.17 + 0.33 + 0.12 = `r 0.21 + 0.17 + 0.17 + 0.33 + 0.12`). ```{r shares-pop-size} # Calculate shares df_population_shares <- df_population |> dplyr::mutate(sum_n = sum(n), .by = c(nat, sex, age)) |> dplyr::mutate(share = n / sum_n) # Display table df_population_shares |> dplyr::mutate(share = round(share, 3)) |> DT::datatable() |> DT::formatStyle(c("share"), backgroundColor = DT::styleRow(c(1:5), "#96D4FF", default = "") ) ``` \ Now all required data are available and we can distribute projection parameters expressed as "number of people" between subregions. Let's do this with immigration from other countries (`imm_int_n`) and immigration from other cantons (`imm_nat_n`). We first join the data frame containing the projection parameters (`fso_parameters_sub`) and the data frame containing the shares (`df_population_shares`). Identifiers for demographic groups must be present in both data frames. The actual distribution involves only a single line per parameter in which the canton-wide number of immigrants (`imm_int_n` and `imm_nat_n`) is multiplied by the share (`share`), which in this approach is identical for both types of immigration. ```{r distribute-pop-size} parameters_sub_size <- fso_parameters_sub |> dplyr::left_join( df_population_shares |> dplyr::select("spatial_unit", "nat", "sex", "age", "share"), by = c("spatial_unit", "nat", "sex", "age") ) |> dplyr::mutate( # Calculate number of incoming people per demographic group and spatial unit imm_int_n_distr = imm_int_n * share, imm_nat_n_distr = imm_nat_n * share ) ``` Let's take a closer look at the result of this, focusing on immigration from other countries. In the table below, the blue columns `check` and `difference` show that the sum of the distributed parameter (`imm_int_n_distr`) adds up to the total number of people (`imm_int_n`; i.e., the original figures provided by the FSO for the whole canton). ```{r distr-pop-size-check} parameters_sub_size |> dplyr::mutate( check = round(sum(imm_int_n_distr), 0), .by = c(year, nat, sex, age) ) |> dplyr::filter(sex == "m" & nat == "int") |> dplyr::mutate(across(c( "int_mothers":"emi_nat", "acq", "share":"imm_nat_n_distr" ), \(x) round(x, 3))) |> dplyr::mutate(difference = imm_int_n - check) |> dplyr::select(-scen) |> DT::datatable() |> DT::formatStyle( "imm_int_n", backgroundColor = "#ffcc8f" ) |> DT::formatStyle( c("check", "difference"), backgroundColor = "#96D4FF" ) ``` \ To proceed with the projection, the columns with the distributed immigration (`imm_int_n_distr` and `imm_nat_n_distr`) need to become the new `imm_int_n` and `imm_nat_n`. Otherwise, `propop()` won't recognize the parameters. ``` {r cleaning} parameters_sub_size_clean <- parameters_sub_size |> dplyr::mutate( imm_int_n = imm_int_n_distr, imm_nat_n = imm_nat_n_distr ) |> dplyr::select(-share, -imm_int_n_distr) ``` Now we can run the projection: ```{r project-1} propop( parameters = parameters_sub_size_clean, year_first = 2019, year_last = 2020, age_groups = 101, fert_first = 16, fert_last = 50, share_born_female = 100 / 205, population = df_population, binational = TRUE, subregional = FALSE ) ``` ## Distribution of people according to past migration The second approach to distribute "number of people" estimates among subregions uses historical migration records. An advantage of this approach is that it can differentiate between different types of migration (or any number-of-people input) and thereby enables adjusting each parameter independently (rather than using the same share for adapting all parameters). This is great, for example, if immigration from other cantons differs from immigration from other countries. To illustrate this approach, we use international immigration as an example and create a fictitious data frame with five regions. For pedagogical reasons, we modify the data a little bit. Spatial unit 1 won't have any international immigration among children aged 0-4 years. Consequently, the value in the column `hist_imm_int` contains zeros for these entries. This doesn't reflect realistic conditions but helps to illustrate the suggested procedure in cases where zero observations exist for demographic groups. With real data, the same can be achieved by summarizing the historical migration data. Since real migration usually varies from year to year, we advise calculating an average across several years. The fictitious data assumes that we have already performed this step. ```{r data} set.seed(145) # Immigration records df_hist_imm <- tibble::tibble( # five fictitious spatial units spatial_unit = rep(c("1", "2", "3", "4", "5"), each = 101 * 4), # two nationalities nat = rep(rep(c("ch", "int"), each = 2 * 101), times = 5), # two levels for sex sex = rep(rep(c("m", "f"), each = 101), times = 5 * 2), # age groups from 0 to 100 age = rep(0:100, times = 5 * 4) ) |> dplyr::mutate( # random numbers between zero and 50 are created to mimic historical # immigration records observed in Aargau hist_imm_nat = sample(0:50, dplyr::n(), replace = TRUE), hist_imm_int = sample(0:50, dplyr::n(), replace = TRUE), # for one spatial unit, we artificially assign zeros to all children # between 0-4 years hist_imm_int = ifelse(spatial_unit == 1 & age %in% c(0:4), 0, hist_imm_int) ) ``` The resulting data frame includes two columns with (fictitious) immigration from abroad (` hist_imm_nat`) and from other cantons (`hist_imm_nat`), aggregated per demographic group: ```{r data-historical-immigration} df_hist_imm |> DT::datatable() ``` \ A challenge is that records often vary considerably between years. If the patterns are too uneven, future trends may become erratic, especially in groups with few people (e.g., small municipalities or small age groups). A first step to mitigate this issue should already have happened at this stage, namely considering several years and using the arithmetic mean as an estimate of each demographic group's past migration. `propop::calculate_shares()` offers an additional remedy. Instead of only using shares based on 1-year age groups, it is possible to resort to 5-year and 10-year age groups, which further smooths out irregular patterns. To use this function, you need to provide a data frame containing a column with the number of immigrants per demographic group (e.g., `hist_imm_nat` or `hist_imm_int` as calculated above). The function assigns each 1-year age group to a 5-year and a 10-year age group (e.g., 2-year olds become part of the age_groups 0-4 and 0-9), sums up all observations within the respective 5-year and 10-year age group (e.g., 0-4 and 0-9 year old Swiss males). These sums (`sum_5`and `sum_10` in the table below) are then divided by five or ten, respectively, and proportionally assigned to each member of the age group (`prop_5` and `prop_10` in the table below). To **distribute the canton-wide numbers** among subregions, `propop::calculate_shares()` uses the 1-year age group if its mean share is larger than zero. If no immigration was recorded for a particular 1-year age group (e.g., the 76-year old Swiss males in spatial unit 1), the mean share of the corresponding 5-year age group is used (i.e., 75-79-year old Swiss males in the case of 76-year old). If the share across the 5-year age group is also zero, the 10-year age group is used (this could be 70-79 year old Swiss males). In this case, the `prop_10` share of the respective 10-year age group is used for both 5-year age groups within the 10-year age group (e.g., the 70-74 and 75-79 year olds). The variable `use_age_group` indicates which share is suggested by this default algorithm. The final outcome `share` indicates the proportion of the historical total per demographic group (`n_sum`) that is to be allocated to the respective spatial unit. (`share = n / n_sum`). ```{r result-historical-immigration} data_distr_hist_int <- df_hist_imm |> calculate_shares(col = "hist_imm_int", age_group = "age_group_5") data_distr_hist_int |> # Display rounded numbers to save space dplyr::mutate(share = round(share, 3)) |> DT::datatable() |> DT::formatStyle( "n", backgroundColor = "#96D4FF" ) |> DT::formatStyle( "n_sum", backgroundColor = "#96D4FF" ) |> DT::formatStyle( "share", backgroundColor = "#007AB8" ) ``` \ Note that summarizing the shares of a demographic group across all spatial units always adds up to 1: ```{r share-check} data_distr_hist_int |> dplyr::summarise(sum_share = round( sum(share, na.rm = TRUE), digits = 1), .by = c(nat, sex, age)) |> DT::datatable() ``` \ From now on, the procedure is identical to the first approach that distributed the number of people according to population size. That is, the share is multiplied with the numbers that the FSO estimated for the whole canton. ```{r parameters-distributed} # In addition to international immigration shares and numbers # we also need the shares and numbers for national immigration data_distr_hist_nat <- df_hist_imm |> calculate_shares(col = "hist_imm_nat") |> # Use unambiguous name dplyr::rename(share_imm_nat = share) |> # Drop unnecessary variables dplyr::select(-c( "hist_imm_nat", "hist_imm_int", "age_group_5", "sum_5", "prop_5", "age_group_10", "sum_10", "prop_10", "use_age_group", "n", "n_sum" )) # Join both data frames holding shares data_distr_hist <- data_distr_hist_int |> # Use unambiguous name dplyr::rename(share_imm_int = share) |> # Drop unnecessary variables dplyr::select(-c( "hist_imm_nat", "hist_imm_int", "age_group_5", "age_group_10", "sum_10", "prop_10", "use_age_group", "n", "n_sum" )) |> dplyr::left_join(data_distr_hist_nat, by = c("spatial_unit", "nat", "sex", "age") ) # Add shares to the data frame that holds the projection parameters fso_parameters_sub_distr_hist <- fso_parameters_sub |> dplyr::left_join( data_distr_hist |> dplyr::select( "spatial_unit", "nat", "sex", "age", "share_imm_int", "share_imm_nat" ), by = c("spatial_unit", "nat", "sex", "age"), relationship = "many-to-one" ) |> # Compute `n` for subregions, assign values directly to imm_int_n and imm_nat_n dplyr::mutate( imm_int_n = imm_int_n * share_imm_int, imm_nat_n = imm_nat_n * share_imm_nat ) |> # Remove unnecessary variables dplyr::select(-c("share_imm_int", "share_imm_nat")) # Show result fso_parameters_sub_distr_hist |> # Remove variables for better overview dplyr::select(-fso_projection_n, -scen) |> head(100) |> dplyr::mutate(across( c("int_mothers":"imm_nat_n"), \(x) sprintf(fmt = "%.3f", x) )) |> DT::datatable() ``` Now everything is ready to run the projection: ```{r project-2} propop( parameters = fso_parameters_sub_distr_hist, year_first = 2019, year_last = 2020, age_groups = 101, fert_first = 16, fert_last = 50, share_born_female = 100 / 205, population = df_population, binational = TRUE, subregional = FALSE ) ``` # Migration between subregions `propop` offers the possibility to account for migration between subregions. To adjust the population size in each subregion according to past migration, the column `mig_sub` (= migration in subregions) is required in the parameter data frame. Here we add `mig_sub` as a *fictitious* parameter the parameter input file. (In real life you use population register records instead). ```{r, eval = TRUE} parameters_sub_mig <- fso_parameters_sub_distr_hist |> # Create fictitious migration parameters dplyr::mutate( mig_sub = dplyr::case_when( # Four regions with emigration, 1 region with immigration spatial_unit == 1 ~ { set.seed(1) round(rnorm(1, mean = 0, sd = 0.2), digits = 4) }, spatial_unit == 2 ~ { set.seed(2) round(rnorm(1, mean = 0, sd = 0.2), digits = 4) }, spatial_unit == 3 ~ { set.seed(25) round(rnorm(1, mean = 0, sd = 0.2), digits = 4) }, spatial_unit == 4 ~ { set.seed(12) round(rnorm(1, mean = 0, sd = 0.2), digits = 4) }, TRUE ~ NA ) ) |> dplyr::mutate( mig_sub = dplyr::case_when( spatial_unit == 5 ~ 0 - sum(mig_sub, na.rm = TRUE), TRUE ~ mig_sub ), # check = sum(mig_sub, na.rm = TRUE), .by = c("nat", "sex", "age", "year", "scen") ) |> dplyr::select( nat, sex, age, year, scen, spatial_unit, birthrate, int_mothers, mor, emi_int, emi_nat, imm_int_n, imm_nat_n, acq, emi_nat_n, mig_nat_n, mig_sub ) ``` Now that all required input files are available, we can set `subregional` to `TRUE` and use `propop::propop()`: ```{r project-3, eval = TRUE} propop( parameters = parameters_sub_mig, year_first = 2019, year_last = 2020, age_groups = 101, fert_first = 16, fert_last = 50, share_born_female = 100 / 205, population = df_population, binational = TRUE, subregional = TRUE ) ```