Projections for subregions

library(propop)

Overview

This vignette explains how to use propop::propop() to perform population projections for multiple regions and particularly for subregions within a larger spatial entity (e.g., municipalities within a canton). It discusses two challenges that arise when conducting such projections and presents potential strategies for addressing them:

  1. How to obtain the required input data for spatial entities below the level of cantons?
  2. How to account for migration between subregions?

For more background and general information about the required input data, see this vignette).

Projection parameters for subregions

Numeric information about demographic processes such as births, migration, or deaths are an essential prerequisite for population projections. However, in Switzerland this information is typically only available for cantons but not for spatial entities at smaller scales.

Supplying input data for spatial units at the sub-cantonal level (e.g.  municipalities) can be straightforward for data expressed as rates (e.g.  mortality rate). The simplest approach is to use the same figures for the subregions as for the canton (unless the figures are implausible for some theoretical or empirical reason). The task is more demanding, though, if you want to alter rates for subregions or when you need to downscale input data expressed as “number of people”. While we don’t offer a solution to adjust rates (yet), the next two sections show two possibilities to distribute the cantonal “number of people” estimates among subregions.

Distribution of people according to population size

The simplest approach to allocating canton-wide “number of people” estimates to subregions is to determine each subregion’s population size relative to the canton’s total population and distribute the numbers accordingly. To put it somewhat simplistically, if a municipality represents 10% of the canton’s population, it should proportionally receive 10% of the canton’s incoming immigrants. The approach described below is more sophisticated in that it uses the spatial units’ shares per demographic group but the core idea is the same.

Let us look at a concrete, numeric example. For the sake of simplicity, we use the data included in the package to create small, fictitious input data with five regions:

Input parameters (after this step, the rates and numbers for the subregions are still identical to those for the whole canton):

# FSO parameters for fictitious subregions
fso_parameters_sub <- fso_parameters |>
  # duplicating rows 5 times
  tidyr::uncount(5) |>
  # create 5 subregions
  dplyr::mutate(spatial_unit = rep(1:5, times = nrow(fso_parameters))) |>
  dplyr::mutate(spatial_unit = as.character(spatial_unit))

Population data:

# Generate 5 random "cuts" to distribute the original population; 
# avoid extreme values with a range of 0.1 to 0.5
cut_1 <- {
  set.seed(1)
  round(runif(1, min = 0.1, max = 0.5), digits = 2)
}
cut_2 <- {
  set.seed(2)
  round(runif(1, min = 0.1, max = 0.5), digits = 2)
}
cut_3 <- {
  set.seed(3)
  round(runif(1, min = 0.1, max = 0.5), digits = 2)
}
cut_4 <- {
  set.seed(4)
  round(runif(1, min = 0.1, max = 0.5), digits = 2)
}
# make sure everything adds up to 100%
cut_5 <- 1 - cut_1 - cut_2 - cut_3 - cut_4

# Generate population data for five subregions
df_population <- fso_population |>
  # duplicating rows 5 times
  tidyr::uncount(5) |>
  # create 5 subregions
  dplyr::mutate(
    spatial_unit = as.character(rep(1:5, times = nrow(fso_population)))
  ) |>
  dplyr::mutate(
    # Distribute original population according to "cuts"
    n = dplyr::case_match(
      spatial_unit,
      "1" ~ round(n * cut_1),
      "2" ~ round(n * cut_2),
      "3" ~ round(n * cut_3),
      "4" ~ round(n * cut_4),
      "5" ~ round(n * cut_5),
      .default = NA
    ),
    .keep = "all"
  )

To calculate the shares, we count the number of people in each demographic group across all spatial units (sum_n). Next, the share is obtained by dividing the spatial unit’s n by the sum of all people across all spatial units in this demographic group (sum_n). Note that the sum of all five region’s shares is 1 (e.g., 0.21 + 0.17 + 0.17 + 0.33 + 0.12 = 1).

# Calculate shares
df_population_shares <- df_population |>
  dplyr::mutate(sum_n = sum(n), .by = c(nat, sex, age)) |>
  dplyr::mutate(share = n / sum_n)

# Display table
df_population_shares |>
  dplyr::mutate(share = round(share, 3)) |>
  DT::datatable() |>
  DT::formatStyle(c("share"),
                  backgroundColor = DT::styleRow(c(1:5), "#96D4FF", default = "")
  )

 

Now all required data are available and we can distribute projection parameters expressed as “number of people” between subregions.

Let’s do this with immigration from other countries (imm_int_n) and immigration from other cantons (imm_nat_n). We first join the data frame containing the projection parameters (fso_parameters_sub) and the data frame containing the shares (df_population_shares). Identifiers for demographic groups must be present in both data frames. The actual distribution involves only a single line per parameter in which the canton-wide number of immigrants (imm_int_n and imm_nat_n) is multiplied by the share (share), which in this approach is identical for both types of immigration.

parameters_sub_size <- fso_parameters_sub |>
  dplyr::left_join(
    df_population_shares |>
      dplyr::select("spatial_unit", "nat", "sex", "age", "share"),
    by = c("spatial_unit", "nat", "sex", "age")
  ) |>
  dplyr::mutate(
    # Calculate number of incoming people per demographic group and spatial unit
    imm_int_n_distr = imm_int_n * share,
    imm_nat_n_distr = imm_nat_n * share
  )

Let’s take a closer look at the result of this, focusing on immigration from other countries. In the table below, the blue columns check and difference show that the sum of the distributed parameter (imm_int_n_distr) adds up to the total number of people (imm_int_n; i.e., the original figures provided by the FSO for the whole canton).

parameters_sub_size |>
  dplyr::mutate(
    check = round(sum(imm_int_n_distr), 0),
    .by = c(year, nat, sex, age)
  ) |>
  dplyr::filter(sex == "m" & nat == "int") |>
  dplyr::mutate(across(c(
    "int_mothers":"emi_nat", "acq", "share":"imm_nat_n_distr"
  ), \(x) round(x, 3))) |>
  dplyr::mutate(difference = imm_int_n - check) |>
  dplyr::select(-scen) |>
  DT::datatable() |>
  DT::formatStyle(
    "imm_int_n",
    backgroundColor = "#ffcc8f"
  ) |>
  DT::formatStyle(
    c("check", "difference"),
    backgroundColor = "#96D4FF"
  )


To proceed with the projection, the columns with the distributed immigration (imm_int_n_distr and imm_nat_n_distr) need to become the new imm_int_n and imm_nat_n. Otherwise, propop() won’t recognize the parameters.

parameters_sub_size_clean <- parameters_sub_size |>
  dplyr::mutate(
    imm_int_n = imm_int_n_distr,
    imm_nat_n = imm_nat_n_distr
  ) |>
  dplyr::select(-share, -imm_int_n_distr)

Now we can run the projection:

propop(
  parameters = parameters_sub_size_clean,
  year_first = 2019,
  year_last = 2020,
  age_groups = 101,
  fert_first = 16,
  fert_last = 50,
  share_born_female = 100 / 205,
  population = df_population,
  binational = TRUE,
  subregional = FALSE
)
#> Running projection for: 1
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 2
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 3
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 4
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 5
#> ✔ Year: 2019
#> ✔ Year: 2020
#> 
#> ── Settings used for the projection ────────────────────────────────────────────
#> Year of starting population: 2018
#> Number of age groups: 101
#> Fertile period: 16-50
#> Share of female newborns: 0.488
#> Size of starting population: 678203
#> Projection period: 2019-2020
#> Projected population size (2020): 686742
#> Nationality-specific projection: "yes"
#> Subregional migration: "no"
#> ────────────────────────────────────────────────────────────────────────────────
#> # A tibble: 4,040 × 16
#>     year spatial_unit   age sex   nat   n_jan births   mor emi_int emi_nat
#>    <dbl> <fct>        <dbl> <fct> <fct> <dbl>  <dbl> <dbl>   <dbl>   <dbl>
#>  1  2019 1                0 m     ch        0   521. 2.29     1.26    4.41
#>  2  2019 1                1 m     ch      526     0  0.415    3.15   15.3 
#>  3  2019 1                2 m     ch      544     0  0.210    2.94   12.8 
#>  4  2019 1                3 m     ch      539     0  0.206    2.73   10.3 
#>  5  2019 1                4 m     ch      553     0  0        2.31    8.81
#>  6  2019 1                5 m     ch      569     0  0        2.31    7.56
#>  7  2019 1                6 m     ch      548     0  0        2.10    6.30
#>  8  2019 1                7 m     ch      539     0  0        1.68    5.46
#>  9  2019 1                8 m     ch      548     0  0        1.47    4.83
#> 10  2019 1                9 m     ch      553     0  0        1.26    4.41
#> # ℹ 4,030 more rows
#> # ℹ 6 more variables: imm_int <dbl>, imm_nat <dbl>, acq <dbl>, n_dec <dbl>,
#> #   delta_n <dbl>, delta_perc <dbl>

Distribution of people according to past migration

The second approach to distribute “number of people” estimates among subregions uses historical migration records. An advantage of this approach is that it can differentiate between different types of migration (or any number-of-people input) and thereby enables adjusting each parameter independently (rather than using the same share for adapting all parameters). This is great, for example, if immigration from other cantons differs from immigration from other countries.

To illustrate this approach, we use international immigration as an example and create a fictitious data frame with five regions. For pedagogical reasons, we modify the data a little bit. Spatial unit 1 won’t have any international immigration among children aged 0-4 years. Consequently, the value in the column hist_imm_int contains zeros for these entries. This doesn’t reflect realistic conditions but helps to illustrate the suggested procedure in cases where zero observations exist for demographic groups.

With real data, the same can be achieved by summarizing the historical migration data. Since real migration usually varies from year to year, we advise calculating an average across several years. The fictitious data assumes that we have already performed this step.

set.seed(145)

# Immigration records
df_hist_imm <- tibble::tibble(
  # five fictitious spatial units
  spatial_unit = rep(c("1", "2", "3", "4", "5"), each = 101 * 4),
  # two nationalities
  nat = rep(rep(c("ch", "int"), each = 2 * 101), times = 5),
  # two levels for sex
  sex = rep(rep(c("m", "f"), each = 101), times = 5 * 2),
  # age groups from 0 to 100
  age = rep(0:100, times = 5 * 4)
) |>
  dplyr::mutate(
    # random numbers between zero and 50 are created to mimic historical
    # immigration records observed in Aargau
    hist_imm_nat = sample(0:50, dplyr::n(), replace = TRUE),
    hist_imm_int = sample(0:50, dplyr::n(), replace = TRUE),
    # for one spatial unit, we artificially assign zeros to all children
    # between 0-4 years
    hist_imm_int = ifelse(spatial_unit == 1 & age %in%
                            c(0:4), 0, hist_imm_int)
  )

The resulting data frame includes two columns with (fictitious) immigration from abroad (hist_imm_nat) and from other cantons (hist_imm_nat), aggregated per demographic group:

df_hist_imm |>
  DT::datatable()

 

A challenge is that records often vary considerably between years. If the patterns are too uneven, future trends may become erratic, especially in groups with few people (e.g., small municipalities or small age groups).

A first step to mitigate this issue should already have happened at this stage, namely considering several years and using the arithmetic mean as an estimate of each demographic group’s past migration.

propop::calculate_shares() offers an additional remedy. Instead of only using shares based on 1-year age groups, it is possible to resort to 5-year and 10-year age groups, which further smooths out irregular patterns.

To use this function, you need to provide a data frame containing a column with the number of immigrants per demographic group (e.g., hist_imm_nat or hist_imm_int as calculated above). The function assigns each 1-year age group to a 5-year and a 10-year age group (e.g., 2-year olds become part of the age_groups 0-4 and 0-9), sums up all observations within the respective 5-year and 10-year age group (e.g., 0-4 and 0-9 year old Swiss males). These sums (sum_5and sum_10 in the table below) are then divided by five or ten, respectively, and proportionally assigned to each member of the age group (prop_5 and prop_10 in the table below).

To distribute the canton-wide numbers among subregions, propop::calculate_shares() uses the 1-year age group if its mean share is larger than zero. If no immigration was recorded for a particular 1-year age group (e.g., the 76-year old Swiss males in spatial unit 1), the mean share of the corresponding 5-year age group is used (i.e., 75-79-year old Swiss males in the case of 76-year old). If the share across the 5-year age group is also zero, the 10-year age group is used (this could be 70-79 year old Swiss males). In this case, the prop_10 share of the respective 10-year age group is used for both 5-year age groups within the 10-year age group (e.g., the 70-74 and 75-79 year olds). The variable use_age_group indicates which share is suggested by this default algorithm.

The final outcome share indicates the proportion of the historical total per demographic group (n_sum) that is to be allocated to the respective spatial unit. (share = n / n_sum).

data_distr_hist_int <- df_hist_imm |>
  calculate_shares(col = "hist_imm_int", age_group = "age_group_5")

data_distr_hist_int |>
  # Display rounded numbers to save space
  dplyr::mutate(share = round(share, 3)) |>
  DT::datatable() |>
  DT::formatStyle(
    "n",
    backgroundColor = "#96D4FF"
  ) |>
  DT::formatStyle(
    "n_sum",
    backgroundColor = "#96D4FF"
  ) |>
  DT::formatStyle(
    "share",
    backgroundColor = "#007AB8"
  )

 

Note that summarizing the shares of a demographic group across all spatial units always adds up to 1:

data_distr_hist_int |>
  dplyr::summarise(sum_share = round(
    sum(share, na.rm = TRUE),
    digits = 1), .by = c(nat, sex, age)) |>
  DT::datatable()


From now on, the procedure is identical to the first approach that distributed the number of people according to population size. That is, the share is multiplied with the numbers that the FSO estimated for the whole canton.

# In addition to international immigration shares and numbers
# we also need the shares and numbers for national immigration
data_distr_hist_nat <- df_hist_imm |>
  calculate_shares(col = "hist_imm_nat") |>
  # Use unambiguous name
  dplyr::rename(share_imm_nat = share) |>
  # Drop unnecessary variables
  dplyr::select(-c(
    "hist_imm_nat", "hist_imm_int", "age_group_5", "sum_5", "prop_5", 
    "age_group_10", "sum_10", "prop_10", "use_age_group", "n", "n_sum"
  ))

# Join both data frames holding shares
data_distr_hist <- data_distr_hist_int |>
  # Use unambiguous name
  dplyr::rename(share_imm_int = share) |>
  # Drop unnecessary variables
  dplyr::select(-c(
    "hist_imm_nat", "hist_imm_int", "age_group_5", "age_group_10", "sum_10",
    "prop_10", "use_age_group", "n", "n_sum"
  )) |>
  dplyr::left_join(data_distr_hist_nat,
                   by = c("spatial_unit", "nat", "sex", "age")
  )

# Add shares to the data frame that holds the projection parameters
fso_parameters_sub_distr_hist <- fso_parameters_sub |>
  dplyr::left_join(
    data_distr_hist |>
      dplyr::select(
        "spatial_unit", "nat", "sex", "age", "share_imm_int", "share_imm_nat"
      ),
    by = c("spatial_unit", "nat", "sex", "age"),
    relationship = "many-to-one"
  ) |>
  # Compute `n` for subregions, assign values directly to imm_int_n and imm_nat_n
  dplyr::mutate(
    imm_int_n = imm_int_n * share_imm_int,
    imm_nat_n = imm_nat_n * share_imm_nat
  ) |>
  # Remove unnecessary variables
  dplyr::select(-c("share_imm_int", "share_imm_nat"))

# Show result
fso_parameters_sub_distr_hist |>
  # Remove variables for better overview
  dplyr::select(-fso_projection_n, -scen) |>
  head(100) |>
  dplyr::mutate(across(
    c("int_mothers":"imm_nat_n"), \(x) sprintf(fmt = "%.3f", x)
  )) |>
  DT::datatable()

Now everything is ready to run the projection:

propop(
  parameters = fso_parameters_sub_distr_hist,
  year_first = 2019,
  year_last = 2020,
  age_groups = 101,
  fert_first = 16,
  fert_last = 50,
  share_born_female = 100 / 205,
  population = df_population,
  binational = TRUE,
  subregional = FALSE
)
#> Running projection for: 1
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 2
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 3
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 4
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 5
#> ✔ Year: 2019
#> ✔ Year: 2020
#> 
#> ── Settings used for the projection ────────────────────────────────────────────
#> Year of starting population: 2018
#> Number of age groups: 101
#> Fertile period: 16-50
#> Share of female newborns: 0.488
#> Size of starting population: 678203
#> Projection period: 2019-2020
#> Projected population size (2020): 686746
#> Nationality-specific projection: "yes"
#> Subregional migration: "no"
#> ────────────────────────────────────────────────────────────────────────────────
#> # A tibble: 4,040 × 16
#>     year spatial_unit   age sex   nat   n_jan births   mor emi_int emi_nat
#>    <dbl> <fct>        <dbl> <fct> <fct> <dbl>  <dbl> <dbl>   <dbl>   <dbl>
#>  1  2019 1                0 m     ch        0   518. 2.26     1.25    4.38
#>  2  2019 1                1 m     ch      526     0  0.413    3.15   15.3 
#>  3  2019 1                2 m     ch      544     0  0.207    2.94   12.8 
#>  4  2019 1                3 m     ch      539     0  0.206    2.73   10.3 
#>  5  2019 1                4 m     ch      553     0  0        2.31    8.81
#>  6  2019 1                5 m     ch      569     0  0        2.31    7.56
#>  7  2019 1                6 m     ch      548     0  0        2.10    6.30
#>  8  2019 1                7 m     ch      539     0  0        1.68    5.46
#>  9  2019 1                8 m     ch      548     0  0        1.47    4.83
#> 10  2019 1                9 m     ch      553     0  0        1.26    4.41
#> # ℹ 4,030 more rows
#> # ℹ 6 more variables: imm_int <dbl>, imm_nat <dbl>, acq <dbl>, n_dec <dbl>,
#> #   delta_n <dbl>, delta_perc <dbl>

Migration between subregions

propop offers the possibility to account for migration between subregions. To adjust the population size in each subregion according to past migration, the column mig_sub (= migration in subregions) is required in the parameter data frame.

Here we add mig_sub as a fictitious parameter the parameter input file. (In real life you use population register records instead).

parameters_sub_mig <- fso_parameters_sub_distr_hist |>
  # Create fictitious migration parameters
  dplyr::mutate(
    mig_sub = dplyr::case_when(
      # Four regions with emigration, 1 region with immigration
      spatial_unit == 1 ~ {
        set.seed(1)
        round(rnorm(1, mean = 0, sd = 0.2), digits = 4)
      },
      spatial_unit == 2 ~ {
        set.seed(2)
        round(rnorm(1, mean = 0, sd = 0.2), digits = 4)
      },
      spatial_unit == 3 ~ {
        set.seed(25)
        round(rnorm(1, mean = 0, sd = 0.2), digits = 4)
      },
      spatial_unit == 4 ~ {
        set.seed(12)
        round(rnorm(1, mean = 0, sd = 0.2), digits = 4)
      },
      TRUE ~ NA
    )
  ) |>
  dplyr::mutate(
    mig_sub = dplyr::case_when(
      spatial_unit == 5 ~ 0 - sum(mig_sub, na.rm = TRUE), TRUE ~ mig_sub
    ),
    # check = sum(mig_sub, na.rm = TRUE),
    .by = c("nat", "sex", "age", "year", "scen")
  ) |>
  dplyr::select(
    nat, sex, age, year, scen, spatial_unit, birthrate, int_mothers, mor,
    emi_int, emi_nat, imm_int_n, imm_nat_n, acq, emi_nat_n, mig_nat_n, mig_sub
  )

Now that all required input files are available, we can set subregional to TRUE and use propop::propop():

propop(
  parameters = parameters_sub_mig,
  year_first = 2019,
  year_last = 2020,
  age_groups = 101,
  fert_first = 16,
  fert_last = 50,
  share_born_female = 100 / 205,
  population = df_population,
  binational = TRUE,
  subregional = TRUE
)
#> Running projection for: 1
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 2
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 3
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 4
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 5
#> ✔ Year: 2019
#> ✔ Year: 2020
#> 
#> ── Settings used for the projection ────────────────────────────────────────────
#> Year of starting population: 2018
#> Number of age groups: 101
#> Fertile period: 16-50
#> Share of female newborns: 0.488
#> Size of starting population: 678203
#> Projection period: 2019-2020
#> Projected population size (2020): 686746
#> Nationality-specific projection: "yes"
#> Subregional migration: "yes"
#> ────────────────────────────────────────────────────────────────────────────────
#> # A tibble: 4,040 × 17
#>     year spatial_unit   age sex   nat   n_jan births   mor emi_int emi_nat
#>    <dbl> <fct>        <dbl> <fct> <fct> <dbl>  <dbl> <dbl>   <dbl>   <dbl>
#>  1  2019 1                0 m     ch        0   518. 2.26     1.25    4.38
#>  2  2019 1                1 m     ch      526     0  0.413    3.15   15.3 
#>  3  2019 1                2 m     ch      544     0  0.207    2.94   12.8 
#>  4  2019 1                3 m     ch      539     0  0.206    2.73   10.3 
#>  5  2019 1                4 m     ch      553     0  0        2.31    8.81
#>  6  2019 1                5 m     ch      569     0  0        2.31    7.56
#>  7  2019 1                6 m     ch      548     0  0        2.10    6.30
#>  8  2019 1                7 m     ch      539     0  0        1.68    5.46
#>  9  2019 1                8 m     ch      548     0  0        1.47    4.83
#> 10  2019 1                9 m     ch      553     0  0        1.26    4.41
#> # ℹ 4,030 more rows
#> # ℹ 7 more variables: imm_int <dbl>, imm_nat <dbl>, acq <dbl>, mig_sub <dbl>,
#> #   n_dec <dbl>, delta_n <dbl>, delta_perc <dbl>