This vignette explains how to use propop::propop()
to
perform population projections for multiple regions and particularly for
subregions within a larger spatial entity (e.g., municipalities within a
canton). It discusses two challenges that arise when
conducting such projections and presents potential strategies for
addressing them:
For more background and general information about the required input data, see this vignette).
Numeric information about demographic processes such as births, migration, or deaths are an essential prerequisite for population projections. However, in Switzerland this information is typically only available for cantons but not for spatial entities at smaller scales.
Supplying input data for spatial units at the sub-cantonal level (e.g. municipalities) can be straightforward for data expressed as rates (e.g. mortality rate). The simplest approach is to use the same figures for the subregions as for the canton (unless the figures are implausible for some theoretical or empirical reason). The task is more demanding, though, if you want to alter rates for subregions or when you need to downscale input data expressed as “number of people”. While we don’t offer a solution to adjust rates (yet), the next two sections show two possibilities to distribute the cantonal “number of people” estimates among subregions.
The simplest approach to allocating canton-wide “number of people” estimates to subregions is to determine each subregion’s population size relative to the canton’s total population and distribute the numbers accordingly. To put it somewhat simplistically, if a municipality represents 10% of the canton’s population, it should proportionally receive 10% of the canton’s incoming immigrants. The approach described below is more sophisticated in that it uses the spatial units’ shares per demographic group but the core idea is the same.
Let us look at a concrete, numeric example. For the sake of simplicity, we use the data included in the package to create small, fictitious input data with five regions:
Input parameters (after this step, the rates and numbers for the subregions are still identical to those for the whole canton):
# FSO parameters for fictitious subregions
fso_parameters_sub <- fso_parameters |>
# duplicating rows 5 times
tidyr::uncount(5) |>
# create 5 subregions
dplyr::mutate(spatial_unit = rep(1:5, times = nrow(fso_parameters))) |>
dplyr::mutate(spatial_unit = as.character(spatial_unit))
Population data:
# Generate 5 random "cuts" to distribute the original population;
# avoid extreme values with a range of 0.1 to 0.5
cut_1 <- {
set.seed(1)
round(runif(1, min = 0.1, max = 0.5), digits = 2)
}
cut_2 <- {
set.seed(2)
round(runif(1, min = 0.1, max = 0.5), digits = 2)
}
cut_3 <- {
set.seed(3)
round(runif(1, min = 0.1, max = 0.5), digits = 2)
}
cut_4 <- {
set.seed(4)
round(runif(1, min = 0.1, max = 0.5), digits = 2)
}
# make sure everything adds up to 100%
cut_5 <- 1 - cut_1 - cut_2 - cut_3 - cut_4
# Generate population data for five subregions
df_population <- fso_population |>
# duplicating rows 5 times
tidyr::uncount(5) |>
# create 5 subregions
dplyr::mutate(
spatial_unit = as.character(rep(1:5, times = nrow(fso_population)))
) |>
dplyr::mutate(
# Distribute original population according to "cuts"
n = dplyr::case_match(
spatial_unit,
"1" ~ round(n * cut_1),
"2" ~ round(n * cut_2),
"3" ~ round(n * cut_3),
"4" ~ round(n * cut_4),
"5" ~ round(n * cut_5),
.default = NA
),
.keep = "all"
)
To calculate the shares, we count the number of people in each
demographic group across all spatial units (sum_n
). Next,
the share
is obtained by dividing the spatial unit’s
n
by the sum of all people across all spatial units in this
demographic group (sum_n
). Note that the sum of all five
region’s shares is 1 (e.g., 0.21 + 0.17 + 0.17 + 0.33 + 0.12 = 1).
# Calculate shares
df_population_shares <- df_population |>
dplyr::mutate(sum_n = sum(n), .by = c(nat, sex, age)) |>
dplyr::mutate(share = n / sum_n)
# Display table
df_population_shares |>
dplyr::mutate(share = round(share, 3)) |>
DT::datatable() |>
DT::formatStyle(c("share"),
backgroundColor = DT::styleRow(c(1:5), "#96D4FF", default = "")
)
Now all required data are available and we can distribute projection parameters expressed as “number of people” between subregions.
Let’s do this with immigration from other countries
(imm_int_n
) and immigration from other cantons
(imm_nat_n
). We first join the data frame containing the
projection parameters (fso_parameters_sub
) and the data
frame containing the shares (df_population_shares
).
Identifiers for demographic groups must be present in both data frames.
The actual distribution involves only a single line per parameter in
which the canton-wide number of immigrants (imm_int_n
and
imm_nat_n
) is multiplied by the share (share
),
which in this approach is identical for both types of immigration.
parameters_sub_size <- fso_parameters_sub |>
dplyr::left_join(
df_population_shares |>
dplyr::select("spatial_unit", "nat", "sex", "age", "share"),
by = c("spatial_unit", "nat", "sex", "age")
) |>
dplyr::mutate(
# Calculate number of incoming people per demographic group and spatial unit
imm_int_n_distr = imm_int_n * share,
imm_nat_n_distr = imm_nat_n * share
)
Let’s take a closer look at the result of this, focusing on
immigration from other countries. In the table below, the blue columns
check
and difference
show that the sum of the
distributed parameter (imm_int_n_distr
) adds up to the
total number of people (imm_int_n
; i.e., the original
figures provided by the FSO for the whole canton).
parameters_sub_size |>
dplyr::mutate(
check = round(sum(imm_int_n_distr), 0),
.by = c(year, nat, sex, age)
) |>
dplyr::filter(sex == "m" & nat == "int") |>
dplyr::mutate(across(c(
"int_mothers":"emi_nat", "acq", "share":"imm_nat_n_distr"
), \(x) round(x, 3))) |>
dplyr::mutate(difference = imm_int_n - check) |>
dplyr::select(-scen) |>
DT::datatable() |>
DT::formatStyle(
"imm_int_n",
backgroundColor = "#ffcc8f"
) |>
DT::formatStyle(
c("check", "difference"),
backgroundColor = "#96D4FF"
)
To proceed with the projection, the columns with the distributed
immigration (imm_int_n_distr
and
imm_nat_n_distr
) need to become the new
imm_int_n
and imm_nat_n
. Otherwise,
propop()
won’t recognize the parameters.
parameters_sub_size_clean <- parameters_sub_size |>
dplyr::mutate(
imm_int_n = imm_int_n_distr,
imm_nat_n = imm_nat_n_distr
) |>
dplyr::select(-share, -imm_int_n_distr)
Now we can run the projection:
propop(
parameters = parameters_sub_size_clean,
year_first = 2019,
year_last = 2020,
age_groups = 101,
fert_first = 16,
fert_last = 50,
share_born_female = 100 / 205,
population = df_population,
binational = TRUE,
subregional = FALSE
)
#> Running projection for: 1
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 2
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 3
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 4
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 5
#> ✔ Year: 2019
#> ✔ Year: 2020
#>
#> ── Settings used for the projection ────────────────────────────────────────────
#> Year of starting population: 2018
#> Number of age groups: 101
#> Fertile period: 16-50
#> Share of female newborns: 0.488
#> Size of starting population: 678203
#> Projection period: 2019-2020
#> Projected population size (2020): 686742
#> Nationality-specific projection: "yes"
#> Subregional migration: "no"
#> ────────────────────────────────────────────────────────────────────────────────
#> # A tibble: 4,040 × 16
#> year spatial_unit age sex nat n_jan births mor emi_int emi_nat
#> <dbl> <fct> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2019 1 0 m ch 0 521. 2.29 1.26 4.41
#> 2 2019 1 1 m ch 526 0 0.415 3.15 15.3
#> 3 2019 1 2 m ch 544 0 0.210 2.94 12.8
#> 4 2019 1 3 m ch 539 0 0.206 2.73 10.3
#> 5 2019 1 4 m ch 553 0 0 2.31 8.81
#> 6 2019 1 5 m ch 569 0 0 2.31 7.56
#> 7 2019 1 6 m ch 548 0 0 2.10 6.30
#> 8 2019 1 7 m ch 539 0 0 1.68 5.46
#> 9 2019 1 8 m ch 548 0 0 1.47 4.83
#> 10 2019 1 9 m ch 553 0 0 1.26 4.41
#> # ℹ 4,030 more rows
#> # ℹ 6 more variables: imm_int <dbl>, imm_nat <dbl>, acq <dbl>, n_dec <dbl>,
#> # delta_n <dbl>, delta_perc <dbl>
The second approach to distribute “number of people” estimates among subregions uses historical migration records. An advantage of this approach is that it can differentiate between different types of migration (or any number-of-people input) and thereby enables adjusting each parameter independently (rather than using the same share for adapting all parameters). This is great, for example, if immigration from other cantons differs from immigration from other countries.
To illustrate this approach, we use international immigration as an
example and create a fictitious data frame with five regions. For
pedagogical reasons, we modify the data a little bit. Spatial unit 1
won’t have any international immigration among children aged 0-4 years.
Consequently, the value in the column hist_imm_int
contains
zeros for these entries. This doesn’t reflect realistic conditions but
helps to illustrate the suggested procedure in cases where zero
observations exist for demographic groups.
With real data, the same can be achieved by summarizing the historical migration data. Since real migration usually varies from year to year, we advise calculating an average across several years. The fictitious data assumes that we have already performed this step.
set.seed(145)
# Immigration records
df_hist_imm <- tibble::tibble(
# five fictitious spatial units
spatial_unit = rep(c("1", "2", "3", "4", "5"), each = 101 * 4),
# two nationalities
nat = rep(rep(c("ch", "int"), each = 2 * 101), times = 5),
# two levels for sex
sex = rep(rep(c("m", "f"), each = 101), times = 5 * 2),
# age groups from 0 to 100
age = rep(0:100, times = 5 * 4)
) |>
dplyr::mutate(
# random numbers between zero and 50 are created to mimic historical
# immigration records observed in Aargau
hist_imm_nat = sample(0:50, dplyr::n(), replace = TRUE),
hist_imm_int = sample(0:50, dplyr::n(), replace = TRUE),
# for one spatial unit, we artificially assign zeros to all children
# between 0-4 years
hist_imm_int = ifelse(spatial_unit == 1 & age %in%
c(0:4), 0, hist_imm_int)
)
The resulting data frame includes two columns with (fictitious)
immigration from abroad (hist_imm_nat
) and from other
cantons (hist_imm_nat
), aggregated per demographic
group:
A challenge is that records often vary considerably between years. If the patterns are too uneven, future trends may become erratic, especially in groups with few people (e.g., small municipalities or small age groups).
A first step to mitigate this issue should already have happened at this stage, namely considering several years and using the arithmetic mean as an estimate of each demographic group’s past migration.
propop::calculate_shares()
offers an additional remedy.
Instead of only using shares based on 1-year age groups, it is possible
to resort to 5-year and 10-year age groups, which further smooths out
irregular patterns.
To use this function, you need to provide a data frame containing a
column with the number of immigrants per demographic group (e.g.,
hist_imm_nat
or hist_imm_int
as calculated
above). The function assigns each 1-year age group to a 5-year and a
10-year age group (e.g., 2-year olds become part of the age_groups 0-4
and 0-9), sums up all observations within the respective 5-year and
10-year age group (e.g., 0-4 and 0-9 year old Swiss males). These sums
(sum_5
and sum_10
in the table below) are then
divided by five or ten, respectively, and proportionally assigned to
each member of the age group (prop_5
and
prop_10
in the table below).
To distribute the canton-wide numbers among
subregions, propop::calculate_shares()
uses the 1-year age
group if its mean share is larger than zero. If no immigration was
recorded for a particular 1-year age group (e.g., the 76-year old Swiss
males in spatial unit 1), the mean share of the corresponding 5-year age
group is used (i.e., 75-79-year old Swiss males in the case of 76-year
old). If the share across the 5-year age group is also zero, the 10-year
age group is used (this could be 70-79 year old Swiss males). In this
case, the prop_10
share of the respective 10-year age group
is used for both 5-year age groups within the 10-year age group (e.g.,
the 70-74 and 75-79 year olds). The variable use_age_group
indicates which share is suggested by this default algorithm.
The final outcome share
indicates the proportion of the
historical total per demographic group (n_sum
) that is to
be allocated to the respective spatial unit.
(share = n / n_sum
).
data_distr_hist_int <- df_hist_imm |>
calculate_shares(col = "hist_imm_int", age_group = "age_group_5")
data_distr_hist_int |>
# Display rounded numbers to save space
dplyr::mutate(share = round(share, 3)) |>
DT::datatable() |>
DT::formatStyle(
"n",
backgroundColor = "#96D4FF"
) |>
DT::formatStyle(
"n_sum",
backgroundColor = "#96D4FF"
) |>
DT::formatStyle(
"share",
backgroundColor = "#007AB8"
)
Note that summarizing the shares of a demographic group across all spatial units always adds up to 1:
data_distr_hist_int |>
dplyr::summarise(sum_share = round(
sum(share, na.rm = TRUE),
digits = 1), .by = c(nat, sex, age)) |>
DT::datatable()
From now on, the procedure is identical to the first approach that
distributed the number of people according to population size. That is,
the share is multiplied with the numbers that the FSO estimated for the
whole canton.
# In addition to international immigration shares and numbers
# we also need the shares and numbers for national immigration
data_distr_hist_nat <- df_hist_imm |>
calculate_shares(col = "hist_imm_nat") |>
# Use unambiguous name
dplyr::rename(share_imm_nat = share) |>
# Drop unnecessary variables
dplyr::select(-c(
"hist_imm_nat", "hist_imm_int", "age_group_5", "sum_5", "prop_5",
"age_group_10", "sum_10", "prop_10", "use_age_group", "n", "n_sum"
))
# Join both data frames holding shares
data_distr_hist <- data_distr_hist_int |>
# Use unambiguous name
dplyr::rename(share_imm_int = share) |>
# Drop unnecessary variables
dplyr::select(-c(
"hist_imm_nat", "hist_imm_int", "age_group_5", "age_group_10", "sum_10",
"prop_10", "use_age_group", "n", "n_sum"
)) |>
dplyr::left_join(data_distr_hist_nat,
by = c("spatial_unit", "nat", "sex", "age")
)
# Add shares to the data frame that holds the projection parameters
fso_parameters_sub_distr_hist <- fso_parameters_sub |>
dplyr::left_join(
data_distr_hist |>
dplyr::select(
"spatial_unit", "nat", "sex", "age", "share_imm_int", "share_imm_nat"
),
by = c("spatial_unit", "nat", "sex", "age"),
relationship = "many-to-one"
) |>
# Compute `n` for subregions, assign values directly to imm_int_n and imm_nat_n
dplyr::mutate(
imm_int_n = imm_int_n * share_imm_int,
imm_nat_n = imm_nat_n * share_imm_nat
) |>
# Remove unnecessary variables
dplyr::select(-c("share_imm_int", "share_imm_nat"))
# Show result
fso_parameters_sub_distr_hist |>
# Remove variables for better overview
dplyr::select(-fso_projection_n, -scen) |>
head(100) |>
dplyr::mutate(across(
c("int_mothers":"imm_nat_n"), \(x) sprintf(fmt = "%.3f", x)
)) |>
DT::datatable()
Now everything is ready to run the projection:
propop(
parameters = fso_parameters_sub_distr_hist,
year_first = 2019,
year_last = 2020,
age_groups = 101,
fert_first = 16,
fert_last = 50,
share_born_female = 100 / 205,
population = df_population,
binational = TRUE,
subregional = FALSE
)
#> Running projection for: 1
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 2
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 3
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 4
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 5
#> ✔ Year: 2019
#> ✔ Year: 2020
#>
#> ── Settings used for the projection ────────────────────────────────────────────
#> Year of starting population: 2018
#> Number of age groups: 101
#> Fertile period: 16-50
#> Share of female newborns: 0.488
#> Size of starting population: 678203
#> Projection period: 2019-2020
#> Projected population size (2020): 686746
#> Nationality-specific projection: "yes"
#> Subregional migration: "no"
#> ────────────────────────────────────────────────────────────────────────────────
#> # A tibble: 4,040 × 16
#> year spatial_unit age sex nat n_jan births mor emi_int emi_nat
#> <dbl> <fct> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2019 1 0 m ch 0 518. 2.26 1.25 4.38
#> 2 2019 1 1 m ch 526 0 0.413 3.15 15.3
#> 3 2019 1 2 m ch 544 0 0.207 2.94 12.8
#> 4 2019 1 3 m ch 539 0 0.206 2.73 10.3
#> 5 2019 1 4 m ch 553 0 0 2.31 8.81
#> 6 2019 1 5 m ch 569 0 0 2.31 7.56
#> 7 2019 1 6 m ch 548 0 0 2.10 6.30
#> 8 2019 1 7 m ch 539 0 0 1.68 5.46
#> 9 2019 1 8 m ch 548 0 0 1.47 4.83
#> 10 2019 1 9 m ch 553 0 0 1.26 4.41
#> # ℹ 4,030 more rows
#> # ℹ 6 more variables: imm_int <dbl>, imm_nat <dbl>, acq <dbl>, n_dec <dbl>,
#> # delta_n <dbl>, delta_perc <dbl>
propop
offers the possibility to account for migration
between subregions. To adjust the population size in each subregion
according to past migration, the column mig_sub
(=
migration in subregions) is required in the parameter data frame.
Here we add mig_sub
as a fictitious parameter
the parameter input file. (In real life you use population register
records instead).
parameters_sub_mig <- fso_parameters_sub_distr_hist |>
# Create fictitious migration parameters
dplyr::mutate(
mig_sub = dplyr::case_when(
# Four regions with emigration, 1 region with immigration
spatial_unit == 1 ~ {
set.seed(1)
round(rnorm(1, mean = 0, sd = 0.2), digits = 4)
},
spatial_unit == 2 ~ {
set.seed(2)
round(rnorm(1, mean = 0, sd = 0.2), digits = 4)
},
spatial_unit == 3 ~ {
set.seed(25)
round(rnorm(1, mean = 0, sd = 0.2), digits = 4)
},
spatial_unit == 4 ~ {
set.seed(12)
round(rnorm(1, mean = 0, sd = 0.2), digits = 4)
},
TRUE ~ NA
)
) |>
dplyr::mutate(
mig_sub = dplyr::case_when(
spatial_unit == 5 ~ 0 - sum(mig_sub, na.rm = TRUE), TRUE ~ mig_sub
),
# check = sum(mig_sub, na.rm = TRUE),
.by = c("nat", "sex", "age", "year", "scen")
) |>
dplyr::select(
nat, sex, age, year, scen, spatial_unit, birthrate, int_mothers, mor,
emi_int, emi_nat, imm_int_n, imm_nat_n, acq, emi_nat_n, mig_nat_n, mig_sub
)
Now that all required input files are available, we can set
subregional
to TRUE
and use
propop::propop()
:
propop(
parameters = parameters_sub_mig,
year_first = 2019,
year_last = 2020,
age_groups = 101,
fert_first = 16,
fert_last = 50,
share_born_female = 100 / 205,
population = df_population,
binational = TRUE,
subregional = TRUE
)
#> Running projection for: 1
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 2
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 3
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 4
#> ✔ Year: 2019
#> ✔ Year: 2020
#> Running projection for: 5
#> ✔ Year: 2019
#> ✔ Year: 2020
#>
#> ── Settings used for the projection ────────────────────────────────────────────
#> Year of starting population: 2018
#> Number of age groups: 101
#> Fertile period: 16-50
#> Share of female newborns: 0.488
#> Size of starting population: 678203
#> Projection period: 2019-2020
#> Projected population size (2020): 686746
#> Nationality-specific projection: "yes"
#> Subregional migration: "yes"
#> ────────────────────────────────────────────────────────────────────────────────
#> # A tibble: 4,040 × 17
#> year spatial_unit age sex nat n_jan births mor emi_int emi_nat
#> <dbl> <fct> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2019 1 0 m ch 0 518. 2.26 1.25 4.38
#> 2 2019 1 1 m ch 526 0 0.413 3.15 15.3
#> 3 2019 1 2 m ch 544 0 0.207 2.94 12.8
#> 4 2019 1 3 m ch 539 0 0.206 2.73 10.3
#> 5 2019 1 4 m ch 553 0 0 2.31 8.81
#> 6 2019 1 5 m ch 569 0 0 2.31 7.56
#> 7 2019 1 6 m ch 548 0 0 2.10 6.30
#> 8 2019 1 7 m ch 539 0 0 1.68 5.46
#> 9 2019 1 8 m ch 548 0 0 1.47 4.83
#> 10 2019 1 9 m ch 553 0 0 1.26 4.41
#> # ℹ 4,030 more rows
#> # ℹ 7 more variables: imm_int <dbl>, imm_nat <dbl>, acq <dbl>, mig_sub <dbl>,
#> # n_dec <dbl>, delta_n <dbl>, delta_perc <dbl>