Title: | Estimate and Manage Empirical Distributions |
---|---|
Description: | Tools to estimate and manage empirical distributions, which should work with survey data. One of the main features is the possibility to create data cubes of estimated statistics, that include all the combinations of the variables of interest (see for example functions dcc5() and dcc6()). |
Authors: | Sandro Burri [aut, cre] |
Maintainer: | Sandro Burri <[email protected]> |
License: | GPL-2 |
Version: | 0.0.6.9000 |
Built: | 2025-01-22 05:57:26 UTC |
Source: | https://github.com/gibonet/distrr |
Generate all combinations of the elements of a character vector
combn_char(x)
combn_char(x)
x |
a character vector |
a nested list. A list whose elements are lists containing the character vectors with the combinations of their elements.
combn_char(c("gender", "sector")) combn_char(c("gender", "sector", "education"))
combn_char(c("gender", "sector")) combn_char(c("gender", "sector", "education"))
Data cube creation (dcc)
dcc(.data, .variables, .fun = jointfun_, ...) dcc2(.data, .variables, .fun = jointfun_, order_type = extract_unique2, ...) dcc5( .data, .variables, .fun = jointfun_, .total = "Totale", order_type = extract_unique4, .all = TRUE, ... )
dcc(.data, .variables, .fun = jointfun_, ...) dcc2(.data, .variables, .fun = jointfun_, order_type = extract_unique2, ...) dcc5( .data, .variables, .fun = jointfun_, .total = "Totale", order_type = extract_unique4, .all = TRUE, ... )
.data |
data frame to be processed |
.variables |
variables to split data frame by, as a
character vector ( |
.fun |
function to apply to each piece
(default: |
... |
additional functions passed to |
order_type |
a function like |
.total |
character string with the name to give to the subset of data
that includes all the observations of a variable (default: |
.all |
logical, indicating if functions' have to be evaluated on the complete dataset. |
a data cube, with a column for each cateogorical variable used, and a row for each combination of all the categorical variables' modalities. In addition to all the modalities, each variable will also have a "Total" possibility, which includes all the others. The data cube will contain marginal, conditional and joint empirical distributions...
data("invented_wages") str(invented_wages) tmp <- dcc( .data = invented_wages, .variables = c("gender", "sector"), .fun = jointfun_ ) tmp str(tmp) tmp2 <- dcc2( .data = invented_wages, .variables = c("gender", "education"), .fun = jointfun_, order_type = extract_unique2 ) tmp2 str(tmp2) # dcc5 works like dcc2, but has an additional optional argument, .total, # that can be added to give a name to the groups that include all the # observations of a variable. tmp5 <- dcc5( .data = invented_wages, .variables = c("gender", "education"), .fun = jointfun_, .total = "TOTAL", order_type = extract_unique2 ) tmp5
data("invented_wages") str(invented_wages) tmp <- dcc( .data = invented_wages, .variables = c("gender", "sector"), .fun = jointfun_ ) tmp str(tmp) tmp2 <- dcc2( .data = invented_wages, .variables = c("gender", "education"), .fun = jointfun_, order_type = extract_unique2 ) tmp2 str(tmp2) # dcc5 works like dcc2, but has an additional optional argument, .total, # that can be added to give a name to the groups that include all the # observations of a variable. tmp5 <- dcc5( .data = invented_wages, .variables = c("gender", "education"), .fun = jointfun_, .total = "TOTAL", order_type = extract_unique2 ) tmp5
Data cube creation
dcc6( .data, .variables, .funs_list = list(n = ~dplyr::n()), .total = "Totale", order_type = extract_unique4, .all = TRUE ) dcc6_fixed( .data, .variables, .funs_list = list(n = ~dplyr::n()), .total = "Totale", order_type = extract_unique5, .all = TRUE, fixed_variable = NULL )
dcc6( .data, .variables, .funs_list = list(n = ~dplyr::n()), .total = "Totale", order_type = extract_unique4, .all = TRUE ) dcc6_fixed( .data, .variables, .funs_list = list(n = ~dplyr::n()), .total = "Totale", order_type = extract_unique5, .all = TRUE, fixed_variable = NULL )
.data |
data frame to be processed. |
.variables |
variables to split data frame by, as a character vector
( |
.funs_list |
a list of function calls in the form of right-hand formula. |
.total |
character string with the name to give to the subset of data
that includes all the observations of a variable (default: |
order_type |
a function like |
.all |
logical, indicating if functions have to be evaluated on the complete dataset. |
fixed_variable |
name of the variable for which you do not want to estimate the total |
dcc6( invented_wages, .variables = c("gender", "sector"), .funs_list = list(n = ~dplyr::n()), .all = TRUE ) dcc6( invented_wages, .variables = c("gender", "sector"), .funs_list = list(n = ~dplyr::n()), .all = FALSE )
dcc6( invented_wages, .variables = c("gender", "sector"), .funs_list = list(n = ~dplyr::n()), .all = TRUE ) dcc6( invented_wages, .variables = c("gender", "sector"), .funs_list = list(n = ~dplyr::n()), .all = FALSE )
Functions to be used in conjunction with 'dcc' family
extract_unique(df) extract_unique2(df) extract_unique3(df) extract_unique4(df) extract_unique5(df)
extract_unique(df) extract_unique2(df) extract_unique3(df) extract_unique4(df) extract_unique5(df)
df |
a data frame |
a list whose elements are character vectors of the unique values of each column
data("invented_wages") tmp <- extract_unique(df = invented_wages[, c("gender", "sector")]) tmp str(tmp)
data("invented_wages") tmp <- extract_unique(df = invented_wages[, c("gender", "sector")]) tmp str(tmp)
Weighted empirical cumulative distribution function (ecdf), conditional on one or more variables
Fhat_conditional_(.data, .variables, x, weights)
Fhat_conditional_(.data, .variables, x, weights)
.data |
a data frame |
.variables |
a character vector with one or more column names |
x |
character vector of length one, with the name of the numeric column whose conditional ecdf has to be estimated |
weights |
character vector of length one, indicating the name of the positive numeric column of weights, which will be used in the estimation of the conditional ecdf |
a data frame, with the variables used to condition, the x variable, and columns wsum (aggregated sum of weights, based on unique values of x) and Fhat (the estimated conditional Fhat). In addition to data frame, the object will be of classes grouped_df, tbl_df and tbl (from package dplyr)
Fhat_conditional_( mtcars, .variables = c("vs", "am"), x = "mpg", weights = "cyl" )
Fhat_conditional_( mtcars, .variables = c("vs", "am"), x = "mpg", weights = "cyl" )
Weighted empirical cumulative distribution function (data frame version)
Fhat_df_(.data, x, weights)
Fhat_df_(.data, x, weights)
.data |
a data frame |
x |
name of the numeric column (as character) |
weights |
name of the weight column (as character) |
a data frame with columns: x, wcum and Fhat
data(invented_wages) Fhat_df_(invented_wages, "wage", "sample_weights")
data(invented_wages) Fhat_df_(invented_wages, "wage", "sample_weights")
This dataset has been completely invented, in order to do some examples with the package.
invented_wages
invented_wages
A data frame (tibble) with 1000 rows and 5 variables:
gender
gender of the worker (men
or women
)
sector
economic sector where the worker is employed (secondary
or tertiary
)
education
educational level of the worker (I
, II
or III
)
wage
monthly wage of the worker (in an invented currency)
sample_weights
sampling weights
Every row of the dataset consists in a fake/invented individual worker. For every individual there is his/her gender, the economic sector in which he/she works, his/her level of education and his/her wage. Furthermore there is a column with the sampling weights.
A minimal function which counts the number of observations by groups in a data frame
jointfun_(.data, .variables, ...)
jointfun_(.data, .variables, ...)
.data |
data frame to be processed |
.variables |
variables to split data frame by, as a character vector ( |
... |
additional function calls to be applied on the .data |
a data frame, with a column for each cateogrical variable used, and a row for each combination of all the categorical variables' modalities.
data("invented_wages") tmp <- jointfun_(.data = invented_wages, .variables = c("gender", "sector")) tmp str(tmp)
data("invented_wages") tmp <- jointfun_(.data = invented_wages, .variables = c("gender", "sector")) tmp str(tmp)
Removes all the rows where variables have value .total
.
only_joint(.cube, .total = "Totale", .variables = NULL)
only_joint(.cube, .total = "Totale", .variables = NULL)
.cube |
a datacube with 'Totale' modalities |
.total |
modality to eliminate (filter out) (default: "Totale") |
.variables |
a character vector with the names of the categorical variables |
a subset of the data cube with only the combinations of all variables modalities, without the "margins".
data(invented_wages) str(invented_wages) vars <- c("gender", "education") tmp <- dcc2( .data = invented_wages, .variables = vars, .fun = jointfun_, order_type = extract_unique2 ) tmp str(tmp) only_joint(tmp, .variables = vars) # Compare dimensions (number of groups) dim(tmp) dim(only_joint(tmp, .variables = vars))
data(invented_wages) str(invented_wages) vars <- c("gender", "education") tmp <- dcc2( .data = invented_wages, .variables = vars, .fun = jointfun_, order_type = extract_unique2 ) tmp str(tmp) only_joint(tmp, .variables = vars) # Compare dimensions (number of groups) dim(tmp) dim(only_joint(tmp, .variables = vars))
Empirical weighted quantile
wq(x, weights, probs = c(0.5))
wq(x, weights, probs = c(0.5))
x |
A numeric vector |
weights |
A vector of (positive) sample weights |
probs |
a numeric vector with the desired quantile levels (default 0.5, the median) |
The weighted quantile (a numeric vector)
Ferrez, J., Graf, M. (2007). Enquète suisse sur la structure des salaires. Programmes R pour l'intervalle de confiance de la médiane. (Rapport de méthodes No. 338-0045). Neuchâtel: Office fédéral de statistique.
wq(x = rnorm(100), weights = runif(100))
wq(x = rnorm(100), weights = runif(100))