---
title: "NHANES Survey Data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{nhanes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette demonstrates how the NHANES 2015–2016 demographic data included in this package were obtained, processed, and are intended to be used. The data are adapted from the [National Health and Nutrition Examination Survey NHANES](https://www.cdc.gov/nchs/nhanes/), conducted by the National Center for Health Statistics (NCHS), Centers for Disease Control and Prevention (CDC). 

> **Disclaimer**: The data sets provided in this package are derived from the NHANES database and have been adapted for educational purposes. As such, they are NOT suitable for use as a research database. For research purposes, you should download original data files from the NHANES website and follow the analysis instructions given there.

## Data Preparation

The raw NHANES data were downloaded in SAS transport format (.xpt) and processed using R, with the following key steps:

-   Reading the demographic file (DEMO_I.xpt) using the haven package.
-   Selecting and renaming key demographic variables (e.g., gender, age, education, income) and survey design variables (strata, weights, PSU).
-   Recoding categorical variables using external code files for clarity (e.g., marital status, education level).
-   Labelling missing values and infrequent categories appropriately.
-   Saving the processed data frame as `nhanes`, which is then loaded with the package for easy access.

## Data Structure

The included `nhanes` data frame contains 9,971 participants and 13 variables. Below is a summary of the variables:

| Variable                | Description                    | Original Name |
|-------------------------|--------------------------------|---------------|
| PSU                     | Masked variance pseudo-PSU     | SDMVPSU       |
| weights                 | 2-year interview weight        | WTINT2YR      |
| strata                  | Masked variance pseudo-stratum | SDMVSTRA      |
| gender                  | Gender (Male/Female)           | RIAGENDR      |
| age                     | Age in years at screening      | RIDAGEYR      |
| birth_country           | Country of birth               | DMDBORN4      |
| marital_status          | Marital status                 | DMDMARTL      |
| interview_lang          | Interview language             | SIALANG       |
| edu_level               | Education level                | DMDHREDU      |
| household_size          | Number of people in household  | DMDHHSIZ      |
| family_size             | Number of people in family     | DMDFMSIZ      |
| annual_household_income | Annual household income        | INDHHIN2      |
| annual_family_income    | Annual family income           | INDFMIN2      |

## Example Usage

```{r message=FALSE}
library(chensus)
library(dplyr)
```

```{r}
# View the structure of the data
glimpse(nhanes)

# Count participants by education level
nhanes |>
  count(edu_level)
```

## Best Practices and References

- For research: Always download the latest, official data directly from the [NHANES website](https://www.cdc.gov/nchs/nhanes/).
- Documentation: Refer to the official NHANES code books for detailed variable definitions and survey methodology.
- Acknowledgment: Data were obtained from the National Health and Nutrition Examination Survey (NHANES), conducted by the National Center for Health Statistics (NCHS), Centers for Disease Control and Prevention (CDC).

## Further Information

- [NHANES main website](https://www.cdc.gov/nchs/nhanes/)
- [NHANES 2015–2016 Data Page](https://wwwn.cdc.gov/nchs/nhanes/search/DataPage.aspx?Component=Demographics&Cycle=2015-2016)

**Note**: This vignette is intended to ensure transparency and proper attribution for the use of NHANES data in this package. Always consult the official NHANES documentation for authoritative guidance.