Starting with the 36-variable .csv file of Behavioral Risk Factor Surveillance System (BRFSS) data prepared HERE, this article will demonstrate how to test, statistically, whether value distributions of a categorical variable: 1) vary across age groups, or 2) have statistically significantly changed over time.
There is a great video on the underlying math provided by Khan Academy on YouTube, but let’s look at a practical application. To decide on a couple of variables, I headed to the R Shiny App that I created in the above article wherein I describe how to prepare the data. (It can take 30 seconds to load if the app was sleeping, because the raw data file is a bit large, and it’s just meant for us data scientists who'd prefer to look at it on my server than running the provided code locally 😊.)
Question: Do you still have asthma? (right now)
Let's look at asthma on the survey. As a follow-up to an initial question of whether respondents were ever told they had asthma, survey administrators followed up with the question: "Do you still have asthma?" Above is a line graph of pooled respondents 18-49 years of age living in California or New York who answered "Yes" or "No" to this question. Interestingly, there seems to have been a spike of self-reported asthma over the same time period as our COVID-19 pandemic, followed by a subsequent decline.
But was the spike statistically significant? Is there a statistically significant difference in the amount of self-reported asthma ("Yes" or "No") in this population segment between 2019 and 2023?
Chi-Square Tests are a statistical method that can be used to answer this question properly, and they are relatively easy to perform. They work by comparing relative proportions of responses across categories, percentages of each value, and determining whether those proportions have statistically significantly changed. Let's start off by looking at just the relative proportions of "Yes" and "No" answers across years.
1. Load the Data
# Load the dataframe
data <- read.csv("D:/d4ph/brfss/analysis/data/brfss_36_2019to2023.csv")
# Load the RDS file containing mappings
label_map <- readRDS("D:/d4ph/brfss/analysis/data/BRFSS_label_map_2011-2023.rds")
- Dataset: The sample dataframe
brfss_36_2019to2023.csv
is loaded. - Mappings:
BRFSS_label_map_2011-2023.rds
contains descriptive labels for coded values in the dataset.
2. Focus on Relevant Data
# View column names and structure
print("Structure of Data:")
str(data)
# Focus on ASTHNOW (asthma status) and year (time)
data_selected <- data %>%
select(year, ASTHNOW) %>%
filter(!is.na(ASTHNOW), !is.na(year))
- The script narrows down the analysis to:
year
(the time variable).ASTHNOW
(asthma status).
- Rows with missing values in either
year
or ASTHNOW
are removed to ensure clean analysis.
3. Map Labels
# If mapping exists, replace codes with descriptive labels
# Assuming `label_map` is a list with variable-value mappings
asthma_labels <- label_map[["ASTHNOW"]]
data_selected <- data_selected %>%
mutate(ASTHNOW = factor(ASTHNOW, levels = names(asthma_labels), labels = asthma_labels))
- Descriptive labels for
ASTHNOW
codes (e.g., "Yes", "No") are applied using the mappings from label_map
. - The
ASTHNOW
variable is transformed into a factor to ensure correct ordering and display.
4. Contingency Table
# Create a table of ASTHNOW responses by YEAR
asthma_table <- data_selected %>%
tabyl(year, ASTHNOW) %>%
adorn_totals(c("row", "col")) # Add totals
# Remove "Don't know/Not sure" and "Refused" columns from table (for demonstration purposes)
asthma_table <- asthma_table[, -c(4, 5)]
print("Contingency Table of ASTHNOW Over Years:")
print(asthma_table)
- A contingency table (cross-tabulation) of
ASTHNOW
responses by year is created using tabyl()
from the janitor
library. - Totals for rows and columns are added for clarity.
- Demonstration: "Don't know/Not sure" and "Refused" responses are removed to focus on key values.
5. Pairwise Chi-Square Tests
# Prepare a function to run chi-square test for each combination of years
run_chi_square <- function(data, year1, year2) {
subset_data <- data %>% filter(year %in% c(year1, year2))
chi_table <- table(subset_data$year, subset_data$ASTHNOW)
chisq.test(chi_table)
}
# Get unique years
years <- unique(data_selected$year)
# Initialize a list to store results
chi_results <- list()
# Run chi-square tests for each combination of years
for (i in 1:(length(years) - 1)) {
for (j in (i + 1):length(years)) {
year1 <- years[i]
year2 <- years[j]
chi_result <- run_chi_square(data_selected, year1, year2)
chi_results[[paste(year1, "vs", year2)]] <- chi_result
}
}
# Output Chi-Square test results
chi_results_table <- tibble::tibble(
Comparison = names(chi_results),
p_value = sapply(chi_results, function(x) x$p.value),
statistic = sapply(chi_results, function(x) x$statistic)
)
# If p values are < 0.001, set them to 0.001 without trailing 0s (for readability)
chi_results_table$p_value <- ifelse(chi_results_table$p_value < 0.001,
"p < 0.001",
formatC(chi_results_table$p_value, format = "f", digits = 3))
print("Chi-Square Test Results for Each Combination of Years:")
#print(chi_results_table)
gt(chi_results_table)
- The script runs Chi-Square tests for all pairwise combinations of years.
- A helper function
run_chi_square()
:- Filters the data for two specific years.
- Creates a contingency table.
- Runs the Chi-Square test.
- Results are stored in a table summarising:
- Comparison: "2019 vs 2020", etc.
- p-value: Statistical significance.
- Test statistic: Chi-Square statistic.
6. Results Display Using gt
# Display contingency table and Chi-Square results using gt
asthma_gt <- gt(asthma_table) %>%
tab_header(
title = "Contingency Table: ASTHNOW Over Years",
subtitle = "Chi-Square Test for Change in Asthma Reporting Over Time"
)
print(asthma_gt)
- The contingency table is formatted into a polished table using the
gt
library. - A header is added for clarity:
- Title: Contingency Table: ASTHNOW Over Years
- Subtitle: Chi-Square Test for Change in Asthma Reporting Over Time
7. Visualization
# Plot asthma proportions over years
asthma_plot <- data_selected %>%
group_by(year, ASTHNOW) %>%
summarize(count = n(), .groups = "drop") %>%
group_by(year) %>%
mutate(proportion = count / sum(count)) %>%
ggplot(aes(x = factor(year), y = proportion, fill = ASTHNOW)) +
geom_bar(stat = "identity", position = "stack") +
labs(
title = "Proportion of ASTHNOW Responses Over Years",
x = "Year",
y = "Proportion",
fill = "Asthma Status"
) +
theme_minimal()
print(asthma_plot)
- A stacked bar chart is created to show proportions of ASTHNOW responses over time:
- X-axis: Years.
- Y-axis: Proportions.
- Fill: ASTHNOW categories ("Yes", "No").
- This helps visually assess any trends or changes across the years.
Key Features:
- Clean, minimalistic design.
- Proportions make year-to-year comparisons intuitive.
Outputs & Interpretation
Of those 18-49 years of age who had ever reported being told they had asthma in California and New York, there are statistically significant differences in levels of self-reported current asthma (Yes or No) between almost every combination of years from 2019 to 2023. However, at an alpha of 0.05, there was not a statistically significant difference in self-reported asthma within this cohort from 2020 to 2021 (p=0.152) or from 2022 to 2023 (p=0.369).
But what is this actually telling us? After all, according to our above logic, we already excluded anyone who hadn't reported being told at some point that they did have asthma. So, this isn't suggesting a change in asthma prevalence. Not necessarily. Rather, these data merely indicate whether included respondents with a prior asthma diagnosis were differently likely between years to report experiencing asthma symptoms.
A simple look at the 2023 BRFSS Codebook underscores this point.
According to this, of the 433,323 respondents represented on the table (total of the Frequency column), a full 368,957 of them (~85.1%) weren't asked the question because they did not answer in the affirmative to having been told at some point in their lives they had asthma.
And there's more. These respondents represent various combinations of, and this is from CDC's article on weighting the BRFSS data (or "weightning" it, if you go by the file name), "sex by age group, race/ethnicity, education, marital status, tenure, sex by race/ethnicity, age group by race/ethnicity, and phone ownership." So, to generalize to the two states whose respondents were included above, California and New York, the correct weights would have to be applied to correctly represent the respective population subsegments in each state.
Suffice it to say, statistical tests, including Chi-Squares, are tremendously helpful, but they do not simplify the sticky, messy realities of mapping public health data to the real world.
You'll notice, perhaps, my mention of the above being pairwise chi-square tests. This is because there were only 2 variables being compared. But, what if we wanted to assess whether distributions were statistically significantly different across more than 2 groups?
Let's step it up a couple notches, tackling a Chi-Square Test with multiple categories and also applying appropriate weights to generalize to the whole United States. Our question: How has self-reported general health changed across the United States from 2019 to 2023? General values for variable of interest, GENHLTH, in 2019 and 2023 can be explored via the R Shiny app referenced above:
But again, these are sums of samples across the United States that must be variously weighted to represent, according to these CDC survey data, the general population of the country.
To do this, let's start by importing a dataframe with requisite columns from base 2023 BRFSS data in the prepared .csv file, offered in my post HERE, keep only the necessary columns for this analysis (GENHLTH and LLCPWT), add a "year" variable, and combine the datasets.
# Import base BRFSS data for 2019 and 2023
brfss_2019 <- read.csv("D:/d4ph/brfss/analysis/data/2019_342.csv")
brfss_2023 <- read.csv("D:/d4ph/brfss/analysis/data/2023_350.csv")
# Only keep the columns of interest (GENHLTH, x_LLCPWT)
brfss_2019 <- brfss_2019[, c("GENHLTH", "x_LLCPWT")]
brfss_2023 <- brfss_2023[, c("GENHLTH", "x_LLCPWT")]
# Add a column for year
brfss_2019$year <- 2019
brfss_2023$year <- 2023
# Combine the datasets
data <- rbind(brfss_2019, brfss_2023)
Now, we can map GENHLTH numerical values to their meanings and create a basic contingency table.
# Map the GENHLTH values to labels
genhlth_labels <- label_map[["GENHLTH"]]
data$GENHLTH <- factor(data$GENHLTH, levels = names(genhlth_labels), labels = genhlth_labels)
# Create a contingency table of GENHLTH responses by year
genhlth_table <- data %>%
tabyl(year, GENHLTH) %>%
adorn_totals(c("row", "col"))
# Print the contingency table
gt(genhlth_table)
Notice that the piddling numbers in "Don't know/Not Sure", "Refused", and "NA_" (null) values are so small as to be irrelevant to our analysis. (Rerunning everything below with them included confirms this.) Remember, too, that these are merely sums of respondent, whereas we want to know the size of the respective, projected populations that these survey responses represent.
We can drop those 3 unnecessary columns and look at the sums of our variable weights (x_LLCPWT) in a straightforward way by utilizing a couple additional libraries, tidyverse and srvyr :
# Load required libraries
library(tidyverse)
library(srvyr)
# Combine the datasets
data <- rbind(brfss_2019, brfss_2023)
# Map the GENHLTH values to labels
genhlth_labels <- label_map[["GENHLTH"]]
data$GENHLTH <- factor(data$GENHLTH, levels = names(genhlth_labels), labels = genhlth_labels)
# Step 1: Filter valid GENHLTH responses
data_clean <- data %>%
filter(GENHLTH %in% c("Excellent", "Very good", "Good", "Fair", "Poor"))
# Step 2: Convert to survey design
data_svy <- data_clean %>%
as_survey_design(weights = x_LLCPWT)
# Step 3: Summarize only the weighted sums
genhlth_table_wt <- data_svy %>%
group_by(GENHLTH, year) %>%
summarise(weighted_sum = survey_total(vartype = NULL), .groups = "drop") %>% # Drop SE calculation
pivot_wider(
names_from = year,
values_from = weighted_sum,
values_fill = 0 # Fill blanks with 0
)
# Step 4: Format the table
genhlth_table_wt %>%
gt() %>%
cols_label(
GENHLTH = "General Health",
`2019` = "2019 Weighted Sum",
`2023` = "2023 Weighted Sum"
) %>%
fmt_number(
columns = c(`2019`, `2023`),
decimals = 0
) %>%
tab_header(
title = "Weighted Sums of General Health Responses",
subtitle = "Comparison of 2019 and 2023 BRFSS Data"
)
One of the coolest things about both R and Python is the community of users who donate awesome code to make life easier!
This code gives us the projected weighted sums of our variable:
But are the differences statistically significant?
So, let's use a Chi-Square Test to answer this question, first by manually building the matrix and then by extracting what we need from our existing tibble of data. (It's useful to know how to do both.)
Manually Performing the Chi-Square Test
# Manual calculation of chi-square test for weighted data (using the weighted sums)
# Create a contingency table with weighted sums
weighted_sums <- matrix(
c(43798579, 40236000, # Excellent
79226332, 78790995, # Very good
81467949, 85239202, # Good
35556252, 37778702, # Fair
11811356, 11291592), # Poor
nrow = 5, byrow = TRUE,
dimnames = list(
c("Excellent", "Very good", "Good", "Fair", "Poor"), # Rows
c("2019", "2023") # Columns
)
)
# Print the contingency table
print("Contingency Table of Weighted Sums:")
print(weighted_sums)
# Conduct the Chi-square test
chi_square_test <- chisq.test(weighted_sums)
# Print the test results
print("Chi-square Test Results:")
print(chi_square_test)
Programmatically Conduct a Chi-Square Test from Our Table
# Step 1: Extract the weighted sums
# Assuming genhlth_table_wt is your tibble with weighted sums
weighted_sums_table <- genhlth_table_wt %>%
select(GENHLTH, `2019`, `2023`) %>% # Keep only relevant columns
column_to_rownames("GENHLTH") %>% # Use GENHLTH as row names
as.matrix() # Convert to matrix format
# Step 2: Print the contingency table
print("Contingency Table of Weighted Sums:")
print(weighted_sums_table)
# Step 3: Perform the Chi-square test
chi_square_test <- chisq.test(weighted_sums_table)
# Step 4: Print the results
print("Chi-square Test Results:")
print(chi_square_test)
As you can see, both the manual and programmatic approach give us the same result. (Khan Academy also has an awesome video about the math involved with this multiple variable calculation.)
The p-value is wayyy less than 0.05 (and 0.001, for that matter). We can therefore reject the null hypothesis and conclude that there is a statistically significant difference in self-reported general health between 2019 and 2023.
BUT WHAT IS THAT DIFFERENCE!?
Chi-Square Tests tell us that a difference is there, but that is it. So, in what way is general health across these years different? In what categories did the self-reported general health of the population change across these categories from 2019 to 2023?
To determine this, post-hoc tests are necessary. We need to compare differences between the actual values and what the anticipate values would be were there no differences between proportions across groups. One way is to conduct pairwise proportion tests for confirmation, comparing our groups directly.
Manually Performing the Post Hoc Bonferroni-adjusted Pairwise Proportion Test
# Step 1: Create contingency table
weighted_sums <- matrix(
c(43798579, 40236000, # Excellent
79226332, 78790995, # Very good
81467949, 85239202, # Good
35556252, 37778702, # Fair
11811356, 11291592), # Poor
nrow = 5, byrow = TRUE,
dimnames = list(
c("Excellent", "Very good", "Good", "Fair", "Poor"), # General Health
c("2019", "2023") # Years
)
)
# Step 2: Total counts for each year
total_2019 <- sum(weighted_sums[, "2019"])
total_2023 <- sum(weighted_sums[, "2023"])
# Step 3: Run pairwise proportion tests for each General Health group
pairwise_pvals <- data.frame(
General_Health = rownames(weighted_sums),
p_value = sapply(1:nrow(weighted_sums), function(i) {
if (weighted_sums[i, "2019"] > 0 & weighted_sums[i, "2023"] > 0) { # Ensure valid counts
prop.test(
x = c(weighted_sums[i, "2019"], weighted_sums[i, "2023"]),
n = c(total_2019, total_2023),
correct = FALSE
)$p.value
} else {
NA # Return NA if data is insufficient
}
})
)
# Step 4: Adjust p-values for multiple comparisons
pairwise_pvals$p_adjusted <- p.adjust(pairwise_pvals$p_value, method = "bonferroni")
# Step 5: Display the results in a clean table
library(gt)
pairwise_pvals %>%
gt() %>%
fmt_number(columns = c(p_value, p_adjusted), decimals = 4) %>%
tab_header(
title = "Post-hoc Pairwise Proportion Tests",
subtitle = "Bonferroni-Adjusted p-values for General Health Between 2019 and 2023"
) %>%
data_color(
columns = p_adjusted,
fn = scales::col_numeric(
palette = c("lightblue", "white", "red"),
domain = c(0, 0.05)
)
)
Programmatically Performing the Post Hoc Bonferroni from the Table
# Step 1: Extract observed values from the contingency table
observed_counts <- weighted_sums_table
# Step 2: Calculate row totals and column totals
row_totals <- rowSums(observed_counts)
col_totals <- colSums(observed_counts)
grand_total <- sum(observed_counts)
# Step 3: Calculate expected values manually
expected_values <- outer(row_totals, col_totals) / grand_total
# Step 4: Perform pairwise comparisons using prop.test
# Prepare an empty data frame to store the results
pairwise_results <- data.frame(
Comparison = character(),
p_value = numeric(),
stringsAsFactors = FALSE
)
# Step 5: Loop through row pairs for pairwise tests
row_names <- rownames(observed_counts)
for (i in 1:(nrow(observed_counts) - 1)) {
for (j in (i + 1):nrow(observed_counts)) {
# Extract counts for the two groups being compared
group1 <- observed_counts[i, ]
group2 <- observed_counts[j, ]
# Combine observed counts for pairwise test
test_result <- prop.test(
x = c(group1[1], group2[1], group1[2], group2[2]), # Counts for 2019 and 2023
n = c(sum(group1), sum(group2), sum(group1), sum(group2)), # Totals
correct = FALSE
)
# Store the p-value
pairwise_results <- rbind(pairwise_results, data.frame(
Comparison = paste(row_names[i], "vs", row_names[j]),
p_value = test_result$p.value
))
}
}
# Step 6: Adjust p-values for multiple comparisons using Bonferroni correction
pairwise_results$p_adjusted <- p.adjust(pairwise_results$p_value, method = "bonferroni")
# Step 7: Display the results
library(gt)
pairwise_results %>%
gt() %>%
fmt_number(columns = c(p_value, p_adjusted), decimals = 4) %>%
tab_header(
title = "Post-hoc Pairwise Comparisons (Bonferroni Adjusted)",
subtitle = "Statistical Differences Between General Health Categories"
) %>%
data_color(
columns = p_adjusted,
fn = scales::col_numeric(
palette = c("lightblue", "white", "red"),
domain = c(0, 0.05)
)
)
Both approaches yield exactly the same result:
Interpretation:
This analysis confirms that the self-reported general health of the United States statistically significantly changed across all health categories between 2019 and 2023 -- greater proportions of the population reported Very Good, Good, and Fair health. Conversely, lower proportions reported being in Excellent and Poor health (p < 0.001 for all categories).
Why? Well, there's the rub. These results do not indicate why these changes occurred. Fewer self-reporting "Poor" general health responses could reflect higher mortality among people with poor health during the global COVID-19 pandemic. Alternatively, the population may have experienced improvements in health outcomes, leading to fewer individuals reporting "Poor" health.
The independent samples from 2019 and 2023 mean that we cannot directly track changes for individuals over time. Instead, we observe changes in the overall population proportions reported by two distinct, independently sampled groups. The data show a statistically significant shift in proportions, but we cannot use them to determine causal factors or track individual health trajectories.
HERE is a single .R file with all of the above code, broken up into sections with notes indicating what the code is doing.