Home
Posts

U.S. Population (Census)

By Jimmy Fisher
Nov 02, 2024
in Coding Projects

1K views

To study public health, we first need a population of people as well as understand some baseline distribution of characteristics across groups. The three most popular categories to delineate people groups at a population level are age, sex, and geographic location. Race and/or ethnicity is a fourth dimension of differentiation, though groupings of human genetic variation are often unclear and inconsistently applied across data sources. Less common--but situationally useful--dimensions of population discernment are socioeconomic status, educational attainment, marital status, employment, disability, and primary language spoken at home.

There are other ways we divide ourselves into groups, too, but let's get started on our U.S. Census data project. You can look through the available tables at https://data.census.gov/table, but I will start here with age and sex data (table S0101) from the American Community Survey (ACS). Both 1-year and 5-year ACS data are available (differences HERE), but I'm interested in ACS 1-year estimates.

To leverage the API and download these data directly into an open source data editor, I started R and used the tidycensus library to make it easier.

# Load necessary libraries
library(tidycensus)
library(dplyr)
# Set Census API key census_api_key("YOUR_CENSUS_API_KEY")

# Define the years you want to pull data for years <- c(2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020,
2021, 2022, 2023)

# Define the variables for total population, age, and sex brackets
variables <- c(
"B01003_001", # Total population
"B01001_003", # Male: Under 5 years
"B01001_004", # Male: 5-9 years
"B01001_005", # Male: 10-14 years
"B01001_006", # Male: 15-17 years
"B01001_007", # Male: 18 and 19 years
"B01001_008", # Male: 20 years
"B01001_009", # Male: 21 years
"B01001_010", # Male: 22-24 years
"B01001_011", # Male: 25-29 years
"B01001_012", # Male: 30-34 years
"B01001_013", # Male: 35-39 years
"B01001_014", # Male: 40-44 years
"B01001_015", # Male: 45-49 years
"B01001_016", # Male: 50-54 years
"B01001_017", # Male: 55-59 years
"B01001_018", # Male: 60 and 61 years
"B01001_019", # Male: 62-64 years
"B01001_020", # Male: 65 and 66 years
"B01001_021", # Male: 67-69 years
"B01001_022", # Male: 70-74 years
"B01001_023", # Male: 75-79 years
"B01001_024", # Male: 80-84 years
"B01001_025", # Male: 85 years and over
"B01001_027", # Female: Under 5 years
"B01001_028", # Female: 5-9 years
"B01001_029", # Female: 10-14 years
"B01001_030", # Female: 15-17 years
"B01001_031", # Female: 18 and 19 years
"B01001_032", # Female: 20 years
"B01001_033", # Female: 21 years
"B01001_034", # Female: 22-24 years
"B01001_035", # Female: 25-29 years
"B01001_036", # Female: 30-34 years
"B01001_037", # Female: 35-39 years
"B01001_038", # Female: 40-44 years
"B01001_039", # Female: 45-49 years
"B01001_040", # Female: 50-54 years
"B01001_041", # Female: 55-59 years
"B01001_042", # Female: 60 and 61 years
"B01001_043", # Female: 62-64 years
"B01001_044", # Female: 65 and 66 years
"B01001_045", # Female: 67-69 years
"B01001_046", # Female: 70-74 years
"B01001_047", # Female: 75-79 years
"B01001_048", # Female: 80-84 years
"B01001_049" # Female: 85 years and over
)

# Create function to get the data
get_acs_data <- function(state, year, variables, survey) {
get_acs(
geography = ifelse(state == "US", "us", "state"),
variables = variables,
survey = survey,
state = ifelse(state == "US", NULL, state), # If state is "US", set state to NULL
year = year
)
}

# Load fips_codes data
data(fips_codes, package = "tidycensus")

# Create the list of desired states for output
states <- unique(fips_codes$state)[1:51] # Get full list of U.S. states

# Pull data for each state and year in a tryCatch block to skip unavailable # state-year combinations)
acs_data <- lapply(states, function(state) {
lapply(years, function(year) {
tryCatch({
data <- get_acs_data(state, year, variables, survey)
data <- data %>% mutate(year = year) # Add year column to the data
}, error = function(e) {
message(paste("Skipping state", state, "year", year, "due to error:", e$message))
data.frame()
})
}) %>%
bind_rows()
}) %>%
bind_rows() # Combine the data into a single data frame

(Keep in mind that slight alterations to this code will work to download any Census data here.)

This code iterates through each year in the years array and each state to pull the specified variables, dumping outputs into an R object of tibbles/dataframes as acs_data. I placed the code within in a tryCatch bracket to prevent any missing year-state combo from throwing a function-ending error. For example, due to low (COVID-related) survey response rates in 2020, there are no 1-year estimates available for that year.

This yields the following tibble:

As is often the case on account of my limited time, I exported this as a .csv file to pick up the following day, at which point I made 2 variable lists to convert the variable column into age and sex variables (reference):

# Import acs_data.csv
acs_data <- read.csv("acs_data.csv") # (because I continued from a different session)

# Print the first few rows of the acs_data
head(acs_data)

# Add sex column-the data
sex_map <- list(
Male = c("B01001_002", "B01001_003", "B01001_004", "B01001_005",
"B01001_006", "B01001_007", "B01001_008", "B01001_009",
"B01001_010", "B01001_011", "B01001_012", "B01001_013",
"B01001_014", "B01001_015", "B01001_016", "B01001_017",
"B01001_018", "B01001_019", "B01001_020", "B01001_021",
"B01001_022", "B01001_023", "B01001_024"),
Female = c("B01001_026", "B01001_027", "B01001_028", "B01001_029",
"B01001_030", "B01001_031", "B01001_032", "B01001_033",
"B01001_034", "B01001_035", "B01001_036", "B01001_037",
"B01001_038", "B01001_039", "B01001_040", "B01001_041",
"B01001_042", "B01001_043", "B01001_044", "B01001_045",
"B01001_046", "B01001_047", "B01001_048")
)

age_map <- list(
"B01003_001" = "total",
"B01001_003" = "0-4",
"B01001_004" = "5-9",
"B01001_005" = "10-14",
"B01001_006" = "15-17",
"B01001_007" = "18-19",
"B01001_008" = "20",
"B01001_009" = "21",
"B01001_010" = "22-24 years",
"B01001_011" = "25-29 years",
"B01001_012" = "30-34 years",
"B01001_013" = "35-39 years",
"B01001_014" = "40-44 years",
"B01001_015" = "45-49 years",
"B01001_016" = "50-54 years",
"B01001_017" = "55-59 years",
"B01001_018" = "60-61 years",
"B01001_019" = "62-64 years",
"B01001_020" = "65-66 years",
"B01001_021" = "67-69 years",
"B01001_022" = "70-74 years",
"B01001_023" = "75-79 years",
"B01001_024" = "80-84 years",
"B01001_025" = "85+",
"B01001_027" = "0-4",
"B01001_028" = "5-9",
"B01001_029" = "10-14",
"B01001_030" = "15-17",
"B01001_031" = "18-19",
"B01001_032" = "20",
"B01001_033" = "21",
"B01001_034" = "22-24",
"B01001_035" = "25-29",
"B01001_036" = "30-34",
"B01001_037" = "35-39",
"B01001_038" = "40-44",
"B01001_039" = "45-49",
"B01001_040" = "50-54",
"B01001_041" = "55-59",
"B01001_042" = "60-61",
"B01001_043" = "62-64",
"B01001_044" = "65-66",
"B01001_045" = "67-69",
"B01001_046" = "70-74",
"B01001_047" = "75-79",
"B01001_048" = "80-84",
"B01001_049" = "85+"
)

# Add the 'sex' column based on the variable values
acs_data <- acs_data %>%
mutate(sex = case_when(
variable %in% sex_map$Male ~ "M",
variable %in% sex_map$Female ~ "F",
TRUE ~ NA_character_
))

# Add the age column-acs_data based on the age_map
acs_data$age <- sapply(acs_data$variable, function(v) age_map[[v]])

# Rename GEOID column to st_fips
acs_data <- acs_data %>% rename(st_fips = GEOID)

# Rename NAME column to state
acs_data <- acs_data %>% rename(state = NAME)

# Drop variable and x columns
acs_data <- acs_data %>%
select(-variable, -X)

# View the updated acs_data
head(acs_data)

Now, the head of ACS 1-year dataframe looks like this:

Perfect. I built a very simple R Shiny app to explore these data. Note that margin of error (moe) should used in statistical calculations, but let's keep it simple for now:

Code (on GitHub) <- includes data prep
Deployment (on shinyapps.io)

"This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau."

So, now we have a base U.S. population, on top of which we can overlay other available public data. Since our families are the fundamental driver of human civilization, that was my exploratory step with Census data (HERE).

Super Admin

Jimmy Fisher

previous post Correlation is not Causation

next post Intro to Data Projects

you may also like

by Jimmy Fisher
Nov 10, 2024

U.S. Families (Census Data)

by Jimmy Fisher
Nov 16, 2024

U.S. Mortality Data (NCHS)

by Jimmy Fisher
Dec 01, 2024

Wrangling BRFSS (2011-2023)

by Jimmy Fisher
Dec 17, 2024

Chi-Square Tests & BRFSS Weights

by Jimmy Fisher
Dec 18, 2024

Mental Health, MLR, & One-Hot Encoding (BRFSS)

by Jimmy Fisher
Dec 18, 2024

Mental Health, MLR, & One-Hot Encoding (BRFSS)

by Jimmy Fisher
Dec 17, 2024

Chi-Square Tests & BRFSS Weights

by Jimmy Fisher
Dec 14, 2024

No Skepticism, No Science

Coding Projects

by Jimmy Fisher
Dec 01, 2024

Wrangling BRFSS (2011-2023)

The Behavioral Risk Factor surveillance System (BRFSS) is a health-related telephone survey establishe...
read more