Home
Posts

Mental Health, MLR, & One-Hot Encoding (BRFSS)

By Jimmy Fisher
Dec 18, 2024
in Coding Projects

170 views

In this article, I cover how to conduct a Multivariate Linear Regression (MLR) on Behavioral Risk Factor Surveillance System (BRFSS) data representing the population of the United States to predict self-reported physical and mental health. This is a continuation of a series of posts that began with cleaning and exploring separate years' base data (2011-2023) and follows one on how to conduct Chi-Square Tests (and post hoc Bonferroni tests) on 2019 and 2023 data using the programming language R.

To mix it up a bit, I will use Python this time, to prepare our data and run our MLR tests below, though R can just as easily be used to the same ends.

If you're completely unfamiliar with linear regression techniques, I highly recommend THIS series on it by Khan Academy to get grounded in the basic idea.

My research question: Can we use other variables in the 2023 BRFSS data to predict the number of days respondents reported experiencing poor mental health in the 30 before they were surveyed?

Let's begin by looking at the 2023 BRFSS Codebook to assess the nature of our dependent variable, MENTHLTH:

Cumulatively, around 2% refused to answer the question or answered "Don't know/Not sure", and for purposes here we can safely exclude them. Only 3 of the 433,323 respondents represented in the Frequency column have missing values, so we'll ignore those as well, focusing on the roughly 98% for whom we have MENTHLTH data.

The next hurdle we must clear before running any calculations is defining which independent (predictor) variables we would like to consider in the analysis. The easy answer is "all of them," but it is not always so simple, as many variables in the BRFSS data were only asked to subsets of the total surveyed population.

After excluding records with 77s, 99s, and blanks in our MENTHLTH column, let's assess data completeness for all variables to consider our available data.

# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt import seaborn as sns

# Load data (available at https://dataforpublichealth.com/post-details/wrangling-brfss-2011-2023)
data = pd.read_csv('data/2023_350.csv', low_memory=False)

# Exclude rows of dependent variable (MENTHLTH) with missing values, 99, and 77
data = data[data['MENTHLTH'] != '77']
data = data[data['MENTHLTH'] != '99']
data = data.dropna(subset=['MENTHLTH'])

Make sure to replace all empty strings with NaN

# Replace empty strings, spaces, or other patterns with NaN
data.replace(r'^\s*$', np.nan, regex=True, inplace=True)

And finally...

# Count missing values per column
missing_summary = data.isnull().sum().sort_values(ascending=False)
missing_summary = missing_summary[missing_summary > 0] # Only show columns with missing values
missing_summary = pd.DataFrame({'Missing Values': missing_summary,
'Percentage': (missing_summary / len(data)) * 100})

# Plot to visually inspect
missing_summary.plot(kind='bar', figsize=(15, 5), legend=False)
plt.title("Missing Data by Columns")
plt.ylabel("Count of Missing Values")
plt.xticks([])
plt.show()

The height of the bars indicate number of missing rows, with the ones at the left missing 100% of all rows (i.e., HIDDEN data, according to the codebook). One major reason for the rest of the missing values is that several questions are only asked to segments of the surveyed population. For example, as covered in an earlier post, the variable ASTHMA3 shows whether a person self-reported ever having been told they had asthma, and only if they said yes were they asked whether they still had asthma. State-level variations in questionnaires account for some of these missing values as well.

But, whatever the reasons for these missing values, we cannot predict with data that we do not have. So, let's set a threshold cutoff and delimit our consideration to a complete segment of data, excluding those for whom data are missing.

Here's how to programmatically set a threshold missing value percent, excluding any variables that do not meet the muster of that requirement. I'm going to exclude any variables at are missing more than 25% of the total records in our dataset.

## Set a threshold for allowable missing values and exclude columns
## with more than the threshold

# Set the threshold percentage
threshold = 25 / 100 # 25% as a fraction

# Calculate the percentage of missing values per column
missing_percentage = data.isnull().mean()

# Filter columns: Keep only those with less than or equal to 25% missing data
columns_to_keep = missing_percentage[missing_percentage <= threshold].index

# Create a new DataFrame with the filtered columns
data_filtered = data[columns_to_keep]

# Output
print(f"Original shape: {data.shape}")
print(f"Filtered shape (after excluding >25% missing): {data_filtered.shape}")

We also know that several of these 139 variables aren't going to be predictive, such as record weights, survey dates, sequence numbers (the dataset's primary key), and flags for cell phone or land line. A quick pass through the codebook helped me identify the obvious ones, and I removed them.

# Drop non-informative columns

data_filtered = data_filtered.drop(['FMONTH', 'IDATE', 'IMONTH', 'IDAY', 'IYEAR', 'DISPCODE',
'SEQNO', 'x_PSU', 'SAFETIME', 'CTELNUM1', 'CELLFON5', 'CADULT1',
'CELLSEX2', 'CSTATE1', 'LANDLINE', 'x_STSTR', 'x_STRWT',
'x_WT2RAKE', 'x_LLCPWT2', 'x_LLCPWT'], axis=1)

# Output
print(f"Shape after dropping non-informative columns: {data_filtered.shape}")

We went from 350 variables in the initial .csv to 110, including our outcome variable, MENTHLTH. If I missed any non-informative columns, I'll let the model whittle them out on account of their intrinsic lack of predictive power.

Moving on, each column's respective missing values are spread across different records. So, if we exclude all rows missing any of these from these 110 variables, we get...

# Drop all rows with missing values
data_filtered = data_filtered.dropna()

print(f"Shape after dropping rows with missing values: {data_filtered.shape}")

...189,278 remaining records, a decrease of 235,940 (~55%).

We now need to handle numerical values in our remaining continuous variables that indicate data are unavailable, such as '77' for refused, '99' for "Don't know", and other numbers stipulated in the 2023 BRFSS Codebook. Also, in some variables, the number '88' indicates 0, so this needs to be recode for continuous variable correlations to be properly construed. If we do not handle these records, we will inadvertently skew our analysis.

## Recode 88 in MENTHLTH, PHYSHLTH, POORHLTH, CHILDREN, & SMOKE100 to 0
data_filtered['MENTHLTH'] = data_filtered['MENTHLTH'].replace(88, 0)
data_filtered['PHYSHLTH'] = data_filtered['PHYSHLTH'].replace(88, 0)
data_filtered['CHILDREN'] = data_filtered['CHILDREN'].replace(88, 0)
data_filtered['SMOKE100'] = data_filtered['SMOKE100'].replace(88, 0)

# Remove records where PHYSHLTH, POORHLTH, HHADULT, CHILDREN, or SMOKE100 is 77 or 99
data_filtered = data_filtered[~data_filtered['PHYSHLTH'].isin([77, 99])]
data_filtered = data_filtered[~data_filtered['HHADULT'].isin([77, 99])]
data_filtered = data_filtered[~data_filtered['CHILDREN'].isin([77, 99])]
data_filtered = data_filtered[~data_filtered['SMOKE100'].isin([77, 99])]

# Remove records where MAXVO21_, FC601_, or x_DRNKWK2 is 99900
data_filtered = data_filtered[~data_filtered['MAXVO21_'].isin([99900])]
data_filtered = data_filtered[~data_filtered['FC601_'].isin([99900])]
data_filtered = data_filtered[~data_filtered['x_DRNKWK2'].isin([99900])]

# Remove records where STRFREQ_ is '99000'
data_filtered = data_filtered[~data_filtered['STRFREQ_'].isin([99000])]

# Remove records where DROCDY4_ is 900
data_filtered = data_filtered[~data_filtered['DROCDY4_'].isin([900])]

# Print the shape of the data after removing records
print(f"Shape after removing records: {data_filtered.shape}")

This excludes an additional 7,217 records (for which valid numerical data were not available), yielding to us a final 182,061 records for analysis.

Before building our multiple regression model, we will delimit our predictor variables and finish preparing the data.

Leverage iterative correlation testing (and Cramér’s V for categorical variables) to reduce number of variables prior to one-hot encoding. Not only does this protect against multicollinearity, but if we one-hot encoded every category, we'd end up with around 1,800 columns were we to one-hot encode every current column. That is impractical and slow, so this approach is better.

# Perform iterative correlation testing to determine which columns to keep

# Step 1: Compute correlation matrix for continuous predictors
continuous_vars = ['MAXVO21_', 'FC601_', 'x_BMI5', 'DROCDY4_', 'x_DRNKWK2', 'PHYSHLTH'] # Known continuous predictors
categorical_vars = [col for col in data_filtered.columns if col not in continuous_vars] # For Cramer's V (next)

corr_matrix = data_filtered[continuous_vars].corr().abs() # Absolute correlations

# Step 2: Remove duplicate correlations (upper triangle)
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Step 3: Identify predictors to drop based on threshold
threshold = 0.7 # Set correlation threshold to avoid high redundancy
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > threshold)]

print(f"Columns to drop due to high correlation: {to_drop}")

Iterative correlation testing on our continuous variables identified 1 to drop (FC601_), and we now have 109 columns left, including our dependent variable (MENTHLTH).
# Print number of columns before removal
print(f"Columns before removal: {data_filtered.shape[1]}")

# Remove the columns identified above
data_filtered = data_filtered.drop(to_drop + low_variance_vars, axis=1)

# Remove dropped variables from lists
continuous_vars = [col for col in continuous_vars if col not in to_drop + low_variance_vars]
categorical_vars = [col for col in categorical_vars if col not in to_drop + low_variance_vars]

# Print number of columns after removal
print(f"Columns after removal: {data_filtered.shape[1]}")

Noticing that the computed BRFSS variable x_MENT14D is a straight recoding of our dependent variable (MENTHLTH), I removed that as well, because its autocorrelation is unhelpful and we can bin it ourselves later.

Binning is a way to group numbers into ranges or "bins" to make them easier to analyze. Instead of looking at every single number, we group them into categories like 0-10, 11-20, and so on. For example, if you have test scores, binning can group them into ranges like "low," "medium," and "high," helping you see patterns more clearly.

# Remove x_MENT14D from data_filtered
data_filtered = data_filtered.drop('x_MENT14D', axis=1)

# Remove x_MENT14D from categorical_vars
categorical_vars = [col for col in categorical_vars if col != 'x_MENT14D']

Down to 109 columns!

Assessing associations between categorical predictor variables is rather straightforward using Cramér's V. I used a threshold of 0.7 and straightforward function to iterate through all remaining categorical variables, flagging any with a V higher than 0.7 for exclusion on account of being sufficiently redundant.

# Function to compute Cramér's V
def cramers_v(x, y):
"""Calculate Cramér's V for two categorical variables."""
confusion_matrix = pd.crosstab(x, y) # Create a contingency table
chi2, _, _, _ = chi2_contingency(confusion_matrix) # Perform Chi-Square test
n = confusion_matrix.sum().sum() # Total number of observations
phi2 = chi2 / n # Compute phi-squared
r, k = confusion_matrix.shape # Number of rows and columns in the table
phi2_corr = max(0, phi2 - ((k-1)(r-1)) / (n-1)) # Bias correction
r_corr = max(1, r - ((r-1)2) / (n-1)) # Ensure corrected rows > 0
k_corr = max(1, k - ((k-1)2) / (n-1)) # Ensure corrected columns > 0
denominator = min((k_corr - 1), (r_corr - 1))
if denominator <= 0: # Avoid division by zero
return np.nan
return np.sqrt(phi2_corr / denominator) # Return Cramér's V

# Create a matrix to store Cramér's V values
cramers_matrix = pd.DataFrame(index=categorical_vars, columns=categorical_vars)

# Compute Cramér's V for all pairs of categorical variables
for col1 in categorical_vars:
for col2 in categorical_vars:
if col1 == col2:
cramers_matrix.loc[col1, col2] = 0.0 # Self-correlation set to 0 for comparison
else:
cramers_matrix.loc[col1, col2] = cramers_v(data_filtered[col1], data_filtered[col2])

# Convert matrix values to floats for further processing
cramers_matrix = cramers_matrix.astype(float)

# Print Cramér's V matrix
print("Cramér's V Matrix:")
print(cramers_matrix)

# Identify variables with high association (e.g., V > 0.8)
threshold = 0.7 # Set threshold for high association
to_drop_cat = []

# Iterate over columns and ignore self-correlation (diagonal values)
for col in cramers_matrix.columns:
high_association = cramers_matrix[col][cramers_matrix.index != col] > threshold # Exclude diagonal
if any(high_association):
to_drop_cat.append(col)

print(f"Categorical variables to drop (V > {threshold}): {to_drop_cat}")

This identified an additional 64 categorical variables sufficiently correlated to one another (threshold > 0.7) to warrant removal. This is quite a finding in itself, and a very* good thing to know about BRFSS data in general. Let's drop 'em!

# Remove the categorical variables identified above with Cramer's V
data_filtered = data_filtered.drop(to_drop_cat, axis=1)

# print the shape of the data after removing columns
print(f"Shape after removing columns: {data_filtered.shape}")

We now have 182,061 records spanning 44 predictor variables and our outcome variable, MENTHLTH.

Useful Tip: There is one last technique you can use to narrow down predictors prior to one-hot coding large numbers of categorical variables: running a series of quick bivariate analyses. Bivariate analysis examines the relationship between two variables to understand how one affects or is associated with the other. This is relevant far beyond multiple linear regression and is a useful "trick."

For continuous variables, we will look at the bivariate correlations of each continuous predictor to our outcome variable. For categorical variables, we'll use ANOVA p-values to determine whether each separate categorical predictor is statistically significantly associated with differences in our outcome variable on its own.

For our continuous predictors:.

# Make all variables numeric
data_filtered = data_filtered.apply(pd.to_numeric, errors='ignore')
# Quick Bivariate Analyses to further narrow down predictors
# Target column in data_filtered
target_col = 'MENTHLTH'

# Compute correlations for continuous variables
correlations = {}
for col in continuous_vars:
try:
# Compute correlation between the continuous variable and the target
correlations[col] = data_filtered[target_col].corr(data_filtered[col])
except Exception as e:
print(f"Error with variable {col}: {e}")

# Filter predictors based on correlation threshold
threshold = 0.2
selected_continuous = [col for col, corr in correlations.items() if abs(corr) > threshold]

print("Selected Continuous Variables:", selected_continuous)

And for our categorical predictors:

from scipy.stats import f_oneway

# Target column in data_filtered
target_col = 'MENTHLTH'

# Perform ANOVA for each categorical variable
anova_results = {}
for col in categorical_vars: # Ensure categorical_vars contains original column names
try:
# Group the target variable (MENTHLTH) by each level of the categorical variable
groups = [
data_filtered.loc[data_filtered[col] == level, target_col] # Subset target by category levels
for level in data_filtered[col].dropna().unique()
]
if len(groups) > 1: # Ensure at least two groups exist
anova_results[col] = f_oneway(groups).pvalue
except KeyError:
print(f"Variable {col} is not in data_filtered and was skipped.")
except Exception as e:
print(f"Error with variable {col}: {e}")

# Filter categorical predictors based on p-value threshold
pval_threshold = 0.05
selected_categorical = [col for col, pval in anova_results.items() if pval < pval_threshold]

print("Selected Categorical Variables:", selected_categorical)

One of our continuous predictors (PHYSHLTH) met our correlation threshold of .2, and our ANOVAs were statistically significant for 34 of our categorical variables. Down to 35 predictors for our 1 dependent variable (total = 36).

# Print the shape of the reduced dataset
print(f"Shape of the reduced dataset: {data_prepared.shape}")

One-Hot Encode categorical variables. One-Hot Encoding is a method used to convert categorical variables into a numerical format so that they can be used in machine learning or statistical models. This process ensures that the relatively few continuous predictors are retained while preparing all categorical variables for regression analysis.

One-hot encoding is a way to turn categories into numbers so computers can understand them. For each category, it creates a new column where a "1" means that category is present and "0" means it isn’t. For example, if you have colors like red, blue, and green, one-hot encoding makes three columns: one for red, one for blue, and one for green, with only one column marked as "1" for each row.

## One-Hot Encoding

# Step 1: Break data into target and predictors
y = data_filtered['MENTHLTH']
X = data_filtered.drop(columns=['MENTHLTH'])

# Step 2: One-Hot Encode Remaining Categorical Variables
X_encoded = pd.get_dummies(X, columns=categorical_vars, drop_first=True)

# Step 5: Combine Continuous and Encoded Variables
# Ensure all continuous variables remain intact
X_final = X_encoded[continuous_vars].join(X_encoded.drop(columns=continuous_vars, errors='ignore'))

# Step 6: Verify Final Data
print("Final Predictor Variables:")
print(X_final.head())
print("Shape:", X_final.shape)

This turns our 35 predictors into 165 columns. Nice. Now, let's build our model.

Fit a linear regression model using statsmodels or scikit-learn. There are a few ways to do this, but I'm going to use a machine learning approach that splits our data randomly into 2 sections, a training* set with a random 80% of the data, and a separate test set with the rest. The accuracy of our MLR model will be assessed on how well it predicts MENTHLTH in our test data, because the model will have been trained without encountering that test data segment at all.

The following code took almost 41 minutes to run on my computer! 😂 Iterative approaches are a cornerstone of machine learning techniques. This approach identified the 67 features from the 168 that contributed to the model's predictive power.

import statsmodels.api as sm
from sklearn.model_selection import train_test_split

# Step 1: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=42)

# Step 2: Define Stepwise Selection Function
def stepwise_selection(X, y, threshold_in=0.05, threshold_out=0.05, verbose=True):
"""Perform stepwise selection to find the best predictors for the MLR model."""
initial_list = []
included = list(initial_list)
while True:
changed = False

# Forward step: Add variable with the lowest p-value
excluded = list(set(X.columns) - set(included))
new_pval = pd.Series(index=excluded, dtype=float)
for new_column in excluded:
try:
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
new_pval[new_column] = model.pvalues[new_column]
except Exception as e:
print(f"Error with variable {new_column}: {e}")
new_pval[new_column] = np.nan
best_pval = new_pval.min()
if best_pval < threshold_in:
best_feature = new_pval.idxmin()
included.append(best_feature)
changed = True
if verbose:
print(f"Adding {best_feature} with p-value {best_pval}")

# Backward step: Remove variable with the highest p-value
model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
pvalues = model.pvalues.iloc[1:] # Exclude intercept
worst_pval = pvalues.max()
if worst_pval > threshold_out:
worst_feature = pvalues.idxmax()
included.remove(worst_feature)
changed = True
if verbose:
print(f"Dropping {worst_feature} with p-value {worst_pval}")

if not changed:
break

return included

# Step 3: Run Stepwise Selection
selected_features = stepwise_selection(X_train, y_train)

This code implements a stepwise selection process to identify the best predictors for a Multiple Linear Regression (MLR) model. One by one, the code iterates through all predictors not currently in the model (the "excluded" predictors) and evaluates their statistical significance if added. Then, it iteratively adds the most significant variable (based on the p-value threshold) or removes the least significant variable already included, ensuring the model retains only the strongest predictors. This process continues until no further changes improve the model.

Finally, we use the selected features to build a regression model and evaluate its performance on our test data segment, to measure its performance. For contrast, this code, actually building and testing the model, took under than a second to complete.
from sklearn.metrics import accuracy_score # For classification accuracy

# Step 4: Build Final Model with Selected Features
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

final_model = sm.OLS(y_train, sm.add_constant(X_train_selected)).fit()
print("\nFinal Model Summary:")
print(final_model.summary())

# Step 5: Test the Model on the Test Set
y_pred = final_model.predict(sm.add_constant(X_test_selected))
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred.round())
print(f"Accuracy of the model on test set: {accuracy * 100:.2f}%")
print(f"\nMean Squared Error: {mse}\nR-squared: {r2}")

The outcome variable, MENTHLTH, asked respondents: "Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?"

Answers could be 0 or any whole number from 1 to 30. The model wasn't trained on the test data, but using other variables in the survey, our MLR model correctly guessed the exact number of poor health days respondents reported over 1/8 of the time (~13.89% accuracy).

Binning continuous variables like MENTHLTH into groups often improves model performance, and it's relatively easy to do. According to the codebook, over half of respondents reported poor mental health for 0 days out of the last 30, so I'll look at that one as either 0 days ('0') or 1+ days of self-reported poor mental health ('1') and attempt to predict that. Then, I'll lump our dependent variable into 3, 4, 5, and 6-bin groups, sequentially, and see if our predictive performance improves. My function to systematically perform all of these then outputs predictive performance as a single table to make comparison easier.

# Define a function to bin y, train an MLR model, and compute metrics
def run_mlr_with_bins(X, y, num_bins_list):
results = [] # To store model metrics

for num_bins in num_bins_list:
# Bin the target variable
if num_bins == 2: # Special case for binary (0 vs. 1+)
y_binned = (y > 0).astype(int)
else:
y_binned = pd.cut(y, bins=num_bins, labels=range(num_bins), include_lowest=True).astype(int)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binned, test_size=0.2, random_state=42)

# Train a multiple linear regression model
X_train = sm.add_constant(X_train) # Add constant for intercept
X_test = sm.add_constant(X_test) # Add constant for intercept

model = sm.OLS(y_train, X_train).fit()

# Test the model
y_pred = model.predict(X_test)

# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Store the results
results.append({
'Bins': num_bins,
'MSE': mse,
'R-squared': r2
})

# Convert results to a DataFrame for a clean table
results_df = pd.DataFrame(results)
return results_df

# Run the function on your dataset
num_bins_list = [2, 3, 4, 5, 6] # Define the binning segments
X = X_final # Predictor variables
results_table = run_mlr_with_bins(X, y, num_bins_list)

# Display the results
print("Model Metrics for Different Binnings (MLR):")
print(results_table)

Much better! Here we are, using one-hot encoding for mostly categorical variables in an MLR model, and we have predicted self-reported number of days with poor mental health:

* The exact number (0-30) of poor mental health days with ~14% accuracy.

* Whether respondents reported 0 or 1+ with ~73% accuracy.

* If the respondent was in the bottom, middle, or top third of reported poor mental health days at ~85%

* The correct quartile at ~75% accuracy (3 in 4), and the quintile at ~67% (2 in 3), and sextile at ~66% (2 in 3)

We've tackled a lot in this project. After preparing a subset of 2023 BRFSS data for analysis, utilizing several techniques to identify predictor variables for our MLR model, we employed one-hot encoding to allow us to leverage this statistical modeling method with our categorical variables. Our resulting model predicts the exact number of days each BRFSS respondent reported poor mental health in the last 30 days with almost 14% accuracy. Then, we binned our output variable and found the highest predictive accuracy by binning our output variable into thirds -- bottom (0-9), middle (10-19), and top (20-30) values in our range for MENTHLTH, achieving 85.3% predictive accuracy.

The above code is reusable and is directly accessible in this Jupyter Notebook on GitHub.

Super Admin

Jimmy Fisher

previous post Chi-Square Tests & BRFSS Weights

you may also like

by Jimmy Fisher
Nov 02, 2024

U.S. Population (Census)

by Jimmy Fisher
Nov 10, 2024

U.S. Families (Census Data)

by Jimmy Fisher
Nov 16, 2024

U.S. Mortality Data (NCHS)

by Jimmy Fisher
Dec 01, 2024

Wrangling BRFSS (2011-2023)

by Jimmy Fisher
Dec 17, 2024

Chi-Square Tests & BRFSS Weights

by Jimmy Fisher
Dec 17, 2024

Chi-Square Tests & BRFSS Weights

by Jimmy Fisher
Dec 14, 2024

No Skepticism, No Science

by Jimmy Fisher
Dec 01, 2024

Wrangling BRFSS (2011-2023)

Coding Projects

by Jimmy Fisher
Nov 16, 2024

U.S. Mortality Data (NCHS)

Turning our attention to the darker bookend of human life, let's take a look at some public read more