Sunday, December 22, 2024

Mental Health, MLR, & One-Hot Encoding (BRFSS)


In this article, I cover how to conduct a Multivariate Linear Regression (MLR) on Behavioral Risk Factor Surveillance System (BRFSS) data representing the population of the United States to predict self-reported physical and mental health. This is a continuation of a series of posts that began with cleaning and exploring separate years' base data (2011-2023) and follows one on how to conduct Chi-Square Tests (and post hoc Bonferroni tests) on 2019 and 2023 data using the programming language R.

To mix it up a bit, I will use Python this time, to prepare our data and run our MLR tests below, though R can just as easily be used to the same ends.

If you're completely unfamiliar with linear regression techniques, I highly recommend THIS series on it by Khan Academy to get grounded in the basic idea.

My research question: Can we use other variables in the 2023 BRFSS data to predict the number of days respondents reported experiencing poor mental health in the 30 before they were surveyed?

Let's begin by looking at the 2023 BRFSS Codebook to assess the nature of our dependent variable, MENTHLTH:


Cumulatively, around 2% refused to answer the question or answered "Don't know/Not sure", and for purposes here we can safely exclude them. Only 3 of the 433,323 respondents represented in the Frequency column have missing values, so we'll ignore those as well, focusing on the roughly 98% for whom we have MENTHLTH data.

The next hurdle we must clear before running any calculations is defining which independent (predictor) variables we would like to consider in the analysis. The easy answer is "all of them," but it is not always so simple, as many variables in the BRFSS data were only asked to subsets of the total surveyed population.

After excluding records with 77s, 99s, and blanks in our MENTHLTH column, let's assess data completeness for all variables to consider our available data.

# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt import seaborn as sns
# Load data (available at https://dataforpublichealth.com/post-details/wrangling-brfss-2011-2023)
data = pd.read_csv('data/2023_350.csv', low_memory=False)

# Exclude rows of dependent variable (MENTHLTH) with missing values, 99, and 77
data = data[data['MENTHLTH'] != '77']
data = data[data['MENTHLTH'] != '99']
data = data.dropna(subset=['MENTHLTH'])
Make sure to replace all empty strings with NaN
# Replace empty strings, spaces, or other patterns with NaN
data.replace(r'^\s*$', np.nan, regex=True, inplace=True)

And finally...
# Count missing values per column
missing_summary = data.isnull().sum().sort_values(ascending=False)
missing_summary = missing_summary[missing_summary > 0]  # Only show columns with missing values
missing_summary = pd.DataFrame({'Missing Values': missing_summary,
                                'Percentage': (missing_summary / len(data)) * 100})
# Plot to visually inspect
missing_summary.plot(kind='bar', figsize=(15, 5), legend=False)
plt.title("Missing Data by Columns")
plt.ylabel("Count of Missing Values")
plt.xticks([])
plt.show()

The height of the bars indicate number of missing rows, with the ones at the left missing 100% of all rows (i.e., HIDDEN data, according to the codebook). One major reason for the rest of the missing values is that several questions are only asked to segments of the surveyed population. For example, as covered in an earlier post, the variable ASTHMA3 shows whether a person self-reported ever having been told they had asthma, and only if they said yes were they asked whether they still had asthma. State-level variations in questionnaires account for some of these missing values as well.

But, whatever the reasons for these missing values, we cannot predict with data that we do not have. So, let's set a threshold cutoff and delimit our consideration to a complete segment of data, excluding those for whom data are missing. 

Here's how to programmatically set a threshold missing value percent, excluding any variables that do not meet the muster of that requirement. I'm going to exclude any variables at are missing more than 25% of the total records in our dataset. 
## Set a threshold for allowable missing values and exclude columns
## with more than the threshold

# Set the threshold percentage
threshold = 25 / 100  # 25% as a fraction

# Calculate the percentage of missing values per column
missing_percentage = data.isnull().mean()

# Filter columns: Keep only those with less than or equal to 25% missing data
columns_to_keep = missing_percentage[missing_percentage <= threshold].index

# Create a new DataFrame with the filtered columns
data_filtered = data[columns_to_keep]

# Output
print(f"Original shape: {data.shape}")
print(f"Filtered shape (after excluding >25% missing): {data_filtered.shape}")


We also know that several of these 139 variables aren't going to be predictive, such as record weights, survey dates, sequence numbers (the dataset's primary key), and flags for cell phone or land line. A quick pass through the codebook helped me identify the obvious ones, and I removed them. 
# Drop non-informative columns
data_filtered = data_filtered.drop(['FMONTH', 'IDATE', 'IMONTH', 'IDAY', 'IYEAR', 'DISPCODE',
                                    'SEQNO', 'x_PSU', 'SAFETIME', 'CTELNUM1', 'CELLFON5', 'CADULT1',
                                    'CELLSEX2', 'CSTATE1', 'LANDLINE', 'x_STSTR', 'x_STRWT',
                                    'x_WT2RAKE', 'x_LLCPWT2', 'x_LLCPWT'], axis=1)

# Output
print(f"Shape after dropping non-informative columns: {data_filtered.shape}")

We went from 350 variables in the initial .csv to 110, including our outcome variable, MENTHLTHIf I missed any non-informative columns, I'll let the model whittle them out on account of their intrinsic lack of predictive power.

Moving on, each column's respective missing values are spread across different records. So, if we exclude all rows missing any of these from these 110 variables, we get...
# Drop all rows with missing values
data_filtered = data_filtered.dropna()

print(f"Shape after dropping rows with missing values: {data_filtered.shape}")
...189,278 remaining records, a decrease of 235,940 (~55%).

We now need to handle numerical values in our remaining continuous variables that indicate data are unavailable, such as '77' for refused, '99' for "Don't know", and other numbers stipulated in the 2023 BRFSS Codebook. Also, in some variables, the number '88' indicates 0, so this needs to be recode for continuous variable correlations to be properly construed. If we do not handle these records, we will inadvertently skew our analysis.

## Recode 88 in MENTHLTH, PHYSHLTH, POORHLTH, CHILDREN, & SMOKE100 to 0
data_filtered['MENTHLTH'] = data_filtered['MENTHLTH'].replace(88, 0)
data_filtered['PHYSHLTH'] = data_filtered['PHYSHLTH'].replace(88, 0)
data_filtered['CHILDREN'] = data_filtered['CHILDREN'].replace(88, 0)
data_filtered['SMOKE100'] = data_filtered['SMOKE100'].replace(88, 0)

# Remove records where PHYSHLTH, POORHLTH, HHADULT, CHILDREN, or SMOKE100 is 77 or 99
data_filtered = data_filtered[~data_filtered['PHYSHLTH'].isin([77, 99])]
data_filtered = data_filtered[~data_filtered['HHADULT'].isin([77, 99])]
data_filtered = data_filtered[~data_filtered['CHILDREN'].isin([77, 99])]
data_filtered = data_filtered[~data_filtered['SMOKE100'].isin([77, 99])]

# Remove records where MAXVO21_, FC601_, or x_DRNKWK2 is 99900
data_filtered = data_filtered[~data_filtered['MAXVO21_'].isin([99900])]
data_filtered = data_filtered[~data_filtered['FC601_'].isin([99900])]
data_filtered = data_filtered[~data_filtered['x_DRNKWK2'].isin([99900])]

# Remove records where STRFREQ_ is '99000'
data_filtered = data_filtered[~data_filtered['STRFREQ_'].isin([99000])]

# Remove records where DROCDY4_ is 900
data_filtered = data_filtered[~data_filtered['DROCDY4_'].isin([900])]

# Print the shape of the data after removing records
print(f"Shape after removing records: {data_filtered.shape}")

This excludes an additional 7,217 records (for which valid numerical data were not available), yielding to us a final 
182,061 records for analysis.

Before building our multiple regression model, we will delimit our predictor variables and finish preparing the data.

  1. Leverage iterative correlation testing (and Cramér’s V for categorical variables) to reduce number of variables prior to one-hot encoding. Not only does this protect against multicollinearity, but if we one-hot encoded every category, we'd end up with around 1,800 columns were we to one-hot encode every current column. That is impractical and slow, so this approach is better.



    # Perform iterative correlation testing to determine which columns to keep

    # Step 1: Compute correlation matrix for continuous predictors
    continuous_vars = ['MAXVO21_', 'FC601_', 'x_BMI5', 'DROCDY4_', 'x_DRNKWK2', 'PHYSHLTH']  # Known continuous predictors
    categorical_vars = [col for col in data_filtered.columns if col not in continuous_vars] # For Cramer's V (next)

    corr_matrix = data_filtered[continuous_vars].corr().abs()  # Absolute correlations

    # Step 2: Remove duplicate correlations (upper triangle)
    upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

    # Step 3: Identify predictors to drop based on threshold
    threshold = 0.7  # Set correlation threshold to avoid high redundancy
    to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > threshold)]

    print(f"Columns to drop due to high correlation: {to_drop}")

    Iterative correlation testing on our continuous variables identified 1 to drop (FC601_), and we now have 109 columns left, including our dependent variable (MENTHLTH).
    # Print number of columns before removal
    print(f"Columns before removal: {data_filtered.shape[1]}")

    # Remove the columns identified above
    data_filtered = data_filtered.drop(to_drop + low_variance_vars, axis=1)

    # Remove dropped variables from lists
    continuous_vars = [col for col in continuous_vars if col not in to_drop + low_variance_vars]
    categorical_vars = [col for col in categorical_vars if col not in to_drop + low_variance_vars]

    # Print number of columns after removal
    print(f"Columns after removal: {data_filtered.shape[1]}")



    Noticing that the computed BRFSS variable x_MENT14D is a straight recoding of our dependent variable (MENTHLTH), I removed that as well, because its autocorrelation is unhelpful and we can bin it ourselves later.

    Binning is a way to group numbers into ranges or "bins" to make them easier to analyze. Instead of looking at every single number, we group them into categories like 0-10, 11-20, and so on. For example, if you have test scores, binning can group them into ranges like "low," "medium," and "high," helping you see patterns more clearly.

    # Remove x_MENT14D from data_filtered
    data_filtered = data_filtered.drop('x_MENT14D', axis=1)

    # Remove x_MENT14D from categorical_vars
    categorical_vars = [col for col in categorical_vars if col != 'x_MENT14D']

    Down to 109 columns!

    Assessing associations between 
    categorical predictor variables is rather straightforward using Cramér's V. I used a threshold of 0.7 and straightforward function to iterate through all remaining categorical variables, flagging any with a V higher than 0.7 for exclusion on account of being sufficiently redundant.

    # Function to compute Cramér's V
    def cramers_v(x, y):
        """Calculate Cramér's V for two categorical variables."""
        confusion_matrix = pd.crosstab(x, y)  # Create a contingency table
        chi2, _, _, _ = chi2_contingency(confusion_matrix)  # Perform Chi-Square test
        n = confusion_matrix.sum().sum()  # Total number of observations
        phi2 = chi2 / n  # Compute phi-squared
        r, k = confusion_matrix.shape  # Number of rows and columns in the table
        phi2_corr = max(0, phi2 - ((k-1)*(r-1)) / (n-1))  # Bias correction
        r_corr = max(1, r - ((r-1)**2) / (n-1))  # Ensure corrected rows > 0
        k_corr = max(1, k - ((k-1)**2) / (n-1))  # Ensure corrected columns > 0
        denominator = min((k_corr - 1), (r_corr - 1))
        if denominator <= 0:  # Avoid division by zero
            return np.nan
        return np.sqrt(phi2_corr / denominator)  # Return Cramér's V

    # Create a matrix to store Cramér's V values
    cramers_matrix = pd.DataFrame(index=categorical_vars, columns=categorical_vars)

    # Compute Cramér's V for all pairs of categorical variables
    for col1 in categorical_vars:
        for col2 in categorical_vars:
            if col1 == col2:
                cramers_matrix.loc[col1, col2] = 0.0  # Self-correlation set to 0 for comparison
            else:
                cramers_matrix.loc[col1, col2] = cramers_v(data_filtered[col1], data_filtered[col2])

    # Convert matrix values to floats for further processing
    cramers_matrix = cramers_matrix.astype(float)

    # Print Cramér's V matrix
    print("Cramér's V Matrix:")
    print(cramers_matrix)

    # Identify variables with high association (e.g., V > 0.8)
    threshold = 0.7  # Set threshold for high association
    to_drop_cat = []

    # Iterate over columns and ignore self-correlation (diagonal values)
    for col in cramers_matrix.columns:
        high_association = cramers_matrix[col][cramers_matrix.index != col] > threshold  # Exclude diagonal
        if any(high_association):
            to_drop_cat.append(col)

    print(f"Categorical variables to drop (V > {threshold}): {to_drop_cat}")

    This identified an additional 64 categorical variables sufficiently correlated to one another (threshold > 0.7) to warrant removal. This is quite a finding in itself, and a very good thing to know about BRFSS data in general. Let's drop 'em! 

    # Remove the categorical variables identified above with Cramer's V
    data_filtered = data_filtered.drop(to_drop_cat, axis=1)

    # print the shape of the data after removing columns
    print(f"Shape after removing columns: {data_filtered.shape}")

    We now have 182,061 records spanning 44 predictor variables and our outcome variable, MENTHLTH

    Useful Tip: There is one last technique you can use to narrow down predictors prior to one-hot coding large numbers of categorical variables: running a series of quick bivariate analyses. Bivariate analysis examines the relationship between two variables to understand how one affects or is associated with the other. This is relevant far beyond multiple linear regression and is a useful "trick."

    For continuous variables, we will look at the bivariate correlations of each continuous predictor to our outcome variable. For categorical variables, we'll use ANOVA 
    p-values to determine whether each separate categorical predictor is statistically significantly associated with differences in our outcome variable on its own.

    For our continuous predictors:.
    # Make all variables numeric
    data_filtered = data_filtered.apply(pd.to_numeric, errors='ignore')
    # Quick Bivariate Analyses to further narrow down predictors
    # Target column in data_filtered
    target_col = 'MENTHLTH'

    # Compute correlations for continuous variables
    correlations = {}
    for col in continuous_vars:
        try:
            # Compute correlation between the continuous variable and the target
            correlations[col] = data_filtered[target_col].corr(data_filtered[col])
        except Exception as e:
            print(f"Error with variable {col}: {e}")

    # Filter predictors based on correlation threshold
    threshold = 0.2
    selected_continuous = [col for col, corr in correlations.items() if abs(corr) > threshold]

    print("Selected Continuous Variables:", selected_continuous)



    And for our categorical predictors:

    from scipy.stats import f_oneway

    # Target column in data_filtered
    target_col = 'MENTHLTH'

    # Perform ANOVA for each categorical variable
    anova_results = {}
    for col in categorical_vars:  # Ensure categorical_vars contains original column names
        try:
            # Group the target variable (MENTHLTH) by each level of the categorical variable
            groups = [
                data_filtered.loc[data_filtered[col] == level, target_col]  # Subset target by category levels
                for level in data_filtered[col].dropna().unique()
            ]
            if len(groups) > 1:  # Ensure at least two groups exist
                anova_results[col] = f_oneway(*groups).pvalue
        except KeyError:
            print(f"Variable {col} is not in data_filtered and was skipped.")
        except Exception as e:
            print(f"Error with variable {col}: {e}")

    # Filter categorical predictors based on p-value threshold
    pval_threshold = 0.05
    selected_categorical = [col for col, pval in anova_results.items() if pval < pval_threshold]

    print("Selected Categorical Variables:", selected_categorical)


    One of our continuous predictors (PHYSHLTH) met our correlation threshold of .2, and our ANOVAs were statistically significant for 34 of our categorical variables. Down to 35 predictors for our 1 dependent variable (total = 36).

    # Print the shape of the reduced dataset
    print(f"Shape of the reduced dataset: {data_prepared.shape}")

  2. One-Hot Encode categorical variables. One-Hot Encoding is a method used to convert categorical variables into a numerical format so that they can be used in machine learning or statistical models. This process ensures that the relatively few continuous predictors are retained while preparing all categorical variables for regression analysis.

    One-hot encoding is a way to turn categories into numbers so computers can understand them. For each category, it creates a new column where a "1" means that category is present and "0" means it isn’t. For example, if you have colors like red, blue, and green, one-hot encoding makes three columns: one for red, one for blue, and one for green, with only one column marked as "1" for each row.

    ## One-Hot Encoding

    # Step 1: Break data into target and predictors
    y = data_filtered['MENTHLTH']
    X = data_filtered.drop(columns=['MENTHLTH'])

    # Step 2: One-Hot Encode Remaining Categorical Variables
    X_encoded = pd.get_dummies(X, columns=categorical_vars, drop_first=True)

    # Step 5: Combine Continuous and Encoded Variables
    # Ensure all continuous variables remain intact
    X_final = X_encoded[continuous_vars].join(X_encoded.drop(columns=continuous_vars, errors='ignore'))

    # Step 6: Verify Final Data
    print("Final Predictor Variables:")
    print(X_final.head())
    print("Shape:", X_final.shape)

    This turns our 35 predictors into 165 columns. Nice. 
    Now, let's build our model.

  3. Fit a linear regression model using statsmodels or scikit-learn. There are a few ways to do this, but I'm going to use a machine learning approach that splits our data randomly into 2 sections, a training set with a random 80% of the data, and a separate test set with the rest. The accuracy of our MLR model will be assessed on how well it predicts MENTHLTH in our test data, because the model will have been trained without encountering that test data segment at all.

    The following code took almost 41 minutes to run on my computer!  😂 Iterative approaches are a cornerstone of machine learning techniques. This approach identified the 67 features from the 168 that contributed to the model's predictive power.

    import statsmodels.api as sm
    from sklearn.model_selection import train_test_split

    # Step 1: Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=42)

    # Step 2: Define Stepwise Selection Function
    def stepwise_selection(X, y, threshold_in=0.05, threshold_out=0.05, verbose=True):
        """Perform stepwise selection to find the best predictors for the MLR model."""
        initial_list = []
        included = list(initial_list)
        while True:
            changed = False
           
            # Forward step: Add variable with the lowest p-value
            excluded = list(set(X.columns) - set(included))
            new_pval = pd.Series(index=excluded, dtype=float)
            for new_column in excluded:
                try:
                    model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
                    new_pval[new_column] = model.pvalues[new_column]
                except Exception as e:
                    print(f"Error with variable {new_column}: {e}")
                    new_pval[new_column] = np.nan
            best_pval = new_pval.min()
            if best_pval < threshold_in:
                best_feature = new_pval.idxmin()
                included.append(best_feature)
                changed = True
                if verbose:
                    print(f"Adding {best_feature} with p-value {best_pval}")

            # Backward step: Remove variable with the highest p-value
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
            pvalues = model.pvalues.iloc[1:]  # Exclude intercept
            worst_pval = pvalues.max()
            if worst_pval > threshold_out:
                worst_feature = pvalues.idxmax()
                included.remove(worst_feature)
                changed = True
                if verbose:
                    print(f"Dropping {worst_feature} with p-value {worst_pval}")

            if not changed:
                break

        return included


    # Step 3: Run Stepwise Selection
    selected_features = stepwise_selection(X_train, y_train)


    This code implements a stepwise selection process to identify the best predictors for a Multiple Linear Regression (MLR) model. One by one, the code iterates through all predictors not currently in the model (the "excluded" predictors) and evaluates their statistical significance if added. Then, it iteratively adds the most significant variable (based on the p-value threshold) or removes the least significant variable already included, ensuring the model retains only the strongest predictors. This process continues until no further changes improve the model.

    Finally, we use the selected features to build a regression model and evaluate its performance on our test data segment, to measure its performance. For contrast, this code, actually building and testing the model, took under than a second to complete.
    from sklearn.metrics import accuracy_score # For classification accuracy

    # Step 4: Build Final Model with Selected Features
    X_train_selected = X_train[selected_features]
    X_test_selected = X_test[selected_features]

    final_model = sm.OLS(y_train, sm.add_constant(X_train_selected)).fit()
    print("\nFinal Model Summary:")
    print(final_model.summary())

    # Step 5: Test the Model on the Test Set
    y_pred = final_model.predict(sm.add_constant(X_test_selected))
    from sklearn.metrics import mean_squared_error, r2_score
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred.round())
    print(f"Accuracy of the model on test set: {accuracy * 100:.2f}%")
    print(f"\nMean Squared Error: {mse}\nR-squared: {r2}")



    The outcome variable, MENTHLTH, asked respondents: "Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?"

    Answers could be 0 or any whole number from 1 to 30. The model wasn't trained on the test data, but using other variables in the survey, our MLR model correctly guessed the exact number of poor health days respondents reported over 1/8 of the time (~13.89% accuracy). 

    Binning continuous variables like MENTHLTH into groups often improves model performance, and it's relatively easy to do. According to the codebook, over half of respondents reported poor mental health for 0 days out of the last 30, so I'll look at that one as either 0 days ('0') or 1+ days of self-reported poor mental health ('1') and attempt to predict that. Then, I'll lump our dependent variable into 3, 4, 5, and 6-bin groups, sequentially, and see if our predictive performance improves. My function to systematically perform all of these then outputs predictive performance as a single table to make comparison easier.

    # Define a function to bin y, train an MLR model, and compute metrics
    def run_mlr_with_bins(X, y, num_bins_list):
        results = []  # To store model metrics

        for num_bins in num_bins_list:
            # Bin the target variable
            if num_bins == 2:  # Special case for binary (0 vs. 1+)
                y_binned = (y > 0).astype(int)
            else:
                y_binned = pd.cut(y, bins=num_bins, labels=range(num_bins), include_lowest=True).astype(int)

            # Split the data into training and testing sets
            X_train, X_test, y_train, y_test = train_test_split(X, y_binned, test_size=0.2, random_state=42)

            # Train a multiple linear regression model
            X_train = sm.add_constant(X_train)  # Add constant for intercept
            X_test = sm.add_constant(X_test)    # Add constant for intercept

            model = sm.OLS(y_train, X_train).fit()

            # Test the model
            y_pred = model.predict(X_test)

            # Calculate performance metrics
            mse = mean_squared_error(y_test, y_pred)
            r2 = r2_score(y_test, y_pred)

            # Store the results
            results.append({
                'Bins': num_bins,
                'MSE': mse,
                'R-squared': r2
            })

        # Convert results to a DataFrame for a clean table
        results_df = pd.DataFrame(results)
        return results_df

    # Run the function on your dataset
    num_bins_list = [2, 3, 4, 5, 6]  # Define the binning segments
    X = X_final  # Predictor variables
    results_table = run_mlr_with_bins(X, y, num_bins_list)

    # Display the results
    print("Model Metrics for Different Binnings (MLR):")
    print(results_table)



    Much better! Here we are, using one-hot encoding for mostly categorical variables in an MLR model, and we have predicted self-reported number of days with poor mental health:

    The exact number (0-30) of poor mental health days with ~14% accuracy.

    Whether respondents reported 0 or 1+ with ~73% accuracy.

    * If the respondent was in the bottom, middle, or top third of reported poor mental health days at ~85%

    * The correct quartile at ~75% accuracy (3 in 4), and the quintile at ~67% (2 in 3), and sextile at ~66% (2 in 3)


    We've tackled a lot in this project. After preparing a subset of 2023 BRFSS data for analysis, utilizing several techniques to identify predictor variables for our MLR model, we employed one-hot encoding to allow us to leverage this statistical modeling method with our categorical variables. Our resulting model predicts the exact number of days each BRFSS respondent reported poor mental health in the last 30 days with almost 14% accuracy. Then, we binned our output variable and found the highest predictive accuracy by binning our output variable into thirds -- bottom (0-9), middle (10-19), and top (20-30) values in our range for MENTHLTH, achieving 85.3% predictive accuracy.

    The above code is reusable and is directly accessible in this Jupyter Notebook on GitHub.

    Super Admin

    Jimmy Fisher



    you may also like

    • by Jimmy Fisher
    • Nov 02, 2024
    U.S. Population (Census)
    • by Jimmy Fisher
    • Nov 10, 2024
    U.S. Families (Census Data)
    • by Jimmy Fisher
    • Nov 16, 2024
    U.S. Mortality Data (NCHS)
    • by Jimmy Fisher
    • Dec 01, 2024
    Wrangling BRFSS (2011-2023)
    • by Jimmy Fisher
    • Dec 17, 2024
    Chi-Square Tests & BRFSS Weights