Sunday, December 22, 2024

Logistic Regression


Logistic regression is a crucial technique in data science and statistics, primarily used for binary classification tasks. This method predicts the probability of a binary outcome based on one or more predictor variables, making it a widely used tool in various industries such as finance and healthcare. The essence of logistic regression lies in its ability to model the probability of class membership using the logistic function, ensuring outputs remain between 0 and 1. 

The Logistic Function


The logistic function at the core of this technique is expressed as:

Let's look at this equation's parts...

  1. Core Structure:

      WHERE β0 β1X1 β2X2 … βnXn
  2. Exponential Component:


  3. Coefficients and Variables:
            : Intercept
            
    β1​, β2​, …, βn​: Coefficients for each variable
            
    X1 , X2 , … , Xn: Independent variables

In practice, this model equation allows calculation of the probability of a binary outcome based on the combined weighted influence of each included independent variable holding all other predictors constant.




Example in Python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset
data = pd.read_csv('diabetes.csv')

# Define feature columns and the target variable
X = data[['Age', 'BMI', 'BloodPressure']]
y = data['Diabetes']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')


This example demonstrates how scikit-learn can simplify the creation and evaluation of a logistic regression model, delivering quick insights into dataset outcomes. 




Example in R

# Load necessary libraries
library(readr)
library(caret) # For train-test splitting and accuracy calculation
library(dplyr) # For data manipulation if needed

# Load the dataset
data <- read_csv('diabetes.csv')

# Define feature columns and the target variable
X <- data %>% select(Age, BMI, BloodPressure)
y <- data$Diabetes

# Combine X and y into a single data frame for caret
dataset <- cbind(X, Diabetes = y)

# Split the data into training and testing sets
set.seed(42)
trainIndex <- createDataPartition(dataset$Diabetes, p = 0.7, list = FALSE)
train_data <- dataset[trainIndex, ]
test_data <- dataset[-trainIndex, ]

# Train the logistic regression model
model <- glm(Diabetes ~ Age + BMI + BloodPressure, data = train_data, family = binomial)

# Predict and evaluate the model
y_pred <- predict(model, newdata = test_data, type = "response")
y_pred_class <- ifelse(y_pred > 0.5, 1, 0) # Convert probabilities to binary outcomes
accuracy <- mean(y_pred_class == test_data$Diabetes)
cat(sprintf("Accuracy: %.2f%%", accuracy * 100))


This code is the rough equivalent to the Python code above. One useful thing to know is that Large Language Models (LLMs) such as ChatGPT can help you code and even explain code to you that you find online. So, for example, with this R script, copy-pasting it into OpenAI's 4o model and asking for an explanation yields the following:

This code implements a logistic regression model in R to predict whether a person has diabetes based on age, BMI, and blood pressure. Here’s a step-by-step explanation:

  1. Load Libraries:

    • readr: Used for reading CSV files.
    • caret: Provides functions for splitting data into training and testing sets, as well as calculating model accuracy.
    • dplyr: Used for data manipulation (though it’s not actively used in this code).
  2. Load the Dataset:

    • The dataset is loaded from a file called diabetes.csv using read_csv and assigned to the variable data.
  3. Define Features and Target:

    • The features (predictor variables) are selected from the columns Age, BMI, and BloodPressure.
    • The target variable (y), which represents whether a person has diabetes, is extracted from the column Diabetes.
  4. Combine Features and Target:

    • cbind combines the feature matrix X and target variable y into a single data frame called dataset for easier handling in the caret functions.
  5. Train-Test Split:

    • createDataPartition splits dataset into training (70%) and testing (30%) sets.
    • set.seed(42) ensures reproducibility so that the same data split is achieved every time the code runs.
  6. Train the Logistic Regression Model:

    • glm trains a logistic regression model using Age, BMI, and BloodPressure as predictors for Diabetes. The family = binomial parameter specifies logistic regression.
  7. Prediction and Evaluation:

    • predict generates predictions on the test data (test_data). Setting type = "response" returns probabilities.
    • ifelse converts these probabilities into binary predictions (1 for diabetes, 0 for no diabetes), based on a threshold of 0.5.
    • The accuracy of the model is calculated by comparing predictions (y_pred_class) with actual outcomes in the test set.
    • Finally, cat displays the model’s accuracy as a percentage.

In summary, the code reads a dataset, splits it, trains a logistic regression model on a subset of the data, makes predictions, and evaluates the model’s accuracy.


Pretty good output from my simple prompt of "Explain this code!"



Logistic regression is extraordinarily useful for binary classification, where the outcome variable has two possible values (e.g., yes/no, 0/1). It is valued for its simplicity and interpretability, as it provides coefficients that describe the contribution of each predictor variable to the log odds of the outcome. Data scientists can leverage this method to create predictive models that quantify how each variable influences the probability of a specific outcome. Logistic regression builds on concepts from linear regression but is adapted for classification by modeling the log odds of an outcome’s probability rather than the outcome itself.

Super Admin

Jimmy Fisher



you may also like

  • by Jimmy Fisher
  • Oct 19, 2024
Multiple Linear Regression
  • by Jimmy Fisher
  • Oct 19, 2024
ANOVAs and MANOVAs
  • by Jimmy Fisher
  • Oct 19, 2024
Particle Swarm Optimization
  • by Jimmy Fisher
  • Oct 19, 2024
Principal Component Analysis (PCA)
  • by Jimmy Fisher
  • Oct 19, 2024
Random Forest Models (RFM)