Logistic regression¶

In this tutorial, we will demonstrate how to perform logistic regression on heart disease data using crandas. We will cover the following steps:

Import necessary libraries
Load the dataset
Preprocess the dataset
Train the logistic regression model
Make predictions
Compute metrics

First, let’s import the necessary libraries and set our VDL session variables.

import crandas as cd
import pandas as pd
from crandas.base import session
from timeit import timeit
from crandas.crlearn.logistic_regression import LogisticRegression
from crandas.crlearn.model_selection import train_test_split
from crandas.crlearn.utils import min_max_normalize
from crandas.crlearn.metrics import classification_accuracy, tjur_r2, mcfadden_r2, precision_recall
from pathlib import Path

force_reupload = False

session.base_path = Path('../../../vdl-instance/vdl/secrets')
session.endpoint = 'https://localhost:9820/api/v1'


dataset_size = 100 #max 10000 (will take 20 minutes)

Next, we’ll load the heart_outcome dataset using pandas and upload it to the VDL using crandas.upload_pandas_dataframe()

df = pd.read_csv('../data/logreg_data/heart_outcome.csv')
df.head()
heart_outcome = cd.upload_pandas_dataframe(df.head(dataset_size), name='heart_outcome')

Now, let’s preprocess the heart_2020_cleaned dataset by normalizing numeric values, creating dummy variables for categorical data, and mapping Yes/No values to 1/0. Then, we’ll upload the preprocessed dataset to the VDL.

Note

The following actions uploading the heart_data table correspond to the second party.

#create normalization function
def normalize(col):
    return (col - col.min()) / (col.max() - col.min())

#load the CSV into Pandas DF
df = pd.read_csv('../data/logreg_data/heart_2020_cleaned.csv')

#normalize numeric values
fields = ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
for field in fields:
    df[f"{field}_normalized"] = normalize(df[field])

#create dummy variables for categoric data
df = pd.get_dummies(df, columns=['Sex', 'Diabetic', 'GenHealth', 'AgeCategory'], dtype='int64')
df = df.rename(columns={'Diabetic_No, borderline diabetes':'Diabetic_No_Borderline', 'Diabetic_Yes (during pregnancy)':'Diabetic_Yes_Pregnancy'})

#convert Yes/No to 0/1
fields = ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'PhysicalActivity', 'Asthma', 'KidneyDisease', 'SkinCancer']
for field in fields:
    df[field] = df[field].map({'Yes':1 ,'No':0})

heart_data = cd.upload_pandas_dataframe(df.head(dataset_size), name='heart_data')

So now we have two CDataFrames to train our logistic regression; heart_outcome and heart_data.

1. Train the model¶

#Merge data
merged = cd.merge(heart_data, heart_outcome, on='ID')

#Define training data
X = merged[['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'PhysicalActivity', 'SleepTime', 'Asthma', 'KidneyDisease', 'SkinCancer', 'BMI_normalized', 'PhysicalHealth_normalized', 'MentalHealth_normalized', 'SleepTime_normalized', 'Sex_Female', 'Sex_Male', 'Diabetic_No', 'Diabetic_No_Borderline', 'Diabetic_Yes', 'Diabetic_Yes_Pregnancy', 'GenHealth_Excellent', 'GenHealth_Fair', 'GenHealth_Good', 'GenHealth_Poor', 'GenHealth_Very good']]
y = merged[['HeartDisease']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#Train logistic regression model
BinomialModel = LogisticRegression(max_iter=10, warm_start=True)
crlearn_model = BinomialModel.fit(X_train, y_train)

>>> print("Trained logistic regression model on:", len(X), "rows")
    Trained logistic regression model on: 100 rows

crlearn_model.get_beta(mode='open')

	intercept	Smoking	Alcohol-Drinking	Stroke	DiffWalking	PhysicalActivity	SleepTime	Asthma	KidneyDisease	SkinCancer	LungCOPD	Obesity	HighBloodPressure	Sex_Male	Diabetic_No	Diabetic_No_Borderline	Diabetic_Yes	Diabetic_Yes_Pregnancy	GenHealth_Excellent	GenHealth_Fair	GenHealth_Good	GenHealth_Poor	GenHealth_Very good
0	-1.002275	-0.171445	0.0	0.155548	0.306121	-0.870124	0.071704	-0.374236	0.03623	-0.127964	0.18351	0.451123	-0.073792	-0.371922	-0.945653	-0.208014	0.24253	0.0	-0.227025	-0.299601	-0.345784	1.018995	-1.057688

1 rows × 25 columns

2. Make predictions¶

Now that our model is trained, we can use it to make predictions on the test dataset.

class_probabilities_test = BinomialModel.predict_proba(X_test)
class_predictions_test = crlearn_model.predict(X_test)
class_probabilities_train = crlearn_model.predict_proba(X_train)
class_probabilities_test

Name: 13366A5A46D25213A4FEDBB179883B28DD3F9A044E66B12E346BD60389F46D33
Size: 30 rows x 2 columns
CIndex([Col("predictions (probabilities) 0", "fp", 1), Col("predictions (probabilities) 1", "fp", 1)])

3. Compute metrics¶

Finally, let’s compute various performance metrics to evaluate our model’s performance

We’ll start with the McFadden R^2

>>> mcfadden_r2(BinomialModel, X_test, y_test).open()
    0.10343170166015625

Note

McFadden R^2: This is a goodness-of-fit measure for logistic regression models. It compares the likelihood of the fitted model to the likelihood of a null model that only includes the intercept. A higher McFadden R^2 indicates a better fit, with values ranging from 0 to 1. It is particularly useful in situations where traditional R^2 is not applicable, such as with logistic regression.

>>> classification_accuracy(y_test, class_predictions_test).open()
    0.966679573059082

Note

Classification Accuracy: This is the proportion of correct predictions among the total number of predictions made. It is a common measure of model performance in classification tasks. A higher classification accuracy indicates a better-performing model.

>>> tjur_r2(y_test, class_probabilities_test).open()
    0.0916992188

Note

Tjur R^2: This is another goodness-of-fit measure for logistic regression models. It is based on the difference in average predicted probabilities between the two groups (i.e., positive and negative outcomes). A higher Tjur R^2 indicates a better fit, with values ranging from 0 to 1. Like McFadden R^2, it is useful in situations where traditional R^2 is not applicable.

pr = precision_recall(y_test, class_predictions_test)
[precision, recall] = pr.open()

>>> print("Precision", precision)
    Precision 1.0

>>> print("Recall", recall)
    Recall 0.0

Note

Precision and Recall: These are two complementary metrics for evaluating the performance of a classification model, particularly when dealing with imbalanced datasets.

Precision: It measures the proportion of true positives among all positive predictions (true positives + false positives). A higher precision indicates that the model is better at identifying true positives without generating too many false positives. Recall: It measures the proportion of true positives among all actual positives (true positives + false negatives). A higher recall indicates that the model is better at identifying true positives without missing too many actual positives.

This concludes our tutorial on performing logistic regression with heart disease data using crandas. By following these steps, you can create a predictive model, make predictions, and evaluate the performance of your model using a variety of metrics.