.. _logregdemo:
Logistic regression
=======================
In this tutorial, we will demonstrate how to perform logistic regression on heart disease data using crandas. We will cover the following steps:
- Import necessary libraries
- Load the dataset
- Preprocess the dataset
- Train the logistic regression model
- Make predictions
- Compute metrics
First, let's import the necessary libraries and set our engine session variables.
.. code:: python
import crandas as cd
import pandas as pd
from crandas.base import session
from timeit import timeit
from crandas.crlearn.logistic_regression import LogisticRegression
from crandas.crlearn.model_selection import train_test_split
from crandas.crlearn.utils import min_max_normalize
from crandas.crlearn.metrics import classification_accuracy, tjur_r2, mcfadden_r2, precision_recall
from pathlib import Path
force_reupload = False
dataset_size = 100 #max 10000 (will take 20 minutes)
Next, we'll load the heart_outcome dataset using pandas and upload it to the engine using :func:`.crandas.upload_pandas_dataframe`
.. code:: python
df = pd.read_csv('../data/logreg_data/heart_outcome.csv')
df.head()
heart_outcome = cd.upload_pandas_dataframe(df.head(dataset_size), name='heart_outcome')
Now, let's preprocess the heart_2020_cleaned dataset by normalizing numeric values, creating dummy variables for categorical data, and mapping Yes/No values to 1/0. Then, we'll upload the preprocessed dataset to the engine.
.. note::
The following actions uploading the ``heart_data`` table correspond to the second party.
.. code:: python
#create normalization function
def normalize(col):
return (col - col.min()) / (col.max() - col.min())
#load the CSV into pandas DF
df = pd.read_csv('../data/logreg_data/heart_2020_cleaned.csv')
#normalize numeric values
fields = ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
for field in fields:
df[f"{field}_normalized"] = normalize(df[field])
#create dummy variables for categoric data
df = pd.get_dummies(df, columns=['Sex', 'Diabetic', 'GenHealth', 'AgeCategory'], dtype='int64')
df = df.rename(columns={'Diabetic_No, borderline diabetes':'Diabetic_No_Borderline', 'Diabetic_Yes (during pregnancy)':'Diabetic_Yes_Pregnancy'})
#convert Yes/No to 0/1
fields = ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'PhysicalActivity', 'Asthma', 'KidneyDisease', 'SkinCancer']
for field in fields:
df[field] = df[field].map({'Yes':1 ,'No':0})
heart_data = cd.upload_pandas_dataframe(df.head(dataset_size), name='heart_data')
So now we have two :class:`CDataFrames<.CDataFrame>` to train our logistic regression; `heart_outcome` and `heart_data`.
1. Train the model
~~~~~~~~~~~~~~~~~~~
.. code:: python
#Merge data
merged = cd.merge(heart_data, heart_outcome, on='ID')
#Define training data
X = merged[['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'PhysicalActivity', 'SleepTime', 'Asthma', 'KidneyDisease', 'SkinCancer', 'BMI_normalized', 'PhysicalHealth_normalized', 'MentalHealth_normalized', 'SleepTime_normalized', 'Sex_Female', 'Sex_Male', 'Diabetic_No', 'Diabetic_No_Borderline', 'Diabetic_Yes', 'Diabetic_Yes_Pregnancy', 'GenHealth_Excellent', 'GenHealth_Fair', 'GenHealth_Good', 'GenHealth_Poor', 'GenHealth_Very good']]
y = merged[['HeartDisease']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Train logistic regression model
BinomialModel = LogisticRegression(max_iter=10, warm_start=True)
crlearn_model = BinomialModel.fit(X_train, y_train)
>>> print("Trained logistic regression model on:", len(X), "rows")
Trained logistic regression model on: 100 rows
.. code:: python
crlearn_model.get_beta(mode='open')
.. csv-table::
:header: , intercept, Smoking, Alcohol-Drinking, Stroke, DiffWalking, PhysicalActivity, SleepTime, Asthma, KidneyDisease, SkinCancer, LungCOPD, Obesity, HighBloodPressure, Sex_Male, Diabetic_No, Diabetic_No_Borderline, Diabetic_Yes, Diabetic_Yes_Pregnancy, GenHealth_Excellent, GenHealth_Fair, GenHealth_Good, GenHealth_Poor, GenHealth_Very good
:widths: 4, 10, 12, 15, 10, 12, 18, 10, 10, 15, 12, 10, 10, 17, 10, 15, 24, 15, 24, 19, 18, 15, 15, 20
:stub-columns: 1
**0**, -1.002275, -0.171445, 0.0, 0.155548, 0.306121, -0.870124, 0.071704, -0.374236, 0.03623, -0.127964, 0.18351, 0.451123, -0.073792, -0.371922, -0.945653, -0.208014, 0.24253, 0.0, -0.227025, -0.299601, -0.345784, 1.018995, -1.057688
.. parsed-literal::
1 rows × 25 columns
2. Make predictions
~~~~~~~~~~~~~~~~~~~~
Now that our model is trained, we can use it to make predictions on the test dataset.
.. code:: python
class_probabilities_test = BinomialModel.predict_proba(X_test)
class_predictions_test = crlearn_model.predict(X_test)
class_probabilities_train = crlearn_model.predict_proba(X_train)
class_probabilities_test
.. parsed-literal::
Name: 13366A5A46D25213A4FEDBB179883B28DD3F9A044E66B12E346BD60389F46D33
Size: 30 rows x 2 columns
CIndex([Col("predictions (probabilities) 0", "fp", 1), Col("predictions (probabilities) 1", "fp", 1)])
3. Compute metrics
~~~~~~~~~~~~~~~~~~~
Finally, let's compute various performance metrics to evaluate our model's performance
We'll start with the McFadden R^2
>>> mcfadden_r2(BinomialModel, X_test, y_test).open()
0.10343170166015625
.. note::
**McFadden R^2:**
This is a goodness-of-fit measure for logistic regression models.
It compares the likelihood of the fitted model to the likelihood of a null model that only includes the intercept.
A higher McFadden R^2 indicates a better fit, with values ranging from 0 to 1.
It is particularly useful in situations where traditional R^2 is not applicable, such as with logistic regression.
>>> classification_accuracy(y_test, class_predictions_test).open()
0.966679573059082
.. note::
**Classification Accuracy:**
This is the proportion of correct predictions among the total number of predictions made.
It is a common measure of model performance in classification tasks.
A higher classification accuracy indicates a better-performing model.
>>> tjur_r2(y_test, class_probabilities_test).open()
0.0916992188
.. note::
**Tjur R^2:**
This is another goodness-of-fit measure for logistic regression models.
It is based on the difference in average predicted probabilities between the two groups (i.e., positive and negative outcomes).
A higher Tjur R^2 indicates a better fit, with values ranging from 0 to 1.
Like McFadden R^2, it is useful in situations where traditional R^2 is not applicable.
.. code:: python
pr = precision_recall(y_test, class_predictions_test)
[precision, recall] = pr.open()
>>> print("Precision", precision)
Precision 1.0
>>> print("Recall", recall)
Recall 0.0
.. note::
**Precision and Recall:**
These are two complementary metrics for evaluating the performance of a classification model, particularly when dealing with imbalanced datasets.
**Precision:** It measures the proportion of true positives among all positive predictions (true positives + false positives). A higher precision indicates that the model is better at identifying true positives without generating too many false positives.
**Recall:** It measures the proportion of true positives among all actual positives (true positives + false negatives). A higher recall indicates that the model is better at identifying true positives without missing too many actual positives.
This concludes our tutorial on performing logistic regression with heart disease data using crandas. By following these steps, you can create a predictive model, make predictions, and evaluate the performance of your model using a variety of metrics.