.. _logregdemo: Logistic regression ======================= In this tutorial, we will demonstrate how to perform logistic regression on heart disease data using crandas. We will cover the following steps: - Import necessary libraries - Load the dataset - Preprocess the dataset - Train the logistic regression model - Make predictions - Compute metrics First, let's import the necessary libraries and set our VDL session variables. .. code:: python import crandas as cd import pandas as pd from crandas.base import session from timeit import timeit from crandas.crlearn.logistic_regression import LogisticRegression from crandas.crlearn.model_selection import train_test_split from crandas.crlearn.utils import min_max_normalize from crandas.crlearn.metrics import classification_accuracy, tjur_r2, mcfadden_r2, precision_recall from pathlib import Path force_reupload = False session.base_path = Path('../../../vdl-instance/vdl/secrets') session.endpoint = 'https://localhost:9820/api/v1' dataset_size = 100 #max 10000 (will take 20 minutes) Next, we'll load the heart_outcome dataset using pandas and upload it to the VDL using :func:`.crandas.upload_pandas_dataframe` .. code:: python df = pd.read_csv('../data/logreg_data/heart_outcome.csv') df.head() heart_outcome = cd.upload_pandas_dataframe(df.head(dataset_size), name='heart_outcome') Now, let's preprocess the heart_2020_cleaned dataset by normalizing numeric values, creating dummy variables for categorical data, and mapping Yes/No values to 1/0. Then, we'll upload the preprocessed dataset to the VDL. .. note:: The following actions uploading the ``heart_data`` table correspond to the second party. .. code:: python #create normalization function def normalize(col): return (col - col.min()) / (col.max() - col.min()) #load the CSV into pandas DF df = pd.read_csv('../data/logreg_data/heart_2020_cleaned.csv') #normalize numeric values fields = ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime'] for field in fields: df[f"{field}_normalized"] = normalize(df[field]) #create dummy variables for categoric data df = pd.get_dummies(df, columns=['Sex', 'Diabetic', 'GenHealth', 'AgeCategory'], dtype='int64') df = df.rename(columns={'Diabetic_No, borderline diabetes':'Diabetic_No_Borderline', 'Diabetic_Yes (during pregnancy)':'Diabetic_Yes_Pregnancy'}) #convert Yes/No to 0/1 fields = ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'PhysicalActivity', 'Asthma', 'KidneyDisease', 'SkinCancer'] for field in fields: df[field] = df[field].map({'Yes':1 ,'No':0}) heart_data = cd.upload_pandas_dataframe(df.head(dataset_size), name='heart_data') So now we have two :class:`CDataFrames<.CDataFrame>` to train our logistic regression; `heart_outcome` and `heart_data`. 1. Train the model ~~~~~~~~~~~~~~~~~~~ .. code:: python #Merge data merged = cd.merge(heart_data, heart_outcome, on='ID') #Define training data X = merged[['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'PhysicalActivity', 'SleepTime', 'Asthma', 'KidneyDisease', 'SkinCancer', 'BMI_normalized', 'PhysicalHealth_normalized', 'MentalHealth_normalized', 'SleepTime_normalized', 'Sex_Female', 'Sex_Male', 'Diabetic_No', 'Diabetic_No_Borderline', 'Diabetic_Yes', 'Diabetic_Yes_Pregnancy', 'GenHealth_Excellent', 'GenHealth_Fair', 'GenHealth_Good', 'GenHealth_Poor', 'GenHealth_Very good']] y = merged[['HeartDisease']] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) #Train logistic regression model BinomialModel = LogisticRegression(max_iter=10, warm_start=True) crlearn_model = BinomialModel.fit(X_train, y_train) >>> print("Trained logistic regression model on:", len(X), "rows") Trained logistic regression model on: 100 rows .. code:: python crlearn_model.get_beta(mode='open') .. csv-table:: :header: , intercept, Smoking, Alcohol-Drinking, Stroke, DiffWalking, PhysicalActivity, SleepTime, Asthma, KidneyDisease, SkinCancer, LungCOPD, Obesity, HighBloodPressure, Sex_Male, Diabetic_No, Diabetic_No_Borderline, Diabetic_Yes, Diabetic_Yes_Pregnancy, GenHealth_Excellent, GenHealth_Fair, GenHealth_Good, GenHealth_Poor, GenHealth_Very good :widths: 4, 10, 12, 15, 10, 12, 18, 10, 10, 15, 12, 10, 10, 17, 10, 15, 24, 15, 24, 19, 18, 15, 15, 20 :stub-columns: 1 **0**, -1.002275, -0.171445, 0.0, 0.155548, 0.306121, -0.870124, 0.071704, -0.374236, 0.03623, -0.127964, 0.18351, 0.451123, -0.073792, -0.371922, -0.945653, -0.208014, 0.24253, 0.0, -0.227025, -0.299601, -0.345784, 1.018995, -1.057688 .. parsed-literal:: 1 rows × 25 columns 2. Make predictions ~~~~~~~~~~~~~~~~~~~~ Now that our model is trained, we can use it to make predictions on the test dataset. .. code:: python class_probabilities_test = BinomialModel.predict_proba(X_test) class_predictions_test = crlearn_model.predict(X_test) class_probabilities_train = crlearn_model.predict_proba(X_train) class_probabilities_test .. parsed-literal:: Name: 13366A5A46D25213A4FEDBB179883B28DD3F9A044E66B12E346BD60389F46D33 Size: 30 rows x 2 columns CIndex([Col("predictions (probabilities) 0", "fp", 1), Col("predictions (probabilities) 1", "fp", 1)]) 3. Compute metrics ~~~~~~~~~~~~~~~~~~~ Finally, let's compute various performance metrics to evaluate our model's performance We'll start with the McFadden R^2 >>> mcfadden_r2(BinomialModel, X_test, y_test).open() 0.10343170166015625 .. note:: **McFadden R^2:** This is a goodness-of-fit measure for logistic regression models. It compares the likelihood of the fitted model to the likelihood of a null model that only includes the intercept. A higher McFadden R^2 indicates a better fit, with values ranging from 0 to 1. It is particularly useful in situations where traditional R^2 is not applicable, such as with logistic regression. >>> classification_accuracy(y_test, class_predictions_test).open() 0.966679573059082 .. note:: **Classification Accuracy:** This is the proportion of correct predictions among the total number of predictions made. It is a common measure of model performance in classification tasks. A higher classification accuracy indicates a better-performing model. >>> tjur_r2(y_test, class_probabilities_test).open() 0.0916992188 .. note:: **Tjur R^2:** This is another goodness-of-fit measure for logistic regression models. It is based on the difference in average predicted probabilities between the two groups (i.e., positive and negative outcomes). A higher Tjur R^2 indicates a better fit, with values ranging from 0 to 1. Like McFadden R^2, it is useful in situations where traditional R^2 is not applicable. .. code:: python pr = precision_recall(y_test, class_predictions_test) [precision, recall] = pr.open() >>> print("Precision", precision) Precision 1.0 >>> print("Recall", recall) Recall 0.0 .. note:: **Precision and Recall:** These are two complementary metrics for evaluating the performance of a classification model, particularly when dealing with imbalanced datasets. **Precision:** It measures the proportion of true positives among all positive predictions (true positives + false positives). A higher precision indicates that the model is better at identifying true positives without generating too many false positives. **Recall:** It measures the proportion of true positives among all actual positives (true positives + false negatives). A higher recall indicates that the model is better at identifying true positives without missing too many actual positives. This concludes our tutorial on performing logistic regression with heart disease data using crandas. By following these steps, you can create a predictive model, make predictions, and evaluate the performance of your model using a variety of metrics.