.. _logregdemo:

Logistic regression
=======================

In this tutorial, we will demonstrate how to perform logistic regression on heart disease data using crandas. We will cover the following steps:

- Import necessary libraries
- Load the dataset
- Preprocess the dataset
- Train the logistic regression model
- Make predictions
- Compute metrics

First, let's import the necessary libraries:

.. code:: python

    import crandas as cd
    import pandas as pd
    from timeit import timeit
    from crandas.crlearn.logistic_regression import LogisticRegression
    from crandas.crlearn.model_selection import train_test_split
    from crandas.crlearn.metrics import classification_accuracy, tjur_r2, mcfadden_r2, precision_recall
    from pathlib import Path

    force_reupload = False

    dataset_size = 100 #max 10000 (will take 20 minutes)

Next, we'll load the heart_outcome dataset using pandas and upload it to the engine using :func:`.crandas.upload_pandas_dataframe`

.. code:: python

    df = pd.read_csv('../data/logreg_data/heart_outcome.csv')
    df.head()
    heart_outcome = cd.upload_pandas_dataframe(df.head(dataset_size), name='heart_outcome')

Now, let's preprocess the heart_2020_cleaned dataset by normalizing numeric values, creating dummy variables for categorical data, and mapping Yes/No values to 1/0. Then, we'll upload the preprocessed dataset to the engine.

.. note::

    The following actions uploading the ``heart_data`` table correspond to the second party.

.. code:: python

    #create normalization function
    def normalize(col):
        return (col - col.min()) / (col.max() - col.min())

    #load the CSV into pandas DF
    df = pd.read_csv('../data/logreg_data/heart_2020_cleaned.csv')

    #normalize numeric values
    fields = ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
    for field in fields:
        df[f"{field}_normalized"] = normalize(df[field])

    #create dummy variables for categoric data
    df = pd.get_dummies(df, columns=['Sex', 'Diabetic', 'GenHealth', 'AgeCategory'], dtype='int64')
    df = df.rename(columns={'Diabetic_No, borderline diabetes':'Diabetic_No_Borderline', 'Diabetic_Yes (during pregnancy)':'Diabetic_Yes_Pregnancy'})

    #convert Yes/No to 0/1
    fields = ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'PhysicalActivity', 'Asthma', 'KidneyDisease', 'SkinCancer']
    for field in fields:
        df[field] = df[field].map({'Yes':1 ,'No':0})

    heart_data = cd.upload_pandas_dataframe(df.head(dataset_size), name='heart_data')

So now we have two :class:`CDataFrames<.CDataFrame>` to train our logistic regression; `heart_outcome` and `heart_data`.

1. Train the model
~~~~~~~~~~~~~~~~~~~

.. code:: python

    #Merge data
    merged = cd.merge(heart_data, heart_outcome, on='ID')

    #Define training data
    X = merged[['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'PhysicalActivity', 'SleepTime', 'Asthma', 'KidneyDisease', 'SkinCancer', 'BMI_normalized', 'PhysicalHealth_normalized', 'MentalHealth_normalized', 'SleepTime_normalized', 'Sex_Female', 'Sex_Male', 'Diabetic_No', 'Diabetic_No_Borderline', 'Diabetic_Yes', 'Diabetic_Yes_Pregnancy', 'GenHealth_Excellent', 'GenHealth_Fair', 'GenHealth_Good', 'GenHealth_Poor', 'GenHealth_Very good']]
    y = merged[['HeartDisease']]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    #Train logistic regression model
    BinomialModel = LogisticRegression(max_iter=10, warm_start=True)
    crlearn_model = BinomialModel.fit(X_train, y_train)


>>> print("Trained logistic regression model on:", len(X), "rows")
    Trained logistic regression model on: 100 rows

.. code:: python

>>> crlearn_model.open()["beta_"]
    [-0.8638057708740234, 0.015638351440429688, 0.0, -0.07614994049072266,
     0.41289520263671875, -0.5652246475219727, -0.007580757141113281,
     -0.04671955108642578, 0.06763076782226562, -0.45177268981933594,
     -0.11643314361572266, 0.09277820587158203, -0.2556333541870117,
     0.041253089904785156, -0.6205329895019531, -0.17079925537109375,
     -0.7881975173950195, -0.09045124053955078,  0.087310791015625,
     0.0, -0.1735095977783203, 0.1998453140258789, -0.37598609924316406,
     0.19045639038085938, -0.632145881652832]

2. Make predictions
~~~~~~~~~~~~~~~~~~~~

Now that our model is trained, we can use it to make predictions on the test dataset.

.. code:: python

    class_probabilities_test = BinomialModel.predict_proba(X_test)
    class_predictions_test = crlearn_model.predict(X_test)
    class_probabilities_train = crlearn_model.predict_proba(X_train)
    class_probabilities_test

.. parsed-literal::

    Name: 13366A5A46D25213A4FEDBB179883B28DD3F9A044E66B12E346BD60389F46D33
    Size: 30 rows x 2 columns
    CIndex([Col("predictions (probabilities) 0", "fp", 1), Col("predictions (probabilities) 1", "fp", 1)])


3. Compute metrics
~~~~~~~~~~~~~~~~~~~

Finally, let's compute various performance metrics to evaluate our model's performance

We'll start with the McFadden R^2

>>> mcfadden_r2(BinomialModel, X_test, y_test).open()
    0.10343170166015625


.. note::

    **McFadden R^2:**
    This is a goodness-of-fit measure for logistic regression models.
    It compares the likelihood of the fitted model to the likelihood of a null model that only includes the intercept.
    A higher McFadden R^2 indicates a better fit, with values ranging from 0 to 1.
    It is particularly useful in situations where traditional R^2 is not applicable, such as with logistic regression.


>>> classification_accuracy(y_test, class_predictions_test).open()
    0.966679573059082

.. note::

    **Classification Accuracy:**
    This is the proportion of correct predictions among the total number of predictions made.
    It is a common measure of model performance in classification tasks.
    A higher classification accuracy indicates a better-performing model.

>>> tjur_r2(y_test, class_probabilities_test).open()
    0.0916992188

.. note::

    **Tjur R^2:**
    This is another goodness-of-fit measure for logistic regression models.
    It is based on the difference in average predicted probabilities between the two groups (i.e., positive and negative outcomes).
    A higher Tjur R^2 indicates a better fit, with values ranging from 0 to 1.
    Like McFadden R^2, it is useful in situations where traditional R^2 is not applicable.

.. code:: python

    pr = precision_recall(y_test, class_predictions_test)
    [precision, recall] = pr.open()

>>> print("Precision", precision)
    Precision 1.0

>>> print("Recall", recall)
    Recall 0.0

.. note::

    **Precision and Recall:**
    These are two complementary metrics for evaluating the performance of a classification model, particularly when dealing with imbalanced datasets.

    **Precision:** It measures the proportion of true positives among all positive predictions (true positives + false positives). A higher precision indicates that the model is better at identifying true positives without generating too many false positives.
    **Recall:** It measures the proportion of true positives among all actual positives (true positives + false negatives). A higher recall indicates that the model is better at identifying true positives without missing too many actual positives.

This concludes our tutorial on performing logistic regression with heart disease data using crandas. By following these steps, you can create a predictive model, make predictions, and evaluate the performance of your model using a variety of metrics.