.. _logreg:

Binomial logistic regression
############################

The binomial logistic regression model is a statistical model that models the probability
of a certain event taking place based on one or more features. For instance, one
might want to analyze the probability of an individual having a certain disease,
based on their age, weight, blood pressure etc. Alternatively, one might want to
analyze the correlation between the features and the response. These
functionalities are provided through the logistic regression implementation in
crandas. For more information about logistic regression, see `Wikipedia <https://en.wikipedia.org/wiki/Logistic_regression>`_.

This guide explains all of the steps to fit a logistic regression model on a
shared dataset, along with their options and considerations. This is done
through an example with the Titanic dataset (more about this dataset later).

The logistic regression API in crandas follows the `scikit-learn API <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn-linear-model-logisticregression>`_
wherever possible. It is often possible to run existing scikit-learn code with
minimal modifications in crandas. That being said, there are still some
differences, so it is a good idea to go through this guide once.


Setup
=====

Before moving on with the guide, it is necessary to import a few modules
and functions from crandas first:

.. code:: python

    import crandas as cd
    from crandas.crlearn.logistic_regression import LogisticRegression
    from crandas.crlearn.model_selection import train_test_split
    from crandas.crlearn.metrics import mcfadden_r2, tjur_r2, classification_accuracy, precision_recall
    from crandas.crlearn.utils import min_max_normalize


Reading the data
================

In this example, the dataset is read from a local CSV:

.. code:: python

    tab = cd.read_csv("tutorials/data/logreg_test_data/titanic_scaled.csv")

This imports the Titanic dataset, which contains records on the passengers of
the RMS Titanic, with details such as age, gender, class among others, along
with a variable indicating whether the passenger survived the disaster or not.
This guide demonstrates how logistic regression can be used to predict whether
a passenger survives based on these features.

The dataset looks as following:

>>> print(tab.open().head())
       Survived  Pclass  Sex       Age  SibSp  Parch      Fare
    0         0     1.0    1  0.271173    0.2    0.0  0.027567
    1         1     0.0    0  0.472229    0.2    0.0  0.271039
    2         1     1.0    0  0.321438    0.0    0.0  0.030133
    3         1     0.0    0  0.434531    0.2    0.0  0.201901
    4         0     1.0    1  0.434531    0.0    0.0  0.030608


Note that the features are all either binary or numerical
(the ``Pclass``, ``Age``, ``SibSp``, ``Parch``, ``Fare`` columns).

.. note::
    See `Kaggle <https://www.kaggle.com/c/titanic/data>`_ for more information
    about this dataset.

    The Titanic dataset provided here is a slightly modified variant
    of the original dataset. Specifically, it excludes a few columns and rows
    containing NaN values.


Preparing the data
==================

Getting rid of null values
---------------------------

The logistic regression can only be executed on a :class:`.CDataFrame` without null values (specifically, without nullable columns).
If the dataset contains any missing values, one can get rid of all rows with null values using :meth:`.CDataFrame.dropna`.

.. code:: python

    tab = tab.dropna()

An alternative to deleting the rows with null values is performing `data imputation <https://en.wikipedia.org/wiki/Imputation_(statistics)>`_ using :meth:`.CSeries.fillna`. However, this might introduce bias and is not recommended in the general case.


Normalizing
-----------

If the dataset contains any numerical values (e.g. ``Age`` in this example), these
first need to be normalized to values between 0 and 1. The way in which this is
commonly done is through `Min-Max-Normalization <https://en.wikipedia.org/wiki/Feature_scaling#Rescaling_(min-max_normalization)>`_.

.. code:: python

    tab_normalized = min_max_normalize(tab, columns=['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'])

Here, ``columns`` can be used to specify which columns need to be normalized. The
remaining columns will remain untouched.

.. attention::
    It is **essential** to normalize your numerical features to within [0, 1]
    **before** you fit the model. Otherwise, fitting will **not** work correctly,
    and will return erroneous results.

    Negative values are also **not** allowed.

Splitting into train and test sets
----------------------------------

First, split the predictor variables from the response variable:

.. code:: python

    X = tab_normalized[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
    y = tab_normalized[['Survived']]

Next, split the data into train and test sets (70% train, 30% test):

.. code:: python
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

.. note::

    The ``train_test_split`` function is the crandas-analogue of the identically
    named `scikit-learn function <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split>`_

    The ``random_state`` argument can be any integer seed used to generate the random 
    assignment between the two sets. Using the same seed will lead to the same split
    over multiple runs


Creating the model
==================

The logistic regression functionality in crandas is made accessible through
the :class:`.LogisticRegression` class, which contains the actual model,
and which can be used to fit the model and make predictions. This model can be
created using:

.. code:: python

    model = LogisticRegression(solver='lbfgs')


.. note::
    The ``solver`` argument indicates which numerical solver the model should
    use to fit the model. Currently, the available options are:

    - ``lbfgs`` (which stands for `Limited-memory BFGS <https://en.wikipedia.org/wiki/Limited-memory_BFGS>`_)
    - ``gd`` (which stands for `Gradient Descent <https://en.wikipedia.org/wiki/Gradient_descent>`_)

    The ``lbfgs`` solver gives better results and fits the model faster. As such,
    there is normally no reason to deviate from it. 


Fitting the model
=================

Now that the data has been prepared and the model has been created, the model
can be fitted to the training set:

.. code:: python

    model.fit(X_train, y_train, max_iter=10)

Here, the ``max_iter`` argument specifies how many iterations the numerical
solver should perform to fit the model. The default of 10 is sufficient in some
cases (for this example it is), but sometimes it is necessary to increase this
number in order for the model to fully converge.

.. note::

    Fitting a logistic regression model in crandas can take quite some time,
    depending on the number of records in the dataset, the number of features,
    and the number of iterations that you specify.

The fitted model parameters can now be accessed as following:

.. code:: python

    beta = model.get_beta()

The first column is used for the intercept terms, while the remaining columns
correspond to each of the features.


Predicting
==========

Now that the model has been fitted, it can be used to make predictions. We
distinguish two different types in crandas:

- probabilities: the model can predict the probability of each class being associated with the record
- classes: the model can predict the class with the highest likelihood


Probabilities
-------------

First, to predict the probabilities corresponding to each record of the test
dataset:

.. code:: python

    y_pred_probabilities = model.predict_proba(X_test)

This returns a table with two columns, with the probability for the first and
second classes respectively. These sum up to one.


Classes
-------

Alternatively, if you are interested in making actual class predictions rather
than the probabilities, you can directly predict the classes through:

.. code:: python

    y_pred_classes = model.predict(X_test)


.. tip::
    Predicting *classes* is a quick operation, that takes *significantly* less
    time than predicting probabilities.


Assessing prediction quality
============================

After fitting the model, it is important to assess the quality of the model and
its predictions. crandas provides a couple of methods for doing this, namely:

- Classification Accuracy
- `Precision & Recall <https://en.wikipedia.org/wiki/Precision_and_recall>`_
- `Tjur R^2 <https://en.wikipedia.org/wiki/Pseudo-R-squared#R2T_by_Tjur>`_


Accuracy
--------

To compute the accuracy of the (class) predictions, you can use:

.. code:: python

    accuracy = classification_accuracy(y_test, y_pred_classes)
    print("Classification Accuracy:", accuracy.open())

Precision & Recall
------------------

The precision and recall can be computed simultaneously (on the class predictions) through:

.. code:: python

    pr = precision_recall(y_test, y_pred_classes)
    [precision, recall] = pr.open()
    print("Precision", precision)
    print("Recall", recall)


Tjur R^2
--------

Finally, the Tjur R^2 is a metric that can be used to evaluate how well the
model can distinguish the two classes from each other:

.. code:: python

    tjur = tjur_r2(y_test, y_pred_probabilities)
    print("Tjur R^2:", tjur.open())

Note that it is evaluated on the predicted probabilities, not the predicted
classes. This returns a number typically between 0 and 1, with 1 indicating that the model
can perfectly distinguish the classes, and a number close to 0 indicating that the
model can hardly distinguish the classes.


.. note::
    
    In theory, it is possible for the Tjur R^2 value to become negative, as low as -1. This happens
    when the model makes mostly 'opposite' predictions, i.e. it predicts 1 when the class is 0, and 0
    when the class is actually 1.
    
    In practice, this does not occur often.

For more information on the Tjur R^2, the following reference is relevant:

    Tue Tjur. “Coefficients of determination in logistic regression models—
    A new proposal: The coefficient of discrimination”. In: The American
    Statistician 63.4 (2009), pp. 366-372.


Assessing model quality
=======================

Finally, from a mathematical perspective, it can also be interesting to evaluate
specifically the fit of the model, independent of any predictions or any test dataset. This
can be done through the McFadden R^2 metric, which is computed as following:

.. code:: python

    mcfadden = mcfadden_r2(model, X, y)
    print("McFadden R^2:", mcfadden.open())

This returns a value between 0 and 1, indicating how effective the logistic regression
model is at using the features to predict the response variable.

.. note::
    The ``mcfadden R^2`` only says something about the quality of the proposed
    model in relation to the training dataset. It does not imply anything about
    future predictions (e.g. on new data).

.. attention::
    Observe that the ``mcfadden R^2`` is computed on the *training* dataset, **not**
    the test dataset.
    
    This is because it purely relates the resulting model to the dataset on which it
    was fitted.

For more information on the McFadden R^2, the following reference is relevant:

    McFadden, Daniel. "Conditional logit analysis of qualitative choice
    behavior." (1973).

.. seealso::
    The `Wikipedia Section <https://en.wikipedia.org/wiki/Logistic_regression#Goodness_of_fit_summary>`_ on
    goodness of fit measures.

For another tutorial of our logistic regression functionality, go to :ref:`Tutorial: Logistic Regression<logregdemo>`.