Binomial logistic regression

The binomial logistic regression model is a statistical model that models the probability of a certain event taking place based on one or more features. For instance, one might want to analyze the probability of an individual having a certain disease, based on their age, weight, blood pressure etc. Alternatively, one might want to analyze the correlation between the features and the response. These functionalities are provided through the logistic regression implementation in crandas. For more information about logistic regression, see Wikipedia.

This guide explains all of the steps to fit a logistic regression model on a shared dataset, along with their options and considerations. This is done through an example with the Titanic dataset (more about this dataset later).

The logistic regression API in crandas follows the scikit-learn API wherever possible. It is often possible to run existing scikit-learn code with minimal modifications in crandas. That being said, there are still some differences, so it is a good idea to go through this guide once.

Setup

Before moving on with the guide, it is necessary to import a few modules and functions from crandas first:

import crandas as cd
from crandas.crlearn.logistic_regression import LogisticRegression
from crandas.crlearn.model_selection import train_test_split
from crandas.crlearn.metrics import (
    mcfadden_r2,
    tjur_r2,
    classification_accuracy,
    precision_recall,
)
from crandas.crlearn.utils import MinMaxScaler

Reading the data

In this example, the dataset is read from a local CSV:

tab = cd.read_csv("tutorials/data/logreg_test_data/titanic_scaled.csv")

This imports the Titanic dataset, which contains records on the passengers of the RMS Titanic, with details such as age, gender, class among others, along with a variable indicating whether the passenger survived the disaster or not. This guide demonstrates how logistic regression can be used to predict whether a passenger survives based on these features.

The dataset looks as following:

>>> print(tab.open().head())
       Survived  Pclass  Sex       Age  SibSp  Parch      Fare
    0         0     1.0    1  0.271173    0.2    0.0  0.027567
    1         1     0.0    0  0.472229    0.2    0.0  0.271039
    2         1     1.0    0  0.321438    0.0    0.0  0.030133
    3         1     0.0    0  0.434531    0.2    0.0  0.201901
    4         0     1.0    1  0.434531    0.0    0.0  0.030608

Note that the features are all either binary or numerical (the Pclass, Age, SibSp, Parch, Fare columns).

Note

See Kaggle for more information about this dataset.

The Titanic dataset provided here is a slightly modified variant of the original dataset. Specifically, it excludes a few columns and rows containing NaN values.

Preparing the data

Getting rid of null values

The logistic regression can only be executed on a DataFrame without null values (specifically, without nullable columns). If the dataset contains any missing values, one can get rid of all rows with null values using .dropna().

tab = tab.dropna()

An alternative to deleting the rows with null values is performing data imputation using .fillna(). However, this might introduce bias and is not recommended in the general case.

Splitting into train and test sets

First, split the predictor variables from the response variable:

X = tab[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = tab[['Survived']]

Next, split the data into train and test sets (\(70\%\) train, \(30\%\) test):

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

Note

The train_test_split function is the crandas-analogue of the identically named scikit-learn function

The random_state argument can be any integer seed used to generate the random assignment between the two sets. Using the same seed will lead to the same split over multiple runs

Normalizing

If the dataset contains any numerical values (e.g. Age in this example), these first need to be normalized to values between 0 and 1. The way in which this is commonly done is through Min-Max-Normalization.

scaler = MinMaxScaler()
X_train_normalized = scaler.fit_transform(
    X_train, features=["Pclass", "Age", "SibSp", "Parch", "Fare"]
)

Here, features can be used to specify which columns need to be normalized. The remaining columns will not be normalized (but will be converted to fixed point).

Attention

It is essential to normalize your numerical features to within [0, 1] before you fit the model. Otherwise, fitting will not work correctly, and will return errors or erroneous results.

Negative values are also not allowed.

Creating the model

The logistic regression functionality in crandas is made accessible through the LogisticRegression class, which contains the actual model, and which can be used to fit the model and make predictions. This model can be created using:

model = LogisticRegression(optimizer='lbfgs', max_iter=10)

Note

The optimizer argument indicates which numerical optimizer the model should use to fit the model. Currently, the available options are:

lbfgs (which stands for Limited-memory BFGS)
gd (which stands for Gradient Descent)

The lbfgs optimizer gives better results and fits the model faster. As such, there is normally no reason to deviate from it.

The max_iter argument specifies how many iterations the numerical optimizer should perform to fit the model. The default of 10 is sufficient in some cases (for this example it is), but sometimes it is necessary to increase this number in order for the model to fully converge.

Fitting the model

Now that the data has been prepared and the model has been created, the model can be fitted to the training set:

model.fit(X_train_normalized, y_train)

Note

Fitting a logistic regression model in crandas can take quite some time, depending on the number of records in the dataset, the number of features, and the number of iterations that you specify.

The fitted model parameters can now be accessed as following:

beta = model.open()["beta_"]

The first value is the intercept term. The remaining values are coefficients corresponding to each of the features.

Predicting

Now that the model has been fitted, it can be used to make predictions. To make predictions, the test dataset needs to be normalized by applying the MinMaxScaler that was also used when fitting.

We distinguish two different types in crandas:

probabilities: the model can predict the probability of each class being associated with the record
classes: the model can predict the class with the highest likelihood

Probabilities

First, to predict the probabilities corresponding to each record of the test dataset:

y_pred_probabilities = model.predict_proba(scaler.transform(X_test))

This returns a table with two columns, with the probability for the first and second classes respectively. These sum up to one.

Classes

Alternatively, if you are interested in making actual class predictions rather than the probabilities, you can directly predict the classes through:

y_pred_classes = model.predict(scaler.transform(X_test))

Tip

Predicting classes is a quick operation, that takes significantly less time than predicting probabilities.

Assessing prediction quality

After fitting the model, it is important to assess the quality of the model and its predictions. crandas provides a couple of methods for doing this, namely:

Classification Accuracy
Precision & Recall
Tjur \(R^2\)

Accuracy

To compute the accuracy of the (class) predictions, you can use:

accuracy = classification_accuracy(y_test, y_pred_classes)
print("Classification Accuracy:", accuracy.open())

Precision & Recall

The precision and recall can be computed simultaneously (on the class predictions) through:

pr = precision_recall(y_test, y_pred_classes)
[precision, recall] = pr.open()
print("Precision", precision)
print("Recall", recall)

Tjur \(R^2\)

Finally, the Tjur \(R^2\) is a metric that can be used to evaluate how well the model can distinguish the two classes from each other:

tjur = tjur_r2(y_test, y_pred_probabilities)
print("Tjur $R^2$:", tjur.open())

Note that it is evaluated on the predicted probabilities, not the predicted classes. This returns a number typically between 0 and 1, with 1 indicating that the model can perfectly distinguish the classes, and a number close to 0 indicating that the model can hardly distinguish the classes.

Note

In theory, it is possible for the Tjur \(R^2\) ¹ value to become negative, as low as -1. This happens when the model makes mostly opposite predictions, i.e. it predicts 1 when the class is 0, and 0 when the class is actually 1.

In practice, this does not occur often.

Assessing model quality

Finally, from a mathematical perspective, it can also be interesting to evaluate specifically the fit of the model, independent of any predictions or any test dataset. This can be done through the McFadden \(R^2\) metric, which is computed as following:

mcfadden = mcfadden_r2(model, X_train_normalized, y_train)
print("McFadden R^2:", mcfadden.open())

This returns a value between 0 and 1, indicating how effective the logistic regression model is at using the features to predict the response variable.

Note

The McFadden \(R^2\)`² only says something about the quality of the proposed model in relation to the training dataset. It does not imply anything about future predictions (e.g. on new data).

Attention

Observe that the McFadden \(R^2\) is computed on the training dataset, not the test dataset.

This is because it purely relates the resulting model to the dataset on which it was fitted.