Ordinal logistic regression
The ordinal logistic regression model is a statistical model that models
the relation between one or more features and an ordinal response
variable. That is, the response can take on more than two categories,
but there exists an ordering between these categories. For instance,
this could be a ranking, consisting of bad, acceptable and good as
its categories. For more information about ordinal logistic regression,
see Wikipedia.
Setup
Before moving on with the guide, it is necessary to import a few modules and functions from crandas first:
import crandas as cd
from crandas.crlearn.logistic_regression import LogisticRegression
from crandas.crlearn.metrics import classification_accuracy
from crandas.crlearn.utils import MinMaxScaler
from crandas.crlearn.metrics import confusion_matrix
Reading the data
In this example, the dataset is read from a local CSV:
This imports the White wine quality dataset, which contains records of
white wines with features such as the pH or the alcohol content,
along with the quality. The quality column is an ordinal variable
with a value from 0 to 4, indicating the graded quality of the wine.
This guide demonstrates how ordinal logistic regression can be used to
predict the quality of a wine based on these features.
The dataset looks as following:
>>> print(tab.open().head())
quality fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
0 3 0.428572 0.150 0.303030 0.029126 0.131673 0.214022 0.473283 0.253345 0.662921 0.351352 0.303572
1 2 0.515873 0.070 0.242424 0.288026 0.391459 0.391144 0.641221 0.373327 0.359550 0.256757 0.321428
2 2 0.412699 0.170 0.191919 0.190939 0.128114 0.701107 0.629771 0.280967 0.370787 0.121622 0.303572
3 2 0.476191 0.140 0.353536 0.009708 0.153025 0.236162 0.519084 0.253345 0.696630 0.256757 0.285714
4 2 0.317460 0.170 0.272727 0.268608 0.149467 0.295203 0.480916 0.300820 0.471910 0.378378 0.410714
Note that the features are all numerical.
Note
See Kaggle for more information about this dataset.
The Wine quality dataset provided here is a slightly modified variant of the original dataset. Specifically, the features have already been normalized, some classes that are sparsely represented have been excluded, and the dataset has been sampled for a more balanced class distribution.
Preparing the data
Getting rid of null values
The ordinal regression can only be executed on a
DataFrame without null values
(specifically, without nullable columns). If the dataset contains any
missing values, one can get rid of all rows with null values using
.dropna().
An alternative to deleting the rows with null values is performing data
imputation using
.fillna(). However, this might
introduce bias and is not recommended in the general case.
Splitting into predictors, response
First, split the predictor variables from the response variable:
X = tab[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides',
'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']]
y = tab[['quality']]
Normalizing
If the predictor variables contain any numerical values (e.g.
fixed acidity in this example), these first need to be normalized to
values between 0 and 1. The way in which this is commonly done is
through
Min-Max-Normalization.
scaler = MinMaxScaler()
X = scaler.fit_transform(X, features=['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol'])
Here, features can be used to specify which columns need to be
normalized. The remaining columns will remain untouched.
Attention
It is essential to normalize your numerical features to within \([0, 1]\) before you fit the model. Otherwise, fitting will not work correctly, and will return erroneous results.
Negative values are also not allowed.
Creating the model
The logistic regression functionality in crandas is made accessible
through the LogisticRegression class,
which can be used to fit the model and make predictions. The model can
be created using:
Here, the multi_class argument specifies the type of regression to be
performed, in this case ordinal. The n_classes argument specifies
the number of classes in the dataset.
Note
The optimizer argument indicates which numerical solver the model
should use to fit the model. Currently, the available options are:
lbfgs(which stands for Limited-memory BFGS)gd(which stands for Gradient Descent)
The lbfgs optimizer gives better results and fits the model faster. As
such, there is normally no reason to deviate from it.
The max_iter argument specifies how many iterations the numerical
optimizer should perform to fit the model. The default of 10 is
sufficient in some cases but sometimes it is necessary to increase this
number in order for the model to fully converge. In this case, we need
20 iterations for the wine quality dataset.
Attention
It is required to specify the number of classes in the dataset. Unlike in scikit-learn, the crandas model does not detect this automatically due to the dataset being secret-shared.
Fitting the model
Now that the data has been prepared and the model has been created, the model can be fitted to the training set:
Note
Fitting a logistic regression model in crandas can take quite some time, depending on the number of records in the dataset, the number of features, and the number of iterations that you specify.
The fitted model parameters can now be accessed as following:
The first n_classes - 1 values represent the threshold values that
distinguish each of the classes. The remaining values correspond to each
of the features.
Predicting
Now that the model has been fitted, it can be used to make predictions. We distinguish two different types in crandas:
- probabilities: the model can predict the probability of each class being associated with the record
- classes: the model can predict the class with the highest likelihood
Probabilities
First, to predict the probabilities corresponding to each record of the test dataset:
This returns a table with five columns (one for each class), containing the point probability for each class. These sum up to one.
Classes
Alternatively, if you are interested in making actual class predictions rather than the probabilities, you can directly predict the classes through:
Assessing prediction quality
After fitting the model, it is important to assess the quality of the model and its predictions. crandas provides a couple of methods for doing this, namely:
- Classification Accuracy
- Confusion Matrix.
Accuracy
To compute the accuracy of the (class) predictions, you can use:
accuracy = classification_accuracy(y, y_pred_classes, n_classes=5)
print("Classification Accuracy:", accuracy.open())
Attention
It is required to specify the number of classes in the dataset. The function does not detect this automatically, due to the dataset being secret-shared.
Confusion Matrix
The confusion matrix visualizes the relation between the predicted classes and the actual classes. The Y-axis represents the true class, while the X-axis represents the class predicted by the model. To compute the confusion matrix obliviously, you can use: