.. _ordinallogreg:

Ordinal logistic regression
###########################

The ordinal logistic regression model is a statistical model that models the
relation between one or more features and an ordinal response variable.
That is, the response can take on more than two categories, but there exists
an ordering between these categories. For instance, this could be a ranking,
consisting of ``bad``, ``acceptable`` and ``good`` as its categories. For more
information about ordinal logistic regression, see `Wikipedia <https://en.wikipedia.org/wiki/Ordered_logit>`_.


Setup
=====

Before moving on with the guide, it is necessary to import a few modules
and functions from crandas first:

.. code:: python

    import crandas as cd
    from crandas.crlearn.logistic_regression import LogisticRegression
    from crandas.crlearn.metrics import classification_accuracy
    from crandas.crlearn.utils import min_max_normalize
    from crandas.crlearn.metrics import confusion_matrix


Reading the data
================

In this example, the dataset is read from a local CSV:

.. code:: python

    tab = cd.read_csv("../../test/logreg_test_data/white_wine_quality_scaled.csv")

This imports the *White wine quality* dataset, which contains records of white
wines with features such as the ``pH`` or the ``alcohol`` content, along with
the ``quality``. The ``quality`` column is an ordinal variable with a value
from 0 to 4, indicating the graded ``quality`` of the wine. This guide
demonstrates how ordinal logistic regression can be used to predict the quality
of a wine based on these features.

The dataset looks as following:

>>> print(tab.open().head())
       class  fixed acidity 	volatile acidity 	citric acid 	residual sugar 	chlorides 	free sulfur dioxide 	total sulfur dioxide 	density 	pH 	        sulphates 	alcohol
    0  3       0.428572 	0.150 	                0.303030 	0.029126 	0.131673 	0.214022 	        0.473283 	        0.253345 	0.662921 	0.351352 	0.303572
    1  2       0.515873 	0.070 	                0.242424 	0.288026 	0.391459 	0.391144 	        0.641221 	        0.373327 	0.359550 	0.256757 	0.321428
    2  2       0.412699 	0.170 	                0.191919 	0.190939 	0.128114 	0.701107 	        0.629771 	        0.280967 	0.370787 	0.121622 	0.303572
    3  2       0.476191 	0.140 	                0.353536 	0.009708 	0.153025 	0.236162 	        0.519084 	        0.253345 	0.696630 	0.256757 	0.285714
    4  2       0.317460 	0.170 	                0.272727 	0.268608 	0.149467 	0.295203 	        0.480916 	        0.300820 	0.471910 	0.378378 	0.410714


Note that the features are all numerical.

.. note::
    See `Kaggle <https://www.kaggle.com/datasets/yasserh/wine-quality-dataset>`_ for more information
    about this dataset.

    The Wine quality dataset provided here is a slightly modified variant
    of the original dataset. Specifically, the features have already been
    normalized, some classes that are sparsely represented have been excluded,
    and the dataset has been sampled for a more balanced class distribution.

Preparing the data
==================

Getting rid of null values
---------------------------

The ordinal regression can only be executed on a :class:`.CDataFrame` without null values (specifically, without nullable columns).
If the dataset contains any missing values, one can get rid of all rows with null values using :meth:`.CDataFrame.dropna`.

.. code:: python

    tab = tab.dropna()

An alternative to deleting the rows with null values is performing `data imputation <https://en.wikipedia.org/wiki/Imputation_(statistics)>`_ using :meth:`.CSeries.fillna`. However, this might introduce bias and is not recommended in the general case.


Normalizing
-----------

If the dataset contains any numerical values (e.g. ``fixed acidity`` in this example), these
first need to be normalized to values between 0 and 1. The way in which this is
commonly done is through `Min-Max-Normalization <https://en.wikipedia.org/wiki/Feature_scaling#Rescaling_(min-max_normalization)>`_.

.. code:: python

    tab_normalized = min_max_normalize(tab, columns=['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol'])

Here, ``columns`` can be used to specify which columns need to be normalized. The
remaining columns will remain untouched.

.. attention::
    It is **essential** to normalize your numerical features to within [0, 1]
    **before** you fit the model. Otherwise, fitting will **not** work correctly,
    and will return erroneous results.

    Negative values are also **not** allowed.

Splitting into predictors, response
-----------------------------------

First, split the predictor variables from the response variable:

.. code:: python

    X = tab_normalized[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']]
    y = tab_normalized[['quality']]


Creating the model
==================

The logistic regression functionality in crandas is made accessible through
the :class:`.LogisticRegression` class, which can be used to fit the model and
make predictions. The model can be created using:

.. code:: python

    model = LogisticRegression(solver='lbfgs', multi_class="ordinal", n_classes=5)

Here, the ``multi_class`` argument specifies the type of regression to be
performed, in this case  ``ordinal``. The ``n_classes`` argument specifies
the number of classes in the dataset.

.. note::
    The ``solver`` argument indicates which numerical solver the model should
    use to fit the model. Currently, the available options are:

    - ``lbfgs`` (which stands for `Limited-memory BFGS <https://en.wikipedia.org/wiki/Limited-memory_BFGS>`_)
    - ``gd`` (which stands for `Gradient Descent <https://en.wikipedia.org/wiki/Gradient_descent>`_)

    The ``lbfgs`` solver gives better results and fits the model faster. As such,
    there is normally no reason to deviate from it. 

.. attention::
    It is **required** to specify the number of classes in the dataset. Unlike
    in scikit-learn, the crandas model does not detect this automatically due to the
    dataset being secret-shared.
    

Fitting the model
=================

Now that the data has been prepared and the model has been created, the model
can be fitted to the training set:

.. code:: python

    model.fit(X, y, max_iter=20)

Here, the ``max_iter`` argument specifies how many iterations the numerical
solver should perform to fit the model. The default of 10 is sufficient in some
cases but sometimes it is necessary to increase this number in order for the
model to fully converge. In this case, we need 20 iterations for the wine
quality dataset.

.. note::

    Fitting a logistic regression model in crandas can take quite some time,
    depending on the number of records in the dataset, the number of features,
    and the number of iterations that you specify.

The fitted model parameters can now be accessed as following:

.. code:: python

    beta = model.get_beta()

The first ``n_classes - 1`` columns represent the threshold values
that distinguish each of the classes. The remaining columns correspond to each
of the features.


Predicting
==========

Now that the model has been fitted, it can be used to make predictions. We
distinguish two different types in crandas:

- probabilities: the model can predict the probability of each class being associated with the record
- classes: the model can predict the class with the highest likelihood


Probabilities
-------------

First, to predict the probabilities corresponding to each record of the test
dataset:

.. code:: python

    y_pred_probabilities = model.predict_proba(X)

This returns a table with five columns (one for each class), containing the
point probability for each class. These sum up to one.


Classes
-------

Alternatively, if you are interested in making actual class predictions rather
than the probabilities, you can directly predict the classes through:

.. code:: python

    y_pred_classes = model.predict(X)


Assessing prediction quality
============================

After fitting the model, it is important to assess the quality of the model and
its predictions. crandas provides a couple of methods for doing this, namely:

- Classification Accuracy
- `Confusion Matrix <https://en.wikipedia.org/wiki/Confusion_matrix>`_.


Accuracy
--------

To compute the accuracy of the (class) predictions, you can use:

.. code:: python

    accuracy = classification_accuracy(y, y_pred_classes, n_classes=5)
    print("Classification Accuracy:", accuracy.open())

.. attention::
    It is **required** to specify the number of classes in the dataset. The
    function does not detect this automatically, due to the dataset being
    secret-shared.

Confusion Matrix
----------------

The confusion matrix visualizes the relation between the predicted classes and
the actual classes. The Y-axis represents the true class, while the X-axis
represents the class predicted by the model. To compute the confusion matrix
obliviously, you can use:

.. code:: python

    matrix = confusion_matrix(y, y_pred_classes, n_classes=3)
    matrix.open()

    [[85,  15,  60,   0,   3],
     [51,  21,  94,   0,   5],
     [11,  13, 154,   0,  47],
     [ 0,   4,  56,   0,  44],
     [ 2,   2,  77,   0,  94]]