.. _linreg:

.. meta::
    :description: Perform Linear and Ridge regression on encrypted data with crandas. Build secure predictive models for finance, healthcare, and EU-compliant analytics.
    :keywords: crandas, Roseman Labs, linear regression, User Guide, Machine Learning

Linear regression
###################

.. versionadded:: 1.6

    Linear regression

Introduction
============
Linear regression is a statistical tool used to model the linear relationship between a set of variables and a corresponding group of target values.

This guide contains the instructions to create, train and score a linear regression model. [1]_ We follow the notation and structure of `scikit-learn <https://scikit-learn.org/stable/modules/linear_model.html>`__, a popular python machine learning package. Default parameters are also consistent with the ones in scikit-learn.

Beyond the "standard" approach to linear regression, OLS or ordinary least squares, we have implemented Ridge regression, a different estimation technique that is especially useful whenever the independent variables are highly correlated. This property, known as multicollinearity, causes difficulties in estimating separate or unique effects of individual features. As crandas works with private data from multiple sources, it might be hard to know whether variables are related, making this method especially useful.

To demonstrate the linear regression functionality we will work over a weather database, finding the relationship between temperature and features like humidity and wind speed, among others.


.. [1] For more information on linear regression, see `Wikipedia <https://en.wikipedia.org/wiki/Linear_regression>`__.

Setup
=====
Before delving into the main guide, we need to import the necessary modules:

.. code:: python

    import crandas as cd
    from crandas.crlearn.linear_model import LinearRegression, Ridge
    from crandas.crlearn.model_selection import train_test_split
    from crandas.crlearn.metrics import score_r2

Reading the data
================
We start by uploading the data, which exists in a csv file.

.. code:: python

    tab = cd.read_csv("tutorials/data/linreg_data/ncsu_diabetes_dataset.csv")

This dataset contains the variables, ``age``, ``sex``, ``body mass index``, ``average blood pressure``, and six blood serum measurements. Our goal is to model the relationship between those variables and ``Target`` indicating the progression of the disease.

Getting rid of null values
---------------------------

The linear regression can only be executed on a :class:`.DataFrame` without null values (specifically, without nullable columns).
If the dataset contains any missing values, one can get rid of all rows with null values using :meth:`.DataFrame.dropna`.

.. code:: python

    tab = tab.dropna()

An alternative to deleting the rows with null values is performing `data imputation <https://en.wikipedia.org/wiki/Imputation_(statistics)>`__ using :meth:`.CSeries.fillna`. However, this might introduce bias and is not recommended in the general case.


Splitting into train and test sets
==================================
Now we need define the predictor and target variables, and then the dataset needs to be split into a training and test set.
Below we will use a test size of 0.3 (so 70% for training data and 30% for test data):

.. code:: python

    # Set the predictor variables
    X = tab[['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']]

    # Set the target variable
    y = tab[['Target']]

    # Split the data into training and test sets - you can also use the random_state variable to set a fixed seed if desired (e.g. random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

This step ensures that we have a separate set of data to evaluate the performance of our model after training.

Creating the model
==================
Linear regression is implemented in the :class:`.linear_model`. From here we can create a typical OLS linear regression (through :class:`LinearRegression<.crlearn.linear_model.LinearRegression>`) or a Ridge regression (through :func:`Ridge<.crlearn.linear_model.Ridge>`).


.. attention::

    Unlike logistic regression, in linear regression **one should never normalize or scale** the data before fitting as this will result in a model that performs poorly.
    As opposed to logistic regression, correct predictions of linear regression depend on the magnitudes of each feature.
    The engine will perform this normalization (and its inverse) internally on the data in such a way that the resulting model is not affected by this normalization.

.. code:: python

    model = LinearRegression()

    # solver is an optional parameter, if not given it will default to 'cholesky'
    modelR = Ridge(solver='cholesky')

    # If no parameter is given, the Ridge regression has alpha = 1
    modelRalpha = Ridge(solver='cholesky', alpha = 0.5)

The ``solver`` argument specifies the method to be used in the calculation, but if not specified it defaults to ``cholesky``. This is currently the only supported solver.

Fitting the model
=================
The model can now be fit to the training set that we created earlier (70% of the data):

.. code:: python

    model = model.fit(X_train, y_train)

As the notation is the same for both the :class:`LinearRegression<.crlearn.linear_model.LinearRegression>` and :func:`Ridge<.crlearn.linear_model.Ridge>` model, we'll work with only the first one in this guide.

Obtain the model coefficients and standard error
================================================
Following the fitting of the model on the training data, we can obtain the model coefficients or beta coefficients. These represent the influence of each feature on the target variable.
This can be done as follows:

.. code:: python

    beta = model.open()["beta_"]

.. note::

    During fitting a matrix inverse is implicitly derived, where great numerical instability may arise due to singularity of the matrix to be inverted.
    If potential singularity is encountered this is reflected in the opened result.
    It can be accessed by

    .. code:: python

        singular = model.singular_


For linear regression without Ridge regularization (original least squares), the standard error of each regression coefficient is also computed (not to be confused with the residual standard error, or the standard error of regression).

The standard error of each coefficient is derived by :math:`\mathrm{SE}(\beta_j) = \mathrm{Var}(\beta_j)*\mathrm{MSE}`, where :math:`\mathrm{MSE}` stands for the mean squared error and the variance is derived from the training data by :math:`\mathrm{Var}(\beta_j) = \left((X^TX)^{-1}\right)_{jj}`. Together with the coefficients they can be used for statistical tests, p-values and confidence intervals.

The standard error of each coefficient can be obtained as follows:

.. code:: python

    standard_error = model.open()["standard_error_"]


.. note::

    As in the well-known statistical computing package R we **do not** compute the standard errors for Ridge regression,
    as these values are not very meaningful on regularized regression (such as Ridge) estimations because of the bias.
    (See `here <https://cran.r-project.org/web/packages/penalized/vignettes/penalized.pdf>`_).


Predicting
==========
Once the model has been fitted, it can also be used to make predictions on the target variable. Using the :meth:`predict<.crlearn.linear_model.LinearRegression.predict>` method on the test set (``X_test``) we can predict the values for our ``Temperature (C)`` variable.

.. code:: python

    # Create predictions for y-values (temperature) based on X-values (predictor variables)
    y_test_pred = model.predict(X_test)

>>> y_test_pred.open().head()
    Target
0	205.030249
1	176.979178
2	122.074872
3	213.149721
4	174.968525

Assessing prediction quality
============================
Assessing the quality of the model is a critical step after fitting. We can score the model by finding the *R-squared* coefficient using the :func:`score_r2<.crlearn.metrics.score_r2>` function:

>>> score_r2(y_test, y_test_pred).open()
0.5176897048950195

If we are not interested in the prediction, only on the R^2-score, we can skip a step by using the :meth:`score<.crlearn.linear_model.LinearRegression.score>` method. Here we can use it with our Ridge model ``modelR``:

.. code:: python

    # We need to fit the model first
    modelR = modelR.fit(X_train, y_train)
    scoreR = modelR.score(X_test, y_test)

>>> scoreR.open()
0.4838247299194336

Both methods give us the R-squared score. Note that they are different as different model parameters lead to different outputs for the same data.

Currently, R-squared is the only metric for linear regression.