Linear regression
=====================

In this tutorial, we will show you how to use :ref:`linreg`. We will explore how to perform Ordinary Least Squares (OLS) and Ridge regression using crandas. We will use a dummy weather dataset to demonstrate the process. This involves modelling the relationship between temperature and other weather attributes such as humidity and wind speed. 
Let's dive into the steps needed to create, train, and score a multiple linear regression model using crandas.

.. code:: python

    # import the necessary modules
    import crandas as cd
    from crandas.crlearn.linear_model import LinearRegression, Ridge
    from crandas.crlearn.model_selection import train_test_split
    from crandas.crlearn.metrics import score_r2
    
    # On a jupyter environment provided by RosemanLabs, session variables are set automatically in the background
    # Set the session base_path and session.endpoint manually when executing this notebook in any other environment
    # Uncomment the following 4 commented lines and change their value to set these session variables
    
    # from crandas.base import session
    # from pathlib import Path
    
    # session.base_path = Path('base/path/to/vdl/secrets') 
    # session.endpoint = 'https://localhost:9820/api/v1'
    
    # read the data from a csv file
    tab = cd.read_csv("tutorials/data/weather_data/dummy_weather_data.csv", auto_bounds=True)

Next we define the predictor variables (``X``) and the target variable (``y``). 
The data is then split into training and test sets using a 70-30 split.

.. code:: python

    # define the predictor variables
    X = tab[['Apparent Temperature (C)', 'Humidity', 'Wind Speed (km/h)',
             'Wind Bearing (degrees)', 'Visibility (km)', 'Cloud Cover',
             'Pressure (millibars)']]
    
    # define the target variable
    y = tab[['Temperature (C)']]
    
    # split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

A simple ordinary least squares (OLS) linear regression model is trained on the data in the 
next code block. The coefficients (beta) of the model are displayed, as are the first five 
predictions and the model's R2 score.

.. code:: python

    # create a linear regression model (OLS)
    ols_model = LinearRegression()
    ols_model = ols_model.fit(X_train, y_train)
    beta_ols = ols_model.get_beta()
    print(f"OLS Beta:\n{beta_ols.open()}")
    y_test_pred_ols = ols_model.predict(X_test)
    print(f"First five OLS predictions:\n{y_test_pred_ols.open().head()}")
    ols_score = score_r2(y_test, y_test_pred_ols).open()
    print(f"OLS R2 Score: {ols_score}")


.. parsed-literal::

    OLS Beta:
       intercept  Apparent Temperature (C)  Humidity  Wind Speed (km/h)  \
    0   1.552305                  1.021051  0.584511          -0.048673
    
       Wind Bearing (degrees)  Visibility (km)  Cloud Cover  Pressure (millibars)
    0                0.004935         0.101491         0.0               0.006493
    First five OLS predictions:
       predictions
    0    19.993599
    1    21.993005
    2    18.009869
    3    21.004709
    4    19.017549
    OLS R2 Score: 0.9999761581420898
    
If we want to do Ridge regression instead, we can look at the following code blocks. 
In the first block, a ridge regression model with a Cholesky solver is trained. 
In the second block, a ridge regression model with a Cholesky solver and a regularization parameter alpha of 0.5 is trained. 
For each model, the coefficients, first five predictions, and R2 score are displayed.

.. code:: python

    # create a Ridge regression model with 'cholesky' solver
    ridge_model = Ridge(solver='cholesky')
    ridge_model = ridge_model.fit(X_train, y_train)
    beta_ridge = ridge_model.get_beta()
    print(f"Ridge Beta:\n{beta_ridge.open()}")
    y_test_pred_ridge = ridge_model.predict(X_test)
    print(f"First five Ridge predictions:\n{y_test_pred_ridge.open().head()}")
    ridge_score = score_r2(y_test, y_test_pred_ridge).open()
    print(f"Ridge R2 Score: {ridge_score}")


.. parsed-literal::

    Ridge Beta:
       intercept  Apparent Temperature (C)  Humidity  Wind Speed (km/h)  \
    0   0.000175                  1.024868  0.691266          -0.057512
    
       Wind Bearing (degrees)  Visibility (km)  Cloud Cover  Pressure (millibars)
    0                0.005835          0.11996         0.0               0.007699
    First five Ridge predictions:
       predictions
    0    20.003505
    1    22.002791
    2    18.022610
    3    21.016407
    4    19.031838
    Ridge R2 Score: 0.9999094009399414
    

.. code:: python

    # create a Ridge regression model with 'cholesky' solver and alpha=0.5
    ridge_alpha_model = Ridge(solver='cholesky', alpha = 0.5)
    ridge_alpha_model = ridge_alpha_model.fit(X_train, y_train)
    beta_ridge_alpha = ridge_alpha_model.get_beta()
    print(f"Ridge (alpha=0.5) Beta:\n{beta_ridge_alpha.open()}")
    y_test_pred_ridge_alpha = ridge_alpha_model.predict(X_test)
    print(f"First five Ridge (alpha=0.5) predictions:\n{y_test_pred_ridge_alpha.open().head()}")
    ridge_alpha_score = score_r2(y_test, y_test_pred_ridge_alpha).open()
    print(f"Ridge (alpha=0.5) R2 Score: {ridge_alpha_score}")


.. parsed-literal::

    Ridge (alpha=0.5) Beta:
       intercept  Apparent Temperature (C)  Humidity  Wind Speed (km/h)  \
    0   0.000351                   1.02478  0.690762          -0.057468
    
       Wind Bearing (degrees)  Visibility (km)  Cloud Cover  Pressure (millibars)
    0                0.005813         0.119828         0.0               0.007677
    First five Ridge (alpha=0.5) predictions:
       predictions
    0    19.975530
    1    21.974325
    2    17.994225
    3    20.988381
    4    19.003291
    Ridge (alpha=0.5) R2 Score: 0.9999275207519531