Linear regression#

In this tutorial, we will show you how to use Linear regression. We will explore how to perform Ordinary Least Squares (OLS) and Ridge regression using crandas. We will use a dummy weather dataset to demonstrate the process. This involves modelling the relationship between temperature and other weather attributes such as humidity and wind speed. Let’s dive into the steps needed to create, train, and score a multiple linear regression model using crandas.

# import the necessary modules
import crandas as cd
from crandas.crlearn.linear_model import LinearRegression, Ridge
from crandas.crlearn.model_selection import train_test_split
from crandas.crlearn.metrics import score_r2

# On a jupyter environment provided by RosemanLabs, session variables are set automatically in the background
# Set the session base_path and session.endpoint manually when executing this notebook in any other environment
# Uncomment the following 4 commented lines and change their value to set these session variables

# from crandas.base import session
# from pathlib import Path

# session.base_path = Path('base/path/to/vdl/secrets')
# session.endpoint = 'https://localhost:9820/api/v1'

# read the data from a csv file
tab = cd.read_csv("tutorials/data/weather_data/dummy_weather_data.csv", auto_bounds=True)

Next we define the predictor variables (X) and the target variable (y). The data is then split into training and test sets using a 70-30 split.

# define the predictor variables
X = tab[['Apparent Temperature (C)', 'Humidity', 'Wind Speed (km/h)',
         'Wind Bearing (degrees)', 'Visibility (km)', 'Cloud Cover',
         'Pressure (millibars)']]

# define the target variable
y = tab[['Temperature (C)']]

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

A simple ordinary least squares (OLS) linear regression model is trained on the data in the next code block. The coefficients (beta) of the model are displayed, as are the first five predictions and the model’s R2 score.

# create a linear regression model (OLS)
ols_model = LinearRegression()
ols_model = ols_model.fit(X_train, y_train)
beta_ols = ols_model.get_beta()
print(f"OLS Beta:\n{beta_ols.open()}")
y_test_pred_ols = ols_model.predict(X_test)
print(f"First five OLS predictions:\n{y_test_pred_ols.open().head()}")
ols_score = score_r2(y_test, y_test_pred_ols).open()
print(f"OLS R2 Score: {ols_score}")
OLS Beta:
   intercept  Apparent Temperature (C)  Humidity  Wind Speed (km/h)  0   1.552305                  1.021051  0.584511          -0.048673

   Wind Bearing (degrees)  Visibility (km)  Cloud Cover  Pressure (millibars)
0                0.004935         0.101491         0.0               0.006493
First five OLS predictions:
   predictions
0    19.993599
1    21.993005
2    18.009869
3    21.004709
4    19.017549
OLS R2 Score: 0.9999761581420898

If we want to do Ridge regression instead, we can look at the following code blocks. In the first block, a ridge regression model with a Cholesky solver is trained. In the second block, a ridge regression model with a Cholesky solver and a regularization parameter alpha of 0.5 is trained. For each model, the coefficients, first five predictions, and R2 score are displayed.

# create a Ridge regression model with 'cholesky' solver
ridge_model = Ridge(solver='cholesky')
ridge_model = ridge_model.fit(X_train, y_train)
beta_ridge = ridge_model.get_beta()
print(f"Ridge Beta:\n{beta_ridge.open()}")
y_test_pred_ridge = ridge_model.predict(X_test)
print(f"First five Ridge predictions:\n{y_test_pred_ridge.open().head()}")
ridge_score = score_r2(y_test, y_test_pred_ridge).open()
print(f"Ridge R2 Score: {ridge_score}")
Ridge Beta:
   intercept  Apparent Temperature (C)  Humidity  Wind Speed (km/h)  0   0.000175                  1.024868  0.691266          -0.057512

   Wind Bearing (degrees)  Visibility (km)  Cloud Cover  Pressure (millibars)
0                0.005835          0.11996         0.0               0.007699
First five Ridge predictions:
   predictions
0    20.003505
1    22.002791
2    18.022610
3    21.016407
4    19.031838
Ridge R2 Score: 0.9999094009399414
# create a Ridge regression model with 'cholesky' solver and alpha=0.5
ridge_alpha_model = Ridge(solver='cholesky', alpha = 0.5)
ridge_alpha_model = ridge_alpha_model.fit(X_train, y_train)
beta_ridge_alpha = ridge_alpha_model.get_beta()
print(f"Ridge (alpha=0.5) Beta:\n{beta_ridge_alpha.open()}")
y_test_pred_ridge_alpha = ridge_alpha_model.predict(X_test)
print(f"First five Ridge (alpha=0.5) predictions:\n{y_test_pred_ridge_alpha.open().head()}")
ridge_alpha_score = score_r2(y_test, y_test_pred_ridge_alpha).open()
print(f"Ridge (alpha=0.5) R2 Score: {ridge_alpha_score}")
Ridge (alpha=0.5) Beta:
   intercept  Apparent Temperature (C)  Humidity  Wind Speed (km/h)  0   0.000351                   1.02478  0.690762          -0.057468

   Wind Bearing (degrees)  Visibility (km)  Cloud Cover  Pressure (millibars)
0                0.005813         0.119828         0.0               0.007677
First five Ridge (alpha=0.5) predictions:
   predictions
0    19.975530
1    21.974325
2    17.994225
3    20.988381
4    19.003291
Ridge (alpha=0.5) R2 Score: 0.9999275207519531