K-nearest neighbors¶
Added in version 1.9.
Introduction¶
K-nearest neighbors is a predictor algorithm that predicts based on the closest datapoints in the training data. It can be used for both classification and regression, but currently only regression is supported. For a new datapoint it looks at the k nearest neighbors based on a distance function and predicts the target variable as the average of this variable at these data points. For more information about k-nearest neighbors, see Wikipedia.
This guide explains all of the steps to setup a prediction model based on a shared dataset, along with their options and considerations. This is done through an example with the weather dataset, the same one used in Linear regression.
The k-nearest neighbors API in crandas follows the scikit-learn API wherever possible. It is often possible to run existing scikit-learn code with minimal modifications in crandas. That being said, there are still some differences, so it is a good idea to go through this guide once.
Setup¶
Before delving into the main guide, we need to import the necessary modules:
import crandas as cd
from crandas.crlearn.neighbors import KNeighborsRegressor
from crandas.crlearn.model_selection import train_test_split
Reading the data¶
We start by uploading the data, which exists in a csv file.
tab = cd.read_csv("tutorials/data/weather_data/dummy_weather_data.csv")
This dataset contains various weather features like wind speed, pressure and
humidity. Our goal is to predict the temperature (Temperature (C)
) based on
these varibles.
Note
Currently, only integer columns can be used with k-nearest neighbors.
Splitting into train and test sets¶
Now we need define the predictor and target variables, and then the dataset needs to be split into a training and test set. Below we will use a test size of 3.
# Set the predictor variables
X = tab[[
'Apparent Temperature (C)',
#'Humidity', # Fixed points are not yet supported
'Wind Speed (km/h)',
'Wind Bearing (degrees)',
'Visibility (km)',
'Cloud Cover',
'Pressure (millibars)'
]]
# Set the target variable
y = tab[['Temperature (C)']]
# Split the data into training and test sets - you can also use the random_state variable to set a fixed seed if desired (e.g. random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=3)
Creating the model¶
K-nearest neighbors regression is implemented in the class KNeighborsRegressor
.
# Predict with k=5 and default settings
neigh = KNeighborsRegressor(n_neighbors=5)
# Predict with k=10 and p=1, where p denotes the power for the Minkowski metric
# p=1 corresponds to the Manhattan distance
neigh_10 = KNeighborsRegressor(n_neighbors=10, p=1)
# It is also possible to set weights for the columns,
# indicating by what factor the differences should be multiplied
weights = cd.DataFrame(
{
"Apparent Temperature (C)": [0],
"Humidity": [1],
"Wind Speed (km/h)": [3],
"Wind Bearing (degrees)": [1],
"Visibility (km)": [10],
"Cloud Cover": [1],
"Pressure (millibars)": [1],
},
auto_bounds=True,
)
neigh_weights = KNeighborsRegressor(n_neighbors=10, p=1, metric_weights=weights)
Note
The parameter metric_weights
takes a CDataFrame
, so it must be uploaded to the VDL first.
Fitting the model¶
The model can now be fit to the training set that we created earlier:
neigh = neigh.fit(X_train, y_train)
Predicting¶
Once the model has been fitted, it can be used to make predictions on the target
variable. Using the predict_value
method on all the data points in the test set (X_test
) we can predict the
values for our Temperature (C)
variable.
# Create predictions for y-values (temperature) based on X-values (predictor variables)
y_test_pred = []
y_test_pred.append(neigh.predict_value(X_test[0:1]).open())
y_test_pred.append(neigh.predict_value(X_test[1:2]).open())
y_test_pred.append(neigh.predict_value(X_test[2:3]).open())
>>> y_test_pred
[19, 23, 17]
Note
Currently, the method predict_value
only works on single values, so you need to call the method once per row of the test set.