.. _knn: K-nearest neighbors ################### .. versionadded:: 1.9 Introduction ============ K-nearest neighbors is a predictor algorithm that predicts based on the closest datapoints in the training data. It can be used for both classification and regression, but currently only regression is supported. For a new datapoint it looks at the *k* nearest neighbors based on a distance function and predicts the target variable as the average of this variable at these data points. For more information about k-nearest neighbors, see `Wikipedia `_. This guide explains all of the steps to setup a prediction model based on a shared dataset, along with their options and considerations. This is done through an example with the weather dataset, the same one used in :ref:`linreg`. The k-nearest neighbors API in crandas follows the `scikit-learn API `_ wherever possible. It is often possible to run existing scikit-learn code with minimal modifications in crandas. That being said, there are still some differences, so it is a good idea to go through this guide once. Setup ===== Before delving into the main guide, we need to import the necessary modules: .. code:: python import crandas as cd from crandas.crlearn.neighbors import KNeighborsRegressor from crandas.crlearn.model_selection import train_test_split Reading the data ================ We start by uploading the data, which exists in a csv file. .. code:: python tab = cd.read_csv("tutorials/data/weather_data/dummy_weather_data.csv") This dataset contains various weather features like wind speed, pressure and humidity. Our goal is to predict the temperature (``Temperature (C)``) based on these varibles. .. note:: Currently, only *integer* columns can be used with k-nearest neighbors. Splitting into train and test sets ================================== Now we need define the predictor and target variables, and then the dataset needs to be split into a training and test set. Below we will use a test size of 3. .. code:: python # Set the predictor variables X = tab[[ 'Apparent Temperature (C)', #'Humidity', # Fixed points are not yet supported 'Wind Speed (km/h)', 'Wind Bearing (degrees)', 'Visibility (km)', 'Cloud Cover', 'Pressure (millibars)' ]] # Set the target variable y = tab[['Temperature (C)']] # Split the data into training and test sets - you can also use the random_state variable to set a fixed seed if desired (e.g. random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=3) Creating the model ================== K-nearest neighbors regression is implemented in the class :class:`KNeighborsRegressor<.crlearn.neighbors.KNeighborsRegressor>`. .. code:: python # Predict with k=5 and default settings neigh = KNeighborsRegressor(n_neighbors=5) # Predict with k=10 and p=1, where p denotes the power for the Minkowski metric # p=1 corresponds to the Manhattan distance neigh_10 = KNeighborsRegressor(n_neighbors=10, p=1) # It is also possible to set weights for the columns, # indicating by what factor the differences should be multiplied weights = cd.DataFrame( { "Apparent Temperature (C)": [0], "Humidity": [1], "Wind Speed (km/h)": [3], "Wind Bearing (degrees)": [1], "Visibility (km)": [10], "Cloud Cover": [1], "Pressure (millibars)": [1], }, auto_bounds=True, ) neigh_weights = KNeighborsRegressor(n_neighbors=10, p=1, metric_weights=weights) .. note:: The parameter ``metric_weights`` takes a :class:`CDataFrame<.CDataFrame>`, so it must be uploaded to the VDL first. Fitting the model ================= The model can now be fit to the training set that we created earlier: .. code:: python neigh = neigh.fit(X_train, y_train) Predicting ========== Once the model has been fitted, it can be used to make predictions on the target variable. Using the :meth:`predict_value<.crlearn.neighbors.KNeighborsRegressor.predict_value>` method on all the data points in the test set (``X_test``) we can predict the values for our ``Temperature (C)`` variable. .. code:: python # Create predictions for y-values (temperature) based on X-values (predictor variables) y_test_pred = [] y_test_pred.append(neigh.predict_value(X_test[0:1]).open()) y_test_pred.append(neigh.predict_value(X_test[1:2]).open()) y_test_pred.append(neigh.predict_value(X_test[2:3]).open()) >>> y_test_pred [19, 23, 17] .. note:: Currently, the method :meth:`predict_value<.crlearn.neighbors.KNeighborsRegressor.predict_value>` only works on single values, so you need to call the method once per row of the test set.