.. _knn:

K-nearest neighbors
###################

.. versionadded:: 1.9

Introduction
============
K-nearest neighbors is a predictor algorithm that predicts based on the closest
datapoints in the training data. It can be used for both classification and
regression, but currently only regression is supported. For a new datapoint it
looks at the *k* nearest neighbors based on a distance function and predicts the
target variable as the average of this variable at these data points. For more
information about k-nearest neighbors, see `Wikipedia <https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm>`_.

This guide explains all of the steps to setup a prediction model based on a
shared dataset, along with their options and considerations. This is done
through an example with the weather dataset, the same one used in :ref:`linreg`.

The k-nearest neighbors API in crandas follows the `scikit-learn API <https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html>`_
wherever possible. It is often possible to run existing scikit-learn code with
minimal modifications in crandas. That being said, there are still some
differences, so it is a good idea to go through this guide once.


Setup
=====
Before delving into the main guide, we need to import the necessary modules:

.. code:: python

    import crandas as cd
    from crandas.crlearn.neighbors import KNeighborsRegressor
    from crandas.crlearn.model_selection import train_test_split

Reading the data
================
We start by uploading the data, which exists in a csv file.

.. code:: python

    tab = cd.read_csv("tutorials/data/weather_data/dummy_weather_data.csv")

This dataset contains various weather features like wind speed, pressure and
humidity. Our goal is to predict the temperature (``Temperature (C)``) based on
these varibles.

.. note::

   Currently, only *integer* columns can be used with k-nearest neighbors.

Splitting into train and test sets
==================================
Now we need define the predictor and target variables, and then the dataset needs
to be split into a training and test set. Below we will use a test size of 3.

.. code:: python

    # Set the predictor variables
    X = tab[[
        'Apparent Temperature (C)',
        #'Humidity', # Fixed points are not yet supported
        'Wind Speed (km/h)',
        'Wind Bearing (degrees)',
        'Visibility (km)',
        'Cloud Cover',
        'Pressure (millibars)'
    ]]

    # Set the target variable
    y = tab[['Temperature (C)']]

    # Split the data into training and test sets - you can also use the random_state variable to set a fixed seed if desired (e.g. random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=3)

Creating the model
==================
K-nearest neighbors regression is implemented in the class :class:`KNeighborsRegressor<.crlearn.neighbors.KNeighborsRegressor>`.

.. code:: python

    # Predict with k=5 and default settings
    neigh = KNeighborsRegressor(n_neighbors=5)

    # Predict with k=10 and p=1, where p denotes the power for the Minkowski metric
    # p=1 corresponds to the Manhattan distance
    neigh_10 = KNeighborsRegressor(n_neighbors=10, p=1)

    # It is also possible to set weights for the columns,
    # indicating by what factor the differences should be multiplied
    weights = cd.DataFrame(
        {
            "Apparent Temperature (C)": [0],
            "Humidity": [1],
            "Wind Speed (km/h)": [3],
            "Wind Bearing (degrees)": [1],
            "Visibility (km)": [10],
            "Cloud Cover": [1],
            "Pressure (millibars)": [1],
        },
        auto_bounds=True,
    )
    neigh_weights = KNeighborsRegressor(n_neighbors=10, p=1, metric_weights=weights)

.. note::

   The parameter ``metric_weights`` takes a :class:`CDataFrame<.CDataFrame>`, so it must be uploaded to the engine first.

Fitting the model
=================
The model can now be fit to the training set that we created earlier:

.. code:: python

    neigh = neigh.fit(X_train, y_train)

Predicting
==========
Once the model has been fitted, it can be used to make predictions on the target
variable. Using the :meth:`predict_value<.crlearn.neighbors.KNeighborsRegressor.predict_value>`
method on all the data points in the test set (``X_test``) we can predict the
values for our ``Temperature (C)`` variable.

.. code:: python

    # Create predictions for y-values (temperature) based on X-values (predictor variables)
    y_test_pred = []
    y_test_pred.append(neigh.predict_value(X_test[0:1]).open())
    y_test_pred.append(neigh.predict_value(X_test[1:2]).open())
    y_test_pred.append(neigh.predict_value(X_test[2:3]).open())

>>> y_test_pred
[19, 23, 17]

.. note::

   Currently, the method :meth:`predict_value<.crlearn.neighbors.KNeighborsRegressor.predict_value>`
   only works on single values, so you need to call the method once per row of the test set.