September 16, 2021

[Algorithms] Understanding K-Nearest Neighbor (Regression)

[Algorithms] Understanding K-Nearest Neighbor (Regression)

In our previous post about Algorithms, we talked about K Neighbors Classification Model, however this algorithm can be used for regression as well.

In a regression problem you have datasets, let's say with 2 features, where for each datapoint in the X axis you will have a corresponding Y value.  All data points together form a trend, this trend can be represented by a linear formula.

The image below , shows a synthetic dataset to explain the concept above.

And the next 2 images  show the predictions made by k and n regression algorithm, when k = 1 and k = 3. So in these plots, you can see the training points are actually in orange. These orange circles are the training points and the blue triangles are the output of the k-nearest neighbor regression for any given input value of x.

How did the nearest neighbors regressor compute this value?

If the query point we're interested in is predicting value associated with the x value, we simply find the training point that has the X value that's closest to the query point.

With regression, what we do is instead of taking a majority vote, we don't have class values here as targets, we have continuous values

Because the target values in a regression problem are continuous as compared to the discrete values that we see for classifier target labels. To assess how well a regression model fits the data, we use a regression score called r-squared that's between 0 and 1, I will explain about that in another post but for now a value of 1 corresponds to the best possible performance.

Lets  see this with code and some plots

Lets generate new synthetic data for the sample

This can be done easily with the make_regression method from sklearn.datasets package.

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, make_blobs
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer

cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000'])

# synthetic dataset for simple regression
from sklearn.datasets import make_regression
plt.title('Sample regression problem with one input variable')
X_R1, y_R1 = make_regression(n_samples = 100, n_features=1,
                            n_informative=1, bias = 150.0,
                            noise = 30, random_state=0)
plt.scatter(X_R1, y_R1, marker= 'o', s=50)

This will generate the following plot with the synthetic data points.

Then we use the KNN Regressor

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_R1, y_R1, random_state = 0)

knnreg = KNeighborsRegressor(n_neighbors = 5).fit(X_train, y_train)

print('R-squared test score: {:.3f}'
     .format(knnreg.score(X_test, y_test)))

This will give you the following output, with the R-Squared test score

[231.70974697 148.35572605 150.58852659 150.58852659  72.14859259
 166.50590948 141.90634426 235.57098756 208.25897836 102.10462746
 191.31852674 134.50044902 228.32181403 148.35572605 159.16911306
 113.46875166 144.03646012 199.23189853 143.19242433 166.50590948
 231.70974697 208.25897836 128.01545355 123.14247619 141.90634426]
R-squared test score: 0.425

But best to see this visually in a plot, but how?

In the code below, we create 2 subplots, one with Neighbors= 1 and the 2nd with Neighbors =3.

import numpy as np
import pandas as pd

fig, subaxes = plt.subplots(1, 2, figsize=(8,4))
X_predict_input = np.linspace(-3, 3, 50).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X_R1[0::5], y_R1[0::5], random_state = 0)

for thisaxis, K in zip(subaxes, [1, 3]):
    knnreg = KNeighborsRegressor(n_neighbors = K).fit(X_train, y_train)
    y_predict_output = knnreg.predict(X_predict_input)
    thisaxis.set_xlim([-2.5, 0.75])
    thisaxis.plot(X_predict_input, y_predict_output, '^', markersize = 10,
                 label='Predicted', alpha=0.8)
    thisaxis.plot(X_train, y_train, 'o', label='True Value', alpha=0.8)
    thisaxis.set_xlabel('Input feature')
    thisaxis.set_ylabel('Target value')
    thisaxis.set_title('KNN regression (K={})'.format(K))

As you can see when  the number of neighbors =1, the output plot seems to align better with the real values.

I hope you find this blog post interesant, feel free to post any questions below.