September 8, 2021

[Algorithms] Understanding K-nearest neighbors algorithm

Lets talk today about K-Neighbors Classifier

[Algorithms] Understanding K-nearest neighbors algorithm

Lets talk today about K-Neighbors Classifier

From Wikipedia
In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric classification method first developed by Evelyn Fix and Joseph Hodges in 1951,and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in data set. The output depends on whether k-NN is used for classification or regression:

In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors.

Now, what does it mean in plain english for mere mortals?

The easiest way to explain this is with examples and charts.

Lets assume you have a dataset for fruits with the following features:
1.Color
2.Height
3.Width
4.Weight

And the label is the fruit name.

Basically each row in the dataset will have the characteristics of each fruit, and the label: (apple, lemon, mandarine, orange, etc)

Your job is be able to input an unknown fruit and get a prediction.
So you measure the height, width, weight and the color.

Based on that, the algorithm will give you an output.

Without any machine learning algorithm, we can definitely state that all lemons have the same size and weight(almost), and most of them are green depending on the shelf life, some may be a bit more yellow.

With classification algorithms, what we are trying to do is to group the data points into different groups based on common features.

The difficulty here is that some feature values may conflict between groups. For example, the size of a mandarin and lemon may be the same. The color of a mandarine and orange is almost the same.

However when all features are taken into account at the same time, we can better group them and make predictions for new data points we are given.

K-nearest neighbors

The K in the name of this classifier represents the k nearest neighbors, where k is an integer value specified by the user. Hence as the name suggests, this classifier implements learning based on the k nearest neighbors.

When the algorithm starts to learn from code, if we were able to draw the features in 2 dimensions, the algorithm would try to limit each class to a specific zone depending on the height and width as seen on the image below.

When we get a new fruit, we could measure width and height and then based on the distance to the nearest neighbor we should be able to classify that new fruit in one of the groups.

However in real life problems we have more than 2 features and here the algorithm is a perfect fit.

Show me the code

In the code below, we will implement the K-nearest classifier with 4 features and the label.

Lets import the needed libraries.

%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

Setting some precision options for numpy

np.set_printoptions(precision=2)

Lets get the data

Dataset can be downloaded here:
https://www.kaggle.com/mjamilmoughal/fruits-with-colors-dataset

fruits = pd.read_table('readonly/fruit_data_with_colors.txt')

In this step we are creating a series with the feature names, then we create 2 series, one with the feature names and values for each observation.

At the end, and as seen in other videos, we need to split the data in train and test

feature_names_fruits = ['height', 'width', 'mass', 'color_score']
X_fruits = fruits[feature_names_fruits]
y_fruits = fruits['fruit_label']
target_names_fruits = ['apple', 'mandarin', 'orange', 'lemon']

X_fruits_2d = fruits[['height', 'width']]
y_fruits_2d = fruits['fruit_label']

X_train, X_test, y_train, y_test = train_test_split(X_fruits, y_fruits, random_state=0)

As you already know, each feature has values in a different scale, in ML, most of the times its required to "Normalize" those values, with the MinMaxScaler we will make all numeric values in the same scale.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
# we must apply the scaling to the test set that we computed for the training set
X_test_scaled = scaler.transform(X_test)

Finally, we initialize the KNeighborsClassifier with a number of 5 neighbors. Feel free to play with this parameter.

Then with the .fit method we train the algorithgm.

And the last 2 lines of code just prints the accuracy for this model

knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train_scaled, y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'
     .format(knn.score(X_train_scaled, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'
     .format(knn.score(X_test_scaled, y_test)))

Last but not least ,we need to test.
We create a new fruit, with some random values for each feature.
We have to use the scaler as explained above.
And finally we use our model to predict

example_fruit = [[5.5, 2.2, 10, 0.70]]
example_fruit_scaled = scaler.transform(example_fruit)
print('Predicted fruit type for ', example_fruit, ' is ', 
          target_names_fruits[knn.predict(example_fruit_scaled)[0]-1])

The entire code of this post will show you this output.

Accuracy of K-NN classifier on training set: 0.95
Accuracy of K-NN classifier on test set: 1.00
Predicted fruit type for  [[5.5, 2.2, 10, 0.7]]  is  mandarin