Implementing K Nearest Neighbors from scratch in Python

And comparing our model performance with Scikit-learn implementation

Retin P Kumar
Dev Genius

--

Photo by Derick McKinney on Unsplash

K Nearest Neighbors(KNN) is one of the simplest supervised machine learning algorithms. The algorithm was initially developed for classification tasks but was later extended for performing regression tasks as well.

Unlike other supervised machine learning algorithms, KNN doesn’t have an underlying model and the algorithm works solely on the stored training data.

KNN tries to predict the values based on the distances of each test data point from the training dataset. Hence the algorithm doesn’t make any assumptions regarding the features or the output of the data.

Due to this nature of the algorithm where it starts working and makes a prediction only after a test data is introduced, it is also known as Lazy learner.

Apart from classification and regression tasks, the KNN algorithm is also used for missing value imputation, resampling datasets, outlier detection, etc. Some of the industrial applications of KNN include building a Recommender system, handwriting recognition, image recognition, etc.

There are several approaches for finding K Nearest Neighbors like K-dimensional tree which uses a rectangular decision boundary for classifying data points, Ball-tree which uses a spherical decision boundary, etc. In this article, we will implement the brute force approach to KNN using Python from scratch.

The Algorithm

So, the steps for creating a KNN model is as follows:

  1. We need an optimal value for K to start with.
  2. Calculate the distance of each data point in the test set with each point in the training set.
  3. Sort the calculated distances along with the corresponding target values from training data in ascending order.
  4. From these sorted values, select the K top values.
  5. For Classification tasks, the Mode of data points corresponding to K top rows will be the predicted output and for Regression tasks, Mean/ Median will be the predicted output.

Implementation from scratch

Let’s create the basic structure of our KNN algorithm. We will follow the Scikit-learn approach and define a class with a fit and predict method.

Image by author

We will initialize our class instance using the number of neighbors, problem type, and evaluation metric.

Problem type will decide whether it will be a regression or a classification task and metric will decide which distance metric should be used for calculating the distances.

For now, we will assign problem type 0 for regression and 1 for classification. And, metric 0 for Euclidean distance and 1 for Manhattan distance.

Image by author

In our fit method, we will fit the algorithm with the training data features and target.

Image by author

Inside our predict method that takes test feature data as the argument, we have 3 additional steps to perform.

  • First, we calculate the distances.
  • Then we calculate the neighbors.
  • Finally, we make the prediction.

The code is self-explanatory.

Image by author

Putting it all together, we have our BruteForceKNN algorithm.

Image by author

Let’s now try to implement KNN using Scikit learn library and calculate the classification metrics. Then, we will implement our BruteForceKNN on the same data and compare how it performs with respect to Scikit learn implementation.

Testing with Scikit-learn implementation

Let’s import the necessary libraries and dependencies.

Image by author

Now, let’s load iris data from Scikit learn datasets. Then, we will split the data and use standard scaling to scale our data.

Image by author

Now, we need to find the optimal value of ‘k’.

For that, we will instantiate a base KNN classifier with default settings and make predictions for values of ‘k’ ranging from 1 to 31.

Then we will calculate the train and test errors and plot the error curve. With the help of the error curve, we will find the optimal value for ‘k’.

Image by author
image.png
Image by author

From the error curve, we find that k=14 suits well for our data. So, we instantiate a KNN object with k=14, fit the training data, and predict using the test data.

Image by author
Image by author

Now, let’s plot the confusion matrix and check other classification metrics.

Image by author
image.png
Image by author
Image by author

Testing using our Brute Force KNN

Now, we will instantiate our BruteForceKNN class and make predictions with the same data.

Image by author
Image by author

This is great. Our model accuracy matches with Scikit-learn implementation. Now let’s check the confusion matrix.

Image by author
image-2.png
Image by author
Image by author

We find that our BruteForceKNN model performance matches exactly with the Scikit-learn implementation of the KNN Classifier.

With that, we have reached the end of this article.

I hope you’ve enjoyed learning and implementing the KNN algorithm from scratch in Python and comparing the model performance along with Scikit-learn API.

If you liked this article, do follow me for more interesting and actionable posts.

--

--