Example Application and Optimization of k-Nearest Neighbors Regression

In recent years, there has been an explosion on data science and machine learning tools, which makes their use much more affordable to users. The spread of languages like python, with dedicated scientific libraries, has made data analysis much more accessible. One of the most straightforward data science techniques to perform data classification or regression is the k-nearest neighbors (kNN) method. I have seen several online posts explaining how kNN can be used as a classification tool, but there is not much information out there about kNN regression.

Given how extended the use of regression methods are in multiple branches of science, it is surprising that it is so uncommon to find tutorials of such a simple (yet useful) as kNN regression. This tutorial contains simple examples that data science beginners can follow to use this method successfully. I provide the complete codes used during this tutorial, so more advanced readers can still get something out of it and use code snippets for their specific applications of kNN. All codes, images, etc. are also provided in a public Github repository[1].

When trying to predict the target property of a new point, kNN takes a look at neighboring points and averages them to estimate the new target property value.


Tutorial overview

Anyone with a basic python knowledge could follow this tutorial. Complete python codes are shown to help understanding the specific implementation. We will use python’s scikit-learn[2] library, which also provides useful documentation about the kNN regressor[3]. This tutorial will:

  • Provide a simple example on how to prepare data for regression.
  • Show how to perform kNN regression and optimize k.
  • Exemplify how kNN regression works.
  • Show how different weight functions can affect kNN prediction.
  • Discuss some applications and limitations of kNN regression.

Brief introduction to kNN regression

The goal of any machine learning model is to be able to detect trends of a target property whose value is determined by a set of descriptors (also referred to as features), and then try to make predictions of the target property for new descriptor values. For example, one could build a machine learning model that is trained with how expected salary (target property) changes as a function of different descriptors: industry sector, location, worker’s age etc. and then tries to predict the average salary that a worker would likely have at some specific conditions. kNN is an example of supervised learning method, that is able to perform both classification and regression tasks.

Classification tasks have a discrete target property. For example, we can have a machine learning model that tries to predict the outcome of sports events, so the target property would have only the discrete values win, loss and possibly draw, but one cannot e.g. define any value between win and draw.

Regression tasks have a continuous target property. For example, we can build a machine learning model that tries to predict how a specific stock market is going to change in the future, so any possible value is in principle possible.

In a nutshell, when trying to predict the target property of a new point, kNN takes a look at neighboring points and averages them to estimate the new target property value. There are numerous posts online on how kNN classification works – see for example recent posts by Italo José[4] and Afroz Chakure[5]. In this tutorial, I will deal with kNN regression, which seems to get less attention. I will discuss some specifics about how kNN regression does predictions, using simple but complete examples.


Generate data

To simplify things here, we will create a one-dimensional dataset. This means that we have only one descriptor x and a target property f(x). As a generic example, I am going to use in this tutorial a small database formed by ten points that follow the function: y = f(x) = ex.

The points considered in this example are shown in the Table and Figure below. The complete code to generate this data is also provided below.

xn (arbitrary units)yn (arbitrary units)
5.00148.41
5.20181.27
5.40221.41
5.60270.43
5.80330.30
6.00403.43
6.20492.75
6.40601.85
6.60735.10
6.80897.85
Table 1. Ten (x,y) points forming our database.
Figure 1. Graphical representation of the data points. Dashed line just indicates the underlying function to generate the database.
#!/usr/bin/env python3
# Import libraries
import math
import matplotlib.pyplot as plt
from matplotlib.ticker import (MultipleLocator)
import numpy as np
# Initialize lists
list_x = []
list_y = []
# Generate dataset as 10 points from x=5 to x=6.8, with y=exp(x)
for x in np.arange(5, 7, 0.2):
    y = math.exp(x)
    list_x.append(x)
    list_y.append(y)
    print("%.2f, %.2f" %(x, y))
# Transform lists to numpy arrays
list_x = np.array(list_x).reshape(-1, 1)
list_y = np.array(list_y)
# Create arrays with function y=exp(x)
function_x = np.arange(4.9, 7.0, 0.01)
function_y = [math.exp(x) for x in function_x]
# Plot points in dataset plus dashed line with function
plt.plot(function_x,function_y,color='C0',linestyle='dashed',linewidth=1)
plt.scatter(list_x, list_y,color='C0')
# Set axis labels
plt.xlabel('$x$',fontsize=15)
plt.ylabel('$y$',fontsize=15)
# Set axis ticks
plt.xticks(np.arange(5,7,0.2))
plt.xlim(4.92,6.88)
plt.ylim(100,1000)
# Set minor ticks
axes = plt.gca()
axes.xaxis.set_minor_locator(MultipleLocator(0.05))
# Save plot into Figure_1.png
file_name='Figure_1.png'
plt.savefig(file_name,format='png',dpi=600)
plt.close()

kNN parameters optimization

With kNN regression, the y value at any given x configuration is predicted as an average of nearby neighbors:

y_n = \frac{1}{k} \sum ^k_{i=1} w_i \cdot y_i

where w_i is the weight of the ith neighbor, with a target property value yi. The weights for different neighbors can be calculated in different ways, but we will initially consider them to the uniform, where all neighbors have the same weight w_i = 1.

When using kNN regression, one has to take into account two considerations:

  1. The number of k nearest neighbors considered.
  2. How the nearest neighbors are weighted.

Cross-validation to optimize k

Regarding the first of these points, a simple method to select the optimum number of k nearest neighbors to make accurate predictions is to use a grid search. A grid search involves trying different k values and finally choosing the one that minimizes the prediction error. As with any machine learning model, one is faced with the task of how to select the data that will be used to train the model and the data that will be used to test the model accuracy.

Since we have a small amount of data here (only N=10 data points), we will use the leave-one-out (LOO) cross-validation (for more complex datasets, one may opt for a k-fold cross-validation). For a dataset with N samples, one trains the machine learning model with N-1 points, and uses the remaining point as a test. This procedure is repeated N times, so each point is used exactly once as a test, as shown in the figure below.

Figure 2. Scheme of the leave-one-out cross-validation.

As a result, one ends with a predicted value for each point. Then, the predicted value of each point is compared with its actual value. As an error metric, we have used the root mean square error (rmse). Finally, we can compare the (rmse) obtained for different k values, and we select as an optimum k value the one that minimizes the rmse. The code implementing this grid search is shown below, as well as its result.

#!/usr/bin/env python3
# Import libraries
import math
import matplotlib.pyplot as plt
import numpy as np
from sklearn import preprocessing, neighbors
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import LeaveOneOut
from matplotlib.ticker import (MultipleLocator)
# Initialize list
list_x = []
list_y = []
# Generate dataset as 10 points from x=5 to x=6.8, with y=exp(x)
for x in np.arange(5, 7, 0.2):
    y = math.exp(x)
    list_x.append(x)
    list_y.append(y)
# Transform lists to numpy arrays
list_x = np.array(list_x).reshape(-1, 1)
list_y = np.array(list_y)
# Initialize grid search values
best_rmse = 0
best_k = 0
# Select which k neighbors will be studied in grid
possible_k = [1,2,3,4,5,6,7,8,9]
# Start loop for grid search
for k in possible_k:
    # Initialize lists with predicted values
    y_pred = []
    x_pred = []
    # Start LOO loop 
    for train_index, test_index in LeaveOneOut().split(list_x):
        # Assign train and test data
        X_train, X_test = list_x[train_index], list_x[test_index]
        y_train, y_test = list_y[train_index], list_y[test_index]
        # Scale data (not needed here, since just 1 descriptor, but it is good practice)
        scaler = preprocessing.StandardScaler().fit(X_train)
        X_train_scaled = scaler.transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        # kNN regressor
        knn = neighbors.KNeighborsRegressor(n_neighbors=k, weights='uniform')
        # Train kNN model and predict values for X_test_scaled
        pred = knn.fit(X_train_scaled, y_train).predict(X_test_scaled)
        # Assign predicted value in each LOO step
        y_pred.append(pred)
        x_pred.append(X_test)
    # Calculate rmse between actual y values and predicted values
    mse = mean_squared_error(y_pred,list_y)
    rmse = np.sqrt(mse)
    # Print k and rmse of each grid point
    print('k: %i, rmse: %.2f' %(k, rmse))
    # If this value is better than previous one (or it is first step), update best value
    if rmse < best_rmse or k==possible_k[0]:
        best_rmse = rmse
        best_k = k
# Print final best k value and rmse
print("Optimum: kNN, k=%i, rmse: %.2f" %(best_k,best_rmse))
k: 1, RMSE: 95.55
k: 2, RMSE: 74.87
k: 3, RMSE: 105.13
k: 4, RMSE: 122.74
k: 5, RMSE: 152.07
k: 6, RMSE: 173.34
k: 7, RMSE: 202.04
k: 8, RMSE: 230.51
k: 9, RMSE: 264.38
Optimum: kNN, k=2, RMSE: 74.87

We can use this optimum case, with k=2, to visualize how the kNN regression algorithm is working. In the Figure below, we show how, when trying to estimate the value at x1, the predicted value y’1 is calculated as the average of its nearest neighbors, y2 and y3, y’2 is predicted as the average of its two nearest neighbors, y1 and y3, and so on.

Figure 3. Points in database indicated in blue. Each panel corresponds to a new leave-one-out step, where one data point (labeled in orange as y’n) is used as a test set, whose value is estimated as an average of the two nearest neighbors.

In the Figure above, one can see how the predicted values behave reasonably well for the whole range, except on the extremes. This is because the prediction at the extremes will be biased due to the knowledge of the interval, i.e. there are only neighbors to one side of the extreme. There is also an additional issue that cannot be observed in the previous Figure, since our xn values are equidistant. We can, for example, train the model with all our ten points, and make predictions for all values inside the studied range.

In the Figure below, we can see how the predicted values in between the training points xn are constant, and it corresponds exactly to the average of the k nearest neighbors. This is the cause of all k nearest neighbors having the same weight, which we specified with weights=’uniform’ (default option in scikit-learn’s kNN regressor). However, one can clearly see how, in a case like this, it would be more accurate to give a larger weight to the neighbors that are closer. The code necessary to reproduce these results and Figure are also shown below.

Figure 4. Training points indicated in blue. Orange line corresponds to estimated values across the studied range, as the average of the two nearest points in the training set.
#!/usr/bin/env python3
import math
import matplotlib.pyplot as plt
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import mean_squared_error
from sklearn import neighbors
from matplotlib.ticker import (MultipleLocator)
### 1) Generate data
list_x = []
list_y = []
for i in np.arange(5, 7, 0.2):
x = i
y = math.exp(x)
list_x.append(x)
list_y.append(y)
list_x = np.array(list_x).reshape(-1, 1)
list_y = np.array(list_y)
basic_x = np.arange(4.9, 7.0, 0.01)
basic_y = [math.exp(x) for x in basic_x]
best_k = 2
### 3) Do kNN regression
X_train = np.array(list_x).reshape(-1, 1)
y_train = np.array(list_y)
X_test = np.arange(5.0, 6.81, 0.01).reshape(-1, 1)
# scale data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn = neighbors.KNeighborsRegressor(n_neighbors=best_k, weights='uniform')
file_name='Fig3_1.png'
label_kNN = "k-NN regression (uniform)"
# kNN regression
y_test = knn.fit(X_train_scaled, y_train).predict(X_test_scaled)
# Plot graph
plt.plot(basic_x,basic_y,color='C0',linestyle='dashed',linewidth=1)
plt.scatter(list_x, list_y,color='C0',label='$x_n$')
plt.plot(X_test, y_test,color='C1',label=label_kNN)
plt.legend()
plt.xlabel('$x$',fontsize=15)
plt.ylabel('$f(x)$',fontsize=15)
plt.xticks(np.arange(5,7,0.2))
plt.xlim(4.92,6.88)
plt.ylim(100,1000)
axes = plt.gca()
axes.xaxis.set_minor_locator(MultipleLocator(0.05))
# Save plot into png
plt.savefig(file_name,format='png',dpi=600)
plt.close()

Different weights for nearest neighbors

When using kNN regression, one has to select a sensible weight function. In our case, one can see how it would be better to give a larger weight to points that are closer. This can be achieved by simply using the weights=’distance’ option, which calculates weights as the inverse of the distance between the studied nth point and the neighbor i in the training set: w_i = \frac{1}{d_i}, where di corresponds to the Euclidean distance between xi and xn. The result of using these new weights is shown in the Figure below, where we can observe how it behaves much better for the whole studied range.

EDIT: Note that one could use any other non-Euclidean metric, which might be more beneficial in other specific cases. See my other post on Towards Data Science on how to use custom user-defined metrics
Figure 5. Training points indicated in blue. Orange line corresponds to estimated values across the studied range, as the distance-weighted average of the two nearest points in the training set.

One should then repeat the previous steps to select the optimum number of k nearest neighbors, now using weights=’distance’. Making this change to the grid search code provided before, we obtain again that the optimum number of nearest neighbors is two, as shown below.

k: 1, rmse: 95.55
k: 2, rmse: 67.59
k: 3, rmse: 83.71
k: 4, rmse: 93.96
k: 5, rmse: 107.86
k: 6, rmse: 118.00
k: 7, rmse: 130.29
k: 8, rmse: 141.99
k: 9, rmse: 154.95
Optimum: kNN, k=2, rmse: 67.59

kNN regression applications

The example that we have used here is extremely simple and we have seen how kNN is able to make very accurate predictions within the studied range. However, one should note that even with such a simple algorithm as kNN, it is still possible to make accurate predictions (similar to much more sophisticated regression algorithms) for real cases in different branches of science.

For example, in 2014, researchers from the Medtronic Energy and Component Center used kNN to build machine learning models to estimate the capacity of lithium-ion batteries[6]. In 2019, researchers from my group at the University of Liverpool also used kNN to predict the power conversion efficiency of solar cells[7].


Limitations of kNN regression

It is worth mentioning a type of problem where kNN performs particularly poorly. This problematic task is extrapolation, i.e., predicting values that are outside the range of the training set. In general, this kind of extrapolation task is a challenge for all ML algorithms. Still, kNN does particularly bad, since it purely relies on the information of neighbors in the training set, but there are no training points in that region.

We show in the Figure below how kNN regression (with different weights) performs for some regions outside of the training range. We can see how kNN completely fails to predict the exponential behavior expected outside the training range, and it merely returns the weighted average value of the two nearest neighbors. The code used to generate this Figure is also shown below.

Figure 6. Training points are indicated in blue. The orange line corresponds to the estimated values in and out of the training range. The left panel corresponds to the uniformly-weighted average of the nearest two neighbors, and the right panel corresponds to the distance-weighted average of the nearest two neighbors.
#!/usr/bin/env python3
# Import libraries
import math
import matplotlib.pyplot as plt
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import mean_squared_error
from sklearn import neighbors
from matplotlib.ticker import (MultipleLocator)
# Set optimum k optimized previously
best_k = 2
# Initialize lists
list_x = []
list_y = []
# Generate dataset as 10 points from x=5 to x=6.8, with y=exp(x)
for x in np.arange(5, 7, 0.2):
    y = math.exp(x)
    list_x.append(x)
    list_y.append(y)
# Transform lists to numpy arrays
list_x = np.array(list_x).reshape(-1, 1)
list_y = np.array(list_y)
# Create arrays with function y=exp(x)
function_x = np.arange(3.9, 7.81, 0.01)
function_y = [math.exp(x) for x in function_x]
# Assign train data as all 10 points in dataset
X_train = np.array(list_x).reshape(-1, 1)
y_train = np.array(list_y)
# Assign prediction X to intermediate x values 
X_pred = np.arange(4.0, 7.81, 0.01).reshape(-1, 1)
# Scale data (not needed here, since just 1 descriptor, but it is good practice)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_pred_scaled = scaler.transform(X_pred)
# kNN regressor
knn = neighbors.KNeighborsRegressor(n_neighbors=best_k, weights='uniform')
label_kNN = "k-NN regression (uniform)"
# Train kNN model and predict values for X_pred_scaled
y_pred = knn.fit(X_train_scaled, y_train).predict(X_pred_scaled)
# Plot points in dataset plus dashed line with function
plt.plot(function_x,function_y,color='C0',linestyle='dashed',linewidth=1)
plt.scatter(list_x, list_y,color='C0',label='Training points')
# Plot predicted values with kNN regressor
plt.plot(X_pred, y_pred,color='C1',label=label_kNN)
# Plot legend
plt.legend()
# Set axis labels
plt.xlabel('$x$',fontsize=15)
plt.ylabel('$y$',fontsize=15)
# Set axis ticks and limits
plt.xticks(np.arange(4,8,0.4))
plt.xlim(3.92,7.88)
plt.ylim(0,1200)
# Set minor ticks
axes = plt.gca()
axes.xaxis.set_minor_locator(MultipleLocator(0.05))
# Save plot into png
file_name='Figure6_1.png'
plt.savefig(file_name,format='png',dpi=600)
plt.close()

Conclusions

In this tutorial, we have first seen a brief introduction of k-nearest neighbors regression. We have generated a simple one-dimensional database and visualized it. Then, we have covered the two main features to be tuned in k-nearest neighbors regression: i) the number of k neighbors, and ii) the effect of different weight functions. Finally, we have seen how k-nearest neighbors regression might be very accurate to make specific predictions but still has some severe limitations when performing extrapolation tasks.

Was this tutorial helpful to you? Let me know if you were able to successfully use k-nearest neighbors regression!


References

1. https://github.com/marcosdelcueto/Tutorial_kNN_regression

2. https://scikit-learn.org/stable/index.html

3. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

4. https://towardsdatascience.com/knn-k-nearest-neighbors-1-a4707b24bd1d

5. https://towardsdatascience.com/k-nearest-neighbors-knn-algorithm-bd375d14eec7

6. C Hu et al, Applied Energy 129, 49 (2014) DOI: https://doi.org/10.1016/j.apenergy.2014.04.077

7. D Padula, J Simpson and A Troisi, Materials Horizons 6, 343 (2019) DOI: https://doi.org/10.1039/C8MH01135D

* Featured image by Christina Morillo from Pexels


Marcos del Cueto

I am a research associate at the University of Liverpool. I have a PhD in Theoretical Chemistry, and I am now diving into the applications of machine learning to materials discovery.

 Subscribe in a reader