K-Nearest Neighbors Examples¶

The k-Nearest Neighbors algorithm (k-NN) is a fundamental classification algorithm. This notebook walks through a few examples using scikit-learn and some synthetic data sets.

The basic idea behind the algorithm is very simple: given a set of points with known classes, we classify a new point by looking at the $k$ closest points and taking the majority vote. We can adjust the algorithm in several ways, including:

• altering the number of neighbors $k$ used to classify
• changing our notion of distance (e.g. euclidean, taxi-cab / manhattan distance, and many others)
• weighting the votes of the known points by distance

Resources:

First we load a few libraries.

In [4]:
%matplotlib inline
import random

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import neighbors

# Make our plots a bit larger
matplotlib.rcParams["figure.figsize"] = (8, 8)


The following function generates some random data that lies within a ring (annulus).

In [5]:
def annulus(inner_radius, outer_radius, n=30, color='b'):
"""Generate n points with class color between the inner radius and the outer radius."""
data = []
for _ in range(n):
# Pick an angle and radius
angle = 2 * np.pi * random.random()
r = inner_radius + diff * random.random()
x = r * np.cos(angle)
y = r * np.sin(angle)
data.append((x, y))
# Return a data frame for convenience
xs, ys = zip(*data)
df = pd.DataFrame()
df["x"] = xs
df["y"] = ys
df["color"] = color
return df


Let's try it out!

In [6]:
df = annulus(4, 10, 300)

plt.scatter(df['x'], df['y'])
plt.show()

In [7]:
df1 = annulus(2, 5, 100, color='r')
df2 = annulus(6, 10, 400, color='b')
df = pd.concat([df1, df2])

plt.scatter(df['x'], df['y'], c=df['color'])
plt.show()


Now we've got some classified data, let's generate some new data points and see how they are classified. For the first example we'll use the five nearest neighbors, that is $k=5$.

In [8]:
df_test = annulus(2, 10, 600, color=None)

In [9]:
X = df[['x', 'y']]
y = df['color']

k = 5

clf = neighbors.KNeighborsClassifier(k, weights="uniform", metric="euclidean", algorithm="ball_tree")
clf.fit(X, y)

predictions = clf.predict(df_test[['x', 'y']])
df_test['color'] = predictions


Let's plot the newly classified data:

In [10]:
plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.show()

plt.scatter(df['x'], df['y'], c=df['color'], alpha=0.5)
plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.show()


Next we try an example where there is a bit of overlap in the known data points, and a larger value of $k$.

In [11]:
df1 = annulus(2, 6, 100, color='r')
df2 = annulus(4, 8, 300, color='b')
df = pd.concat([df1, df2])

df_test = annulus(1, 10, 500, color=None)

In [12]:
X = df[['x', 'y']]
y = df['color']

k = 30

clf = neighbors.KNeighborsClassifier(k, weights="uniform", metric="euclidean", algorithm="ball_tree")
clf.fit(X, y)

predictions = clf.predict(df_test[['x', 'y']])
df_test['color'] = predictions

plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.show()

plt.scatter(df['x'], df['y'], c=df['color'], alpha=0.5)
plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.show()


Exercises:

• What happens when the inner circle has 1000 points and $k$ is larger, say 30? Why did this happen?
• What about when $k$ is small?

Now let's try some different shapes for our known data points.

In [14]:
def generate_block_data(n):
"""Generates data in the unit square and it's reflection through the origin."""
xs = []
ys = []
colors = []
for _ in range(n):
x = random.random()
y = random.random()
xs.append(x)
ys.append(y)
colors.append('r')
x = random.random()
y = random.random()
xs.append(-x)
ys.append(-y)
colors.append('b')
df = pd.DataFrame()
df['x'] = xs
df['y'] = ys
df['color'] = colors
return df


Let's take a look and classify some random points.

In [15]:
# Random Data

n = 200

df = generate_block_data(100)

X = df[['x', 'y']]
y = df['color']

df_test = annulus(0, 2, 200, color=None)

k = 30

clf = neighbors.KNeighborsClassifier(k, weights="uniform", metric="euclidean", algorithm="ball_tree")
clf.fit(X, y)

predictions = clf.predict(df_test[['x', 'y']])

df_test['color'] = predictions

plt.scatter(df['x'], df['y'], c=df['color'], s=30)
plt.show()

plt.scatter(df['x'], df['y'], c=df['color'], alpha=0.5)
plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.plot([-1.5, 0, 1.5], [1.5, 0, -1.5], linewidth=2, color='black')
plt.show()


In this case our classifier effectively creates a boundary that separates the data via the line $y=-x$. Next we expand on the example by creating a checkboard of data.

In [16]:
def generate_checkerboard_data(n=2000, l=5):
xs = []
ys = []
colors = []
for _ in range(n):
x = l * random.random()
y = l * random.random()
xs.append(x)
ys.append(y)
if (int(x) + int(y)) % 2 == 0:
colors.append('r')
else:
colors.append('b')
df = pd.DataFrame()
df['x'] = xs
df['y'] = ys
df['color'] = colors
return df

In [210]:
# Random Data

l = 4
n = 200

df = generate_checkerboard_data(n=1000, l=l)

X = df[['x', 'y']]
y = df['color']

df_test = generate_checkerboard_data(n=500, l=l)

k = 5

clf = neighbors.KNeighborsClassifier(k, weights="uniform", metric="euclidean", algorithm="ball_tree")
clf.fit(X, y)

predictions = clf.predict(df_test[['x', 'y']])

df_test['color'] = predictions

plt.scatter(df['x'], df['y'], c=df['color'], s=30)
plt.show()

plt.scatter(df['x'], df['y'], c=df['color'], alpha=0.5)
plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.show()


In this case the classification boundary is much more complex than a line. Compare to these examples for many other classifiers:

Exercises:

• Try out a few other classifiers in sklearn and compare the decision boundaries.

For example, here's a decision tree classifier:

In [19]:
# Random Data
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)

clf.fit(X, y)

predictions = clf.predict(df_test[['x', 'y']])
df_test['color'] = predictions

plt.scatter(df['x'], df['y'], c=df['color'], alpha=0.5)
plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.show()