K-Nearest Neighbors Examples

The k-Nearest Neighbors algorithm (k-NN) is a fundamental classification algorithm. This notebook walks through a few examples using scikit-learn and some synthetic data sets.

The basic idea behind the algorithm is very simple: given a set of points with known classes, we classify a new point by looking at the $k$ closest points and taking the majority vote. We can adjust the algorithm in several ways, including:

  • altering the number of neighbors $k$ used to classify
  • changing our notion of distance (e.g. euclidean, taxi-cab / manhattan distance, and many others)
  • weighting the votes of the known points by distance

Resources:

First we load a few libraries.

In [4]:
%matplotlib inline
import random

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import neighbors

# Make our plots a bit larger
matplotlib.rcParams["figure.figsize"] = (8, 8)

The following function generates some random data that lies within a ring (annulus).

In [5]:
def annulus(inner_radius, outer_radius, n=30, color='b'):
    """Generate n points with class `color` between the inner radius and the outer radius."""
    data = []
    diff = outer_radius - inner_radius
    for _ in range(n):
        # Pick an angle and radius
        angle = 2 * np.pi * random.random()
        r = inner_radius + diff * random.random()
        x = r * np.cos(angle)
        y = r * np.sin(angle)
        data.append((x, y))
    # Return a data frame for convenience
    xs, ys = zip(*data)
    df = pd.DataFrame()
    df["x"] = xs
    df["y"] = ys
    df["color"] = color
    return df

Let's try it out!

In [6]:
df = annulus(4, 10, 300)

plt.scatter(df['x'], df['y'])
plt.show()
In [7]:
df1 = annulus(2, 5, 100, color='r')
df2 = annulus(6, 10, 400, color='b')
df = pd.concat([df1, df2])

plt.scatter(df['x'], df['y'], c=df['color'])
plt.show()

Now we've got some classified data, let's generate some new data points and see how they are classified. For the first example we'll use the five nearest neighbors, that is $k=5$.

In [8]:
df_test = annulus(2, 10, 600, color=None)
In [9]:
X = df[['x', 'y']]
y = df['color']

k = 5

clf = neighbors.KNeighborsClassifier(k, weights="uniform", metric="euclidean", algorithm="ball_tree")
clf.fit(X, y)

predictions = clf.predict(df_test[['x', 'y']])
df_test['color'] = predictions

Let's plot the newly classified data:

In [10]:
plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.show()

plt.scatter(df['x'], df['y'], c=df['color'], alpha=0.5)
plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.show()

Next we try an example where there is a bit of overlap in the known data points, and a larger value of $k$.

In [11]:
df1 = annulus(2, 6, 100, color='r')
df2 = annulus(4, 8, 300, color='b')
df = pd.concat([df1, df2])

df_test = annulus(1, 10, 500, color=None)
In [12]:
X = df[['x', 'y']]
y = df['color']

k = 30

clf = neighbors.KNeighborsClassifier(k, weights="uniform", metric="euclidean", algorithm="ball_tree")
clf.fit(X, y)

predictions = clf.predict(df_test[['x', 'y']])
df_test['color'] = predictions

plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.show()

plt.scatter(df['x'], df['y'], c=df['color'], alpha=0.5)
plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.show()

Exercises:

  • What happens when the inner circle has 1000 points and $k$ is larger, say 30? Why did this happen?
  • What about when $k$ is small?

Now let's try some different shapes for our known data points.

In [14]:
def generate_block_data(n):
    """Generates data in the unit square and it's reflection through the origin."""
    xs = []
    ys = []
    colors = []
    for _ in range(n):
        x = random.random()
        y = random.random()
        xs.append(x)
        ys.append(y)
        colors.append('r')
        x = random.random()
        y = random.random()
        xs.append(-x)
        ys.append(-y)
        colors.append('b')
    df = pd.DataFrame()
    df['x'] = xs
    df['y'] = ys
    df['color'] = colors
    return df

Let's take a look and classify some random points.

In [15]:
# Random Data

n = 200

df = generate_block_data(100)


X = df[['x', 'y']]
y = df['color']

df_test = annulus(0, 2, 200, color=None)

k = 30

clf = neighbors.KNeighborsClassifier(k, weights="uniform", metric="euclidean", algorithm="ball_tree")
clf.fit(X, y)

predictions = clf.predict(df_test[['x', 'y']])

df_test['color'] = predictions

plt.scatter(df['x'], df['y'], c=df['color'], s=30)
plt.show()

plt.scatter(df['x'], df['y'], c=df['color'], alpha=0.5)
plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.plot([-1.5, 0, 1.5], [1.5, 0, -1.5], linewidth=2, color='black')
plt.show()

In this case our classifier effectively creates a boundary that separates the data via the line $y=-x$. Next we expand on the example by creating a checkboard of data.

In [16]:
def generate_checkerboard_data(n=2000, l=5):
    xs = []
    ys = []
    colors = []
    for _ in range(n):
        x = l * random.random()
        y = l * random.random()
        xs.append(x)
        ys.append(y)
        if (int(x) + int(y)) % 2 == 0:
            colors.append('r')
        else:
            colors.append('b')
    df = pd.DataFrame()
    df['x'] = xs
    df['y'] = ys
    df['color'] = colors
    return df
In [210]:
# Random Data

l = 4
n = 200

df = generate_checkerboard_data(n=1000, l=l)

X = df[['x', 'y']]
y = df['color']

df_test = generate_checkerboard_data(n=500, l=l)

k = 5

clf = neighbors.KNeighborsClassifier(k, weights="uniform", metric="euclidean", algorithm="ball_tree")
clf.fit(X, y)

predictions = clf.predict(df_test[['x', 'y']])

df_test['color'] = predictions

plt.scatter(df['x'], df['y'], c=df['color'], s=30)
plt.show()

plt.scatter(df['x'], df['y'], c=df['color'], alpha=0.5)
plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.show()

In this case the classification boundary is much more complex than a line. Compare to these examples for many other classifiers:

Exercises:

  • Try out a few other classifiers in sklearn and compare the decision boundaries.

For example, here's a decision tree classifier:

In [19]:
# Random Data
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)

clf.fit(X, y)

predictions = clf.predict(df_test[['x', 'y']])
df_test['color'] = predictions

plt.scatter(df['x'], df['y'], c=df['color'], alpha=0.5)
plt.scatter(df_test['x'], df_test['y'], c=df_test['color'], marker='+', s=30)
plt.show()