Scikit-Learn Regularization and Gradient Descent

We're using the Iris data set to examine various linear models with scikit-learn.

In [136]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model, metrics

matplotlib.rcParams['figure.figsize'] = (8, 8)
In [137]:
iris = sns.load_dataset("iris")
sns.pairplot(data=iris, hue="species")
sns.plt.show()

Let's take a look at the petal length and sepal length and try to fit a linear model.

In [12]:
sns.pairplot(data=iris, vars=['petal_length', 'sepal_length'], hue="species")
sns.plt.show()

Linear model

In [140]:
X = iris[['petal_length']]
y = iris['sepal_length']

model = linear_model.LinearRegression()
model.fit(X, y)
print "R^2:", model.score(X, y)
print "SSE", metrics.mean_squared_error(y, model.predict(X))

# Plot the best fit line
plt.scatter(iris["petal_length"], iris["sepal_length"])
plt.plot(iris["petal_length"], model.predict(X), color='g')
plt.xlabel("Petal Length")
plt.ylabel("Sepal Length")
plt.show()
R^2: 0.759954645773
SSE 0.163500225106

Ridge model

In [141]:
from sklearn import grid_search

alphas = np.logspace(-10, 10, 100)
gs = grid_search.GridSearchCV(
    cv=3,
    estimator=linear_model.Ridge(),
    param_grid={'alpha': alphas},
    scoring='mean_squared_error')

gs.fit(X, y)

print "SSE:", metrics.mean_squared_error(y, gs.predict(X))
print "Alpha:", gs.best_params_

# Plot the best fit line
plt.scatter(iris["petal_length"], iris["sepal_length"])
plt.plot(iris["petal_length"], model.predict(X), color='g', label="Best fit Line")
plt.plot(iris["petal_length"], gs.predict(X), color='b', label="Ridge Line")
plt.xlabel("Petal Length")
plt.ylabel("Sepal Length")
plt.legend()
plt.show()
SSE: 0.168775845706
Alpha: {'alpha': 52.14008287999674}

Lasso Model

In [142]:
alphas = np.logspace(-10, 10, 100)
gs2 = grid_search.GridSearchCV(
    cv=3,
    estimator=linear_model.Lasso(),
    param_grid={'alpha': alphas},
    scoring='mean_squared_error')

gs2.fit(X, y)

print "SSE:", metrics.mean_squared_error(y, gs2.predict(X))
print "Alpha:", gs2.best_params_

# Plot the best fit line
plt.scatter(iris["petal_length"], iris["sepal_length"])
plt.plot(iris["petal_length"], model.predict(X), color='g', label="Best fit Line")
plt.plot(iris["petal_length"], gs.predict(X), color='b', label="Ridge Line")
plt.plot(iris["petal_length"], gs2.predict(X), label="Lasso Line", color='r')
plt.xlabel("Petal Length")
plt.ylabel("Sepal Length")
plt.legend()
plt.show()
SSE: 0.175949022631
Alpha: {'alpha': 0.19630406500402683}

Gradient Descent

In [145]:
X = iris[['petal_length']]
y = iris['sepal_length']

model = linear_model.LinearRegression()
model.fit(X, y)
model2 = linear_model.SGDRegressor(n_iter=15)
model2.fit(X, y)
print "LM R^2:", model.score(X, y)
print "LM SSE", metrics.mean_squared_error(y, model.predict(X))

print "GD R^2:", model2.score(X, y)
print "GD SSE", metrics.mean_squared_error(y, model2.predict(X))
print model2.intercept_, model2.coef_

# Plot the best fit line
plt.scatter(iris["petal_length"], iris["sepal_length"])
plt.plot(iris["petal_length"], model.predict(X), color='b', label="Best Fit Line")
plt.plot(iris["petal_length"], model2.predict(X), color='g', label="Gradient Descent")
plt.xlabel("Petal Length")
plt.ylabel("Sepal Length")
plt.legend(loc=2)
plt.show()
LM R^2: 0.759954645773
LM SSE 0.163500225106
GD R^2: -0.228993998235
GD SSE 0.837095123175
[ 2.37676709] [ 0.84543996]

Looks like the gradient descent algorithm got stuck in a local minimum by fitting a best fit line through the cluster on the right, or just didn't go through enough iterations. If we change the learning_rate and/or the number of iterations n_iter we can get a better overall fit by finding the minimum in that goes through both data sets.

In [147]:
X = iris[['petal_length']]
y = iris['sepal_length']

model = linear_model.LinearRegression()
model.fit(X, y)
model2 = linear_model.SGDRegressor(eta0=0.1, n_iter=50)
model2.fit(X, y)
print "R^2:", model.score(X, y)
print "SSE", metrics.mean_squared_error(y, model.predict(X))

print "R^2:", model2.score(X, y)
print "SSE", metrics.mean_squared_error(y, model2.predict(X))
print model2.intercept_, model2.coef_

# Plot the best fit line
plt.scatter(iris["petal_length"], iris["sepal_length"])
plt.plot(iris["petal_length"], model.predict(X), color='b', label="Best Fit Line")
plt.plot(iris["petal_length"], model2.predict(X), color='g', label="Gradient Descent")
plt.legend(loc=2)
plt.xlabel("Petal Length")
plt.ylabel("Sepal Length")
plt.show()
R^2: 0.759954645773
SSE 0.163500225106
R^2: 0.710667490754
SSE 0.197070801658
[ 4.28724869] [ 0.36903519]

The Loss function

It's possible to take a look at the landscape that gradient descent is trying to minimize over by computing the loss function for many potential models.

In [126]:
# This make take a minute or so
domain = np.arange(-10, 10, 0.05)
l = len(domain)
mses = np.zeros((l, l))

def f(x, a, b):
    return a + b * x

X = iris['petal_length']
y = iris['sepal_length']

data = zip(X, y)

def compute_mse(a, b, data=data):
    s = 0.
    for (x, y) in data:
        s += (y - f(x, a, b)) ** 2
    return s
        
for i, a in enumerate(domain):
    for j, b in enumerate(domain):
        mses[i][j] = np.log(compute_mse(a, b) / l)
In [150]:
sep = 20

plt.matshow(mses, cmap='viridis')
domain = map(round, domain)
plt.title("Loss Function for this dataset")
plt.yticks(range(len(domain))[::sep], domain[::sep])
plt.xticks(range(len(domain))[::sep], domain[::sep])
plt.ylabel("Intercept")
plt.xlabel("Slope")
plt.colorbar()
plt.show()

The model above converged to an intercept of 4.28724869 and a slope of 0.36903519, which is roughly the location of the darkest spot on the heatmap.

We can also see how regularization changes the landscape. In this case the changes that Lasso and Ridge make to the loss function end up being fairly similar.

In [153]:
# Now our loss function takes a parameter alpha, and this changes the heatmap
domain = np.arange(-10, 10, 0.05)
l = len(domain)
mses = np.zeros((l, l))

def f(x, a, b):
    return a + b * x

X = iris['petal_length']
y = iris['sepal_length']

data = zip(X, y)

def compute_mse(a, b, data=data, alpha=10**3):
    s = 0.
    for (x, y) in data:
        s += (y - f(x, a, b)) ** 2
    s += alpha * (abs(a) + abs(b))
    return s

for i, a in enumerate(domain):
    for j, b in enumerate(domain):
        mses[i][j] = np.log(compute_mse(a, b) / l)
In [154]:
sep = 20

plt.matshow(mses, cmap='viridis')
domain = map(round, domain)
plt.title("Loss Function for this dataset")
plt.yticks(range(len(domain))[::sep], domain[::sep])
plt.xticks(range(len(domain))[::sep], domain[::sep])
plt.ylabel("Intercept")
plt.xlabel("Slope")
plt.colorbar()
plt.show()
In [157]:
# Now our loss function takes a parameter alpha, and this changes the heatmap
domain = np.arange(-10, 10, 0.05)
l = len(domain)
mses = np.zeros((l, l))

def f(x, a, b):
    return a + b * x

X = iris['petal_length']
y = iris['sepal_length']

data = zip(X, y)

def compute_mse(a, b, data=data, alpha=10**3):
    s = 0.
    for (x, y) in data:
        s += (y - f(x, a, b)) ** 2
    s += alpha * (a**2 + b**2)**(0.5)
    return s

for i, a in enumerate(domain):
    for j, b in enumerate(domain):
        mses[i][j] = np.log(compute_mse(a, b) / l)
In [158]:
sep = 20

plt.matshow(mses, cmap='viridis')
domain = map(round, domain)
plt.title("Loss Function for this dataset")
plt.yticks(range(len(domain))[::sep], domain[::sep])
plt.xticks(range(len(domain))[::sep], domain[::sep])
plt.ylabel("Intercept")
plt.xlabel("Slope")
plt.colorbar()
plt.show()
In [ ]: