Logistic Regression with Scikit-Learn

This notebook gives a quick example of a logistic regression using scikit-learn. Despite the name, logistic regression is typically used as a classifier, with the regression predicting the probability that a data point belongs to one of two classes (usually a binary choice).

First let's generate and plot some data.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import seaborn as sns

%matplotlib inline
In [2]:
distA = stats.norm(30, 5)
distB = stats.norm(15, 4)

data = []
for i in range(100):
    data.append((distA.rvs(), "A"))
    data.append((distB.rvs(), "B"))
    
df = pd.DataFrame(data, columns=["measurement", "class"])
df.head()
Out[2]:
measurement class
0 22.241336 A
1 7.913350 B
2 33.913087 A
3 18.627316 B
4 25.655871 A
In [3]:
sns.violinplot(x="class", y="measurement", data=df)
sns.plt.show()
In [4]:
sns.distplot(df[df["class"] == "A"]["measurement"])
sns.distplot(df[df["class"] == "B"]["measurement"])
sns.plt.show()

Now let's convert the classes to numerical values of 0 and 1 and re-plot.

In [5]:
df["class_num"] = df['class'].apply(lambda x: 1 if x == 'A' else 0 )
df.head()
Out[5]:
measurement class class_num
0 22.241336 A 1
1 7.913350 B 0
2 33.913087 A 1
3 18.627316 B 0
4 25.655871 A 1
In [6]:
plt.scatter(df["measurement"], df["class_num"])
plt.show()

We could try to use a linear regression to separate the classes. With the best fit line we could label points above and below the line in seperate classes. This works ok (better than no classifier) but has a lot of drawbacks, and logistic regression typically gives a better solution.

In [7]:
from sklearn import linear_model
X = df[["measurement"]]
y = df["class_num"]
model = linear_model.LinearRegression()
model.fit(X, y)

plt.scatter(df["measurement"], df["class_num"])
plt.plot(df["measurement"], model.predict(X), color="r")
plt.show()

A logistic regression produces a classifier that separates the two classes much more sharply.

In [8]:
from sklearn import linear_model

df.sort_values(by="measurement", inplace=True)

X = df[["measurement"]]
y = df["class_num"]
model = linear_model.LogisticRegression()
model.fit(X, y)

plt.scatter(df["measurement"], df["class_num"])
plt.plot(df["measurement"], model.predict(X), color="r")
plt.xlabel("Measurement")
plt.show()

We can also plot the predicted probabilities and check the accuracy of the model.

In [9]:
from sklearn import linear_model

df.sort_values(by="measurement", inplace=True)

X = df[["measurement"]]
y = df["class_num"]
model = linear_model.LogisticRegression()
model.fit(X, y)

plt.scatter(df["measurement"], df["class_num"])
plt.plot(df["measurement"], model.predict_proba(X)[:, 1], color="r")
plt.xlabel("Measurement")
plt.ylabel("Probability of being in class B")
plt.show()

print "Accuracy", model.score(X, y)
Accuracy 0.96

Now let's try a set of data that is not so well separated.

In [12]:
distA = stats.norm(22, 5)
distB = stats.norm(15, 3)

data = []
for i in range(100):
    data.append((distA.rvs(), "A"))
    data.append((distB.rvs(), "B"))
    
df = pd.DataFrame(data, columns=["measurement", "class"])
df["class_num"] = df['class'].apply(lambda x: 1 if x == 'A' else 0 )

sns.distplot(df[df["class"] == "A"]["measurement"])
sns.distplot(df[df["class"] == "B"]["measurement"])
sns.plt.show()
In [13]:
from sklearn import linear_model

df.sort_values(by="measurement", inplace=True)

X = df[["measurement"]]
y = df["class_num"]
model = linear_model.LogisticRegression()
model.fit(X, y)

plt.scatter(df["measurement"], df["class_num"])
plt.plot(df["measurement"], model.predict_proba(X)[:, 1], color="r")
plt.show()

print "Accuracy", model.score(X, y)
Accuracy 0.81

In this case the accuracy is not as good but it's still useful.

In [ ]: