Linear Regression with Statsmodels and Scikit-Learn

There are many ways to fit a linear regression and in python I find myself commonly using both scikit-learn and statsmodels. This notebook demos some common tasks using these libraries:

  • Linear regressions in both
  • Using dummy variables
  • Multilinear regression
  • Quadratic and polynomial regressions
  • Exponential regressions

You can also create polynomial fits with numpy and more general curve fits with scipy.

To get started let's load in the libraries that we'll need.

In [61]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns
import statsmodels.api as sm
from sklearn import linear_model

For the first few examples we'll use the famous Iris dataset. Seaborn provides a few data sets including this one. Let's load the data and take a quick look using Seaborn's pairplot.

In [48]:
# Load the data into a pandas dataframe
iris = sns.load_dataset("iris")
iris.head()
Out[48]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [11]:
# Quick plot of the data using seaborn
sns.pairplot(iris, hue="species")
sns.plt.show()

You can see a pretty strong linear relationship between petal_length and petal_width. Let's fit a linear regression. Seaborn can plot the data with a regression line so let's do that first (but it doesn't give us much in the way of statistics).

In [12]:
sns.lmplot(x="petal_length", y="petal_width", data=iris)
sns.plt.show()

Now let's use scikit-learn to find the best fit line.

In [14]:
# from sklearn import linear_model

X = iris[["petal_length"]]
y = iris["petal_width"]

# Fit the linear model
model = linear_model.LinearRegression()
results = model.fit(X, y)

# Print the coefficients
print results.intercept_, results.coef_
-0.363075521319 [ 0.41575542]

This means that our best fit line is: $$y = a + b x$$ where $a = -0.363075521319$ and $b = 0.41575542$.

Next let's use statsmodels.

In [15]:
# import statsmodels.api as sm

# Note the swap of X and y
model = sm.OLS(y, X)
results = model.fit()
# Statsmodels gives R-like statistical output
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            petal_width   R-squared:                       0.967
Model:                            OLS   Adj. R-squared:                  0.967
Method:                 Least Squares   F-statistic:                     4417.
Date:                Tue, 14 Jun 2016   Prob (F-statistic):          1.22e-112
Time:                        22:19:00   Log-Likelihood:                -8.7179
No. Observations:                 150   AIC:                             19.44
Df Residuals:                     149   BIC:                             22.45
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
petal_length     0.3365      0.005     66.463      0.000         0.327     0.347
==============================================================================
Omnibus:                       19.720   Durbin-Watson:                   0.857
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               23.498
Skew:                           0.957   Prob(JB):                     7.90e-06
Kurtosis:                       3.311   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

If you look closely you'll note that this model doesn't include an intercept by default like scikit-learn does. There's an easy way to do this using numpy's Vandermonde matrix function numpy.vander.

In [17]:
X = [1, 2, 3]
np.vander(X, 3)
Out[17]:
array([[1, 1, 1],
       [4, 2, 1],
       [9, 3, 1]])

As you can see, np.vander gives us the powers of the input matrix or array. We can use it to simply add a constant row for the intercept (zeroth power) or to do polynomial fits (larger exponents). First let's redo the statsmodels fit with an intercept.

In [20]:
X = iris["petal_length"]
X = np.vander(X, 2) # add a constant row for the intercept
y = iris["petal_width"]

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            petal_width   R-squared:                       0.927
Model:                            OLS   Adj. R-squared:                  0.927
Method:                 Least Squares   F-statistic:                     1882.
Date:                Tue, 14 Jun 2016   Prob (F-statistic):           4.68e-86
Time:                        22:23:17   Log-Likelihood:                 24.796
No. Observations:                 150   AIC:                            -45.59
Df Residuals:                     148   BIC:                            -39.57
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.4158      0.010     43.387      0.000         0.397     0.435
const         -0.3631      0.040     -9.131      0.000        -0.442    -0.285
==============================================================================
Omnibus:                        5.765   Durbin-Watson:                   1.455
Prob(Omnibus):                  0.056   Jarque-Bera (JB):                5.555
Skew:                           0.359   Prob(JB):                       0.0622
Kurtosis:                       3.611   Cond. No.                         10.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Note that the coefficients are almost identical to what we saw before with scikit-learn, and the fit is pretty good ($R^2=0.927$). Let's see if adding in the species helps. Since that feature is categorical, we need to use dummy variables.

In [49]:
dummies = pd.get_dummies(iris["species"])
# Add to the original dataframe
iris = pd.concat([iris, dummies], axis=1)
iris.head()
Out[49]:
sepal_length sepal_width petal_length petal_width species setosa versicolor virginica
0 5.1 3.5 1.4 0.2 setosa 1.0 0.0 0.0
1 4.9 3.0 1.4 0.2 setosa 1.0 0.0 0.0
2 4.7 3.2 1.3 0.2 setosa 1.0 0.0 0.0
3 4.6 3.1 1.5 0.2 setosa 1.0 0.0 0.0
4 5.0 3.6 1.4 0.2 setosa 1.0 0.0 0.0

Now we perform a multilinear regression with the dummy variables added.

In [51]:
X = iris[["petal_length", "setosa", "versicolor", "virginica"]]
X = sm.add_constant(X) # another way to add a constant row for an intercept
y = iris["petal_width"]

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            petal_width   R-squared:                       0.946
Model:                            OLS   Adj. R-squared:                  0.944
Method:                 Least Squares   F-statistic:                     845.5
Date:                Tue, 14 Jun 2016   Prob (F-statistic):           4.88e-92
Time:                        22:46:01   Log-Likelihood:                 46.704
No. Observations:                 150   AIC:                            -85.41
Df Residuals:                     146   BIC:                            -73.37
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
const            0.2501      0.098      2.561      0.011         0.057     0.443
petal_length     0.2304      0.034      6.691      0.000         0.162     0.298
setosa          -0.3410      0.051     -6.655      0.000        -0.442    -0.240
versicolor       0.0944      0.054      1.751      0.082        -0.012     0.201
virginica        0.4967      0.096      5.150      0.000         0.306     0.687
==============================================================================
Omnibus:                        6.210   Durbin-Watson:                   1.736
Prob(Omnibus):                  0.045   Jarque-Bera (JB):                9.578
Skew:                          -0.110   Prob(JB):                      0.00832
Kurtosis:                       4.218   Cond. No.                     1.69e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.68e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

In this case it looks like we got a slight improvement from including the dummy variables. The dummy variables have a bigger impact on a fit between petal_length and sepal_length.

In [57]:
X = iris[["petal_length"]]
X = sm.add_constant(X)
y = iris["sepal_length"]

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           sepal_length   R-squared:                       0.760
Model:                            OLS   Adj. R-squared:                  0.758
Method:                 Least Squares   F-statistic:                     468.6
Date:                Tue, 14 Jun 2016   Prob (F-statistic):           1.04e-47
Time:                        22:49:10   Log-Likelihood:                -77.020
No. Observations:                 150   AIC:                             158.0
Df Residuals:                     148   BIC:                             164.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
const            4.3066      0.078     54.939      0.000         4.152     4.462
petal_length     0.4089      0.019     21.646      0.000         0.372     0.446
==============================================================================
Omnibus:                        0.207   Durbin-Watson:                   1.867
Prob(Omnibus):                  0.902   Jarque-Bera (JB):                0.346
Skew:                           0.069   Prob(JB):                        0.841
Kurtosis:                       2.809   Cond. No.                         10.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [58]:
X = iris[["petal_length", "setosa", "versicolor", "virginica"]]
y = iris["sepal_length"]

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           sepal_length   R-squared:                       0.837
Model:                            OLS   Adj. R-squared:                  0.833
Method:                 Least Squares   F-statistic:                     249.4
Date:                Tue, 14 Jun 2016   Prob (F-statistic):           3.10e-57
Time:                        22:49:23   Log-Likelihood:                -48.116
No. Observations:                 150   AIC:                             104.2
Df Residuals:                     146   BIC:                             116.3
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
petal_length     0.9046      0.065     13.962      0.000         0.777     1.033
setosa           3.6835      0.106     34.719      0.000         3.474     3.893
versicolor       2.0826      0.280      7.435      0.000         1.529     2.636
virginica        1.5659      0.363      4.315      0.000         0.849     2.283
==============================================================================
Omnibus:                        0.578   Durbin-Watson:                   1.802
Prob(Omnibus):                  0.749   Jarque-Bera (JB):                0.672
Skew:                           0.140   Prob(JB):                        0.715
Kurtosis:                       2.830   Cond. No.                         71.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Quadratic Fit

Next we look at a data set that needs a quadratic fit. Let's do both a linear and quadratic fit and compare.

In [76]:
func = lambda x: 2 + 0.5 * x + 3 * x ** 2 + 5 * stats.norm.rvs(0, 10)
df = pd.DataFrame()
df["x"] = list(range(0, 30))
df["y"] = map(func, df["x"])
df.plot.scatter(x='x', y='y')
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3d1003050>
In [105]:
# Linear Fit
X = df["x"]
X = np.vander(X, 2) # add a constant row for the intercept
y = df["y"]

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.926
Model:                            OLS   Adj. R-squared:                  0.923
Method:                 Least Squares   F-statistic:                     351.0
Date:                Tue, 14 Jun 2016   Prob (F-statistic):           2.23e-17
Time:                        23:03:30   Log-Likelihood:                -203.34
No. Observations:                  30   AIC:                             410.7
Df Residuals:                      28   BIC:                             413.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            86.9246      4.640     18.736      0.000        77.421    96.428
const       -394.8666     78.347     -5.040      0.000      -555.353  -234.380
==============================================================================
Omnibus:                        3.221   Durbin-Watson:                   0.175
Prob(Omnibus):                  0.200   Jarque-Bera (JB):                2.444
Skew:                           0.553   Prob(JB):                        0.295
Kurtosis:                       2.145   Cond. No.                         33.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [106]:
# Quadratic Fit
X2 = df['x']
X2 = np.vander(X2, 3) # add a constant and quadratic term
y = df["y"]

model2 = sm.OLS(y, X2)
results2 = model2.fit()
print(results2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.997
Model:                            OLS   Adj. R-squared:                  0.996
Method:                 Least Squares   F-statistic:                     4003.
Date:                Tue, 14 Jun 2016   Prob (F-statistic):           4.06e-34
Time:                        23:03:42   Log-Likelihood:                -156.99
No. Observations:                  30   AIC:                             320.0
Df Residuals:                      27   BIC:                             324.2
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             3.1033      0.130     23.798      0.000         2.836     3.371
x2            -3.0725      3.914     -0.785      0.439       -11.103     4.958
const         25.1198     24.517      1.025      0.315       -25.186    75.425
==============================================================================
Omnibus:                        0.146   Durbin-Watson:                   2.670
Prob(Omnibus):                  0.929   Jarque-Bera (JB):                0.021
Skew:                           0.039   Prob(JB):                        0.990
Kurtosis:                       2.897   Cond. No.                     1.10e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

We see that the quadratic fit is better. We can plot the residuals in both cases to see how far off the models are in each case.

In [107]:
# Plot true values versus the predictions
plt.scatter(df['y'], results.predict(), color="g", label="Linear")
plt.scatter(df['y'], results2.predict(), color="b", label="Quadratic")
plt.xlabel("True Values")
plt.ylabel("Predicted Values")
plt.show()

Although it's a little hard to tell from the plot (since both fits are good), the blue (quadratic) fit is closer to "y=x", indicating a closer agreement with the true values and the model's predictions.

Higher order polynomial regressions are as easy as increasing the exponent parameter in numpy.vander.

Exponential functions

We can also transform our data before applying linear regression. This allows us to fit functions such as exponentials of the form $y=A e^{k x}$ using linear regression. Here's some exponentially distributed data.

In [109]:
func = lambda x: 2.5 * np.exp(0.5 * x) + stats.norm.rvs(0, 0.3)
df = pd.DataFrame()
df["x"] = np.arange(-1, 4, 0.1)
df["y"] = map(func, df["x"])
df.plot.scatter(x='x', y='y')
Out[109]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3d9d10250>

If we take the log of the y-variable we get something more linear.

In [110]:
df["log-y"] = np.log(df["y"])
df.plot.scatter(x='x', y='log-y')
Out[110]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3dd31a9d0>

We can then use linear regression to determine $k$ and $\log A$, since taking the $\log$ of both sides of $y = A e^{k x}$ gives us $\log y = \log A + k x$.

In [111]:
X = df["x"]
X = sm.add_constant(X)
y = df["log-y"]

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  log-y   R-squared:                       0.980
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     2403.
Date:                Tue, 14 Jun 2016   Prob (F-statistic):           1.17e-42
Time:                        23:11:11   Log-Likelihood:                 39.452
No. Observations:                  50   AIC:                            -74.90
Df Residuals:                      48   BIC:                            -71.08
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          0.8160      0.022     36.278      0.000         0.771     0.861
x              0.5390      0.011     49.021      0.000         0.517     0.561
==============================================================================
Omnibus:                        6.788   Durbin-Watson:                   2.358
Prob(Omnibus):                  0.034   Jarque-Bera (JB):               11.226
Skew:                           0.105   Prob(JB):                      0.00365
Kurtosis:                       5.312   Cond. No.                         3.29
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

As you can see the fit is very good.

In [ ]: