Mauna Loa CO2 Data

In this notebook we'll be analyzing the Mauna Loa Obsevatory CO2 measurements. The data is available here.

Let's try to model the time series. It has both a seasonal component and an increasing mean.

In [1]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

plt.rcParams['figure.figsize'] = (8, 8)

Load the data

It's a little tricky to load this dataset but with some extra optional arguments to read_csv we can get the job done.

In [2]:
filename = "co2_mm_mlo.txt"
columns = ["year", "month", "decimal date", "average", "interpolated", "trend", "?"]

co2 = pd.read_csv(filename, delim_whitespace=True, skiprows=72, names=columns)
co2.dropna(inplace=True)
co2.head()
Out[2]:
year month decimal date average interpolated trend ?
0 1958 3 1958.208 315.71 315.71 314.62 -1
1 1958 4 1958.292 317.45 317.45 315.29 -1
2 1958 5 1958.375 317.50 317.50 314.71 -1
3 1958 6 1958.458 -99.99 317.10 314.85 -1
4 1958 7 1958.542 315.86 315.86 314.98 -1

Next we have to mangle the date format.

In [3]:
def make_date(row):
    year = str(int(row['year']))
    month = str(int(row['month']))
    if len(month) < 2:
        month = '0' + month
    day = '01'
    return "{}-{}-{}".format(year, month, day)

co2['Date'] = co2.apply(make_date, axis=1)
co2['Date'] = pd.to_datetime(co2['Date'], format='%Y-%m-%d')
co2['date'] = co2['Date']
co2.set_index('Date', inplace=True)
co2.head()
Out[3]:
year month decimal date average interpolated trend ? date
Date
1958-03-01 1958 3 1958.208 315.71 315.71 314.62 -1 1958-03-01
1958-04-01 1958 4 1958.292 317.45 317.45 315.29 -1 1958-04-01
1958-05-01 1958 5 1958.375 317.50 317.50 314.71 -1 1958-05-01
1958-06-01 1958 6 1958.458 -99.99 317.10 314.85 -1 1958-06-01
1958-07-01 1958 7 1958.542 315.86 315.86 314.98 -1 1958-07-01

Let's plot the data

In [4]:
plt.plot(co2.index, co2["interpolated"])
plt.ylabel("CO_2")
plt.xlabel("Date")
plt.show()

Build a DataFrame

In [5]:
df = pd.DataFrame()
df['y'] = co2["interpolated"]
df['x'] = co2["date"].apply(lambda x: x.toordinal())
df['Date'] = co2["date"]
df.set_index('Date', inplace=True)

df.head()
Out[5]:
y x
Date
1958-03-01 315.71 714839
1958-04-01 317.45 714870
1958-05-01 317.50 714900
1958-06-01 317.10 714931
1958-07-01 315.86 714961

Modeling

It turns out that a polynomial (quadratic) model of the log of the CO2 versus the date works fairly well.

In [6]:
X = np.vander(df['x'], 3)
y = np.log(df['y'])

from sklearn import linear_model

model = linear_model.LinearRegression()

model.fit(X, y)

plt.plot(df['x'], df['y'], label="Raw Data")
plt.plot(df['x'], np.exp(model.predict(X)), label="Predicted", color='r', linewidth='2')
plt.ylabel("CO_2")
plt.xlabel("Date")
plt.legend()
Out[6]:
<matplotlib.legend.Legend at 0x7fce5d1430d0>

Subtract off the predicted trend

To get at the oscilliatory portion of the time series we subtract off the predictions from the quadratic model.

In [7]:
df['y2'] = df['y'] - np.exp(model.predict(X))
plt.plot(df['x'], df['y2'])
Out[7]:
[<matplotlib.lines.Line2D at 0x7fce5d0677d0>]

Analyzing the Time Series

We can use autocorrelation to measure the self-similarity of the time series. In this case we expect a yearly cycle.

In [8]:
# Monthly
df['y2'].autocorr(lag=1)
Out[8]:
0.84705794674370127
In [9]:
# Yearly
df['y2'].autocorr(lag=12)
Out[9]:
0.96559155885857939

As you can see there is a very strong yearly autocorrelation, as expected. We can take a look at a range of lag values with a plot from statsmodels.

In [10]:
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(df['y2'], lags=13)
plt.show()

Forecasting with ARMA and ARIMA

We could extrapolate the patterns we found about or we could use a time series forecasting model.

First we try an ARMA model on the reduced time series.

In [11]:
from statsmodels.tsa.arima_model import ARMA

model = ARMA(df['y2'], (1, 1)).fit()
model.summary()
Out[11]:
ARMA Model Results
Dep. Variable: y2 No. Observations: 694
Model: ARMA(1, 1) Log Likelihood -866.163
Method: css-mle S.D. of innovations 0.841
Date: Tue, 26 Jul 2016 AIC 1740.326
Time: 21:17:44 BIC 1758.496
Sample: 03-01-1958 HQIC 1747.353
- 12-01-2015
coef std err z P>|z| [95.0% Conf. Int.]
const 0.0111 0.253 0.044 0.965 -0.485 0.507
ar.L1.y2 0.7864 0.024 32.924 0.000 0.740 0.833
ma.L1.y2 0.7023 0.021 32.914 0.000 0.661 0.744
1.2715 +0.0000j 1.2715 0.0000 -1.4238 +0.0000j 1.4238 0.5000
Roots
Real Imaginary Modulus Frequency
AR.1
MA.1
In [12]:
model.resid.plot()
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce67a5a290>
In [13]:
model.plot_predict(1, 500)
plt.show()
/home/user/anaconda2/lib/python2.7/site-packages/statsmodels/tsa/arima_model.py:1724: FutureWarning: TimeSeries is deprecated. Please use Series
  forecast = TimeSeries(forecast, index=self.data.predict_dates)

That's a pretty good fit! We could combine this model with our model for the mean to make predictions, but let's go ahead and try an ARIMA model on the raw data. This will handle the increasing mean as part of the model.

In [14]:
from statsmodels.tsa.arima_model import ARIMA

model = ARIMA(co2['interpolated'], (2, 1, 2)).fit()
model.summary()
Out[14]:
ARIMA Model Results
Dep. Variable: D.interpolated No. Observations: 693
Model: ARIMA(2, 1, 2) Log Likelihood -657.754
Method: css-mle S.D. of innovations 0.624
Date: Tue, 26 Jul 2016 AIC 1327.508
Time: 21:17:45 BIC 1354.754
Sample: 04-01-1958 HQIC 1338.045
- 12-01-2015
coef std err z P>|z| [95.0% Conf. Int.]
const 0.1255 0.007 16.902 0.000 0.111 0.140
ar.L1.D.interpolated 1.5587 0.023 66.515 0.000 1.513 1.605
ar.L2.D.interpolated -0.8615 0.023 -38.155 0.000 -0.906 -0.817
ma.L1.D.interpolated -0.9243 0.051 -18.203 0.000 -1.024 -0.825
ma.L2.D.interpolated 0.0178 0.048 0.370 0.712 -0.077 0.112
0.9046 -0.5852j 1.0774 -0.0914 0.9046 +0.5852j 1.0774 0.0914 1.1055 +0.0000j 1.1055 0.0000 50.6990 +0.0000j 50.6990 0.0000
Roots
Real Imaginary Modulus Frequency
AR.1
AR.2
MA.1
MA.2
In [15]:
model.resid.plot()
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce5721c110>
In [16]:
model.plot_predict(1, 500)
plt.show()
/home/user/anaconda2/lib/python2.7/site-packages/statsmodels/tsa/arima_model.py:1847: FutureWarning: TimeSeries is deprecated. Please use Series
  forecast = TimeSeries(forecast, index=self.data.predict_dates)

Our model matches the raw data quite well!

Finally, let's predict the future.

In [17]:
model.plot_predict(1, 1200)
plt.xlabel("Date")
plt.ylabel("Atmospheric CO2")
plt.show()
In [ ]: