# Mauna Loa CO2 Data¶

In this notebook we'll be analyzing the Mauna Loa Obsevatory CO2 measurements. The data is available here.

Let's try to model the time series. It has both a seasonal component and an increasing mean.

In [1]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

plt.rcParams['figure.figsize'] = (8, 8)


It's a little tricky to load this dataset but with some extra optional arguments to read_csv we can get the job done.

In [2]:
filename = "co2_mm_mlo.txt"
columns = ["year", "month", "decimal date", "average", "interpolated", "trend", "?"]

co2 = pd.read_csv(filename, delim_whitespace=True, skiprows=72, names=columns)
co2.dropna(inplace=True)

Out[2]:
year month decimal date average interpolated trend ?
0 1958 3 1958.208 315.71 315.71 314.62 -1
1 1958 4 1958.292 317.45 317.45 315.29 -1
2 1958 5 1958.375 317.50 317.50 314.71 -1
3 1958 6 1958.458 -99.99 317.10 314.85 -1
4 1958 7 1958.542 315.86 315.86 314.98 -1

Next we have to mangle the date format.

In [3]:
def make_date(row):
year = str(int(row['year']))
month = str(int(row['month']))
if len(month) < 2:
month = '0' + month
day = '01'
return "{}-{}-{}".format(year, month, day)

co2['Date'] = co2.apply(make_date, axis=1)
co2['Date'] = pd.to_datetime(co2['Date'], format='%Y-%m-%d')
co2['date'] = co2['Date']
co2.set_index('Date', inplace=True)

Out[3]:
year month decimal date average interpolated trend ? date
Date
1958-03-01 1958 3 1958.208 315.71 315.71 314.62 -1 1958-03-01
1958-04-01 1958 4 1958.292 317.45 317.45 315.29 -1 1958-04-01
1958-05-01 1958 5 1958.375 317.50 317.50 314.71 -1 1958-05-01
1958-06-01 1958 6 1958.458 -99.99 317.10 314.85 -1 1958-06-01
1958-07-01 1958 7 1958.542 315.86 315.86 314.98 -1 1958-07-01

## Let's plot the data¶

In [4]:
plt.plot(co2.index, co2["interpolated"])
plt.ylabel("CO_2")
plt.xlabel("Date")
plt.show()


## Build a DataFrame¶

In [5]:
df = pd.DataFrame()
df['y'] = co2["interpolated"]
df['x'] = co2["date"].apply(lambda x: x.toordinal())
df['Date'] = co2["date"]
df.set_index('Date', inplace=True)


Out[5]:
y x
Date
1958-03-01 315.71 714839
1958-04-01 317.45 714870
1958-05-01 317.50 714900
1958-06-01 317.10 714931
1958-07-01 315.86 714961

## Modeling¶

It turns out that a polynomial (quadratic) model of the log of the CO2 versus the date works fairly well.

In [6]:
X = np.vander(df['x'], 3)
y = np.log(df['y'])

from sklearn import linear_model

model = linear_model.LinearRegression()

model.fit(X, y)

plt.plot(df['x'], df['y'], label="Raw Data")
plt.plot(df['x'], np.exp(model.predict(X)), label="Predicted", color='r', linewidth='2')
plt.ylabel("CO_2")
plt.xlabel("Date")
plt.legend()

Out[6]:
<matplotlib.legend.Legend at 0x7fce5d1430d0>

## Subtract off the predicted trend¶

To get at the oscilliatory portion of the time series we subtract off the predictions from the quadratic model.

In [7]:
df['y2'] = df['y'] - np.exp(model.predict(X))
plt.plot(df['x'], df['y2'])

Out[7]:
[<matplotlib.lines.Line2D at 0x7fce5d0677d0>]

## Analyzing the Time Series¶

We can use autocorrelation to measure the self-similarity of the time series. In this case we expect a yearly cycle.

In [8]:
# Monthly
df['y2'].autocorr(lag=1)

Out[8]:
0.84705794674370127
In [9]:
# Yearly
df['y2'].autocorr(lag=12)

Out[9]:
0.96559155885857939

As you can see there is a very strong yearly autocorrelation, as expected. We can take a look at a range of lag values with a plot from statsmodels.

In [10]:
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(df['y2'], lags=13)
plt.show()


## Forecasting with ARMA and ARIMA¶

We could extrapolate the patterns we found about or we could use a time series forecasting model.

First we try an ARMA model on the reduced time series.

In [11]:
from statsmodels.tsa.arima_model import ARMA

model = ARMA(df['y2'], (1, 1)).fit()
model.summary()

Out[11]:
Dep. Variable: No. Observations: y2 694 ARMA(1, 1) -866.163 css-mle 0.841 Tue, 26 Jul 2016 1740.326 21:17:44 1758.496 03-01-1958 1747.353 - 12-01-2015
coef std err z P>|z| [95.0% Conf. Int.] 0.0111 0.253 0.044 0.965 -0.485 0.507 0.7864 0.024 32.924 0.000 0.740 0.833 0.7023 0.021 32.914 0.000 0.661 0.744
1.2715 +0.0000j 1.2715 0.0000 -1.4238 +0.0000j 1.4238 0.5000
In [12]:
model.resid.plot()

Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce67a5a290>
In [13]:
model.plot_predict(1, 500)
plt.show()

/home/user/anaconda2/lib/python2.7/site-packages/statsmodels/tsa/arima_model.py:1724: FutureWarning: TimeSeries is deprecated. Please use Series
forecast = TimeSeries(forecast, index=self.data.predict_dates)


That's a pretty good fit! We could combine this model with our model for the mean to make predictions, but let's go ahead and try an ARIMA model on the raw data. This will handle the increasing mean as part of the model.

In [14]:
from statsmodels.tsa.arima_model import ARIMA

model = ARIMA(co2['interpolated'], (2, 1, 2)).fit()
model.summary()

Out[14]:
Dep. Variable: No. Observations: D.interpolated 693 ARIMA(2, 1, 2) -657.754 css-mle 0.624 Tue, 26 Jul 2016 1327.508 21:17:45 1354.754 04-01-1958 1338.045 - 12-01-2015
coef std err z P>|z| [95.0% Conf. Int.] 0.1255 0.007 16.902 0.000 0.111 0.140 1.5587 0.023 66.515 0.000 1.513 1.605 -0.8615 0.023 -38.155 0.000 -0.906 -0.817 -0.9243 0.051 -18.203 0.000 -1.024 -0.825 0.0178 0.048 0.370 0.712 -0.077 0.112
0.9046 -0.5852j 1.0774 -0.0914 0.9046 +0.5852j 1.0774 0.0914 1.1055 +0.0000j 1.1055 0.0000 50.6990 +0.0000j 50.6990 0.0000
In [15]:
model.resid.plot()

Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fce5721c110>
In [16]:
model.plot_predict(1, 500)
plt.show()

/home/user/anaconda2/lib/python2.7/site-packages/statsmodels/tsa/arima_model.py:1847: FutureWarning: TimeSeries is deprecated. Please use Series
forecast = TimeSeries(forecast, index=self.data.predict_dates)


Our model matches the raw data quite well!

Finally, let's predict the future.

In [17]:
model.plot_predict(1, 1200)
plt.xlabel("Date")
plt.ylabel("Atmospheric CO2")
plt.show()

In [ ]: