Analyzing and predicting the number of air passengers¶

Made from an initial version prepared by Balázs Kégl (LAL/CNRS), Alex Gramfort (LTCI/Telecom ParisTech), Djalel Benbouzid (UPMC), Mehdi Cherti (LAL/CNRS) for RAMP

Introduction¶

The data set was donated by an unnamed company handling flight ticket reservations. The data is thin, it contains:

the date of departure
the departure airport
the arrival airport
the mean and standard deviation of the number of weeks of the reservations made before the departure date
a field called log_PAX which is related to the number of passengers (the actual number were changed for privacy reasons)

The goal is to predict the log_PAX column. The prediction quality is measured by RMSE.

The challenge for this dataset is to perform prediction on flight traffic using learnt regressor.

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

# optional
import seaborn as sns; sns.set()

Fetch the data and load it in pandas¶

data = pd.read_csv("public_train.csv")

print( min(data['DateOfDeparture']) )
print( max(data['DateOfDeparture']) )

2011-09-01
2013-03-05

data.head()

data['Departure'].unique()

array(['DFW', 'LAX', 'ORD', 'DEN', 'MCO', 'IAH', 'MIA', 'ATL', 'LGA',
       'SEA', 'PHX', 'CLT', 'DTW', 'LAS', 'EWR', 'MSP', 'BOS', 'PHL',
       'SFO', 'JFK'], dtype=object)

data.hist(column='log_PAX', bins=50);

data.hist('std_wtd', bins=50);

data.hist('WeeksToDeparture', bins=50);

data.describe()

data.dtypes

DateOfDeparture      object
Departure            object
Arrival              object
WeeksToDeparture    float64
log_PAX             float64
std_wtd             float64
dtype: object

data.shape

(11128, 6)

print( data['log_PAX'].mean() )
print(  data['log_PAX'].std() )

10.99772803448557
0.9982419545897369

Preprocessing for prediction¶

Getting dates into numerical columns is a common operation when time series are analyzed with non-parametric predictors. The code below makes all possible choices: ordered columns for the year, month, day, weekday, week, and day in the year, and one-hot columns for year month, day, weekday, and week.

The departure and arrival airports are also converted into one-hot columns.

data_encoded = data

data_encoded = data_encoded.join(pd.get_dummies(data_encoded['Departure'], prefix='d'))
data_encoded = data_encoded.join(pd.get_dummies(data_encoded['Arrival'], prefix='a'))
data_encoded = data_encoded.drop('Departure', axis=1)
data_encoded = data_encoded.drop('Arrival', axis=1)

# following http://stackoverflow.com/questions/16453644/regression-with-date-variable-using-scikit-learn
data_encoded['DateOfDeparture'] = pd.to_datetime(data_encoded['DateOfDeparture'])
data_encoded['year'] = data_encoded['DateOfDeparture'].dt.year
data_encoded['month'] = data_encoded['DateOfDeparture'].dt.month
data_encoded['day'] = data_encoded['DateOfDeparture'].dt.day
data_encoded['weekday'] = data_encoded['DateOfDeparture'].dt.weekday
data_encoded['week'] = data_encoded['DateOfDeparture'].dt.week
data_encoded['n_days'] = data_encoded['DateOfDeparture'].apply(lambda date: (date - pd.to_datetime("1970-01-01")).days)

data_encoded = data_encoded.join(pd.get_dummies(data_encoded['year'], prefix='y'))
data_encoded = data_encoded.join(pd.get_dummies(data_encoded['month'], prefix='m'))
data_encoded = data_encoded.join(pd.get_dummies(data_encoded['day'], prefix='d'))
data_encoded = data_encoded.join(pd.get_dummies(data_encoded['weekday'], prefix='wd'))
data_encoded = data_encoded.join(pd.get_dummies(data_encoded['week'], prefix='w'))

data_encoded.tail(5)

A linear regressor baseline¶

We drop the target column and the original data column.

features = data_encoded.drop(['log_PAX','DateOfDeparture'], axis=1)
X_columns = data_encoded.columns.drop(['log_PAX','DateOfDeparture'])
X = features.values
y = data_encoded['log_PAX'].values

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

It gives us a pretty nice improvement above baseline

from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score

reg = LinearRegression()

scores = cross_val_score(reg, X_train, y_train, cv=5, scoring='mean_squared_error')
print("log RMSE: {:.4f} +/-{:.4f}".format(
    np.mean(np.sqrt(-scores)), np.std(np.sqrt(-scores))))

C:\Program Files\Anaconda3\lib\site-packages\sklearn\metrics\scorer.py:90: DeprecationWarning: Scoring method mean_squared_error was renamed to neg_mean_squared_error in version 0.18 and will be removed in 0.20.
  sample_weight=sample_weight)
C:\Program Files\Anaconda3\lib\site-packages\sklearn\metrics\scorer.py:90: DeprecationWarning: Scoring method mean_squared_error was renamed to neg_mean_squared_error in version 0.18 and will be removed in 0.20.
  sample_weight=sample_weight)
C:\Program Files\Anaconda3\lib\site-packages\sklearn\metrics\scorer.py:90: DeprecationWarning: Scoring method mean_squared_error was renamed to neg_mean_squared_error in version 0.18 and will be removed in 0.20.
  sample_weight=sample_weight)
C:\Program Files\Anaconda3\lib\site-packages\sklearn\metrics\scorer.py:90: DeprecationWarning: Scoring method mean_squared_error was renamed to neg_mean_squared_error in version 0.18 and will be removed in 0.20.
  sample_weight=sample_weight)

log RMSE: 0.6392 +/-0.0258

C:\Program Files\Anaconda3\lib\site-packages\sklearn\metrics\scorer.py:90: DeprecationWarning: Scoring method mean_squared_error was renamed to neg_mean_squared_error in version 0.18 and will be removed in 0.20.
  sample_weight=sample_weight)

Exercise: Visualize the coefficients, try to make sense of them.

Random Forests¶

%%time
from sklearn.ensemble import RandomForestRegressor

n_estimators = 10
max_depth = 10
max_features = 10
reg = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, max_features=max_features)

scores = cross_val_score(reg, X_train, y_train, cv=5, scoring='mean_squared_error',n_jobs=3)
print("log RMSE: {:.4f} +/-{:.4f}".format(
    np.mean(np.sqrt(-scores)), np.std(np.sqrt(-scores))))

log RMSE: 0.7513 +/-0.0242
Wall time: 1.72 s

Variable importances¶

reg.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
           max_features=10, max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

len(X_columns)

153

plt.figure(figsize=(15, 5))

ordering = np.argsort(reg.feature_importances_)[::-1][:50]

importances = reg.feature_importances_[ordering]
feature_names = X_columns[ordering]

x = np.arange(len(feature_names))
plt.bar(x, importances)
plt.xticks(x + 0.5, feature_names, rotation=90, fontsize=15);

	DateOfDeparture	Departure	Arrival	WeeksToDeparture	log_PAX	std_wtd
0	2012-10-21	DFW	SFO	14.600000	10.757779	11.575837
1	2012-09-13	LAX	ATL	14.730769	11.808097	13.364304
2	2012-09-04	ORD	IAH	8.470588	10.865349	5.885551
3	2012-08-13	DEN	PHX	8.200000	10.710562	6.292853
4	2012-09-10	ORD	SEA	12.090909	11.498355	9.138662

	WeeksToDeparture	log_PAX	std_wtd
count	11128.000000	11128.000000	11128.000000
mean	11.466842	10.997728	8.635226
std	2.807131	0.998242	2.145805
min	2.625000	3.878108	2.160247
25%	9.523810	10.427055	7.109619
50%	11.304348	11.008268	8.582272
75%	13.259259	11.590093	10.162525
max	23.163265	14.224277	17.035677

	DateOfDeparture	WeeksToDeparture	log_PAX	std_wtd	d_CLT	d_DEN	d_IAH	d_LAS	d_ORD	a_DFW	a_LAX	a_LGA	a_SFO	year	month	day	weekday	week	n_days	y_2011	y_2012	y_2013	m_1	m_9	m_11	m_12	d_3	d_14	d_22	d_24	d_28	wd_0	wd_1	wd_2	wd_5	w_5	w_35	w_39	w_47	w_50
11123	2011-11-22	11.636364	10.334322	8.888438	0	0	1	0	0	0	1	0	0	2011	11	22	1	47	15300	1	0	0	0	0	1	0	0	0	1	0	0	0	1	0	0	0	0	0	1	0
11124	2011-12-14	13.541667	12.459632	10.341891	0	0	0	0	1	1	0	0	0	2011	12	14	2	50	15322	1	0	0	0	0	0	1	0	1	0	0	0	0	0	1	0	0	0	0	0	1
11125	2013-01-28	5.583333	11.381124	3.752777	1	0	0	0	0	0	0	1	0	2013	1	28	0	5	15733	0	0	1	1	0	0	0	0	0	0	0	1	1	0	0	0	1	0	0	0	0
11126	2011-09-03	12.730769	11.308812	7.997788	0	0	0	1	0	0	1	0	0	2011	9	3	5	35	15220	1	0	0	0	1	0	0	1	0	0	0	0	0	0	0	1	0	1	0	0	0
11127	2012-09-24	14.250000	11.619697	9.508768	0	1	0	0	0	0	0	0	1	2012	9	24	0	39	15607	0	1	0	0	1	0	0	0	0	0	1	0	1	0	0	0	0	0	1	0	0