Training a Decision Tree or a Random Forest on a classification problem

Author: Pr Fabien MOUTARDE, Robotics Lab, MINES ParisTech, PSL Research University

1. Decision Trees with SciKit-Learn on a very simple dataset

We will first work on very simple classic dataset: Iris, which is a classification problem corresponding to determination of iris flower sub-species based on a few geometric characteristics of the flower.

Please FIRST READ the Iris DATASET DESCRIPTION. In this classification problem, there are 3 classes, with a total of 150 examples (each one with 4 input). Please now execute code cell below to load and view the dataset.

In [ ]:
import numpy as np

from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap

from sklearn import preprocessing 
from sklearn.preprocessing import StandardScaler

# Load Iris classification dataset
from sklearn.datasets import load_iris
iris = load_iris()

# Print all 150 examples
print("(Number_of_examples, example_size) = " , iris.data.shape, "\n")
for i in range(0, 150) :
    print('Input = ', iris.data[i], ' , Label = ', iris.target[i] )

Building, training and evaluating a simple Decision Tree classifier

The SciKit-learn class for Decision Tree classifiers is sklearn.tree.DecisionTreeClassifier.

Please FIRST READ (and understand!) the DecisionTreeClassifier DOCUMENTATION to understand all parameters of the contructor.

You can then begin by running the code block below, in which default set of parameter values has been used. If graphical view works, look at the structure of the learnt decision tree.

Then, check the influence of MAIN parameters for Decision Tree classifier, i.e.:

  • homegeneity criterion ('gini' or 'entropy')
  • max_depth
  • min_samples_split

NB : Note that post-training PRUNING IS unfortunately NOT implemented in SciKit-Learn Decision-Trees :(

In [ ]:
from sklearn.model_selection import train_test_split

# Split dataset into training and test part
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

# Learn a Decision Tree
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy', splitter='best', max_depth=5, 
                                  min_samples_split=2, min_samples_leaf=1, 
                                  min_weight_fraction_leaf=0.0, max_features=None, 
                                  random_state=None, max_leaf_nodes=None, 
                                  min_impurity_split=1e-07, class_weight=None, presort=False)
clf = clf.fit(X_train, y_train)

# Graphical view of learnt Decision Tree
#
#import pydotplus 
#dot_data = tree.export_graphviz(clf, out_file=None) 
#graph = pydotplus.graph_from_dot_data(dot_data) 
#graph.write_pdf("iris.pdf")
#from IPython.display import Image 
#Image(graph.create_png()) 

# Evaluate acuracy on test data
print(clf)
score = clf.score(X_test, y_test)
print("Acuracy (on test set) = ", score)
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
y_true, y_pred = y_test, clf.predict(X_test)
print( classification_report(y_true, y_pred) )
print("\n CONFUSION MATRIX")
print( confusion_matrix(y_true, y_pred) )

2. Decision Trees on a MORE REALISTIC DATASET: HANDWRITTEN DIGITS

Please FIRST READ the Digits DATASET DESCRIPTION.

In this classification problem, there are 10 classes, with a total of 1797 examples (each one being a 64D vector corresponding to an 8x8 pixmap). Please now execute code cell below to load the dataset, visualize a typical example, and train a Desicion Tree on it. The original code uses a SUBOPTIMAL set of learning hyperparameters values. Try to play with them in order to improve acuracy.

Finally, find a somewhat optimized setting of the set of 3 main hyper-parameters for Decision Tree learning, by using CROSS-VALIDATION (see cross-validation example from the Multi-Layer Perceptron notebook used in earlier practical session).

Look at final acuracy statistics, and also at the confusion-matrix: what digits are the most confused with each other ?

In [ ]:
from sklearn.datasets import load_digits
digits = load_digits()
n_samples = len(digits.images)
print("Number_of-examples = ", n_samples)

import matplotlib.pyplot as plt
print("\n Plot of first example")
plt.gray() 
plt.matshow(digits.images[0]) 
plt.show() 

# Flatten the images, to turn data in a (samples, feature) matrix:
data = digits.images.reshape((n_samples, -1))

# Split dataset into training and test part
X = data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

# Create and train a Decision Tree Classifier
clf = tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=5, 
                                  min_samples_split=4, min_samples_leaf=1, 
                                  min_weight_fraction_leaf=0.0, max_features=None, 
                                  random_state=None, max_leaf_nodes=None, 
                                  min_impurity_split=1e-07, class_weight=None, presort=False)
clf = clf.fit(X_train, y_train)

# Evaluate acuracy on test data
print(clf)
score = clf.score(X_test, y_test)
print("Acuracy (on test set) = ", score)
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
y_true, y_pred = y_test, clf.predict(X_test)
print( classification_report(y_true, y_pred) )
print("\n CONFUSION MATRIX")
print( confusion_matrix(y_true, y_pred) )

3. Building, training and evaluating a Random Forest classifier

The SciKit-learn class for Random Forest classifiers is Please sklearn.ensemble.RandomForestClassifier.

Please FIRST READ (and understand!) the RandomForestClassifier DOCUMENTATION to understand all parameters of the contructor.

Then you can begin by running the code block below, in which default set of parameter values has been used. As you will see, a RandomForest (even rather small) can easily outperform single Decision Tree.

Then, check the influence of MAIN parameters for Random Forest classifier, i.e.:

  • n_estimators (number of trees in forest)
  • max_depth
  • max_features (max number of features used in each tree)

Finally, find a somewhat optimized setting of the above set of 3 main parameters, by using CROSS-VALIDATION.

In [ ]:
from sklearn.ensemble import RandomForestClassifier

# Create and train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None,
                             min_samples_split=2, min_samples_leaf=1, 
                             min_weight_fraction_leaf=0.0, max_features='auto', 
                             max_leaf_nodes=None, min_impurity_split=1e-07, bootstrap=True, 
                             oob_score=False, n_jobs=1, random_state=None, 
                             verbose=0, warm_start=False, class_weight=None)
clf = clf.fit(X_train, y_train)
print("n_estimators=", clf.n_estimators, " max_depth=",clf.max_depth,
      "max_features=", clf.max_features)

# Evaluate acuracy on test data
print(clf)
score = clf.score(X_test, y_test)
print("Acuracy (on test set) = ", score)
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
y_true, y_pred = y_test, clf.predict(X_test)
print( classification_report(y_true, y_pred) )
print("\n CONFUSION MATRIX")
print( confusion_matrix(y_true, y_pred) )

3. Building, training and evaluating an AdaBoost classifier

The SciKit-learn class for adaBoost is sklearn.ensemble.AdaBoostClassifier.

Please FIRST READ (and understand!) the AdaBoostClassifier DOCUMENTATION to understand all parameters of the contructor.

Then begin by running the code block below, in which a default set of parameter values has been used. Look at the training curve: you can see that training error goes down to zero rather quickly, and that test_error continues to diminish with increasing iterations.

Then, check the influence of MAIN parameters for adaBoost classifier, i.e.:

  • base_estimator (ie type of Weak Classifier/Learner)
  • n_estimators (number of boosting iterations, and therefore also number of weak classifiers)
  • algorithm

In particular, check which other types of classifiers can be used as Weak Classifier with the adaBoost implementation of SciKit-Learn.

NB: in principle it is possible to use MLP classifiers as weak classifiers, but not with SciKit-learn implementation of MLPClassifier (because weighting of examples is not handled).

In [ ]:
from sklearn.ensemble import AdaBoostClassifier

# Create and train an adaBoost classifier using SMALL Decision Trees as weak classifiers
weak_learner = tree.DecisionTreeClassifier(max_depth=6)
clf = AdaBoostClassifier(weak_learner, n_estimators=60, learning_rate=1.0, algorithm='SAMME', 
                         random_state=None)
clf = clf.fit(X_train, y_train)
print("Weak_learner:", clf.base_estimator)
print("Weights of weak classifiers: ", clf.estimator_weights_)
      
# Plot training curves (error = f(iterations))
n_iter = clf.n_estimators
from sklearn.metrics import zero_one_loss
ada_train_err = np.zeros((clf.n_estimators,))
for i, y_pred in enumerate(clf.staged_predict(X_train)):
    ada_train_err[i] = zero_one_loss(y_pred, y_train)
ada_test_err = np.zeros((clf.n_estimators,))
for i, y_pred in enumerate(clf.staged_predict(X_test)):
    ada_test_err[i] = zero_one_loss(y_pred, y_test)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(np.arange(n_iter) + 1, ada_train_err,
        label='Training Error',
        color='green')
ax.plot(np.arange(n_iter) + 1, ada_test_err,
        label='Test Error',
        color='orange')
ax.set_ylim((0.0, 0.5))
ax.set_xlabel('boosting iterations')
ax.set_ylabel('error rate')
leg = ax.legend(loc='upper right', fancybox=True)
plt.show()

# Evaluate acuracy on test data
print("n_estimators=", clf.n_estimators)
score = clf.score(X_test, y_test)
print("Acuracy (on test set) = ", score)
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
y_true, y_pred = y_test, clf.predict(X_test)
print( classification_report(y_true, y_pred) )
print("\n CONFUSION MATRIX")
print( confusion_matrix(y_true, y_pred) )
In [ ]: