{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "\n", "# Training a Decision Tree or a Random Forest on a classification problem\n", "\n", "**Author: Pr Fabien MOUTARDE, Robotics Lab, MINES ParisTech, PSL Research University**\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 1. Decision Trees with SciKit-Learn on a very simple dataset\n", "\n", "**We will first work on very simple classic dataset: Iris, which is a classification problem corresponding to determination of iris flower sub-species based on a few geometric characteristics of the flower.**\n", "\n", "**Please FIRST READ the [*Iris DATASET DESCRIPTION*](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html#sphx-glr-auto-examples-datasets-plot-iris-dataset-py).**\n", "In this classification problem, there are 3 classes, with a total of 150 examples (each one with 4 input). Please **now execute code cell below to load and view the dataset**.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": true }, "outputs": [], "source": [ "import numpy as np\n", "\n", "from matplotlib import pyplot as plt\n", "from matplotlib.colors import ListedColormap\n", "\n", "from sklearn import preprocessing \n", "from sklearn.preprocessing import StandardScaler\n", "\n", "# Load Iris classification dataset\n", "from sklearn.datasets import load_iris\n", "iris = load_iris()\n", "\n", "# Print all 150 examples\n", "print(\"(Number_of_examples, example_size) = \" , iris.data.shape, \"\\n\")\n", "for i in range(0, 150) :\n", " print('Input = ', iris.data[i], ' , Label = ', iris.target[i] )\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "**Building, training and evaluating a simple Decision Tree classifier**\n", "\n", "The SciKit-learn class for Decision Tree classifiers is sklearn.tree.DecisionTreeClassifier.\n", "\n", "**Please FIRST READ (and understand!) the [*DecisionTreeClassifier DOCUMENTATION*](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) to understand all parameters of the contructor.**\n", "\n", "**You can then begin by running the code block below, in which default set of parameter values has been used.** If graphical view works, look at the structure of the learnt decision tree.\n", "\n", "**Then, check the influence of MAIN parameters for Decision Tree classifier, i.e.:**\n", " - **homegeneity criterion ('gini' or 'entropy')**\n", " - **max_depth**\n", " - **min_samples_split**\n", " \n", "NB : Note that post-training *PRUNING* IS unfortunately *NOT* implemented in SciKit-Learn Decision-Trees :(" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Split dataset into training and test part\n", "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)\n", "\n", "# Learn a Decision Tree\n", "from sklearn import tree\n", "clf = tree.DecisionTreeClassifier(criterion='entropy', splitter='best', max_depth=5, \n", " min_samples_split=2, min_samples_leaf=1, \n", " min_weight_fraction_leaf=0.0, max_features=None, \n", " random_state=None, max_leaf_nodes=None, \n", " min_impurity_split=1e-07, class_weight=None, presort=False)\n", "clf = clf.fit(X_train, y_train)\n", "\n", "# Graphical view of learnt Decision Tree\n", "#\n", "#import pydotplus \n", "#dot_data = tree.export_graphviz(clf, out_file=None) \n", "#graph = pydotplus.graph_from_dot_data(dot_data) \n", "#graph.write_pdf(\"iris.pdf\")\n", "#from IPython.display import Image \n", "#Image(graph.create_png()) \n", "\n", "# Evaluate acuracy on test data\n", "print(clf)\n", "score = clf.score(X_test, y_test)\n", "print(\"Acuracy (on test set) = \", score)\n", "from sklearn.metrics import classification_report\n", "from sklearn.metrics import confusion_matrix\n", "y_true, y_pred = y_test, clf.predict(X_test)\n", "print( classification_report(y_true, y_pred) )\n", "print(\"\\n CONFUSION MATRIX\")\n", "print( confusion_matrix(y_true, y_pred) )\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 2. Decision Trees on a MORE REALISTIC DATASET: HANDWRITTEN DIGITS\n", "\n", "**Please FIRST READ the [*Digits DATASET DESCRIPTION*](http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html#sphx-glr-auto-examples-datasets-plot-digits-last-image-py).**\n", "\n", "In this classification problem, there are 10 classes, with a total of 1797 examples (each one being a 64D vector corresponding to an 8x8 pixmap). Please **now execute code cell below to load the dataset, visualize a typical example, and train a Desicion Tree on it**. \n", "The original code uses a **SUBOPTIMAL set of learning hyperparameters values. Try to play with them in order to improve acuracy.**\n", "\n", "Finally, **find a somewhat optimized setting of the set of 3 main hyper-parameters for Decision Tree learning, by using CROSS-VALIDATION** (see cross-validation example from the Multi-Layer Perceptron notebook used in earlier practical session).\n", "\n", "Look at final acuracy statistics, and also at the confusion-matrix: what digits are the most confused with each other ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": true }, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()\n", "n_samples = len(digits.images)\n", "print(\"Number_of-examples = \", n_samples)\n", "\n", "import matplotlib.pyplot as plt\n", "print(\"\\n Plot of first example\")\n", "plt.gray() \n", "plt.matshow(digits.images[0]) \n", "plt.show() \n", "\n", "# Flatten the images, to turn data in a (samples, feature) matrix:\n", "data = digits.images.reshape((n_samples, -1))\n", "\n", "# Split dataset into training and test part\n", "X = data\n", "y = digits.target\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)\n", "\n", "# Create and train a Decision Tree Classifier\n", "clf = tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=5, \n", " min_samples_split=4, min_samples_leaf=1, \n", " min_weight_fraction_leaf=0.0, max_features=None, \n", " random_state=None, max_leaf_nodes=None, \n", " min_impurity_split=1e-07, class_weight=None, presort=False)\n", "clf = clf.fit(X_train, y_train)\n", "\n", "# Evaluate acuracy on test data\n", "print(clf)\n", "score = clf.score(X_test, y_test)\n", "print(\"Acuracy (on test set) = \", score)\n", "from sklearn.metrics import classification_report\n", "from sklearn.metrics import confusion_matrix\n", "y_true, y_pred = y_test, clf.predict(X_test)\n", "print( classification_report(y_true, y_pred) )\n", "print(\"\\n CONFUSION MATRIX\")\n", "print( confusion_matrix(y_true, y_pred) )\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 3. Building, training and evaluating a Random Forest classifier\n", "\n", "The SciKit-learn class for Random Forest classifiers is Please sklearn.ensemble.RandomForestClassifier.\n", "\n", "**Please FIRST READ (and understand!) the [*RandomForestClassifier DOCUMENTATION*](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to understand all parameters of the contructor.**\n", "\n", "**Then you can begin by running the code block below, in which default set of parameter values has been used.** As you will see, a RandomForest (even rather small) can easily outperform single Decision Tree. \n", "\n", "**Then, check the influence of MAIN parameters for Random Forest classifier, i.e.:**\n", " - **n_estimators (number of trees in forest)**\n", " - **max_depth**\n", " - **max_features (max number of features used in each tree)**\n", "\n", "**Finally, find a somewhat optimized setting of the above set of 3 main parameters, by using CROSS-VALIDATION.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "# Create and train a Random Forest classifier\n", "clf = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None,\n", " min_samples_split=2, min_samples_leaf=1, \n", " min_weight_fraction_leaf=0.0, max_features='auto', \n", " max_leaf_nodes=None, min_impurity_split=1e-07, bootstrap=True, \n", " oob_score=False, n_jobs=1, random_state=None, \n", " verbose=0, warm_start=False, class_weight=None)\n", "clf = clf.fit(X_train, y_train)\n", "print(\"n_estimators=\", clf.n_estimators, \" max_depth=\",clf.max_depth,\n", " \"max_features=\", clf.max_features)\n", "\n", "# Evaluate acuracy on test data\n", "print(clf)\n", "score = clf.score(X_test, y_test)\n", "print(\"Acuracy (on test set) = \", score)\n", "from sklearn.metrics import classification_report\n", "from sklearn.metrics import confusion_matrix\n", "y_true, y_pred = y_test, clf.predict(X_test)\n", "print( classification_report(y_true, y_pred) )\n", "print(\"\\n CONFUSION MATRIX\")\n", "print( confusion_matrix(y_true, y_pred) )\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "deletable": true, "editable": true }, "source": [ "## 3. Building, training and evaluating an AdaBoost classifier\n", "\n", "The SciKit-learn class for adaBoost is sklearn.ensemble.AdaBoostClassifier.\n", "\n", "**Please FIRST READ (and understand!) the [*AdaBoostClassifier DOCUMENTATION*](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier) to understand all parameters of the contructor.**\n", "\n", "**Then begin by running the code block below, in which a default set of parameter values has been used.** Look at the training curve: you can see that **training error goes down to zero rather quickly, and that test_error continues to diminish with increasing iterations**.\n", "\n", "**Then, check the influence of MAIN parameters for adaBoost classifier, i.e.:**\n", " - ** base_estimator (ie type of Weak Classifier/Learner)** \n", " - **n_estimators (number of boosting iterations, and therefore also number of weak classifiers)**\n", " - algorithm\n", " \n", "In particular, check which other types of classifiers can be used as Weak Classifier with the adaBoost implementation of SciKit-Learn.\n", "\n", "NB: in principle it is possible to use MLP classifiers as weak classifiers, but not with SciKit-learn implementation of MLPClassifier (because weighting of examples is not handled)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": false }, "outputs": [], "source": [ "from sklearn.ensemble import AdaBoostClassifier\n", "\n", "# Create and train an adaBoost classifier using SMALL Decision Trees as weak classifiers\n", "weak_learner = tree.DecisionTreeClassifier(max_depth=6)\n", "clf = AdaBoostClassifier(weak_learner, n_estimators=60, learning_rate=1.0, algorithm='SAMME', \n", " random_state=None)\n", "clf = clf.fit(X_train, y_train)\n", "print(\"Weak_learner:\", clf.base_estimator)\n", "print(\"Weights of weak classifiers: \", clf.estimator_weights_)\n", " \n", "# Plot training curves (error = f(iterations))\n", "n_iter = clf.n_estimators\n", "from sklearn.metrics import zero_one_loss\n", "ada_train_err = np.zeros((clf.n_estimators,))\n", "for i, y_pred in enumerate(clf.staged_predict(X_train)):\n", " ada_train_err[i] = zero_one_loss(y_pred, y_train)\n", "ada_test_err = np.zeros((clf.n_estimators,))\n", "for i, y_pred in enumerate(clf.staged_predict(X_test)):\n", " ada_test_err[i] = zero_one_loss(y_pred, y_test)\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "ax.plot(np.arange(n_iter) + 1, ada_train_err,\n", " label='Training Error',\n", " color='green')\n", "ax.plot(np.arange(n_iter) + 1, ada_test_err,\n", " label='Test Error',\n", " color='orange')\n", "ax.set_ylim((0.0, 0.5))\n", "ax.set_xlabel('boosting iterations')\n", "ax.set_ylabel('error rate')\n", "leg = ax.legend(loc='upper right', fancybox=True)\n", "plt.show()\n", "\n", "# Evaluate acuracy on test data\n", "print(\"n_estimators=\", clf.n_estimators)\n", "score = clf.score(X_test, y_test)\n", "print(\"Acuracy (on test set) = \", score)\n", "from sklearn.metrics import classification_report\n", "from sklearn.metrics import confusion_matrix\n", "y_true, y_pred = y_test, clf.predict(X_test)\n", "print( classification_report(y_true, y_pred) )\n", "print(\"\\n CONFUSION MATRIX\")\n", "print( confusion_matrix(y_true, y_pred) )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda root]", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.3" } }, "nbformat": 4, "nbformat_minor": 0 }