{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "\n",
    "# Training a Decision Tree or a Random Forest on a classification problem\n",
    "\n",
    "**Author: Pr Fabien MOUTARDE, Robotics Lab, MINES ParisTech, PSL Research University**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## 1. Decision Trees with SciKit-Learn on a very simple dataset\n",
    "\n",
    "**We will first work on very simple classic dataset: Iris, which is a classification problem corresponding to determination of iris flower sub-species based on a few geometric characteristics of the flower.**\n",
    "\n",
    "**Please FIRST READ the [*Iris DATASET DESCRIPTION*](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html#sphx-glr-auto-examples-datasets-plot-iris-dataset-py).**\n",
    "In this classification problem, there are 3 classes, with a total of 150 examples (each one with 4 input). Please **now execute code cell below to load and view the dataset**.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "from matplotlib import pyplot as plt\n",
    "from matplotlib.colors import ListedColormap\n",
    "\n",
    "from sklearn import preprocessing \n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# Load Iris classification dataset\n",
    "from sklearn.datasets import load_iris\n",
    "iris = load_iris()\n",
    "\n",
    "# Print all 150 examples\n",
    "print(\"(Number_of_examples, example_size) = \" , iris.data.shape, \"\\n\")\n",
    "for i in range(0, 150) :\n",
    "    print('Input = ', iris.data[i], ' , Label = ', iris.target[i] )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "**Building, training and evaluating a simple Decision Tree classifier**\n",
    "\n",
    "The SciKit-learn class for Decision Tree classifiers is sklearn.tree.DecisionTreeClassifier.\n",
    "\n",
    "**Please FIRST READ (and understand!) the [*DecisionTreeClassifier DOCUMENTATION*](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) to understand all parameters of the contructor.**\n",
    "\n",
    "**You can then begin by running the code block below, in which default set of parameter values has been used.** If graphical view works, look at the structure of the learnt decision tree.\n",
    "\n",
    "**Then, check the influence of MAIN parameters for Decision Tree classifier, i.e.:**\n",
    " - **homegeneity criterion ('gini' or 'entropy')**\n",
    " - **max_depth**\n",
    " - **min_samples_split**\n",
    " \n",
    "NB : Note that post-training *PRUNING* IS unfortunately *NOT* implemented in SciKit-Learn Decision-Trees :("
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Split dataset into training and test part\n",
    "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)\n",
    "\n",
    "# Learn a Decision Tree\n",
    "from sklearn import tree\n",
    "clf = tree.DecisionTreeClassifier(criterion='entropy', splitter='best', max_depth=5, \n",
    "                                  min_samples_split=2, min_samples_leaf=1, \n",
    "                                  min_weight_fraction_leaf=0.0, max_features=None, \n",
    "                                  random_state=None, max_leaf_nodes=None, \n",
    "                                  min_impurity_split=1e-07, class_weight=None, presort=False)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "\n",
    "# Graphical view of learnt Decision Tree\n",
    "#\n",
    "#import pydotplus \n",
    "#dot_data = tree.export_graphviz(clf, out_file=None) \n",
    "#graph = pydotplus.graph_from_dot_data(dot_data) \n",
    "#graph.write_pdf(\"iris.pdf\")\n",
    "#from IPython.display import Image \n",
    "#Image(graph.create_png()) \n",
    "\n",
    "# Evaluate acuracy on test data\n",
    "print(clf)\n",
    "score = clf.score(X_test, y_test)\n",
    "print(\"Acuracy (on test set) = \", score)\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.metrics import confusion_matrix\n",
    "y_true, y_pred = y_test, clf.predict(X_test)\n",
    "print( classification_report(y_true, y_pred) )\n",
    "print(\"\\n CONFUSION MATRIX\")\n",
    "print( confusion_matrix(y_true, y_pred) )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## 2. Decision Trees on a  MORE REALISTIC DATASET: HANDWRITTEN DIGITS\n",
    "\n",
    "**Please FIRST READ the [*Digits DATASET DESCRIPTION*](http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html#sphx-glr-auto-examples-datasets-plot-digits-last-image-py).**\n",
    "\n",
    "In this classification problem, there are 10 classes, with a total of 1797 examples (each one being a 64D vector corresponding to an 8x8 pixmap). Please **now execute code cell below to load the dataset, visualize a typical example, and train a Desicion Tree on it**. \n",
    "The original code uses a **SUBOPTIMAL set of learning hyperparameters values. Try to play with them in order to improve acuracy.**\n",
    "\n",
    "Finally, **find a somewhat optimized setting of the set of 3 main hyper-parameters for Decision Tree learning, by using CROSS-VALIDATION** (see cross-validation example from the Multi-Layer Perceptron notebook used in earlier practical session).\n",
    "\n",
    "Look at final acuracy statistics, and also at the confusion-matrix: what digits are the most confused with each other ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_digits\n",
    "digits = load_digits()\n",
    "n_samples = len(digits.images)\n",
    "print(\"Number_of-examples = \", n_samples)\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "print(\"\\n Plot of first example\")\n",
    "plt.gray() \n",
    "plt.matshow(digits.images[0]) \n",
    "plt.show() \n",
    "\n",
    "# Flatten the images, to turn data in a (samples, feature) matrix:\n",
    "data = digits.images.reshape((n_samples, -1))\n",
    "\n",
    "# Split dataset into training and test part\n",
    "X = data\n",
    "y = digits.target\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)\n",
    "\n",
    "# Create and train a Decision Tree Classifier\n",
    "clf = tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=5, \n",
    "                                  min_samples_split=4, min_samples_leaf=1, \n",
    "                                  min_weight_fraction_leaf=0.0, max_features=None, \n",
    "                                  random_state=None, max_leaf_nodes=None, \n",
    "                                  min_impurity_split=1e-07, class_weight=None, presort=False)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "\n",
    "# Evaluate acuracy on test data\n",
    "print(clf)\n",
    "score = clf.score(X_test, y_test)\n",
    "print(\"Acuracy (on test set) = \", score)\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.metrics import confusion_matrix\n",
    "y_true, y_pred = y_test, clf.predict(X_test)\n",
    "print( classification_report(y_true, y_pred) )\n",
    "print(\"\\n CONFUSION MATRIX\")\n",
    "print( confusion_matrix(y_true, y_pred) )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## 3. Building, training and evaluating a Random Forest classifier\n",
    "\n",
    "The SciKit-learn class for Random Forest classifiers is Please sklearn.ensemble.RandomForestClassifier.\n",
    "\n",
    "**Please FIRST READ (and understand!) the [*RandomForestClassifier DOCUMENTATION*](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to understand all parameters of the contructor.**\n",
    "\n",
    "**Then you can begin by running the code block below, in which default set of parameter values has been used.** As you will see, a RandomForest (even rather small) can easily outperform single Decision Tree. \n",
    "\n",
    "**Then, check the influence of MAIN parameters for Random Forest classifier, i.e.:**\n",
    " - **n_estimators (number of trees in forest)**\n",
    " - **max_depth**\n",
    " - **max_features (max number of features used in each tree)**\n",
    "\n",
    "**Finally, find a somewhat optimized setting of the above set of 3 main parameters, by using CROSS-VALIDATION.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "# Create and train a Random Forest classifier\n",
    "clf = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None,\n",
    "                             min_samples_split=2, min_samples_leaf=1, \n",
    "                             min_weight_fraction_leaf=0.0, max_features='auto', \n",
    "                             max_leaf_nodes=None, min_impurity_split=1e-07, bootstrap=True, \n",
    "                             oob_score=False, n_jobs=1, random_state=None, \n",
    "                             verbose=0, warm_start=False, class_weight=None)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "print(\"n_estimators=\", clf.n_estimators, \" max_depth=\",clf.max_depth,\n",
    "      \"max_features=\", clf.max_features)\n",
    "\n",
    "# Evaluate acuracy on test data\n",
    "print(clf)\n",
    "score = clf.score(X_test, y_test)\n",
    "print(\"Acuracy (on test set) = \", score)\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.metrics import confusion_matrix\n",
    "y_true, y_pred = y_test, clf.predict(X_test)\n",
    "print( classification_report(y_true, y_pred) )\n",
    "print(\"\\n CONFUSION MATRIX\")\n",
    "print( confusion_matrix(y_true, y_pred) )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "source": [
    "## 3. Building, training and evaluating an AdaBoost classifier\n",
    "\n",
    "The SciKit-learn class for adaBoost is sklearn.ensemble.AdaBoostClassifier.\n",
    "\n",
    "**Please FIRST READ (and understand!) the [*AdaBoostClassifier DOCUMENTATION*](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier) to understand all parameters of the contructor.**\n",
    "\n",
    "**Then begin by running the code block below, in which a default set of parameter values has been used.** Look at the training curve: you can see that **training error goes down to zero rather quickly, and that test_error continues to diminish with increasing iterations**.\n",
    "\n",
    "**Then, check the influence of MAIN parameters for adaBoost classifier, i.e.:**\n",
    " - ** base_estimator (ie type of Weak Classifier/Learner)** \n",
    " - **n_estimators (number of boosting iterations, and therefore also number of weak classifiers)**\n",
    " - algorithm\n",
    " \n",
    "In particular, check which other types of classifiers can be used as Weak Classifier with the adaBoost implementation of SciKit-Learn.\n",
    "\n",
    "NB: in principle it is possible to use MLP classifiers as weak classifiers, but not with SciKit-learn implementation of MLPClassifier (because weighting of examples is not handled)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import AdaBoostClassifier\n",
    "\n",
    "# Create and train an adaBoost classifier using SMALL Decision Trees as weak classifiers\n",
    "weak_learner = tree.DecisionTreeClassifier(max_depth=6)\n",
    "clf = AdaBoostClassifier(weak_learner, n_estimators=60, learning_rate=1.0, algorithm='SAMME', \n",
    "                         random_state=None)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "print(\"Weak_learner:\", clf.base_estimator)\n",
    "print(\"Weights of weak classifiers: \", clf.estimator_weights_)\n",
    "      \n",
    "# Plot training curves (error = f(iterations))\n",
    "n_iter = clf.n_estimators\n",
    "from sklearn.metrics import zero_one_loss\n",
    "ada_train_err = np.zeros((clf.n_estimators,))\n",
    "for i, y_pred in enumerate(clf.staged_predict(X_train)):\n",
    "    ada_train_err[i] = zero_one_loss(y_pred, y_train)\n",
    "ada_test_err = np.zeros((clf.n_estimators,))\n",
    "for i, y_pred in enumerate(clf.staged_predict(X_test)):\n",
    "    ada_test_err[i] = zero_one_loss(y_pred, y_test)\n",
    "fig = plt.figure()\n",
    "ax = fig.add_subplot(111)\n",
    "ax.plot(np.arange(n_iter) + 1, ada_train_err,\n",
    "        label='Training Error',\n",
    "        color='green')\n",
    "ax.plot(np.arange(n_iter) + 1, ada_test_err,\n",
    "        label='Test Error',\n",
    "        color='orange')\n",
    "ax.set_ylim((0.0, 0.5))\n",
    "ax.set_xlabel('boosting iterations')\n",
    "ax.set_ylabel('error rate')\n",
    "leg = ax.legend(loc='upper right', fancybox=True)\n",
    "plt.show()\n",
    "\n",
    "# Evaluate acuracy on test data\n",
    "print(\"n_estimators=\", clf.n_estimators)\n",
    "score = clf.score(X_test, y_test)\n",
    "print(\"Acuracy (on test set) = \", score)\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.metrics import confusion_matrix\n",
    "y_true, y_pred = y_test, clf.predict(X_test)\n",
    "print( classification_report(y_true, y_pred) )\n",
    "print(\"\\n CONFUSION MATRIX\")\n",
    "print( confusion_matrix(y_true, y_pred) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}