{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Training a Decision Tree or a Random Forest on a classification problem, and compare the latter with using adaBoost\n", "\n", "**Author: Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL Université Paris**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Decision Trees with SciKit-Learn on a very simple dataset\n", "\n", "**We will first work on very simple classic dataset: Iris, which is a classification problem corresponding to determination of iris flower sub-species based on a few geometric characteristics of the flower.**\n", "\n", "**Please FIRST READ the [*Iris DATASET DESCRIPTION*](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html#sphx-glr-auto-examples-datasets-plot-iris-dataset-py).**\n", "In this classification problem, there are 3 classes, with a total of 150 examples (each one with 4 input). Please **now execute code cell below to load and view the dataset**.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(Number_of_examples, example_size) = (150, 4) \n", "\n", "Input = [5.1 3.5 1.4 0.2] , Label = 0\n", "Input = [4.9 3. 1.4 0.2] , Label = 0\n", "Input = [4.7 3.2 1.3 0.2] , Label = 0\n", "Input = [4.6 3.1 1.5 0.2] , Label = 0\n", "Input = [5. 3.6 1.4 0.2] , Label = 0\n", "Input = [5.4 3.9 1.7 0.4] , Label = 0\n", "Input = [4.6 3.4 1.4 0.3] , Label = 0\n", "Input = [5. 3.4 1.5 0.2] , Label = 0\n", "Input = [4.4 2.9 1.4 0.2] , Label = 0\n", "Input = [4.9 3.1 1.5 0.1] , Label = 0\n", "Input = [5.4 3.7 1.5 0.2] , Label = 0\n", "Input = [4.8 3.4 1.6 0.2] , Label = 0\n", "Input = [4.8 3. 1.4 0.1] , Label = 0\n", "Input = [4.3 3. 1.1 0.1] , Label = 0\n", "Input = [5.8 4. 1.2 0.2] , Label = 0\n", "Input = [5.7 4.4 1.5 0.4] , Label = 0\n", "Input = [5.4 3.9 1.3 0.4] , Label = 0\n", "Input = [5.1 3.5 1.4 0.3] , Label = 0\n", "Input = [5.7 3.8 1.7 0.3] , Label = 0\n", "Input = [5.1 3.8 1.5 0.3] , Label = 0\n", "Input = [5.4 3.4 1.7 0.2] , Label = 0\n", "Input = [5.1 3.7 1.5 0.4] , Label = 0\n", "Input = [4.6 3.6 1. 0.2] , Label = 0\n", "Input = [5.1 3.3 1.7 0.5] , Label = 0\n", "Input = [4.8 3.4 1.9 0.2] , Label = 0\n", "Input = [5. 3. 1.6 0.2] , Label = 0\n", "Input = [5. 3.4 1.6 0.4] , Label = 0\n", "Input = [5.2 3.5 1.5 0.2] , Label = 0\n", "Input = [5.2 3.4 1.4 0.2] , Label = 0\n", "Input = [4.7 3.2 1.6 0.2] , Label = 0\n", "Input = [4.8 3.1 1.6 0.2] , Label = 0\n", "Input = [5.4 3.4 1.5 0.4] , Label = 0\n", "Input = [5.2 4.1 1.5 0.1] , Label = 0\n", "Input = [5.5 4.2 1.4 0.2] , Label = 0\n", "Input = [4.9 3.1 1.5 0.2] , Label = 0\n", "Input = [5. 3.2 1.2 0.2] , Label = 0\n", "Input = [5.5 3.5 1.3 0.2] , Label = 0\n", "Input = [4.9 3.6 1.4 0.1] , Label = 0\n", "Input = [4.4 3. 1.3 0.2] , Label = 0\n", "Input = [5.1 3.4 1.5 0.2] , Label = 0\n", "Input = [5. 3.5 1.3 0.3] , Label = 0\n", "Input = [4.5 2.3 1.3 0.3] , Label = 0\n", "Input = [4.4 3.2 1.3 0.2] , Label = 0\n", "Input = [5. 3.5 1.6 0.6] , Label = 0\n", "Input = [5.1 3.8 1.9 0.4] , Label = 0\n", "Input = [4.8 3. 1.4 0.3] , Label = 0\n", "Input = [5.1 3.8 1.6 0.2] , Label = 0\n", "Input = [4.6 3.2 1.4 0.2] , Label = 0\n", "Input = [5.3 3.7 1.5 0.2] , Label = 0\n", "Input = [5. 3.3 1.4 0.2] , Label = 0\n", "Input = [7. 3.2 4.7 1.4] , Label = 1\n", "Input = [6.4 3.2 4.5 1.5] , Label = 1\n", "Input = [6.9 3.1 4.9 1.5] , Label = 1\n", "Input = [5.5 2.3 4. 1.3] , Label = 1\n", "Input = [6.5 2.8 4.6 1.5] , Label = 1\n", "Input = [5.7 2.8 4.5 1.3] , Label = 1\n", "Input = [6.3 3.3 4.7 1.6] , Label = 1\n", "Input = [4.9 2.4 3.3 1. ] , Label = 1\n", "Input = [6.6 2.9 4.6 1.3] , Label = 1\n", "Input = [5.2 2.7 3.9 1.4] , Label = 1\n", "Input = [5. 2. 3.5 1. ] , Label = 1\n", "Input = [5.9 3. 4.2 1.5] , Label = 1\n", "Input = [6. 2.2 4. 1. ] , Label = 1\n", "Input = [6.1 2.9 4.7 1.4] , Label = 1\n", "Input = [5.6 2.9 3.6 1.3] , Label = 1\n", "Input = [6.7 3.1 4.4 1.4] , Label = 1\n", "Input = [5.6 3. 4.5 1.5] , Label = 1\n", "Input = [5.8 2.7 4.1 1. ] , Label = 1\n", "Input = [6.2 2.2 4.5 1.5] , Label = 1\n", "Input = [5.6 2.5 3.9 1.1] , Label = 1\n", "Input = [5.9 3.2 4.8 1.8] , Label = 1\n", "Input = [6.1 2.8 4. 1.3] , Label = 1\n", "Input = [6.3 2.5 4.9 1.5] , Label = 1\n", "Input = [6.1 2.8 4.7 1.2] , Label = 1\n", "Input = [6.4 2.9 4.3 1.3] , Label = 1\n", "Input = [6.6 3. 4.4 1.4] , Label = 1\n", "Input = [6.8 2.8 4.8 1.4] , Label = 1\n", "Input = [6.7 3. 5. 1.7] , Label = 1\n", "Input = [6. 2.9 4.5 1.5] , Label = 1\n", "Input = [5.7 2.6 3.5 1. ] , Label = 1\n", "Input = [5.5 2.4 3.8 1.1] , Label = 1\n", "Input = [5.5 2.4 3.7 1. ] , Label = 1\n", "Input = [5.8 2.7 3.9 1.2] , Label = 1\n", "Input = [6. 2.7 5.1 1.6] , Label = 1\n", "Input = [5.4 3. 4.5 1.5] , Label = 1\n", "Input = [6. 3.4 4.5 1.6] , Label = 1\n", "Input = [6.7 3.1 4.7 1.5] , Label = 1\n", "Input = [6.3 2.3 4.4 1.3] , Label = 1\n", "Input = [5.6 3. 4.1 1.3] , Label = 1\n", "Input = [5.5 2.5 4. 1.3] , Label = 1\n", "Input = [5.5 2.6 4.4 1.2] , Label = 1\n", "Input = [6.1 3. 4.6 1.4] , Label = 1\n", "Input = [5.8 2.6 4. 1.2] , Label = 1\n", "Input = [5. 2.3 3.3 1. ] , Label = 1\n", "Input = [5.6 2.7 4.2 1.3] , Label = 1\n", "Input = [5.7 3. 4.2 1.2] , Label = 1\n", "Input = [5.7 2.9 4.2 1.3] , Label = 1\n", "Input = [6.2 2.9 4.3 1.3] , Label = 1\n", "Input = [5.1 2.5 3. 1.1] , Label = 1\n", "Input = [5.7 2.8 4.1 1.3] , Label = 1\n", "Input = [6.3 3.3 6. 2.5] , Label = 2\n", "Input = [5.8 2.7 5.1 1.9] , Label = 2\n", "Input = [7.1 3. 5.9 2.1] , Label = 2\n", "Input = [6.3 2.9 5.6 1.8] , Label = 2\n", "Input = [6.5 3. 5.8 2.2] , Label = 2\n", "Input = [7.6 3. 6.6 2.1] , Label = 2\n", "Input = [4.9 2.5 4.5 1.7] , Label = 2\n", "Input = [7.3 2.9 6.3 1.8] , Label = 2\n", "Input = [6.7 2.5 5.8 1.8] , Label = 2\n", "Input = [7.2 3.6 6.1 2.5] , Label = 2\n", "Input = [6.5 3.2 5.1 2. ] , Label = 2\n", "Input = [6.4 2.7 5.3 1.9] , Label = 2\n", "Input = [6.8 3. 5.5 2.1] , Label = 2\n", "Input = [5.7 2.5 5. 2. ] , Label = 2\n", "Input = [5.8 2.8 5.1 2.4] , Label = 2\n", "Input = [6.4 3.2 5.3 2.3] , Label = 2\n", "Input = [6.5 3. 5.5 1.8] , Label = 2\n", "Input = [7.7 3.8 6.7 2.2] , Label = 2\n", "Input = [7.7 2.6 6.9 2.3] , Label = 2\n", "Input = [6. 2.2 5. 1.5] , Label = 2\n", "Input = [6.9 3.2 5.7 2.3] , Label = 2\n", "Input = [5.6 2.8 4.9 2. ] , Label = 2\n", "Input = [7.7 2.8 6.7 2. ] , Label = 2\n", "Input = [6.3 2.7 4.9 1.8] , Label = 2\n", "Input = [6.7 3.3 5.7 2.1] , Label = 2\n", "Input = [7.2 3.2 6. 1.8] , Label = 2\n", "Input = [6.2 2.8 4.8 1.8] , Label = 2\n", "Input = [6.1 3. 4.9 1.8] , Label = 2\n", "Input = [6.4 2.8 5.6 2.1] , Label = 2\n", "Input = [7.2 3. 5.8 1.6] , Label = 2\n", "Input = [7.4 2.8 6.1 1.9] , Label = 2\n", "Input = [7.9 3.8 6.4 2. ] , Label = 2\n", "Input = [6.4 2.8 5.6 2.2] , Label = 2\n", "Input = [6.3 2.8 5.1 1.5] , Label = 2\n", "Input = [6.1 2.6 5.6 1.4] , Label = 2\n", "Input = [7.7 3. 6.1 2.3] , Label = 2\n", "Input = [6.3 3.4 5.6 2.4] , Label = 2\n", "Input = [6.4 3.1 5.5 1.8] , Label = 2\n", "Input = [6. 3. 4.8 1.8] , Label = 2\n", "Input = [6.9 3.1 5.4 2.1] , Label = 2\n", "Input = [6.7 3.1 5.6 2.4] , Label = 2\n", "Input = [6.9 3.1 5.1 2.3] , Label = 2\n", "Input = [5.8 2.7 5.1 1.9] , Label = 2\n", "Input = [6.8 3.2 5.9 2.3] , Label = 2\n", "Input = [6.7 3.3 5.7 2.5] , Label = 2\n", "Input = [6.7 3. 5.2 2.3] , Label = 2\n", "Input = [6.3 2.5 5. 1.9] , Label = 2\n", "Input = [6.5 3. 5.2 2. ] , Label = 2\n", "Input = [6.2 3.4 5.4 2.3] , Label = 2\n", "Input = [5.9 3. 5.1 1.8] , Label = 2\n" ] } ], "source": [ "import numpy as np\n", "\n", "from matplotlib import pyplot as plt\n", "from matplotlib.colors import ListedColormap\n", "\n", "from sklearn import preprocessing \n", "from sklearn.preprocessing import StandardScaler\n", "\n", "# Load Iris classification dataset\n", "from sklearn.datasets import load_iris\n", "iris = load_iris()\n", "\n", "# Print all 150 examples\n", "print(\"(Number_of_examples, example_size) = \" , iris.data.shape, \"\\n\")\n", "for i in range(0, 150) :\n", " print('Input = ', iris.data[i], ' , Label = ', iris.target[i] )\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Building, training and evaluating a simple Decision Tree classifier**\n", "\n", "The SciKit-learn class for Decision Tree classifiers is sklearn.tree.DecisionTreeClassifier.\n", "\n", "**Please FIRST READ (and understand!) the [*DecisionTreeClassifier DOCUMENTATION*](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) to understand all parameters of the contructor.**\n", "\n", "**You can then begin by running the code block below, in which default set of parameter values has been used.** If graphical view works, look at the structure of the learnt decision tree.\n", "\n", "**Then, check the influence of MAIN parameters for Decision Tree classifier, i.e.:**\n", " - **homegeneity criterion ('gini' or 'entropy')**\n", " - **max_depth**\n", " - **min_samples_split**\n", " \n", "NB : Note that post-training *PRUNING* IS unfortunately *NOT* implemented in SciKit-Learn Decision-Trees :(" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:310: FutureWarning: The min_impurity_split parameter is deprecated. Its default value has changed from 1e-7 to 0 in version 0.23, and it will be removed in 0.25. Use the min_impurity_decrease parameter instead.\n", " FutureWarning)\n", "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:327: FutureWarning: The parameter 'presort' is deprecated and has no effect. It will be removed in v0.24. You can suppress this warning by not passing any value to the 'presort' parameter.\n", " FutureWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "DecisionTreeClassifier(criterion='entropy', max_depth=5,\n", " min_impurity_split=1e-07, presort=False)\n", "Acuracy (on test set) = 0.9111111111111111\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 20\n", " 1 0.85 0.85 0.85 13\n", " 2 0.83 0.83 0.83 12\n", "\n", " accuracy 0.91 45\n", " macro avg 0.89 0.89 0.89 45\n", "weighted avg 0.91 0.91 0.91 45\n", "\n", "\n", " CONFUSION MATRIX\n", "[[20 0 0]\n", " [ 0 11 2]\n", " [ 0 2 10]]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Split dataset into training and test part\n", "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)\n", "\n", "# Learn a Decision Tree\n", "from sklearn import tree\n", "clf = tree.DecisionTreeClassifier(criterion='entropy', splitter='best', max_depth=5, \n", " min_samples_split=2, min_samples_leaf=1, \n", " min_weight_fraction_leaf=0.0, max_features=None, \n", " random_state=None, max_leaf_nodes=None, \n", " min_impurity_split=1e-07, class_weight=None, presort=False)\n", "clf = clf.fit(X_train, y_train)\n", "\n", "# Graphical view of learnt Decision Tree\n", "tree.plot_tree(clf) \n", "\n", "# Evaluate acuracy on test data\n", "print(clf)\n", "score = clf.score(X_test, y_test)\n", "print(\"Acuracy (on test set) = \", score)\n", "from sklearn.metrics import classification_report\n", "from sklearn.metrics import confusion_matrix\n", "y_true, y_pred = y_test, clf.predict(X_test)\n", "print( classification_report(y_true, y_pred) )\n", "print(\"\\n CONFUSION MATRIX\")\n", "print( confusion_matrix(y_true, y_pred) )\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Decision Trees on a MORE REALISTIC DATASET: HANDWRITTEN DIGITS\n", "\n", "**Please FIRST READ the [*Digits DATASET DESCRIPTION*](http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html#sphx-glr-auto-examples-datasets-plot-digits-last-image-py).**\n", "\n", "In this classification problem, there are 10 classes, with a total of 1797 examples (each one being a 64D vector corresponding to an 8x8 pixmap). Please **now execute code cell below to load the dataset, visualize a typical example, and train a Desicion Tree on it**. \n", "The original code uses a **voluntarily SUBOPTIMAL set of learning hyperparameters values, which reaches ~66% test acuracy. Try to play with them in order to improve acuracy.**\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number_of-examples = 1797\n", "\n", " Plot of first example\n" ] }, { "data": { "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAPoAAAECCAYAAADXWsr9AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAL40lEQVR4nO3dW4hd9RXH8d+vY7xGSaxWJBHtSAmIUHNBKgFpNYpWsS81RFCotCQPrRha0NiX4ptPYh+KELxU8IajBoq01gQVEVrtTIz1MrFoiJhEHSWRGAsR4+rD2SkxnTp7xv3/z5mzvh845MzMmb3WzOR39t7n7L2XI0IABtu3ZrsBAOURdCABgg4kQNCBBAg6kABBBxLoi6DbvsL2W7bftr2hcK37bE/Yfr1knSPqnWX7Odvjtt+wfXPhesfbftn2q02920vWa2oO2X7F9lOlazX1dtp+zfY226OFay2w/bjt7c3f8KKCtZY0P9Ph237b6ztZeETM6k3SkKR3JA1LOlbSq5LOK1jvYknLJL1e6ec7U9Ky5v7Jkv5V+OezpPnN/XmSXpL0g8I/468lPSzpqUq/052STqtU6wFJv2juHytpQaW6Q5I+kHR2F8vrhzX6hZLejogdEfG5pEcl/aRUsYh4QdLeUsufpN77EbG1uf+ppHFJiwrWi4g40Hw4r7kVOyrK9mJJV0m6p1SN2WL7FPVWDPdKUkR8HhGfVCp/qaR3IuLdLhbWD0FfJOm9Iz7epYJBmE22z5G0VL21bMk6Q7a3SZqQtDkiSta7S9Itkr4sWONoIekZ22O21xasMyzpI0n3N7sm99g+qWC9I62R9EhXC+uHoHuSzw3ccbm250t6QtL6iNhfslZEHIqICyQtlnSh7fNL1LF9taSJiBgrsfyvsTIilkm6UtIvbV9cqM4x6u3m3R0RSyV9Jqnoa0iSZPtYSddIGulqmf0Q9F2Szjri48WS9sxSL0XYnqdeyB+KiCdr1W02M5+XdEWhEislXWN7p3q7XJfYfrBQrf+KiD3NvxOSNqm3+1fCLkm7jtgiely94Jd2paStEfFhVwvsh6D/Q9L3bH+3eSZbI+lPs9xTZ2xbvX288Yi4s0K9020vaO6fIGmVpO0lakXEbRGxOCLOUe/v9mxEXF+i1mG2T7J98uH7ki6XVOQdlIj4QNJ7tpc0n7pU0pslah3lOnW42S71Nk1mVUR8YftXkv6q3iuN90XEG6Xq2X5E0g8lnWZ7l6TfRcS9peqpt9a7QdJrzX6zJP02Iv5cqN6Zkh6wPaTeE/ljEVHlba9KzpC0qff8qWMkPRwRTxesd5Okh5qV0A5JNxasJdsnSrpM0rpOl9u8lA9ggPXDpjuAwgg6kABBBxIg6EACBB1IoK+CXvhwxlmrRT3qzXa9vgq6pJq/zKp/OOpRbzbr9VvQARRQ5IAZ2wN9FM7ChQun/T0HDx7UcccdN6N6ixZN/2S+vXv36tRTT51Rvf37p3/OzYEDBzR//vwZ1du9e/e0vyci1BwdN22HDh2a0ffNFRHxP7+YWT8Edi5atWpV1Xp33HFH1XpbtmypWm/DhuInhH3Fvn37qtbrB2y6AwkQdCABgg4kQNCBBAg6kABBBxIg6EACBB1IoFXQa45MAtC9KYPeXGTwD+pdgvY8SdfZPq90YwC602aNXnVkEoDutQl6mpFJwKBqc1JLq5FJzYnytc/ZBdBCm6C3GpkUERslbZQG/zRVYK5ps+k+0COTgAymXKPXHpkEoHutLjzRzAkrNSsMQGEcGQckQNCBBAg6kABBBxIg6EACBB1IgKADCRB0IAEmtcxA7ckpw8PDVevNZOTUN7F3796q9VavXl213sjISNV6k2GNDiRA0IEECDqQAEEHEiDoQAIEHUiAoAMJEHQgAYIOJEDQgQTajGS6z/aE7ddrNASge23W6H+UdEXhPgAUNGXQI+IFSXXPOgDQKfbRgQQ6O02V2WtA/+os6MxeA/oXm+5AAm3eXntE0t8kLbG9y/bPy7cFoEtthixeV6MRAOWw6Q4kQNCBBAg6kABBBxIg6EACBB1IgKADCRB0IIGBmL22fPnyqvVqz0I799xzq9bbsWNH1XqbN2+uWq/2/xdmrwGogqADCRB0IAGCDiRA0IEECDqQAEEHEiDoQAIEHUiAoAMJtLk45Fm2n7M9bvsN2zfXaAxAd9oc6/6FpN9ExFbbJ0sas705It4s3BuAjrSZvfZ+RGxt7n8qaVzSotKNAejOtPbRbZ8jaamkl4p0A6CI1qep2p4v6QlJ6yNi/yRfZ/Ya0KdaBd32PPVC/lBEPDnZY5i9BvSvNq+6W9K9ksYj4s7yLQHoWpt99JWSbpB0ie1tze3HhfsC0KE2s9delOQKvQAohCPjgAQIOpAAQQcSIOhAAgQdSICgAwkQdCABgg4kMBCz1xYuXFi13tjYWNV6tWeh1Vb795kRa3QgAYIOJEDQgQQIOpAAQQcSIOhAAgQdSICgAwkQdCABgg4k0OYqsMfbftn2q83stdtrNAagO22OdT8o6ZKIONBc3/1F23+JiL8X7g1AR9pcBTYkHWg+nNfcGNAAzCGt9tFtD9neJmlC0uaIYPYaMIe0CnpEHIqICyQtlnSh7fOPfozttbZHbY923COAb2har7pHxCeSnpd0xSRf2xgRKyJiRTetAehKm1fdT7e9oLl/gqRVkrYX7gtAh9q86n6mpAdsD6n3xPBYRDxVti0AXWrzqvs/JS2t0AuAQjgyDkiAoAMJEHQgAYIOJEDQgQQIOpAAQQcSIOhAAsxem4EtW7ZUrTfoav/99u3bV7VeP2CNDiRA0IEECDqQAEEHEiDoQAIEHUiAoAMJEHQgAYIOJEDQgQRaB70Z4vCKbS4MCcwx01mj3yxpvFQjAMppO5JpsaSrJN1Tth0AJbRdo98l6RZJX5ZrBUApbSa1XC1pIiLGpngcs9eAPtVmjb5S0jW2d0p6VNIlth88+kHMXgP615RBj4jbImJxRJwjaY2kZyPi+uKdAegM76MDCUzrUlIR8bx6Y5MBzCGs0YEECDqQAEEHEiDoQAIEHUiAoAMJEHQgAYIOJDAQs9dqz9Javnx51Xq11Z6FVvv3OTIyUrVeP2CNDiRA0IEECDqQAEEHEiDoQAIEHUiAoAMJEHQgAYIOJEDQgQRaHQLbXOr5U0mHJH3BJZ2BuWU6x7r/KCI+LtYJgGLYdAcSaBv0kPSM7THba0s2BKB7bTfdV0bEHtvfkbTZ9vaIeOHIBzRPADwJAH2o1Ro9IvY0/05I2iTpwkkew+w1oE+1maZ6ku2TD9+XdLmk10s3BqA7bTbdz5C0yfbhxz8cEU8X7QpAp6YMekTskPT9Cr0AKIS314AECDqQAEEHEiDoQAIEHUiAoAMJEHQgAYIOJOCI6H6hdvcL/RrDw8M1y2l0dLRqvXXr1lWtd+2111atV/vvt2LFYJ+OERE++nOs0YEECDqQAEEHEiDoQAIEHUiAoAMJEHQgAYIOJEDQgQQIOpBAq6DbXmD7cdvbbY/bvqh0YwC603aAw+8lPR0RP7V9rKQTC/YEoGNTBt32KZIulvQzSYqIzyV9XrYtAF1qs+k+LOkjSffbfsX2Pc0gh6+wvdb2qO26p3YBmFKboB8jaZmkuyNiqaTPJG04+kGMZAL6V5ug75K0KyJeaj5+XL3gA5gjpgx6RHwg6T3bS5pPXSrpzaJdAehU21fdb5L0UPOK+w5JN5ZrCUDXWgU9IrZJYt8bmKM4Mg5IgKADCRB0IAGCDiRA0IEECDqQAEEHEiDoQAIDMXuttrVr11atd+utt1atNzY2VrXe6tWrq9YbdMxeA5Ii6EACBB1IgKADCRB0IAGCDiRA0IEECDqQAEEHEpgy6LaX2N52xG2/7fUVegPQkSmvGRcRb0m6QJJsD0naLWlT2bYAdGm6m+6XSnonIt4t0QyAMqYb9DWSHinRCIByWge9uab7NZJG/s/Xmb0G9Km2Axwk6UpJWyPiw8m+GBEbJW2UBv80VWCumc6m+3Visx2Yk1oF3faJki6T9GTZdgCU0HYk078lfbtwLwAK4cg4IAGCDiRA0IEECDqQAEEHEiDoQAIEHUiAoAMJEHQggVKz1z6SNJNz1k+T9HHH7fRDLepRr1a9syPi9KM/WSToM2V7NCJWDFot6lFvtuux6Q4kQNCBBPot6BsHtBb1qDer9fpqHx1AGf22RgdQAEEHEiDoQAIEHUiAoAMJ/AchD47vy2xCkAAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "DecisionTreeClassifier(max_depth=5, min_impurity_split=1e-07,\n", " min_samples_split=4, presort=False)\n", "Acuracy (on test set) = 0.728587319243604\n", " precision recall f1-score support\n", "\n", " 0 0.99 0.90 0.94 91\n", " 1 0.65 0.57 0.61 89\n", " 2 0.72 0.80 0.76 74\n", " 3 0.77 0.70 0.74 91\n", " 4 0.75 0.77 0.76 96\n", " 5 0.94 0.80 0.86 90\n", " 6 0.97 0.79 0.88 97\n", " 7 0.97 0.77 0.86 92\n", " 8 0.37 0.28 0.32 93\n", " 9 0.45 0.92 0.61 86\n", "\n", " accuracy 0.73 899\n", " macro avg 0.76 0.73 0.73 899\n", "weighted avg 0.76 0.73 0.73 899\n", "\n", "\n", " CONFUSION MATRIX\n", "[[82 0 0 0 4 3 0 0 0 2]\n", " [ 0 51 4 6 6 0 0 0 18 4]\n", " [ 1 3 59 1 2 0 0 0 7 1]\n", " [ 0 2 3 64 0 1 0 2 6 13]\n", " [ 0 6 1 0 74 0 0 0 2 13]\n", " [ 0 3 1 0 5 72 0 0 3 6]\n", " [ 0 1 5 0 7 0 77 0 7 0]\n", " [ 0 4 3 0 1 0 0 71 2 11]\n", " [ 0 6 5 9 0 0 2 0 26 45]\n", " [ 0 2 1 3 0 1 0 0 0 79]]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:310: FutureWarning: The min_impurity_split parameter is deprecated. Its default value has changed from 1e-7 to 0 in version 0.23, and it will be removed in 0.25. Use the min_impurity_decrease parameter instead.\n", " FutureWarning)\n", "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:327: FutureWarning: The parameter 'presort' is deprecated and has no effect. It will be removed in v0.24. You can suppress this warning by not passing any value to the 'presort' parameter.\n", " FutureWarning)\n" ] } ], "source": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()\n", "n_samples = len(digits.images)\n", "print(\"Number_of-examples = \", n_samples)\n", "\n", "import matplotlib.pyplot as plt\n", "print(\"\\n Plot of first example\")\n", "plt.gray() \n", "plt.matshow(digits.images[0]) \n", "plt.show() \n", "\n", "# Flatten the images, to turn data in a (samples, feature) matrix:\n", "data = digits.images.reshape((n_samples, -1))\n", "\n", "# Split dataset into training and test part\n", "X = data\n", "y = digits.target\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)\n", "\n", "# Create and train a Decision Tree Classifier\n", "clf = tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=5, \n", " min_samples_split=4, min_samples_leaf=1, \n", " min_weight_fraction_leaf=0.0, max_features=None, \n", " random_state=None, max_leaf_nodes=None, \n", " min_impurity_split=1e-07, class_weight=None, presort=False)\n", "clf = clf.fit(X_train, y_train)\n", "\n", "\n", "# Evaluate acuracy on test data\n", "print(clf)\n", "score = clf.score(X_test, y_test)\n", "print(\"Acuracy (on test set) = \", score)\n", "from sklearn.metrics import classification_report\n", "from sklearn.metrics import confusion_matrix\n", "y_true, y_pred = y_test, clf.predict(X_test)\n", "print( classification_report(y_true, y_pred) )\n", "print(\"\\n CONFUSION MATRIX\")\n", "print( confusion_matrix(y_true, y_pred) )\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question: According to the confusion matrices, what digits are the most confused with each other?__\n", "\n", "__Answer:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Finally, find somewhat optimized values for the set of 3 main hyper-parameters for DecisionTree learning, by using GRID-SEARCH WITH CROSS-VALIDATION** (see cross-validation example from the Multi-Layer Perceptron notebook used in earlier practical session). __Put the code in the cell below:__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question: What best value have you managed to reach for TEST accuracy of your DecisionTree after you properly gridSearched its hyper-parameters using CrossValidation?__\n", "\n", "__Answer:__\n", "\n", "\n", "In order to improve result, the most natural step is to combine SEVERAL decision trees, using the Ensemble model called Random Forest: see below" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Building, training and evaluating a Random Forest classifier\n", "\n", "The SciKit-learn class for Random Forest classifiers is sklearn.ensemble.RandomForestClassifier.\n", "\n", "**Please FIRST READ (and understand!) the [*RandomForestClassifier DOCUMENTATION*](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to understand all parameters of the contructor.**\n", "\n", "**Then you can begin by running the code block below, in which default set of parameter values has been used.** As you will see, a RandomForest (even rather small) can easily outperform single Decision Tree. \n", "\n", "**Then, check the influence of MAIN parameters for Random Forest classifier, i.e.:**\n", " - **n_estimators (number of trees in forest)**\n", " - **max_depth**\n", " - **max_features (max number of features used in each tree)**\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:310: FutureWarning: The min_impurity_split parameter is deprecated. Its default value has changed from 1e-7 to 0 in version 0.23, and it will be removed in 0.25. Use the min_impurity_decrease parameter instead.\n", " FutureWarning)\n", "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:310: FutureWarning: The min_impurity_split parameter is deprecated. Its default value has changed from 1e-7 to 0 in version 0.23, and it will be removed in 0.25. Use the min_impurity_decrease parameter instead.\n", " FutureWarning)\n", "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:310: FutureWarning: The min_impurity_split parameter is deprecated. Its default value has changed from 1e-7 to 0 in version 0.23, and it will be removed in 0.25. Use the min_impurity_decrease parameter instead.\n", " FutureWarning)\n", "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:310: FutureWarning: The min_impurity_split parameter is deprecated. Its default value has changed from 1e-7 to 0 in version 0.23, and it will be removed in 0.25. Use the min_impurity_decrease parameter instead.\n", " FutureWarning)\n", "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:310: FutureWarning: The min_impurity_split parameter is deprecated. Its default value has changed from 1e-7 to 0 in version 0.23, and it will be removed in 0.25. Use the min_impurity_decrease parameter instead.\n", " FutureWarning)\n", "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:310: FutureWarning: The min_impurity_split parameter is deprecated. Its default value has changed from 1e-7 to 0 in version 0.23, and it will be removed in 0.25. Use the min_impurity_decrease parameter instead.\n", " FutureWarning)\n", "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:310: FutureWarning: The min_impurity_split parameter is deprecated. Its default value has changed from 1e-7 to 0 in version 0.23, and it will be removed in 0.25. Use the min_impurity_decrease parameter instead.\n", " FutureWarning)\n", "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:310: FutureWarning: The min_impurity_split parameter is deprecated. Its default value has changed from 1e-7 to 0 in version 0.23, and it will be removed in 0.25. Use the min_impurity_decrease parameter instead.\n", " FutureWarning)\n", "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:310: FutureWarning: The min_impurity_split parameter is deprecated. Its default value has changed from 1e-7 to 0 in version 0.23, and it will be removed in 0.25. Use the min_impurity_decrease parameter instead.\n", " FutureWarning)\n", "C:\\Users\\fabien\\anaconda3_2020-07\\envs\\envML2020\\lib\\site-packages\\sklearn\\tree\\_classes.py:310: FutureWarning: The min_impurity_split parameter is deprecated. Its default value has changed from 1e-7 to 0 in version 0.23, and it will be removed in 0.25. Use the min_impurity_decrease parameter instead.\n", " FutureWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "n_estimators= 10 max_depth= None max_features= auto\n", "RandomForestClassifier(min_impurity_split=1e-07, n_estimators=10, n_jobs=1)\n", "Acuracy (on test set) = 0.9399332591768632\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 91\n", " 1 0.89 0.96 0.92 89\n", " 2 0.90 0.99 0.94 74\n", " 3 0.94 0.89 0.92 91\n", " 4 0.97 0.97 0.97 96\n", " 5 0.95 0.98 0.96 90\n", " 6 0.99 0.98 0.98 97\n", " 7 0.96 0.93 0.95 92\n", " 8 0.86 0.88 0.87 93\n", " 9 0.95 0.83 0.88 86\n", "\n", " accuracy 0.94 899\n", " macro avg 0.94 0.94 0.94 899\n", "weighted avg 0.94 0.94 0.94 899\n", "\n", "\n", " CONFUSION MATRIX\n", "[[91 0 0 0 0 0 0 0 0 0]\n", " [ 0 85 2 0 0 0 0 0 2 0]\n", " [ 0 1 73 0 0 0 0 0 0 0]\n", " [ 0 2 3 81 0 0 0 0 3 2]\n", " [ 0 1 0 0 93 0 1 1 0 0]\n", " [ 0 1 0 0 0 88 0 0 1 0]\n", " [ 0 1 0 0 0 0 95 0 1 0]\n", " [ 0 1 1 0 3 1 0 86 0 0]\n", " [ 0 2 2 2 0 1 0 2 82 2]\n", " [ 0 2 0 3 0 3 0 1 6 71]]\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "# Create and train a Random Forest classifier\n", "clf = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None,\n", " min_samples_split=2, min_samples_leaf=1, \n", " min_weight_fraction_leaf=0.0, max_features='auto', \n", " max_leaf_nodes=None, min_impurity_split=1e-07, bootstrap=True, \n", " oob_score=False, n_jobs=1, random_state=None, \n", " verbose=0, warm_start=False, class_weight=None)\n", "clf = clf.fit(X_train, y_train)\n", "print(\"n_estimators=\", clf.n_estimators, \" max_depth=\",clf.max_depth,\n", " \"max_features=\", clf.max_features)\n", "\n", "# Evaluate acuracy on test data\n", "print(clf)\n", "score = clf.score(X_test, y_test)\n", "print(\"Acuracy (on test set) = \", score)\n", "from sklearn.metrics import classification_report\n", "from sklearn.metrics import confusion_matrix\n", "y_true, y_pred = y_test, clf.predict(X_test)\n", "print( classification_report(y_true, y_pred) )\n", "print(\"\\n CONFUSION MATRIX\")\n", "print( confusion_matrix(y_true, y_pred) )\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Finally, find somewhat optimized values the set of 3 main hyper-parameters for RandomForest, by using CROSS-VALIDATION** (see cross-validation example from the Multi-Layer Perceptron notebook used in earlier practical session). __Put the code in the cell below:__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question: What best value have you managed to reach for TEST accuracy of your RandomForest after you properly gridSearched its hyper-parameters using CrossValidation?__\n", "\n", "__Answer:__" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## 3. Building, training and evaluating an AdaBoost classifier\n", "\n", "The SciKit-learn class for adaBoost is sklearn.ensemble.AdaBoostClassifier.\n", "\n", "**Please FIRST READ (and understand!) the [*AdaBoostClassifier DOCUMENTATION*](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier) to understand all parameters of the contructor.**\n", "\n", "**Then begin by running the code block below, in which a default set of parameter values has been used.** \n", "\n", "**Then, check the influence of MAIN parameters for adaBoost classifier, i.e.:**\n", " - **base_estimator (ie type of Weak Classifier/Learner)** \n", " - **n_estimators (number of boosting iterations, and therefore also number of weak classifiers)**\n", " - algorithm\n", " \n", "**Finally, check which other types of classifiers can be used as Weak Classifier with the adaBoost implementation of SciKit-Learn.**\n", "NB: in principle it is possible to use MLP classifiers as weak classifiers, but not with SciKit-learn implementation of MLPClassifier (because weighting of examples is not handled by its implementation)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Weak_learner: DecisionTreeClassifier(max_depth=6)\n", "Weights of weak classifiers: [4.1662423 5.41474306 4.7238877 4.45785944 4.16955192 4.55695778\n", " 5.33768772 5.42409322 5.7337812 4.21523304 4.44342062 4.90185404\n", " 4.98322751 5.54716878 4.83245137]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "n_estimators= 15\n", "Acuracy (on test set) = 0.9210233592880979\n", " precision recall f1-score support\n", "\n", " 0 0.99 0.98 0.98 91\n", " 1 0.86 0.96 0.90 89\n", " 2 0.96 0.97 0.97 74\n", " 3 0.97 0.81 0.89 91\n", " 4 0.97 0.93 0.95 96\n", " 5 0.93 0.94 0.94 90\n", " 6 0.93 0.93 0.93 97\n", " 7 0.93 0.97 0.95 92\n", " 8 0.85 0.85 0.85 93\n", " 9 0.84 0.88 0.86 86\n", "\n", " accuracy 0.92 899\n", " macro avg 0.92 0.92 0.92 899\n", "weighted avg 0.92 0.92 0.92 899\n", "\n", "\n", " CONFUSION MATRIX\n", "[[89 0 0 0 0 0 2 0 0 0]\n", " [ 0 85 0 0 0 0 0 0 0 4]\n", " [ 1 0 72 0 0 0 0 0 1 0]\n", " [ 0 1 0 74 0 3 1 2 6 4]\n", " [ 0 3 0 0 89 0 3 1 0 0]\n", " [ 0 1 0 0 1 85 0 0 2 1]\n", " [ 0 4 1 0 1 0 90 0 1 0]\n", " [ 0 0 0 0 1 0 0 89 0 2]\n", " [ 0 3 2 1 0 2 1 2 79 3]\n", " [ 0 2 0 1 0 1 0 2 4 76]]\n" ] } ], "source": [ "from sklearn.ensemble import AdaBoostClassifier\n", "\n", "# Create and train an adaBoost classifier using SMALL Decision Trees as weak classifiers\n", "weak_learner = tree.DecisionTreeClassifier(max_depth=6)\n", "clf = AdaBoostClassifier(weak_learner, n_estimators=15, learning_rate=1.0, algorithm='SAMME', \n", " random_state=None)\n", "clf = clf.fit(X_train, y_train)\n", "print(\"Weak_learner:\", clf.base_estimator)\n", "print(\"Weights of weak classifiers: \", clf.estimator_weights_)\n", " \n", "# Plot training curves (error = f(iterations))\n", "n_iter = clf.n_estimators\n", "from sklearn.metrics import zero_one_loss\n", "ada_train_err = np.zeros((clf.n_estimators,))\n", "for i, y_pred in enumerate(clf.staged_predict(X_train)):\n", " ada_train_err[i] = zero_one_loss(y_pred, y_train)\n", "ada_test_err = np.zeros((clf.n_estimators,))\n", "for i, y_pred in enumerate(clf.staged_predict(X_test)):\n", " ada_test_err[i] = zero_one_loss(y_pred, y_test)\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111)\n", "ax.plot(np.arange(n_iter) + 1, ada_train_err,\n", " label='Training Error',\n", " color='green')\n", "ax.plot(np.arange(n_iter) + 1, ada_test_err,\n", " label='Test Error',\n", " color='orange')\n", "ax.set_ylim((0.0, 0.5))\n", "ax.set_xlabel('boosting iterations')\n", "ax.set_ylabel('error rate')\n", "leg = ax.legend(loc='upper right', fancybox=True)\n", "plt.show()\n", "\n", "# Evaluate acuracy on test data\n", "print(\"n_estimators=\", clf.n_estimators)\n", "score = clf.score(X_test, y_test)\n", "print(\"Acuracy (on test set) = \", score)\n", "from sklearn.metrics import classification_report\n", "from sklearn.metrics import confusion_matrix\n", "y_true, y_pred = y_test, clf.predict(X_test)\n", "print( classification_report(y_true, y_pred) )\n", "print(\"\\n CONFUSION MATRIX\")\n", "print( confusion_matrix(y_true, y_pred) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question:__ Looking at the training curves, you can see that **training error goes down to zero rather quickly, but that test_error still continues, after training error is zero, to diminish with increasing iterations**. __Is it normal, and why?__ (check the course!)\n", "\n", "__Answer:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Now, for the case of _DecisionTree_ weak classifiers, find somewhat optimized values of (max_depth, n_estimators) by using CROSS-VALIDATION.** __Put the code below:__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question: What best value have you managed to reach for TEST accuracy of your AdaboostClassifier after you properly gridSearched its hyper-parameters using CrossValidation?__\n", "\n", "__Answer:__" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 1 }