{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ce notebook a été préparé par Fabien Moutarde (MINES ParisTech) en modifiant/combinant 2 notebooks de Cambridge : http://online.cambridgecoding.com/notebooks/cca_admin/deep-learning-for-complete-beginners-recognising-handwritten-digits\n",
    "http://online.cambridgecoding.com/notebooks/cca_admin/convolutional-neural-networks-with-keras\n",
    "\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Welcome to a tutorial to get you quickly up to speed with *deep learning*; from first principles, all the way to discussions of some of the intricate details, with the purposes of achieving respectable performance on one established machine learning benchmark: [MNIST](http://yann.lecun.com/exdb/mnist/) (classification of handwritten digits).\n",
    "\n",
    "\n",
    "MNIST                      \n",
    "![](http://perso.mines-paristech.fr/fabien.moutarde/ES_MachineLearning/TP_convNets/mnist.png)  \n",
    "\n",
    "By the end of this part of the tutorial, you should be capable of understanding and producing a simple CNN (with a structure similar to LeNet architecture) in Keras, achieving a respectable level of accuracy on MNIST.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Convolutions\n",
    "\n",
    "It turns out that there is a very efficient way of pulling this off, and it makes advantage of the structure of the information encoded within an image---it is assumed that pixels that are spatially *closer* together will \"cooperate\" on forming a particular feature of interest much more than ones on opposite corners of the image. Also, if a particular (smaller) feature is found to be of great importance when defining an image's label, it will be equally important if this feature was found anywhere within the image, regardless of location.\n",
    "\n",
    "Enter the **convolution** operator. Given a two-dimensional image, $\\bf I$, and a small matrix, $\\bf K$ of size $h \\times w$, (known as a *convolution kernel*), which we assume encodes a way of extracting an interesting image feature, we compute the convolved image, ${\\bf I} * {\\bf K}$, by overlaying the kernel on top of the image in all possible ways, and recording the sum of elementwise products between the image and the kernel:\n",
    "\n",
    "$$({\\bf I} * {\\bf K})_{xy} = \\sum_{i=1}^h \\sum_{j=1}^w {{\\bf K}_{ij} \\cdot {\\bf I}_{x + i - 1, y + j - 1}}$$\n",
    "\n",
    "(in fact, the exact definition would require us to flip the kernel matrix first, but for the purposes of machine learning it is irrelevant whether this is done)\n",
    "\n",
    "The images below show a diagrammatical overview of the above formula and the result of applying convolution (with two separate kernels) over an image, to act as an edge detector:\n",
    "\n",
    "![](http://perso.mines-paristech.fr/fabien.moutarde/ES_MachineLearning/TP_convNets/convolve.png)\n",
    "![](http://perso.mines-paristech.fr/fabien.moutarde/ES_MachineLearning/TP_convNets/lena.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convolutional and pooling layers\n",
    "\n",
    "The convolution operator forms the fundamental basis of the **convolutional** layer of a CNN. The layer is completely specified by a certain number of kernels, $\\bf \\vec{K}$ (along with additive biases, $\\vec{b}$, per each kernel), and it operates by computing the convolution of the output images of a previous layer with each of those kernels, afterwards adding the biases (one per each output image). Finally, an activation function, $\\sigma$, may be applied to all of the pixels of the output images. Typically, the input to a convolutional layer will have $d$ *channels* (e.g. red/green/blue in the input layer), in which case the kernels are extended to have this number of channels as well, making the final formula of a single output image channel of a convolutional layer (for a kernel ${\\bf K}$ and bias $b$) as follows:\n",
    "\n",
    "$$\\mathrm{conv}({\\bf I}, {\\bf K})_{xy} = \\sigma\\left(b + \\sum_{i=1}^h \\sum_{j=1}^w \\sum_{k=1}^d {{\\bf K}_{ijk} \\cdot {\\bf I}_{x + i - 1, y + j - 1, k}}\\right)$$\n",
    "\n",
    "Note that, since all we're doing here is addition and scaling of the input pixels, the kernels may be learned from a given training dataset via *gradient descent*, exactly as the weights of an MLP. In fact, an MLP is perfectly capable of replicating a convolutional layer, but it would require a lot more training time (and data) to learn to approximate that mode of operation.\n",
    "\n",
    "Finally, let's just note that a convolutional operator is in no way restricted to two-dimensionally structured data: in fact, most machine learning frameworks ([Keras included](https://keras.io/layers/convolutional/)) will provide you with out-of-the-box layers for 1D and 3D convolutions as well!\n",
    "\n",
    "It is important to note that, while a convolutional layer significantly decreases the number of *parameters* compared to a fully connected (FC) layer, it introduces more **hyperparameters**---parameters whose values need to be chosen *before* training starts.\n",
    "\n",
    "Namely, the hyperparameters to choose within a single convolutional layer are:\n",
    "- *depth*: how many different kernels (and biases) will be convolved with the output of the previous layer;\n",
    "- *height* and *width* of each kernel;\n",
    "- *stride*: by how much we shift the kernel in each step to compute the next pixel in the result. This specifies the overlap between individual output pixels, and typically it is set to $1$, corresponding to the formula given before. Note that larger strides result in smaller output sizes.\n",
    "- *padding*: note that convolution by any kernel larger than $1\\times 1$ will *decrease* the output image size---it is often desirable to keep sizes the same, in which case the image is sufficiently padded with zeroes at the edges. This is often called *\"same\"* padding, as opposed to *\"valid\"* (no) padding. It is possible to add arbitrary levels of padding, but typically the padding of choice will be either same or valid.\n",
    "\n",
    "As already hinted, convolutions are not typically meant to be the sole operation in a CNN (although there have been promising recent developments on [all-convolutional networks](https://arxiv.org/pdf/1412.6806v3.pdf)); but rather to extract useful features of an image prior to downsampling it sufficiently to be manageable by an MLP.\n",
    "\n",
    "A very popular approach to downsampling is a *pooling* layer, which consumes small and (usually) disjoint chunks of the image (typically $2\\times 2$) and aggregates them into a single value. There are several possible schemes for the aggregation---the most popular being **max-pooling**, where the maximum pixel value within each chunk is taken. A diagrammatical illustration of $2\\times 2$ max-pooling is given below.\n",
    "\n",
    "![](http://perso.mines-paristech.fr/fabien.moutarde/ES_MachineLearning/TP_convNets/pool.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Putting it all together: a common CNN\n",
    "\n",
    "Now that we got all the building blocks, let's see what a typical convolutional neural network might look like!\n",
    "\n",
    "![](http://perso.mines-paristech.fr/fabien.moutarde/ES_MachineLearning/TP_convNets/cnn.png)\n",
    "\n",
    "A typical CNN architecture for a $k$-class image classification can be split into two distinct parts---a chain of repeating $\\mathrm{Conv}\\rightarrow\\mathrm{Pool}$ layers (sometimes with more than one convolutional layer at once), followed by a few fully connected layers (taking each pixel of the computed images as an independent input), culminating in a $k$-way softmax layer, to which a cross-entropy loss is optimised. I did not draw the activation functions here to make the sketch clearer, but do keep in mind that typically after every convolutional or fully connected layer, an activation (e.g. ReLU) will be applied to all of the outputs.\n",
    "\n",
    "Note the effect of a single $\\mathrm{Conv}\\rightarrow\\mathrm{Pool}$ pass through the image: it reduces height and width of the individual channels in favour of their number, i.e. *depth*.\n",
    "\n",
    "The softmax layer and cross-entropy loss are both introduced in more detail [in the previous tutorial](http://online.cambridgecoding.com/notebooks/cca_admin/deep-learning-for-complete-beginners-recognising-handwritten-digits). For summarisation purposes, a softmax layer's purpose is converting any vector of real numbers into a vector of *probabilities* (nonnegative real values that add up to 1). Within this context, the probabilities correspond to the likelihoods that an input image is a member of a particular class. Minimising the cross-entropy loss has the effect of maximising the model's confidence in the *correct* class, without being concerned for the probabilites for other classes---this makes it a more suitable choice for probabilistic tasks compared to, for example, the squared error loss."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Detour: Overfitting, regularisation and dropout\n",
    "\n",
    "This will be the first (and hopefully the only) time when I will divert your attention to a seemingly unrelated topic. It regards a very important pitfall of machine learning---**overfitting** a model to the training data. While this is primarily going to be a major topic of the next tutorial in the series, the negative effects of overfitting will tend to become quite noticeable on the networks like the one we are about to build, and we need to introduce a way to properly protect ourselves against it, before going any further. Luckily, there is a very simple technique we can use.\n",
    "\n",
    "Overfitting corresponds to adapting our model to the training set to such extremes that its generalisation potential (performance on samples outside of the training set) is *severely* limited. In other words, our model might have learned the training set (along with any noise present within it) perfectly, but it has failed to capture the underlying process that generated it. To illustrate, consider a problem of fitting a sine curve, with white additive noise applied to the data points: \n",
    "\n",
    "![](http://perso.mines-paristech.fr/fabien.moutarde/ES_MachineLearning/TP_convNets/plotsin.png)\n",
    "\n",
    "Here we have a training set (denoted by blue circles) derived from the original sine wave, along with some noise. If we fit a degree-3 polynomial to this data, we get a fairly good approximation to the original curve. Someone might argue that a degree-14 polynomial would do better; indeed, given we have 15 points, such a fit would *perfectly* describe the training data. However, in this case, the additional parameters of the model cause catastrophic results: to cope with the inherent noise of the data, anywhere except in the closest vicinity of the training points, our fit is completely off.\n",
    "\n",
    "Deep convolutional neural networks have a large number of parameters, especially in the fully connected layers. Overfitting might often manifest in the following form: if we don't have sufficiently many training examples, a small group of neurons might become responsible for doing most of the processing and other neurons becoming redundant; or in the other extreme, some neurons might actually become detrimental to performance, with several other neurons of their layer ending up doing nothing else but correcting for their errors.\n",
    "\n",
    "To help our models generalise better in these circumstances, we introduce techniques of *regularisation*: rather than reducing the number of parameters, we impose *constraints* on the model parameters during training to keep them from learning the noise in the training data. The particular method I will introduce here is **dropout**---a technique that initially might seem like \"dark magic\", but actually helps to eliminate exactly the failure modes described above. Namely, dropout with parameter $p$ will, within a single training iteration, go through all neurons in a particular layer and, with probability $p$, *completely eliminate them from the network throughout the iteration*. This has the effect of forcing the neural network to cope with *failures*, and not to rely on existence of a particular neuron (or set of neurons)---relying more on a *consensus* of several neurons within a layer. This is a very simple technique that works quite well already for combatting overfitting on its own, without introducing further regularisers. An illustration is given below.\n",
    "\n",
    "![](http://perso.mines-paristech.fr/fabien.moutarde/ES_MachineLearning/TP_convNets/drop.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Applying a deep CNN to MNIST\n",
    "\n",
    "As this post's objective, we will implement a deep convolutional neural network---and apply it on the MNIST digit recognition classification task.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As already mentioned, a CNN will typically have more hyperparameters than an MLP. For the purposes of this tutorial, we will also stick to \"sensible\" hand-picked values for them, but do still keep in mind that later on I will introduce a more proper method for learning them.\n",
    "\n",
    "The hyperparameters are:\n",
    "- The *batch size*, representing the number of training examples being used simultaneously during a single iteration of the gradient descent algorithm;\n",
    "- The number of *epochs*, representing the number of times the training algorithm will iterate over the entire training set before terminating\\*;\n",
    "- The *kernel sizes* in the convolutional layers;\n",
    "- The *pooling size* in the pooling layers;\n",
    "- The *number of kernels* in the convolutional layers;\n",
    "- The *dropout probability* (we will apply dropout after each pooling, and after the fully connected layer);\n",
    "- The *number of neurons* in the fully connected layer of the MLP.\n",
    "\n",
    "\\* **N.B. here I have set the number of epochs to 100, which might be undesirably slow if you do not have a GPU at your disposal (the convolution layers are going to pose a significant performance bottleneck in this case). You might wish to decrease the epoch count and/or numbers of kernels if you are going to be training the network on a CPU.**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Modelling time! Our network has an architecture similar to LeNet5 of LeCun (see figure below). It will consist of two consecutive groups of one `Convolution2D` followed by  a `MaxPooling2D` layer. After the first pooling layer, the number of kernels is rougly doubled (in line with the previously mentioned principle of sacrificing height and width for more depth). Afterwards, the output of the second pooling layer is flattened to 1D (via the `Flatten` layer), and passed through one or two fully connected (`Dense`) layers. ReLU activations will once again be used for all layers except the output dense layer, which will use a softmax activation (for purposes of probabilistic classification).\n",
    "\n",
    "![](http://perso.mines-paristech.fr/fabien.moutarde/ES_MachineLearning/TP_convNets/lenet5.png)\n",
    "\n",
    "To regularise our model, a `Dropout` layer is applied after each pooling layer, and after the first `Dense` layer. This is another area where Keras shines compared to other frameworks: it has an internal flag that automatically enables or disables dropout, depending on whether the model is currently used for training or testing.\n",
    "\n",
    "The remainder of the model specification is the following:\n",
    "- We use the *cross-entropy* loss function as the objective to optimise (as its derivation is more appropriate for probabilistic tasks);\n",
    "- We use the [*Adam* optimiser for gradient descent](http://sebastianruder.com/optimizing-gradient-descent/);\n",
    "- We report the *accuracy* of the model (as the dataset is balanced across the ten classes)\\*;\n",
    "- We hold out 10% of the data for validation purposes.\n",
    "\n",
    "\\* To get a feeling for why accuracy might be inappropriate for unbalanced datasets, consider an extreme case where 90% of the test data belongs to class $x$ (this could be, for example, the task of diagnosing patients for an extremely rare disease). In this case, a classifier that just outputs $x$ achieves a seemingly impressive accuracy of 90% on the test data, without really doing any learning/generalisation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This model, possibly after some tweaking of its architectural parameters, should be able to break $99\\%$ accuracy on its **test set** with little to no effort.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "Throughout this post we have covered the essentials of convolutional neural networks, introduced the problem of overfitting, and made a very brief dent into how it could be rectified via regularisation (by applying dropout) and successfully implemented a two-layer deep CNN (with LeNet like architecture) in Keras, applying it to MNIST, all in under 50 lines of code. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Just show me the code!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/export/home/manitou/6/moutarde/.local/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "Using Theano backend.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Version KERAS :2.1.5\n",
      "Version Theano :1.0.0\n",
      "(60000, 1, 28, 28)\n",
      "(60000, 10)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:52: UserWarning: Update your `Conv2D` call to the Keras 2 API: `Conv2D(6, (5, 5), padding=\"same\", activation=\"relu\", data_format=\"channels_first\", input_shape=(1, 28, 28...)`\n",
      "/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:57: UserWarning: Update your `Conv2D` call to the Keras 2 API: `Conv2D(16, (5, 5), padding=\"same\", activation=\"relu\")`\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<bound method Sequential.summary of <keras.models.Sequential object at 0xc1ffb0c>>\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python2.7/dist-packages/keras/models.py:942: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.\n",
      "  warnings.warn('The `nb_epoch` argument in `fit` '\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 36000 samples, validate on 24000 samples\n",
      "Epoch 1/20\n",
      "36000/36000 [==============================] - 57s 2ms/step - loss: 0.6565 - acc: 0.7849 - val_loss: 0.1825 - val_acc: 0.9454\n",
      "Epoch 2/20\n",
      " 6944/36000 [====>.........................] - ETA: 38s - loss: 0.3577 - acc: 0.8881"
     ]
    },
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-1-322e304bc16a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     75\u001b[0m model.fit(X_train, Y_train, # Train the model using the training set...\n\u001b[1;32m     76\u001b[0m           \u001b[0mbatch_size\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mbatch_size\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnb_epoch\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mnum_epochs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 77\u001b[0;31m           verbose=1, validation_split=0.4) # ...holding out 40% of the data for validation\n\u001b[0m\u001b[1;32m     78\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mevaluate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mY_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mverbose\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# Evaluate the trained model on the test set!\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/keras/models.pyc\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)\u001b[0m\n\u001b[1;32m    961\u001b[0m                               \u001b[0minitial_epoch\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0minitial_epoch\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    962\u001b[0m                               \u001b[0msteps_per_epoch\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msteps_per_epoch\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 963\u001b[0;31m                               validation_steps=validation_steps)\n\u001b[0m\u001b[1;32m    964\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    965\u001b[0m     def evaluate(self, x=None, y=None,\n",
      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/keras/engine/training.pyc\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)\u001b[0m\n\u001b[1;32m   1703\u001b[0m                               \u001b[0minitial_epoch\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0minitial_epoch\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1704\u001b[0m                               \u001b[0msteps_per_epoch\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msteps_per_epoch\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1705\u001b[0;31m                               validation_steps=validation_steps)\n\u001b[0m\u001b[1;32m   1706\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1707\u001b[0m     def evaluate(self, x=None, y=None,\n",
      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/keras/engine/training.pyc\u001b[0m in \u001b[0;36m_fit_loop\u001b[0;34m(self, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)\u001b[0m\n\u001b[1;32m   1233\u001b[0m                         \u001b[0mins_batch\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mins_batch\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtoarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1234\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1235\u001b[0;31m                     \u001b[0mouts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mins_batch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1236\u001b[0m                     \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mouts\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1237\u001b[0m                         \u001b[0mouts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mouts\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/keras/backend/theano_backend.pyc\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, inputs)\u001b[0m\n\u001b[1;32m   1225\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m__call__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1226\u001b[0m         \u001b[0;32massert\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mlist\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtuple\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1227\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfunction\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1228\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1229\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.pyc\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m    901\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    902\u001b[0m             \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 903\u001b[0;31m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0moutput_subset\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0mNone\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    904\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0moutput_subset\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0moutput_subset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    905\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
     ]
    }
   ],
   "source": [
    "import os\n",
    "os.environ['KERAS_BACKEND']=\"theano\"\n",
    "import theano\n",
    "import keras\n",
    "print(\"Version KERAS :\" + keras.__version__)\n",
    "print(\"Version Theano :\" + theano.__version__)\n",
    "\n",
    "from keras.datasets import mnist # subroutines for fetching the MNIST dataset\n",
    "from keras.models import Model, Sequential # basic class for specifying and training a neural network\n",
    "from keras.layers import Input, Conv2D, MaxPooling2D, Dense, Dropout, Activation, Flatten\n",
    "from keras.utils import np_utils # utilities for one-hot encoding of ground truth values\n",
    "import numpy as np\n",
    "from numpy import newaxis\n",
    "\n",
    "batch_size = 32 # in each iteration, we consider 32 training examples at once\n",
    "num_epochs = 20 # we iterate 20 times over the entire training set\n",
    "kernel_size = 5 # we will use 5x5 kernels throughout\n",
    "pool_size = 2 # we will use 2x2 pooling throughout\n",
    "conv_depth_1 = 6 # we will initially have 6 kernels in first conv. layer...\n",
    "conv_depth_2 = 16 # ...switching to 16 after the first pooling layer\n",
    "drop_prob_1 = 0.25 # dropout after pooling with probability 0.25\n",
    "drop_prob_2 = 0.5 # dropout in the FC layer with probability 0.5\n",
    "hidden_size = 128 # the FC layer will have 128neurons\n",
    "\n",
    "num_train = 60000 # there are 60000 training examples in MNIST\n",
    "num_test = 10000 # there are 10000 test examples in MNIST\n",
    "\n",
    "height, width, depth = 28, 28, 1 # MNIST images are 28x28 and *greyscale*\n",
    "num_classes = 10 # there are 10 classes (1 per digit)\n",
    "\n",
    "(X_train, y_train), (X_test, y_test) = mnist.load_data() # fetch MNIST data\n",
    "\n",
    "X_train = X_train.astype('float32') \n",
    "X_test = X_test.astype('float32')\n",
    "X_train /= 255 # Normalise data to [0, 1] range\n",
    "X_test /= 255 # Normalise data to [0, 1] range\n",
    "\n",
    "X_train = X_train[:, newaxis, :, :] # Reshape in \"convolutionable\" format (add tensor dim for the depth)\n",
    "X_test = X_test[:, newaxis, :, :] # Reshape in \"convolutionable\" format (add tensor dim for the depth)\n",
    "\n",
    "Y_train = np_utils.to_categorical(y_train, num_classes) # One-hot encode the labels\n",
    "Y_test = np_utils.to_categorical(y_test, num_classes) # One-hot encode the labels\n",
    "\n",
    "print(X_train.shape)\n",
    "print(Y_train.shape)\n",
    "\n",
    "inp = Input(shape=(1,height,width)) # greyscale --> depth==1\n",
    "\n",
    "model = Sequential()\n",
    "# Conv [8] -> Pool (with dropout on the pooling layer)\n",
    "model.add( Conv2D(conv_depth_1, (kernel_size,kernel_size), border_mode='same', activation='relu', \n",
    "                         data_format=\"channels_first\", input_shape=( 1, 28, 28)) )\n",
    "model.add( MaxPooling2D(pool_size=(pool_size, pool_size)) )\n",
    "model.add( Dropout(drop_prob_1) )\n",
    "\n",
    "# Conv [16] -> Pool (with dropout on the pooling layer)\n",
    "model.add( Conv2D(conv_depth_2, (kernel_size,kernel_size), border_mode='same', activation='relu') )\n",
    "model.add( MaxPooling2D(pool_size=(pool_size, pool_size)) )\n",
    "model.add( Dropout(drop_prob_1) )\n",
    "\n",
    "print(model.summary)\n",
    "\n",
    "# Now flatten to 1D, apply FC -> ReLU (with dropout) -> softmax\n",
    "model.add( Flatten() )\n",
    "model.add( Dense(hidden_size, activation='relu') )\n",
    "model.add( Dropout(drop_prob_2) )\n",
    "model.add( Dense(num_classes, activation='softmax') )\n",
    "\n",
    "#model = Model(input=inp, output=out) # To define a model, just specify its input and output layers\n",
    "\n",
    "model.compile(loss='categorical_crossentropy', # using the cross-entropy loss function\n",
    "              optimizer='adam', # using the Adam optimiser\n",
    "              metrics=['accuracy']) # reporting the accuracy\n",
    "\n",
    "model.fit(X_train, Y_train, # Train the model using the training set...\n",
    "          batch_size=batch_size, nb_epoch=num_epochs,\n",
    "          verbose=1, validation_split=0.4) # ...holding out 40% of the data for validation\n",
    "model.evaluate(X_test, Y_test, verbose=1) # Evaluate the trained model on the test set!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}