In the previous post we built from scratch a neural network able to classify little images (click here for a quick reminder). Now we will try a different naive approach using just Convolution at first, and then a more complex architecture taking advantage of Batch Normalization and Dropout for better performance.
Again, we’ll need just TF2, NumPy and Matplotlib.
Import libraries & load data
Similarly to the last post, just import
import numpy as np import matplotlib.pyplot as plt from tensorflow.keras.layers import Input, Flatten, Dense, Conv2D from tensorflow.keras.models import Model from tensorflow.keras.optimizers import Adam from tensorflow.keras.utils import to_categorical from tensorflow.keras.datasets import cifar10
and load data (this section is unchanged from previous post, so we do not add any comment):
NUM_CLASSES = 10 (x_train, y_train), (x_test, y_test) = cifar10.load_data() x_train = x_train.astype('float32') / 255.0 x_test = x_test.astype('float32') / 255.0 y_train = to_categorical(y_train, NUM_CLASSES) y_test = to_categorical(y_test, NUM_CLASSES)
Naive ConvNet Architecture
Input images are 32×32 for 3 RGB channels. The first convolution layer applies 10 filters of size 4×4 to input image. Each filter is actually a set of three 4×4 filters, one for each RGB channel. So, in convolution layer 1, each “filter” is actually composed by 3 matrices (one for each channel) that sweep over the 3 RGB images at once. There are 10 of such filters, producing 10 outputs. Pick a filter (3 matrices) and begin to slide over the image channels. The stride is 2, so we will get a smaller convolution output compared to the initial image. For a single-matrix filter, the convolution works as usual. How is the convolution performed in the case of multiple channels? Each channel is individually convoluted and then combined to form a single output pixel (see here and/or here).
input_layer = Input(shape=(32,32,3)) conv_layer_1 = Conv2D(filters = 10, kernel_size = (4,4), strides = 2, padding = 'same')(input_layer) conv_layer_2 = Conv2D(filters = 20, kernel_size = (3,3), strides = 2, padding = 'same')(conv_layer_1) flatten_layer = Flatten()(conv_layer_2) output_layer = Dense(units=10, activation='softmax')(flatten_layer) model = Model(input_layer, output_layer)
With the “same” padding method it is easier to figure the output size when the filter dimension is odd (we are dealing only with square filters!) because there is a central “check” on which one may focus. But how the padding works when the filter is even? The following image may help.
If the filter has odd size, to obtain a convolution result of the same size (that is the purpose of “same” padding when the stride is 1) it is sufficient to pad the original picture with an amount of zeros corresponding to the top, right, bottom and left dimensions. When filter dimension is even, instead of a central check we find a central square: take the top left corner check and pad the original picture with as many zeros as the top, right, bottom and left dimensions. For example, a 4×4 filter with “same” padding entails 1 row of zeros at the top, 2 columns of zeros at the right, 2 rows of zero at the bottom and 1 row of zeros at the left of the original image. The following snippet gives a clear answer– it’s just an aside (not part of the ConvNet model) so you can safely skip.
import tensorflow as tf # Image: 6x6 matrix, each entry is 1. # Filter: 4x4 matrix, each entry is 1. input_ = tf.ones((1, 6, 6, 1), dtype=tf.float32) kernel = tf.ones((4, 4, 1, 1), dtype=tf.float32) conv = tf.nn.conv2d(input_, kernel, [1, 1, 1, 1], 'SAME') print(conv[0, :, :, 0])
tf.Tensor( [[ 9. 12. 12. 12. 9. 6.] [12. 16. 16. 16. 12. 8.] [12. 16. 16. 16. 12. 8.] [12. 16. 16. 16. 12. 8.] [ 9. 12. 12. 12. 9. 6.] [ 6. 8. 8. 8. 6. 4.]], shape=(6, 6), dtype=float32)
Let s be the filter size and let
Note that initial “image”
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is adjusted — if the filter size is even — adding p columns at the right, p rows at the bottom, p – 1 rows at the top and p – 1 columns at the left.
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The second convolution is performed using twenty 3×3 filters, getting twenty 8×8 convolution results. Then everything is flattened into a single 1280 components vector that represents the input of a dense layer with 10 output units, and these ouptuts eventually transform into a distribution applying softmax.
Nothing different from our previous post: we choose an optimizer (
Adam) algorithm for training (adjusting the parameters) and a loss function (
categorical_crossentropy) that measures the difference between predictions and labels. Then we have to specify a metric (
accuracy) that calculates how often predictions equals labels, that is the frequency with which predicted class matches true class. Adaptive Moment Estimation (Adam) is a stochastic gradient descent method that computes adaptive learning rates for each parameter based on average values of past gradients quantities (estimates of first- and second-order moments of the gradients).
opt = Adam(lr=0.0002) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
We call the
fit method to train the model. We use an input batch size of 32 and 10 epochs (how many times the network will be shown the whole training dataset) . The batches fed to the network will be drawn randomly and without replacement from the training data at each training step (
shuffle = True).
model.fit(x_train, y_train, batch_size=32, epochs=10, shuffle=True, validation_data = (x_test, y_test))
After training, the model reaches a modest accuracy, about 40% for both training and validation (it’s far from the best result achievable, but it’s just a little attempt that can be easily improved using techniques like Dropout and Batch Normalization). Note that this result is about the train and validation sets, now we want to know how the trained model performs over new unseen data (that we previously stored in test dataset).
Model evaluation and results visualization
10000/10000 [==============================] - 7s 677us/sample - loss: 1.7488 - accuracy: 0.3898 [1.7487890601158143, 0.3898]
Accuracy is about 39%, poor… but a lot better than guessing.
CLASSES = np.array(['airplane', 'automobile', 'bird', 'cat','deer', 'dog', 'frog', 'horse', 'ship', 'truck']) preds = model.predict(x_test) preds_class = CLASSES[np.argmax(preds, axis = -1)] actual_class = CLASSES[np.argmax(y_test, axis = -1)]
n_to_show = 8 indices = np.random.choice(range(len(x_test)), n_to_show) fig = plt.figure(figsize=(14, 1)) fig.subplots_adjust(hspace=0.3, wspace=0.3) for i, idx in enumerate(indices): img = x_test[idx] ax = fig.add_subplot(1, n_to_show, i+1) ax.axis('off') ax.text(0.5, -0.4, 'pred = ' + str(preds_class[idx]), fontsize=10, ha='center', transform=ax.transAxes) ax.text(0.5, -0.7, 'true = ' + str(actual_class[idx]), fontsize=10, ha='center', transform=ax.transAxes) ax.imshow(img)
Batch Normalization is a technique used to normalize the input layer by re-centering and re-scaling. This is done by evaluating the mean and the standard deviation of each input channel (across the whole batch), then normalizing these inputs (check this video) and, finally, both a scaling and a shifting take place through two learnable parameters and . Batch Normalization is quite effective but the real reasons behind this effectiveness remain unclear. Initially, as it was proposed by Sergey Ioffe and Christian Szegedy in their 2015 article, the purpose of BN was to mitigate the internal covariate shift. In fact, a reason to scale inputs is to get stable training; unfortunately this may be true in the beginning but as the network trains and the weights move away from their initial values there is no guarantee of stability. So, as the training progresses, the distribution of layer inputs changes due to weights update. However, some years later, a paper showed that BN had very little to do with internal covariate shift.
The picture above shows the comparison of distributional stability profiles from VGG networks trained without BatchNorm (Standard), with BatchNorm (Standard + BatchNorm) and with explicit covariate shift added to BatchNorm layers (Standard + “Noisy” BatchNorm). The “noisy” BN has distributional instability induced by adding time-varying, non-zero mean and non-unit variance noise independently to each batch normalized activation.
The following picture shows that, surprisingly, the “noisy” BN model nearly matches the performance of standard BN model, despite complete distributional instability. The internal covariate shift in models using BN is similar or even worse… but they perform better in terms of accuracy.
This leads to reject the idea that lowering internal covariate shift gives a better model. So, how does BN help? BN affects both the variation of loss (loss landscape figure) and variation of gradients of loss (gradient predictiveness figure): the loss varies at a smaller rate and the magnitudes of the gradients are smaller (see picture below). Smoother loss landscapes, usually, allow larger learning rates reducing training times.
Check this video for more.
There is an obvious link between regularization and size/complexity of the network (see picture below). Smaller networks correspond to rigid and simple models. It would be useful sometimes — to avoid overfitting — to exploit a method that helps to reduce complexity, returning a better performing model. Intuitively, less neurons (units) in action correspond to simpler models.
Dropout is a useful technique that is easier to apply than to explain formally or mathematically. It works like this: given a feedforward neural network, at training time remove at each layer some neurons, depending on a certain probability. For example, if the probability is 0.5, you flip a coin and decide if a certain neuron should be in or out.
As you may have noticed, at training time the network on the right (picture above) is a simpler network (less units) prone to express a simpler model, maybe reducing overfitting. The network is trained to produce accurate predictions on unseen data even in unfriendly conditions where some neurons are missing. In addition, by the fact that units can go away at random, each neuron may miss an important input (or more important inputs) from the previous layer and so it can not rely on any one input. The neuron has to spread out the weights with respect to its incoming neurons, causing the weights to shrink. This shrinking lowers the squared norm of the weights, hence Dropout is a sort of (local) L2-regularization (video).
At test time no units are dropped, so that the full network is used to make predictions. In the following code we use Dropout just before the final layer because Batch Normalization already has a regularizing effect by itself. Replace the preceding naive architecture with the following.
input_layer = Input((32,32,3)) x = Conv2D(filters = 32, kernel_size = 3 , strides = 1, padding = 'same')(input_layer) x = BatchNormalization()(x) x = LeakyReLU()(x) x = Conv2D(filters = 32, kernel_size = 3, strides = 2, padding = 'same')(x) x = BatchNormalization()(x) x = LeakyReLU()(x) x = Conv2D(filters = 64, kernel_size = 3, strides = 1, padding = 'same')(x) x = BatchNormalization()(x) x = LeakyReLU()(x) x = Conv2D(filters = 64, kernel_size = 3, strides = 2, padding = 'same')(x) x = BatchNormalization()(x) x = LeakyReLU()(x) x = Flatten()(x) x = Dense(128)(x) x = BatchNormalization()(x) x = LeakyReLU()(x) x = Dropout(rate = 0.5)(x) x = Dense(NUM_CLASSES)(x) output_layer = Activation('softmax')(x) model = Model(input_layer, output_layer)
You can check that — in just 10 epochs — test accuracy goes up to over 70%.
Feel free to email me for comments, questions, suggestions or if you just want to leave a message.