VGG paper to code with Tensorflow
Extracting the necessary information from the paper line by line and turning them into code
The LeNet-5 architecture was perhaps the most widely known CNN
architecture created by Yann LeCun in 1998 using 5x5 convolution filters and average pooling without padding. The next big CNN architecture was AlexNet developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It is similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of one another, instead of stacking a pooling layer on top of each convolutional layer. The GoogLeNet architecture was one of the significant CNN architectures developed by Christian Szegedy et al. from Google Research. It was much more efficient than previous architectures by subnetworks called inception modules: GoogLeNet actually has 10 times fewer parameters than AlexNet (roughly 6 million instead of 60 million).
The next notable CNN was the VGG(Visual Geometry Group) architecture created by Karen Simonyan and Andrew Zisserman and it was a very simple and classical architecture.
First, skim the paper, look at the table-1, read the abstract and introduction, and then conclusion:
We will implement VGG-19 architecture(19 weight layers). To make it easy, I highlighted layers with different colors as blocks:
Now, if you got the big picture let’s start extracting the details by reading section 2 of the paper:
In section 2, VGG architecture’s hyperparameters are explained:
input_shape=(224x224x3)
; input size:224x224 and 3 channels for RGB imgkernel_size=(3,3)
for Conv2D layersstrides=(1,1)
for Conv2D layerspadding='same'
for Conv2D layerspool_size=(2,2)
andstrides=(2,2)
for MaxPooling2D layersunits=4096
andactivation='relu'
for the first two Dense(fully-connected) layersunits=1000
andactivation='softmax'
for the last Dense layer
Next, we will continue to absorb information from section 3:
Section 3 also shows some hyperparameters which are needed to create VGG architecture:
SGD(momentum=0.9)
Stochastic gradient descent with mini-batches => for compiling the modelL2(l2=0.0005)
L2 regularization(weight decay) for the first two Dense layersDropout(0.5)
for the first two Dense layerslearning_rate=0.01
andfactor=0.1
for Learning Rate Scheduler callbacks and compiling the modelRandomNormal(mean=0, stddev=0.001)
for layer’s kernel(weight) initializersbias_initializer='zero'
for layer’s bias initializersGlorotNormal()
orGlorotUniform()
for layer’s kernel(weight) initializers- some augmentation(cropping, flipping, color shifting) for training
Well done! Now, based on the above information we will start to implement VGG-19 from scratch with the help of TensorFlow(Keras) Functional API.
Importing the libraries:
import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.regularizers import L2
from tensorflow.keras import Model
Input:
inputs = Input(shape=(224,224,3), name='input')
Since Conv2D and Dense layers use kernel_initializer="glorot_uniform"
and bias_initializer="zeros"
as default, we don’t need to specify them. However, I specified strides=(1,1)
in Conv2D, even if they are default values, for better clarity. As for MaxPooling2D and Flatten, they don't have trainable parameters(weights and biases):
Conv Block 1: Conv2D layers with 64 filters followed by MaxPooling2D
x = Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block1_conv_1')(inputs)
x = Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block1_conv_2')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block1_pool')(x)
Conv Block 2: Conv2D layers with 128 filters followed by MaxPooling2D
x = Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block2_conv_1')(x)
x = Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block2_conv_2')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block2_pool')(x)
Conv Block 3: Conv2D layers with 256 filters followed by MaxPooling2D
x = Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block3_conv_1')(x)
x = Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block3_conv_2')(x)
x = Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block3_conv_3')(x)
x = Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block3_conv_4')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block3_pool')(x)
Conv Block 4: Conv2D layers with 512 filters followed by MaxPooling2D
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block4_conv_1')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block4_conv_2')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block4_conv_3')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block4_conv_4')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block4_pool')(x)
Conv Block 5: Conv2D layers with 512 filters followed by MaxPooling2D
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block5_conv_1')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block5_conv_2')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block5_conv_3')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block5_conv_4')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block5_pool')(x)
After Conv Block 5, we should use Flatten layer: flattening is used to convert all the resultant 2-Dimensional arrays from pooled feature maps into a single long continuous linear vector. The flattened matrix is fed as input to the fully connected layers. Also, we will use L2 and Dropout regularizations as stated above.
Fully Connected Layers: FC-1 and FC-2
x = Flatten(name='flatten')(x)
x = Dense(4096, activation='relu', kernel_regularizer=L2(l2=0.0005), name='fully_connected_1')(x)
x = Dropout(0.5)(x)
x = Dense(4096, activation='relu', kernel_regularizer=L2(l2=0.0005), name='fully_connected_2')(x)
x = Dropout(0.5)(x)
Output Layer: FC-3 with softmax
outputs = Dense(units=1000, activation='softmax', name='output')(x)
Create the model and print its summary:
model = Model(inputs=inputs, outputs=outputs, name="VGG-19")
model.summary()
Plot the model:
tf.keras.utils.plot_model(model, 'vgg-19.png', show_shapes=True, dpi=72, show_layer_activations=True)
📌 To get the complete code visit my GitHub repository.
References:
- Karen Simonyan and Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556v6, 2015.