VGG paper to code with Tensorflow

Extracting the necessary information from the paper line by line and turning them into code

5 min readDec 31, 2022

The LeNet-5 architecture was perhaps the most widely known CNN
architecture created by Yann LeCun in 1998 using 5x5 convolution filters and average pooling without padding. The next big CNN architecture was AlexNet developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It is similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of one another, instead of stacking a pooling layer on top of each convolutional layer. The GoogLeNet architecture was one of the significant CNN architectures developed by Christian Szegedy et al. from Google Research. It was much more efficient than previous architectures by subnetworks called inception modules: GoogLeNet actually has 10 times fewer parameters than AlexNet (roughly 6 million instead of 60 million).

The next notable CNN was the VGG(Visual Geometry Group) architecture created by Karen Simonyan and Andrew Zisserman and it was a very simple and classical architecture.

First, skim the paper, look at the table-1, read the abstract and introduction, and then conclusion:

VGG paper

We will implement VGG-19 architecture(19 weight layers). To make it easy, I highlighted layers with different colors as blocks:

Now, if you got the big picture let’s start extracting the details by reading section 2 of the paper:

In section 2, VGG architecture’s hyperparameters are explained:

input_shape=(224x224x3); input size:224x224 and 3 channels for RGB img
kernel_size=(3,3) for Conv2D layers
strides=(1,1) for Conv2D layers
padding='same' for Conv2D layers
pool_size=(2,2) and strides=(2,2) for MaxPooling2D layers
units=4096 and activation='relu' for the first two Dense(fully-connected) layers
units=1000 and activation='softmax' for the last Dense layer

Next, we will continue to absorb information from section 3:

Section 3 also shows some hyperparameters which are needed to create VGG architecture:

SGD(momentum=0.9) Stochastic gradient descent with mini-batches => for compiling the model
L2(l2=0.0005) L2 regularization(weight decay) for the first two Dense layers
Dropout(0.5) for the first two Dense layers
learning_rate=0.01 and factor=0.1 for Learning Rate Scheduler callbacks and compiling the model
RandomNormal(mean=0, stddev=0.001) for layer’s kernel(weight) initializers
bias_initializer='zero' for layer’s bias initializers
GlorotNormal() or GlorotUniform() for layer’s kernel(weight) initializers
some augmentation(cropping, flipping, color shifting) for training

Well done! Now, based on the above information we will start to implement VGG-19 from scratch with the help of TensorFlow(Keras) Functional API.

Importing the libraries:

import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.regularizers import L2
from tensorflow.keras import Model

Input:

inputs = Input(shape=(224,224,3), name='input')

Since Conv2D and Dense layers use kernel_initializer="glorot_uniform" and bias_initializer="zeros" as default, we don’t need to specify them. However, I specified strides=(1,1) in Conv2D, even if they are default values, for better clarity. As for MaxPooling2D and Flatten, they don't have trainable parameters(weights and biases):

Conv Block 1: Conv2D layers with 64 filters followed by MaxPooling2D

x = Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block1_conv_1')(inputs)
x = Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block1_conv_2')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block1_pool')(x)

Conv Block 2: Conv2D layers with 128 filters followed by MaxPooling2D

x = Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block2_conv_1')(x)
x = Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block2_conv_2')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block2_pool')(x)

Conv Block 3: Conv2D layers with 256 filters followed by MaxPooling2D

x = Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block3_conv_1')(x)
x = Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block3_conv_2')(x)
x = Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block3_conv_3')(x)
x = Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block3_conv_4')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block3_pool')(x)

Conv Block 4: Conv2D layers with 512 filters followed by MaxPooling2D

x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block4_conv_1')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block4_conv_2')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block4_conv_3')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block4_conv_4')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block4_pool')(x)

Conv Block 5: Conv2D layers with 512 filters followed by MaxPooling2D

x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block5_conv_1')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block5_conv_2')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block5_conv_3')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block5_conv_4')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block5_pool')(x)

After Conv Block 5, we should use Flatten layer: flattening is used to convert all the resultant 2-Dimensional arrays from pooled feature maps into a single long continuous linear vector. The flattened matrix is fed as input to the fully connected layers. Also, we will use L2 and Dropout regularizations as stated above.

Fully Connected Layers: FC-1 and FC-2

x = Flatten(name='flatten')(x)
x = Dense(4096, activation='relu', kernel_regularizer=L2(l2=0.0005), name='fully_connected_1')(x)
x = Dropout(0.5)(x)
x = Dense(4096, activation='relu', kernel_regularizer=L2(l2=0.0005), name='fully_connected_2')(x)
x = Dropout(0.5)(x)

Output Layer: FC-3 with softmax

outputs = Dense(units=1000, activation='softmax', name='output')(x)

Create the model and print its summary:

model = Model(inputs=inputs, outputs=outputs, name="VGG-19")
model.summary()

Plot the model:

tf.keras.utils.plot_model(model, 'vgg-19.png', show_shapes=True, dpi=72, show_layer_activations=True)

📌 To get the complete code visit my GitHub repository.

References:

Karen Simonyan and Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556v6, 2015.

VGG paper to code with Tensorflow

Extracting the necessary information from the paper line by line and turning them into code

Written by Zukhriddin

No responses yet