VGG paper to code with Tensorflow

Extracting the necessary information from the paper line by line and turning them into code

Zukhriddin
5 min readDec 31, 2022

The LeNet-5 architecture was perhaps the most widely known CNN
architecture created by Yann LeCun in 1998 using 5x5 convolution filters and average pooling without padding. The next big CNN architecture was AlexNet developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It is similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of one another, instead of stacking a pooling layer on top of each convolutional layer. The GoogLeNet architecture was one of the significant CNN architectures developed by Christian Szegedy et al. from Google Research. It was much more efficient than previous architectures by subnetworks called inception modules: GoogLeNet actually has 10 times fewer parameters than AlexNet (roughly 6 million instead of 60 million).

The next notable CNN was the VGG(Visual Geometry Group) architecture created by Karen Simonyan and Andrew Zisserman and it was a very simple and classical architecture.

First, skim the paper, look at the table-1, read the abstract and introduction, and then conclusion:

VGG paper

We will implement VGG-19 architecture(19 weight layers). To make it easy, I highlighted layers with different colors as blocks:

extracted image from the table-1 of the VGG paper

Now, if you got the big picture let’s start extracting the details by reading section 2 of the paper:

Sect. 2 of the VGG paper

In section 2, VGG architecture’s hyperparameters are explained:

  • input_shape=(224x224x3); input size:224x224 and 3 channels for RGB img
  • kernel_size=(3,3) for Conv2D layers
  • strides=(1,1) for Conv2D layers
  • padding='same' for Conv2D layers
  • pool_size=(2,2) and strides=(2,2) for MaxPooling2D layers
  • units=4096 and activation='relu' for the first two Dense(fully-connected) layers
  • units=1000 and activation='softmax' for the last Dense layer

Next, we will continue to absorb information from section 3:

Sect. 3 of VGG paper

Section 3 also shows some hyperparameters which are needed to create VGG architecture:

  • SGD(momentum=0.9) Stochastic gradient descent with mini-batches => for compiling the model
  • L2(l2=0.0005) L2 regularization(weight decay) for the first two Dense layers
  • Dropout(0.5) for the first two Dense layers
  • learning_rate=0.01 and factor=0.1 for Learning Rate Scheduler callbacks and compiling the model
  • RandomNormal(mean=0, stddev=0.001) for layer’s kernel(weight) initializers
  • bias_initializer='zero' for layer’s bias initializers
  • GlorotNormal() or GlorotUniform() for layer’s kernel(weight) initializers
  • some augmentation(cropping, flipping, color shifting) for training

Well done! Now, based on the above information we will start to implement VGG-19 from scratch with the help of TensorFlow(Keras) Functional API.

Importing the libraries:

import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.regularizers import L2
from tensorflow.keras import Model

Input:

inputs = Input(shape=(224,224,3), name='input')

Since Conv2D and Dense layers use kernel_initializer="glorot_uniform" and bias_initializer="zeros" as default, we don’t need to specify them. However, I specified strides=(1,1) in Conv2D, even if they are default values, for better clarity. As for MaxPooling2D and Flatten, they don't have trainable parameters(weights and biases):

Conv Block 1: Conv2D layers with 64 filters followed by MaxPooling2D

x = Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block1_conv_1')(inputs)
x = Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block1_conv_2')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block1_pool')(x)

Conv Block 2: Conv2D layers with 128 filters followed by MaxPooling2D

x = Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block2_conv_1')(x)
x = Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block2_conv_2')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block2_pool')(x)

Conv Block 3: Conv2D layers with 256 filters followed by MaxPooling2D

x = Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block3_conv_1')(x)
x = Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block3_conv_2')(x)
x = Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block3_conv_3')(x)
x = Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block3_conv_4')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block3_pool')(x)

Conv Block 4: Conv2D layers with 512 filters followed by MaxPooling2D

x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block4_conv_1')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block4_conv_2')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block4_conv_3')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block4_conv_4')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block4_pool')(x)

Conv Block 5: Conv2D layers with 512 filters followed by MaxPooling2D

x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block5_conv_1')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block5_conv_2')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block5_conv_3')(x)
x = Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', name='block5_conv_4')(x)
x = MaxPooling2D(pool_size=(2,2), strides=(2,2), name='block5_pool')(x)

After Conv Block 5, we should use Flatten layer: flattening is used to convert all the resultant 2-Dimensional arrays from pooled feature maps into a single long continuous linear vector. The flattened matrix is fed as input to the fully connected layers. Also, we will use L2 and Dropout regularizations as stated above.

Fully Connected Layers: FC-1 and FC-2

x = Flatten(name='flatten')(x)
x = Dense(4096, activation='relu', kernel_regularizer=L2(l2=0.0005), name='fully_connected_1')(x)
x = Dropout(0.5)(x)
x = Dense(4096, activation='relu', kernel_regularizer=L2(l2=0.0005), name='fully_connected_2')(x)
x = Dropout(0.5)(x)

Output Layer: FC-3 with softmax

outputs = Dense(units=1000, activation='softmax', name='output')(x)

Create the model and print its summary:

model = Model(inputs=inputs, outputs=outputs, name="VGG-19")
model.summary()

Plot the model:

tf.keras.utils.plot_model(model, 'vgg-19.png', show_shapes=True, dpi=72, show_layer_activations=True)

📌 To get the complete code visit my GitHub repository.

References:

  1. Karen Simonyan and Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556v6, 2015.

--

--

No responses yet