Deeper Convolutional Networks were always considered to be good at learning finer and smaller features in an input, There were two main problems with making deeper and deeper convolutional neural networks, I had written about it in depth in the blog of ResNet, let me summarize and put it here for you, you can find the detailed blog over here. 

  • Accuracy Degradation: There was a point when adding another layer to the network would result in no gain in accuracy and on further increase of layers, the accuracies would start to degrade.

  • Vanishing Gradient: When training deep neural networks, during backpropagation, involved calculating the gradient at multiple layers, these were smaller numbers and according to the chain rule of calculus, these smaller numbers were multiplied upto that layer to calculate the gradient at that layer, this resulted in very small or mostly 0 gradient for the initial layers of the network, leading them to the weights and biases of these initial layers not getting tuned during the training phase of the network

One of the approach to tackle this problem was using Residual or “skip” connections to let the error reach the initial layers of the model by bypassing these signals from one layer to the next, this is what authors of the paper “Deep Residual Networks for Image Recognition” did and introduced us to ResNets.

In this article I want to talk about another workaround which also led to quiet impressive improvements in accuracies and training time all while increasing the depths of these deeply connected convolutional neural networks.

Previously, research work focused around convolutional neural networks showed that for these networks to be substantially deeper, more accurate and efficient to train they had to contain shorter connections between layers close to the input and those close to the output. These neural networks with a depth L layers had L connections which was every layer being connected to its subsequent layer.

But, in the case of DenseNet, the authors went on to make the network with L connections, have L(L+1)/2* direct connections. This was possible as in the network,

  • For each layer, the feature maps of all preceding layers were used as a inputs
  • The feature map for a layer were in turn used as inputs for the subsequent layers.

The major advantages this had were:

 

  • Eliminate the vanishing gradient as the error is propagated to the initial layers through the connections.
  • Strengthen feature propagation as the feature maps of one layer are used by subsequent layers as well.
  • Improve feature reuse which substantially decreases the number trainable parameters.

Introduction

In their paper, “Deeply Connected Convolutional Networks” the authors put forward a CNN based architecture which sorts out the problem of vanishing gradients and degradation when training deep convolutional networks by reducing it into a simple continuity problem, which not only helps the error propagate to the initial layer but also leads to better utilization of feature maps and decreasing the number of trainable parameters, which makes the training of very deep neural networks easier.

Image Source: Densely Connected Convolutional Networks Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger

To ensure that, the architecture consists of connections between all layers with matching feature map sizes which also ensures that maximum information flows between the layers.

To preserve the feed-forward nature of the neural network, each layer obtains additional inputs from all preceding layers and passes on its own feature maps to all subsequent layers in the network.

More importantly, unlike ResNets where the features are combined through summation of before forming the input to the next layers, in DenseNets these feature maps are concatenated one below the other. So, the lth layer would have l inputs, consisting of the feature maps of all preceding convolutional blocks whereas its own feature maps are passed on to all L-l subsequent layers. This results in L(L+1)/2 connections in a network with L layers.

Due to this, there is no reason or need for the network to learn the redundant feature maps, which makes the network require fewer parameters than traditional convolutional neural networks.

Advantages of DenseNet

In traditional CNNs each layer reads the state from its preceding layer and writes to the subsequent layers, this changes the state but it also passes on information that is to be preserved in the layer. In the case of ResNets, information transformation was explicit through identity transformations which were achieved using skip connections, but it is found that many layers contribute very little and can infact be randomly dropped during training. Also, the number of parameters in ResNet is substantially larger because each of the layers have their own weights and biases.

DenseNet on the other hand, explicitly differentiates between information that is added to the network and information that is to be preserved in the network. The layers in DenseNet are compartively narrower ( 12 filters per layer) which adds only a small set of feature maps to the “collective knowledge” of the architecture and keeps the remaining feature maps unchanged and the final classifier makes a decision which involves all the feature maps in the network.

On another note, DenseNet also tackles the problem of vanishing gradient quiet nicely, as their is an improved flow of information and gradients throughout the network, they are easier to train as each layer would have a direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision during training. This helps in training deeper networks.

DenseNets also use feature reuse which makes the condensed model easy to train and highly parameter efficient, concatenating feature maps also results in increased variation in the input of subsequent layers and improves efficency.

Why DenseNets

Lets look at how training is different in DenseNet by considering a single image say x0 that is passed through a convolutional neural network comprising of L layers, each implementing a non-linear transformation Hl(.) where l indexes the layer. Hl(.) could be a composite function of operations such as Batch Normalization, Rectified Linear Unit, Pooling or Convolution. Let the output of the lth layer be xl.

In a traditional convolutional neural network architecture, which would connect the output of the lth layer as input to the (l+1)th layer, which give rise to the following layer wise transitions as:

x_l = H_l(x_{l-1})
Source: Densely Connected Convolutional Networks Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger

In case of ResNets, which add skip connections which bypass the non-linear transformations with an identity function, we would have the layer wise transition function as ,

x_l = H_l(x_{l-1})+x_{l-1}

As a result of using Residual connections in the network, the gradient can flow directly through the identity functions from the later layers to the earlier but the identity function and the output of Hl are combined by summation , which may impede the information flow in the network.

 

We can further improve the information flow across layers using a different connectivity pattern, i.e., using direct connections from any layer to all subsequent layers. As a result, the lth layer recieves the feature maps of all preceding layers, x0, x1 ,x2…xl-1 as input, meaning that the layer transition function would look like,

x_l = H_l([x_0,x_1,...x_{l-1}])

Notice that here, [x0, x1 ,x2…xl-1] represent a concatenation of the feature maps at layers 0,1,… l-1, which contribute to the output of l^{th} layer significantly. 

These dense connections among layers led to naming the architecture as Dense Convolutional Neural Network or DenseNet.

 The concatenation of multiple feature maps done in the layer transition function for DenseNets, would not have been possible as the size of feature maps changed, these feature maps were hence downsampled to change their size.

To do this, the architecture is divided into multiple densely connected dense blocks, the layers between these blocks are transition layers which perform the convolution and pooling. These transition layers consist of batch normalization layer and 1×1 convolutional layer followed by a 2×2 average pooling layer.

Growth Rate

Each function Hl produces k feature maps, it follows that the lth layer has k0 + k*(l-1) input feature maps where k0 is the number of channels in the input layer. A key difference between DenseNet and other architectures was that it had very narrow layers, and this could be used as an hyper parameter during training, called as the Growth Rate. The authors go on to show that a relatively smaller growth rate is sufficient to achieve state of the art results on the datasets. The reason this was possible with a smaller growth rate was because of the fact that every layer had access to the feature maps of all the preceeding layers, or the “collective knowledge” of the network.

Bottleneck Layers

Although each layer produces only k output feature maps, the number of inputs it gets is more than that. Introducing a 1×1 convolution as a bottleneck layer before each 3×3 convolution to reduce the number of input feature maps which would improve computational efficiency.

Summary

Model Compactness: As a result of the concatenation of inputs, the feature maps learned by any of the DenseNet layers can be accessed by all subsequent layers, encouraging more and more feature reuse, leading to a more compact model.

Implicit Deep Supervision: The improvement in accuracy of DenseNets could be attributed to the fact that, deep connections led to additional supervision from the loss function towards all the layers

Stochastic vs. deterministic connection: There is an interesting connection between dense convolutional networks and stochastic depth regularization of residual networks. In stochastic depth, layers in residual networks are randomly dropped, which creates direct connections be- tween the surrounding layers. As the pooling layers are never dropped, the network results in a similar connectivity pattern as DenseNet: there is a small probability for any two layers, between the same pooling layers, to be directly connected—if all intermediate layers are randomly dropped.

Further, I have gone ahead and implemented the DenseNet architecture with a detailed explaination of the code, in an interactive jupyter notebook, you can find it here