Convolutional deep neural networks typically learn low-level features in the early layers and progressively more complex features in the deeper layers. For example, the early layer kernels may be edge and color contrast detectors, and as we go further on in the hierarchy, the early layer outputs are combined to detect eyes, mouths and so on. These higher-level detector outputs are then combined to detect faces when the detector activations are in the correct spatial configuration. All of this is implemented by convolutions followed by activation functions. If the same network is trained to detect also bicycles, sunflowers, etc., then while it is likely that the same edge detectors etc. will work well for all tasks, but the number of features needed in the middle layers explodes. It is also computationally very wasteful: Even if the image is just a bicycle, the convolutions to combine undetected early-layer flower features are still being carried out. In gradient-based optimization methods the kernels of those feature detectors may also get adjusted towards detecting bicycles, leading to forgetting how to detect sunflowers.
So what can be done to A) reduce the computational burden from having all those feature detectors, and B) reduce forgetting so that feature detectors that have found their niche will not be unnecessarily affected by data that does not activate them.
Let’s try to tackle these problems. We denote depth-wise convolution \(\ast\) of input \(x\) of size \((h_x, w_x, c_x)\) with \(c_x\) channels by a kernel \(y\) of size \((h_y, w_y, c_x)\) by:
\[x \ast y\]to be continued…