Abstract

Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark Krizhevsky et al. [18]. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we explore both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

1. Introduction

There is little understanding of internal operation of complex convolutional network models and how they achieve good performance.
Therefore, the paper introduces technique that reveals which input stimuli excites feature maps.
By this, we can also observe how the features evolve during training, and diagnose the potential problems.
In addition, the authors performed sensitivity analysis by revealing the important scene for classification.

1.1. Related Work

Visualization: The studies about visualizing features are limited to the first layer based on projections to pixel space. In higher layers, invariances of units are too complex that simple quadratic approximation is not enough to capture. Therefore, this paper provides non-parametric approach by showing the patterns that activate feature map.
- Simonyan et al. projected back from fully connected layers. → This paper uses convolutional layers.
- Girshick et al. identifies patches that strongly activate at higher layers. (just crops of input images) → This paper use top-down projections within each patch.
Feature Generalization: generalization ability of convnet features is also explored by Donahue et al. and Girshick et al.

2. Approach

The paper uses standard fully supervised convnet models, which capture 2D image to C classes through series of layers.
- Each layer consists of (1) convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters; (2) passing the responses through ReLU function; (3) [optionally] max pooling over local neighborhoods and (4) [optionally] a local contrast operation that normalizes the responses across feature maps.

2.1. Visualization with a Deconvnet

In the paper, the authors perform Deconvolutional Network (deconvnet) to map feature activities back to input pixel space.
- Deconvnet model uses the components of convnet (such as pooling and filtering) in reverse.
- deconvnet is attached to each of convnet layers.
Unpooling: The model records the locations of maxima for each pooling layer as switch variables. This way, deconvnet can preserve the structure of the stimulus and reconstruct the layer. (The locations of other values except for maxima is not traced) > 어디서 가장 크게 activate 됐는지 그 location 정보만 알면 되니까.
Rectification: Through relu non-linearity, the model passes signal. The negative data does not affect much to visualization.
Filtering: The model uses transpose of filters and applies to rectified maps. In this way, the filters are flipped horizontally and vertically.
By these approaches, the model can get the weighted structure according to their contributions.
Since there is no generative process, the projections are not samples. Therefore, the process is like backpropping a single strong activation, with independently imposed ReLU and no contrast normalization.