Gradient-Based Learning Applied to Document Recognition

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.

Gradient-based learning applied to document recognition
This paper addresses that gradient-based, automatic learning can result in better pattern recognition. The two examples are presented: character recognition and document understanding.
LeNet-5는 처음으로 convolutional neural network (CNN)을 제안한 모델이다. 관련하여 필자가 보기 쉽도록 정리하였다. (문법적인 오류가 있을 수 있다.)

Introduction

Main message: better pattern recognition system can be built by relying more on automatic learning and less on hand-designed heuristics. → 패턴 인식 시스템은 automatic learning에 더 의존하도록 만들어져야 한다.
Due to low-cost machines, large databases and powerful machine learning techniques, the accuracy of handwriting and speech recognition has been increased. Usually, the recognition process is performed with two modules: feature extractor and classifier. The researchers used gradient-based learning and back propagation for the recognition tasks, relying on raw inputs. (전통적인 패턴 인식 모델은 hand-designed feature extractor가 적절한 정보를 수집하고, trainable classifier를 사용하여 분류해왔다.)

Convolutional Neural Networks for Isolated Character Recognition

Some problems may occur when we use ordinary fully connected feedforward network: many parameters and memory requirement, and limitations in capturing local correlations.
- ordinary fully connected feedforward network는 수천 개의 trainable parameter를 생성하기 때문에 training set도 더 커지고 memory도 많이 필요하다.
- input의 topology, 즉 image에서는 픽셀들 간 상관관계를 무시할 수 있다.
Therefore, convolutional network was devised to capture variability by local receptive fields (local feature), shared weights, and spatial or temporal subsampling.
- local receptive fields: local feature을 추출한다. local receptive field을 사용하여 기초적인 feature (ex edge, conner, end-points)를 추출할 수 있다. → shift, distortion이 있어도 feature을 추출할 수 있으며, 그 결과로 feature map이 생성된다. 또한, parameter size도 줄어든다.
- shared weights: feature map의 모든 units는 shared weights and bias를 가진다. 따라서, input의 위치에 상관없이 동일한 특징을 추출해낼 수 있다. (shift에 robust) 동일한 weight과 bias를 사용하기 때문에 parameter로 인한 memory 증가도 없으며, parameter이 너무 많아지지 않아 overfitting도 방지할 수 있다.
- spatial or temporal subsampling: 각 특징의 정확한 위치는 중요하지 않다. (오히려 이미지마다 위치가 달라 정확도가 감소할 수 있다.) 따라서 subsampling을 통해 해당 feature을 가졌는지만 평가한다. → 위치 정보를 사용하지 않음으로 생기는 손실은 많은 filter의 수를 늘리고 다양한 feature를 추출하여 해결할 수 있다. (이 때, subsampling은 지금의 mean pooling과 동일한 개념이다.)
Subsequently, LeNet-5 was suggested for the pattern recognition task. LeNet-5의 구조는 다음과 같다.
- The model comprises of input with 32*32 pixel image and seven layers with trainable weights. The input pixels are normalized to have mean of zero and variance of one.
  - 이 때, 실제 이미지는 2828의 크기를 가지나, 중앙 부분으로 이미지를 옮겨, feature를 더 잘 뽑기 위해 3232 사이즈로 만들었다. (padding을 하는 것과 유사)
- The seven layers are in the following sequences: [C1 - S2 - C3 - S4 - C5 - FC - FC]
  1. convolutional layer (C1): 5*5 size (stride=1)인 6개의 kernels
  2. (nonoverlapping) subsampling (mean pooling) layer (S2): 2*2 size인 6개의 kernels (strides=2) (then sigmoid function)
  3. convolutional layer (with noncomplete connection from S2) (C3): 5*5 size (stride=1)인 16개 kernels
    - S2와 C3의 모든 feature map이 각각 연결되지 않은 이유는, 1) to keep the number of connections within reasonable bound (숫자 제한), 2) to force a break of symmetry in the network (서로 다른 입력값을 통해 서로 다른 feature을 추출하여 상호보완적으로 사용할 수 있음)
  4. subsampling layer (S4): 2*2 size인 16개의 kernels (strides=2)
  5. convolutional layer (C5): 5*5 size (stride=1)인 120개 kernels
  6. fully connected layer (FC): tahn as activation function. 출력 유닛은 84개 (to recognize strings of characters taken from ASCII set)
  7. output layer (FC): Euclidean RBF as activation function. 출력 10개. → ASCII 형태로 바꾼 다음 10개의 숫자를 뽑아내고자.
    - Here, RBF is unnormalized negative log-likelihood of Gaussian distribution. It seems like sigmoid function but is more appropriate to discriminate similar characters and reject noncharacters.
- Loss function in the model is adjusted MSE that is appropriate for RBF. It pushes down the penalty for correct classes, whereas pulls up the penalty for incorrect classes. The gradient of loss function is computed by back propagation.
LeNet-5 was shown to have low memory requirements and extract features robustly to noise.
참고자료

[논문 리뷰] LeNet-5 (1998), 파이토치로 구현하기

1. LeNet-5 논문 리뷰,구현

[논문 요약 3] Gradient-Based Learning Applied to Document Recognition

[Deep Learning, CNN, PyTorch Code] LeNet-5