본문 바로가기

Artificial Intelligence/Deep Learning

CS231N - Lecture 6: Training Neural Networks I




Overview (Part I & Part II)





Part I





Activation Functions


  • Sigmoid


    • We have a linear regime near. It looks like a linear function.
    • 3 problems
      • 1) Saturated neurons "kill" the gradients. 

        • Q1: What happens when x = -10? What does is gradient look like?
        • A1: Zero. The gradient become zero.
        • So, after the chain rule, this kills the gradient flow and you have a 0(zero) gradient passed down to downstream nodes.
        • Q2: What happens when x = 0?
        • A2: It's fine in this regime. You get a reasonable gradient here, and then it'll be fine for backprop.
        • Q3: What happens when x = 10?
        • A3: Again, when x is equal to large positive numbers, then these are all regions where the sigmoid function is flat, and it kill off the gradient.
      • 2) Sigmoid outputs are not zero-centered.

        • If all of X is positive, it's always going to be positive. To be more precise, it's always going to be either all positive or all negative.
        • The sign(부호) of gradients on W are same with the sign of the upstream gradient. So what this means is that since all the gradients on W are always either positive or negative, they always move in the same direction.
        • This gives very inefficient gradient updates.

        • If you look at on the right here, we have an example of a case where let's say W is two-dimensional.(가정) So we have our 2 axes for W.
        • We have these two places where the axis are either all positive or negative, and these are the only two gradient update directions. 
        • Let's say our hypothetical optimal W is actually this blue vector. We're starting off at you know some point and at the top of the beginning of the red arrows. (파란색 선이 optimal w vector라고 가정!)
        • We can't directly take a gradient update in blue line direction. Because this is not in one of those two allowed gradient directions. So we'll have to take a sequence of gradient updates. (Always all positive or all negative!! So this is zig zag path)
        • This is why also, in general, we want a zero mean data. 
      • 3) Exp() is a bit compute expensive
        • This is usually not the main problem. Because we're going to have all these convolutions and dot products that are a lot more expensive.
  • Tanh


  • ReLU


    • Q1: What happens when x = -10? A1: Zero gradient.
    • Q2: What happens when x = 10? A2: It is in the linear regime.
    • Q3: What happens when x = 0? A3: Zero gradient. So basically, It's killing the gradient in half of the regime. 

    • Dead ReLU will never activate and update, as compared to an active ReLU where some of the data are active and the others are not active. So there's several reasons for this.
    • First, it can happen when you have bad initialization. If weights happen to be off the data cloud, so they happen to specify this dead ReLU. Then weights are never get a data input that causes data input to activate and never get good gradient flow coming back. So dead ReLU never update and never activate. (초기화를 잘못하면 아예 다 죽어버린다.)
    • Second, What's the more common case is when your learning rate is too high. Because you're making these huge updates, the weights jump around and ReLU unit gets knocked off(중단하다, 끝내다) of the data manifold. So it was fine at the beginning or some point, but it became dead. 
  • Leaky ReLU

    • In PReLU, this gradient in the negative regime is determined through this alpha $\alpha$ parameter. So we don't hard-code alpha. We treat it as a parameter that we can backprop and learn.
  • ELU

    • Compared with the Leaky ReLU, instead of gradient in the negative regime, you actually are building back in a negative saturation regime.
    • But there's arguments that basically this allows you to have some more robustness to noise.
    • ELU is kind of something in between the ReLU and the Leaky ReLU.
    • Like the Leaky ReLU, ELU gives zero mean output, but like the ReLU, ELU also has some of this more saturating behavior.
  • Maxout Neuron


  • TLDR: In practice:




Data Preprocessing
  • Step 1: Preprocess the data

    • The most common preprocessing is to make original data "zero mean" and then "normalize".
    • Why do we want to do this? You can remember earlier that we talked about when all the inputs are positive. In general, if it's not only all positive but also all zero or all negative, it will also cause this type of problem. (어느 한 쪽이라도 편향되지 않도록 정규화를 해주는 것) Reason for normalizing the data is that all features are in the same range, and they contribute equally.
    • In practice, for images, we do the zero centering. But we don't normalize the pixel value. Because generally images have relatively comparable scale and distribution.
  • TLDR: In practice for images: center only

    • To summarize basically for images, we typically do the zero mean pre-processing and we can subtract the entire mean image. So, from the entire training data, you compute the mean image. (미니배치 단위로 해도 평균은 전체로 계산!)
    • Training data will usually be the same size as each image. For example 32 x 32 x 3, you get this array of numbers, and then you subtract entire mean image from each image before you're about to pass through the network.
    • In practice, for some networks, we also do this by subtracting per-channel mean, instead of having an entire mean image that were going to zero-center.



Weight Initialization
  • Q: What happens when W = 0 init is used?
    • Since your weights are zero, given an input, every neuron have  the same operation basically on top of your inputs. So then weights are also all going to get the same gradient and update in the same way. That's the problem when you initialize everything equally. (e.g. sigmoid의 chain rule로 생각해보자!)
  • First idea: Small random numbers (gaussian with zero mean and 1e-2($1 \times 10^{-2}$) standard deviation)

    • In this case, we sample from basically a standard gaussian. but we scale it so that the standard deviation is actually 0.01.
    • But there are problems with deeper networks. Why? 

    • Let's initialize a 10 layer neural network to have 500 neurons in each of these 10 layers. we use tanh nonlinearities in this case and we initialize it with small random numbers. We have random data, and now let's pass it through the entire network. At each layer, look at the statistics of the activations that come out of that layer.
    • We compute the mean and the standard deviations at each layer.
    • Tanh is centered around zero. However, the standard deviation shrinks and quickly collapses to zero.
    • Let's think about the backwards pass. We think input values of each layer are very small, because they've all collapsed at this near zero. 
    • Now backprop, we have upstream gradient flowing down. So because X(각 레이어의 입력값) is small, weights are getting a very small gradient, and weights are not updating.
    • We're multiplying by W over and over again in the backward pass. You're getting the same phenomenon in the backward pass as we had in the forward pass where gradients are getting smaller and smaller.
    • So upstream gradients are collapsing to zero as well.

    • Because we were talking about what happens at different values of inputs to tanh, it's saturated.
    • When weights are saturated, all the gradient will be zero, and weights are not updating.
  • Xavier initialization

    • The way basically do the variance of of the input to be the same as a variance of the output.
    • If you have a small number of inputs, we divide by the smaller number and get larger weights. Because you're multiplying each of weights by small inputs, you need a larger weights to get the same larger variance at output. 
    • If we have many inputs, we want smaller weights in order to get the same spread at the output.
    • But when using the ReLU nonlinearity, it breaks.
  • He initialization

    • Because when using the ReLu nonlinearity it breaks, you can divided by 2. You're basically adjusting for the fact that half the neurons get killed.



Batch Normalization
  • "You want unit gaussian activations? just make them so."
    • Let's just make them that way and force the to be that way!
    • Consider a batch of activations at some layer. To make each dimension unit gaussian, apply:

    • In the current batch, we can take the mean, the variance of the batch, and we can normalize by this. 
    • This is a vanilla differentiable function.

    • We think of input data as we have N training examples in our current batch, and then each batch has dimension D.
    • We compute dimension across our current mini-batch and we normalize by this. (image의 옆면을 본다고 생각했을 때, 여기서 빨간색 위 아래의 화살표(k)는 이전 레이어의 output의 node의 개수(dimension의 개수, k의 개수)라고 할 수 있다.)
    • This is usually inserted after fully connected or convolutional layers.

    • With convolutional layers, it normalize not independently all feature dimension(Activation Map에 있는 모든 channel을 한꺼번에 normalizing하는 것이 아니라) but the elements on the same channel in the same Activation Map.
    • We're doing this batch normalization after every fully connected layer. Do we necessarily want a unit gaussian input to a tanh layer? 
    • What normalization does is to force the input to exist only in the linear regime of tanh. We want to adjust normalization.

    • We have this additional scaling operation. Then we scale by some constant gamma and shift by another factor of beta. (normalized된 값들을 다시 원상복구(recover)! 감마 = 분산, 베타 = 평균)
  • Summarize

    • Gaussian distribution로 만든다고 해서 어떠한 구조를 잃는 것이 아니다. "연산이 잘 수행되도록" 우리는 그저 데이터를 조금만 움직이고 스케일링을 하는 것일뿐이다. (선형변환)



Hyperparameter Optimization
  • Cross-validation strategy
    • Coarse -> fine cross-validation in stages
    • First stage: only a few epochs to get rough idea of what params work
      • With only a few epochs, you can get a pretty good sense of which hyperparameters, which values are good or not.
    • Second stage: longer running time, finer search
      • You can run this for a longer time, and do a finer search over that region.
  • For example: run coarse search for 5 epochs

    • Highlighted regions in red are regions that we start with fine-stage.
    • One thing to note is that learning rate is usually better to optimize in log scale. you can deal with orders of some magnitude(차수, orders of magnitude).
    • But there might be still better ranges if search ranges shift variously.
  • Random Search vs. Grid Search

    • Grid Search
      • We can sample for a fixed set of values for each hyperparameter.
    • Random Search
      • But in practice it's actually better to sample from a random search, sampling random value of each hyperparameter in a range. Because this is that if a function(model, distribution) is more sensitive to change of one variable parameter(초록색 분포에서 볼록 튀어나온 부분) than another, Random search allows more sampling of important parameter, and so you're going to be able to see this shape in this green function(distribution).
  • Hyperparameters to play with:
    • Network architecture
    • Learning rate, its decay schedule, update type
    • Regularization (L2 / Dropout strength)
  • Monitor and visualize the loss curve


  • Bad initialization


  • Monitor and visualize the accuracy:




Summary
  • Activation Functions (use ReLU)
  • Data Preprocessing (images: subtract mean)
  • Weight Initialization (use Xavier init)
  • Batch Normalization (use)
  • Hyperparameter Optimization (random sample hyperparams, in log space when appropriate)



Next time: Training Neural Networks, Part 2

  • Parameter update schemes
  • Learning rate schedules
  • Gradient checking
  • Regularization (Dropout etc.)
  • Evaluation (Ensembles etc.)
  • Transfer Learning / Fine-Tuning