본문 바로가기

Artificial Intelligence/Deep Learning

CS231N - Lecture 7: Training Neural Networks II




Today




Fancier Optimization
  • Optimization

    • We define some loss function. For each value of the network weights, the loss function tells us how good or bad is that value of the weights doing on our problem.
    • Then we imagine that this loss function gives us some nice landscape over the weights.
  • Three Problems with SGD

1. Natsy zigzagging

    • Q: What if loss changes quickly in one direction and slowly in another? What does gradient descent do?
    • A: Very slow progress along shallow dimension, jitter along steep direction.

    • Loss function has high(bad) condition number: ratio of largest to smallest singular value of the Hessian matrix is large.
    • This problem actually becomes much more common in high dimensions. We're only showing a two-dimensional optimization landscape, but in practice, our neural networks might have millions, tens of millions, hundreds of millions of parameters.
2. Local minima or saddle point

    • Q: What if the loss function has a local minima or saddle point?
    • A: Zero gradient
      • In the first picture, gradient descent gets stuck. 
      • In the second picture, Saddle points much more common in high dimension. Not a local minima, you can imagine a point where in one direction we go up, and in the other direction we go down. (쉽게 말해 안장점이란, 임계점(정류점, 정상점) 중에서 극대, 극소가 아닌 점을 말한다.) Then at current point, the gradient is zero.
3. Noisy estimate value because of minibatch stochasticity

    • We estimate the loss and the gradient using a small mini-batch of examples. 
    • What this means is that we're not actually getting the true information about the gradient at every time step. Instead, we're just getting some noisy estimate(추정값) of the gradient at our current point.
    • We need to think about a slightly fancier optimization.
  • SGD + Momentum
    • There is idea of adding a momentum term to SGD.
    • (comment) Momentum을 추가한 방식은 말 그대로 gradient descent를 통해 이동하는 과정에 일종의 'momentum(운동량)'을 주는 것이다. 현재 gradient를 통해 이동하는 방향과는 별개로, 과거에 이동했던 방식을 기억하면서 그 방향으로 일정 정도를 추가적으로 이동하는 방식이다. 쉽게 말해서, 지그재그를 좀 덜 하고 싶은 것이라고 할 수 있다.

    • Build up "velocity" as a running mean of gradients. 
    • Rho $\rho$ gives "friction"; typically rho = 0.9 or 0.99. 

    • Once we have velocity, then when we pass the point of local minima, the point of gradient = 0 will still have velocity. Then we can get over(빠져나오는 효과!) this local minima and continue downward. (saddle points도 동일)

    • Red vector is the direction of the gradient at current point. Green vector is the direction of velocity vector.
    • Actual step(실제 업데이트) is found according to a weighted average of gradient and velocity.
    • This helps overcome some noise in our gradient estimate.
  • Nesterov Momentum(Nesterov accelerated gradient, NAG)

    • You start at the red point. You step in the direction of the velocity. At that point, You evaluate the gradient. Then you go back to your original point and kind of mix velocity and gradient.
    • If your velocity direction was actually a little bit wrong, it allows more use the direction of current gradient.
    • Nesterov also has some really nice theoretical properties on convex optimization, but Nesterov doesn't guarantees performance on non-convex problems.

    • (comment) 때에 따라 방정식의 부호를 교환하기도 한다. 원래는 아래와 같이 쓰기도 한다. (notation은 조금 다를 수 있다.)

    • (comment) NAG를 이용할 경우, Momentum 방식에 비해 보다 더 효과적으로 이동할 수 있다. Momentum 방식의 경우 멈춰야 할 시점에서도 관성에 의해 훨씬 멀리 갈수도 있다는 단점이 존재하는 반면, NAG 방식의 경우 일단 모멘텀으로 이동을 한 후 다시 돌아와서 어떤 방식으로 이동해야할 지를 결정한다. 따라서 Momentum 방식의 빠른 이동에 대한 이점은 누리면서도, 멈춰야 할 적절한 시점에서 제동을 거는 데에 훨씬 용이하다고 생각할 수 있다.
  • AdaGrad

    • With AdaGrad, we use grad squared term instead of velocity term.
    • During training, we keep adding the squared gradients to this grad squared term.
    • (comment) 쉽게 말해서, "지금까지 많이 변화하지 않은 변수들은 step size를 크게 하고, 지금까지 많이 변화했던 변수들은 step size를 작게하자!"라는 것이다.

    • Q1: What happens with AdaGrad?
    • A1: In small dimension, as we add the sum of the squares of the gradient, we're going to be dividing by a small number. So we'll accelerate movement along the small dimension. On the other hand, in large dimension, as we add the sum of the squares of the gradient, we'll be dividing by a large number. So we'll slow down the progress.
    • Q2: What happens to the step size over long time?
    • A2: Here there's a problem. With AdaGrad, the steps(x 관점) actually get smaller and smaller and smaller. Because we continue updating this estimate of the squared gradients over time(매 단계 학습 횟수 t마다 모든 매개 변수 x에 대해 다른 학습 속도를 사용하게 되지만 그럼에도 불구하고), so this estimate(위의 코드에서 grad_squared) grows and grows and grows monotonically over the course of training. 따라서 학습이 진행될수록 update가 약해져서 실제로 무한히 계속 학습한다면 어느 순간 update량이 0이 되어 전혀 갱신되지 않게 된다.
    • This is really good cause in the convex case. As you approach a minimum, you want to slow down. So you actually converge. But in the non-convex case, that's a little bit problematic. Because as you come towards saddle point, AdaGrad might be gotten stuck.
    • 여기서 1e-7은 0으로 나누기를 피하는 smoothing term이다. 일반적으로 1e-7, 1e-8 정도를 사용한다.
    • 또한 학습 속도를 수동으로 조정할 필요가 없기 때문에 대부분의 구현에서는 learning rate를 기본값 0.01로 사용한다.
    • 일반적으로 Neural Net를 학습시킬 때는 잘 사용하지 않는다.
  • RMSProp

    • We're going to let that squared estimate decay.
    • With RMSProp, after we compute gradient, we take more current estimate of the grad squared. So we multiply it by decay rate, which is commonly something like 0.9 or 0.99.
  • Adam
    • Momentum + RMSProp

    • Q: What happens at first time step?
    • A: At the very first time step, we've initialized second moment with zero. Because beta2 is something like 0.9 or 0.99, after one update, second moment is still very close to zero. Then when we divide by second moment in update step, we're making a very, very large step(위에서 x) at the beginning.
    • Therefore, Adam adds this bias correction term. After we update first and second moments, we create an unbiased estimate of first and second moments by incorporating the current time step t.
    • Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point for many models!
  • First-Order Optimization

    • This is kind of a first-order Taylor proximation to the function.
    • But this approximation doesn't hold for very large regions, so we can't step too far in that direction.
  • Second-Order Optimization



    • When you generalize this to multiple dimensions, you get something called the Newton step, where you compute this Hessian matrix.
    • If we use inverting this Hessian matrix, we're going to step directly to the minimum of this quadratic approximation to the function.

    • Additionally, this doesn't have a learning rate. Because this always step to the minimum at every time step. However, in practice, you finally might have a learning rate anyway. Because quadratic approximation might not be perfect.
  • Optimizer 종류별 요약
  • In practice:
    • Adam is a good default choice in many cases
    • SGD + Momentum with learning rate decay often outperforms Adam by a bit, but requires more tuning



Model Ensembles

  • Intro
    • Q: What can we do to try do reduce this gap between train and test error and make our model perform better on unseen data(test data)?
    • A: One really quick and easy thing to try is this idea of model ensembles!
    • 1. Train multiple independent models(10개 정도의 모델을 각각 독립적으로 학습)
    • 2. At test time average their results
    • Enjoy 2% extra performance(overfitting 감소, 성능 향상)
  • Tips and Tricks

    • At test time, You need to average the predictions of these multiple snapshots.
    • You can collect the snapshots during the course of training.
    • 동일한 hyperparamter의 앙상블 모델을 둘 필요없다. 다양한 "모델 사이즈", "learning rate", "regularization 기법" 등을 앙상블 할 수 있다.



Regularization

  • How to improve single-model performance?

  • Add term to loss

    • L2 regularization doesn't really make a lot of sense in neural networks.
  • Dropout

    • In each forward pass, randomly set some neurons to zero(activations를 0으로 둔다). Probability of dropping is a hyperparameter; 0.5 is common.
    • Implementation


    • How can this possibly be a good idea?

      • One interpretation
        • Forces the network to have a redundant representation; Prevents co-adaptation(상호작용) of features.

        • If we have dropout, then in making the final decision about catness, the network can't depend too much on any of particular one features. Instead, it kind of needs to distribute its idea of catness across many different features.
        • This might help prevent overfitting somehow.
      • Another interpretation
        • Dropout is training a large ensemble of models (that share parameters). Because Each binary dropout mask is one model, 따라서 단일 모델로도 앙상블 효과가 가능하다.
    • Dropout Summary

      • At dropout, we have this predict function, and we multiply our outputs of the layer by the dropout probability.
    • More common: "Inverted dropout"

      • What you can do is, at test time, you use the entire weight matrix. But at training time, you divide by "p".
  • Data Augmentation

    • We train on these random transformations of the image rather than the original images.
    • Horizontal Flips

    • Random crops and scales

      • Training: sample random crops / scales
        • ResNet:
        • 1. Pick random L in range [256, 480]
        • 2. Resize training image, short side = L
        • 3. Sample random 224 x 224 patch
      • Testing: Average a fixed set of crops
        • ResNet:
        • 1. Resize image at 5 scales: {224, 256, 384, 480, 640}
        • 2. For each size, use 10 224 x 224 crops: (4 corners + center) $\times$ flips
    • Color Jitter
      • Simple: Randomize contrast and brightness

      • More Complex:
      • 1. Apply PCA to all [R, G, B] pixels in training set
      • 2. Sample a "color offset" along principal component directions
      • 3. Add offset to all pixels of a training image
      • But that's a little bit less common.
  • A common pattern 
    • Prior examples: Batch Normalization, Dropout, Data Augmentation
    • Other examples:
      • Drop Connect


        • Rather than zeroing out the activations at every forward pass, Instead we randomly zero out some of the values of the weight matrix.
      • Fractional Max Pooling

        • We randomize exactly the pool that the regions over which we pool. (max pooling을 랜덤하게!)
      • Stochastic Depth

        • At training time (a network on the left), We randomly drop layers from the network during training.
        • During test time, we use the whole network.



Transfer Learning

  • Transfer learning busts this myth that you need a huge amount of data in order to train a CNN.
  • Transfer Learning with CNNs

    • 2. Small Dataset: you reinitialize last matrix randomly(ex. 4096 x 1000 -> 4096 x 10), and freeze the weights of all the previous layers. And then, you only train the parameters of this last layer and let it converge on your data.
    • 3. Bigger dataset: After you learn last layer for your data, then you can consider actually trying to update the whole network. And, a general strategy here is that when you're updating the network, you want to drop the learning rate from its initial learning rate. (이미 많은 양의 데이터로 학습이 되어 있기 때문에 우리가 가진 작은 데이터셋의 성능은 그 가중치들을 조금만 수정하면 됨. 따라서 기존의 learning rate보다 더 작게 해야함.)


  • Transfer learning with CNNs is pervasive.

  • (comment) 어떠한 task가 있는데 여기에 대한 데이터셋이 크지 않다면, 그 task에 유사한 데이터셋으로 먼저 pretraining을 한 뒤에 model의 일부를 초기화시키고 내가 가지고 있는 데이터로 모델을 fine-tuning하면 된다.




Summary

  • Optimization
    • Momentum, RMSProp, Adam, etc
  • Regularization
    • Dropout, etc
  • Transfer learning
    • Use this for your projects!