본문 바로가기

Artificial Intelligence/Deep Learning

CS231N - Lecture 3: Loss Functions and Optimization





Loss Function 


  • Loss Function
    • You can see that some of these scores are better or worse than others. But that's actually bad. So this is kind of a hand wavy approach.
    • Actually to determine automatically which W will be best, we need some way to quantify the goodness or the badness of any particular W.
    • That's this function that takes a W, looks at the scores and then tells us how bad quantitatively is this W, so is something that we'll call a loss function.
    • A loss function tells how good our current classifier is. more specifically, this allows us to quantify for any given value of W, how good or bad is it?
    • Given a dataset of examples ${(x_i, y_i)}_{i=1}^N$, where $x_i$ is image and $y_i$ is (integer) label
    • Loss over the dataset is a sum of loss over examples :
$$L = \frac{1}{N} \sum_i L_i (f(x_i, W), y_i)$$

  • Data loss + Regularization

    • We have this data set of blue points and we're going to fit some curve to the training data, the blue points, then if the only thing we've told our classifier to do is to try and fit the training data, it might go in and have very wiggly curves to try to perfectly classify all of the training data points.
    • But we care about the performance on the test data! In fact, what we probably would have preferred the classifier to do was maybe predict this straight green line.
    • The way we usually solve it is this concept of regularization. So we're going add an additional term to the loss function. Regularization term encourages the model to somehow pick a simpler W.

    • There's two ways to do regularization
      • First thing is you can constrain your model class to just not contain the more powerful, more complex models
      • Second thing is you can add this soft penalty where the model still has access to more complex models.
    • In plain words, you add this soft constraint saying that "if you want to use these more complex models, you need to overcome this penalty for using their complexity!".

 

    • L2 regularization = Weight decay = Euclidean norm of weight vector W = Squared norm = Penalizing euclidean norm of weight vector (ex. $w_2$)
      • 1) Prefer to spread that influence across all the different values in x 
      • and 2) Depend on the entire x vector rather than depending only on certain elements of the x vector
      • The better the element of W spreads, the loss complex
    • L1 regularization = Penalizing L1 norm of weight vector = Encouraging sparsity in matrix W (ex. $w_1$)
      • measure model complexity by the number of zeros in the weight vector
      • The more zeros, the less complex
  • Softmax Classifier (Multinomial Logistic Regression)

    • For Multinomial Logistic Regression loss function, we actually will use those scores with some additional meaning.
    • In particular we're going to use $\frac{e^s k}{\sum_j e^s j}$ scores to compute a probability distribution over our classes. So we use this so-called softmax function. Now after we send our scores through this softmax function, we end up with this probability distribution, where we have probabilities over our classes.

    • What we want to do is coming out of this softmax function to match this target probability distribution that has the correct class. In other words, what we really want is that the probability of the true class is high and as close to 1. Then our Loss function will be "$-\log{\frac{e^s y_i}{\sum_j e^s j}} = -\log{\, (the \, \, probability \, \, of \, \, the \, \, true \, \, class)}$".
    • Log is a monotonic function, and it's easier to maximize log than it is to maximize the raw probability.
    • Loss functions measure badness not goodness. So we need to put in the minus one -1.

    • In conclusion, we're exponentiate scores, so that they're all positive, and then we'll normalize them to make sure that they all sum to 1. And our loss will be the minus log ($-\log$) of the true class score. So that's the softmax loss, also called multinomial logistic regression.
    • The probability distribution that we want is 1 on the correct class, 0 on the incorrect classes. If we think this case, this value inside the log would end up being 1, because it's the log probability of the true class, then log of 1 is 0, minus log of 1 is still 0. 



Optimization
  • Optimization Strategies
    • Strategy #1 : A first very bad idea solution : Random search

      • Random search will just take a bunch of Ws, sampled randomly, and throw them into our loss function and see how well they do. So spoiler alert, this is a really bad algorithm, you probably shouldn't use this.
    • Strategy #2 : Follow the slope

      • Our x is maybe not a scalar but a whole vector. So we need to generalize this notation to multi-variable things. and the generalization that we use of the derivative in the multi-variable setting is the gradient, so the gradient is a vector of partial derivatives.
      • If you want to know, what is the slope of my landscape in any direction? That's the dot product of the gradient with the unit vector describing that direction. So this gradient is super important, because it gives you this linear, first-order approximation to your function at your current point.
      • Numerical gradient (computing the gradient numerically with finite differences)
        • If we have our W, we try to increment the first element of W by a small value, h, and then re-compute the loss using our loss function and our classifier and all that. Repeat this again for the second, third.
        • This is actually a terrible idea. Because it's super slow. So in practice, you'll never want to compute your gradients.
      • Analytic gradient (computing the gradient analytically with Calculus)
        • This'll be way more efficient than trying to compute it analytically via finite differences.
        • Rather than iterating over all the dimensions of W, we figure out what is the analytic expression for the gradient, and then just write analytic expression down and go directly from the W and compute the dW or the gradient in one step. 
      • In summary
        • Numerical gradient : approximate, slow, easy to write
        • Analytic gradient : exact, fast, error-prone
      • In practice
        • Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check
  • Gradient Descent

    • Gradient descent is first we initialize our W as some random thing, then we'll compute our loss and our gradient and then we'll update our weights in the opposite of the gradient direction. Because the gradient was pointing in the direction of greatest increase of the function, so minus (-) gradient points in the direction of greatest decrease.
    • Step size, sometimes called a learning rate, is probably one of the single most important hyper-parameters that you need to set when you're actually training these things in practice.
  • Stochastic Gradient Descent (SGD)

    • In practice, this N could be very very large. so rather than computing the loss and gradient over the entire training set, instead at every iteration, we sample some small set of training examples, called a minibatch.
    • We'll use this small minibatch to compute an estimate of the full sum, and an estimate of the true gradient.
    • So now it make sample some random minibatch of data, evaluate your loss and gradient on the minibatch, and make an update on W.

    • (comment) 추가적으로 Gradient Descent, Mini-batch Gradient Descent, Stochastic Gradient Descent, Mini-batch Stochastic Gradient Descent 등 너무 다양한 용어들이 나오기 때문에 한 번 정리하고자 한다.


출처 : 딥러닝 기본 원리의 이해 (박희원) 

여기서 $x^{(z)}, y^{(z)}$는 training set의 input과 output의 하나의 짝을 말한다. 보통은 $x^{(i)}, y^{(i)}$라고도 많이 쓴다. 두 번째 Mini-batch Gradient Descent의 수식에서 보이는 $bs$는 batch size의 줄임말이다. 자 그러면 이제 용어에 대한 정의를 간단하게 내려보자. 
    • Gradient Descent(Batch Gradient Descent) 
      • 위에서 가장 왼쪽에 보이는 그림처럼 batch라는 개념을 통해 1 step으로 한 번에 업데이트를 진행하는 것이다. 여기서의 batch는 전체 dataset을 가정한다. 하지만 전체 dataset이 매우 클 수 있기 때문에 파라미터 업데이트하기 위해 전체 dataset에 대한 완전한 loss function을 계산해야한다는 점이 단점이다.
    • Mini-batch Gradient Descent 
      • Gradient Descent의 단점을 보완하고자 나온 것이다. 전체 batch를 업데이트 하는 데, batch size를 mini-batch로 정하여 업데이트를 하는 것이다. 일반적인 mini-batch는 10~256정도 쓰이지만 때에 따라서 달라질 수 있다. 
      • 위의 가운데 그림을 참고하면 이해하기 쉽다. 추가적으로, training set의 input과 output의 하나의 짝도 보면 그래서 $x^{(z : z + bs)}, y^{(z : z + bs)}$로 이루어져있다. 이 말은 즉, batch size만큼 하나의 짝을 새롭게 정의하자는 것이다.
    • Stochastic Gradient Descent(SGD, Online Gradient Descent) 
      • SGD는 Mini-batch Gradient Descent에서 batch size가 1(최소한의 mini-batch)인 경우를 말한다. iteration 당 batch size를 1만 사용하기 때문에 반복이 충분하다면 SGD는 효과적이겠지만, 노이즈가 심하다. '확률적(Stochastic)'이라는 용어가 말해주듯이 각 배치를 포함하는 하나의 batch size(하나)가 "무작위로" 선택된다는 것을 말한다. 
      • 그러면 왜 Batch Gradient Descent(BGD)보다 SGD가 더 좋을까? 먼저 다음의 그림을 보자.

      • BGD는 m개의 dataset을 사용하여 평균을 내어 한 번 update를 한다. 이에 반면에 SGD는 각각의 data마다 cost를 미분하여 update를 m번 한다. 직관적으로 global minimum으로의 방향의 정확도는 BGD가, speed는 SGD가 더 높을 것이다. 
      • SGD의 특징은 global minimum으로 수렴하는 것이 아니라 주변을 맴돈다. 아무래도 하나의 데이터만 이용해서 update를 하다보니 움직임이 가볍고, 정확하게 global minimum으로 향하지는 못한다. 하지만 learning rate $\alpha$를 작게 할수록 수렴하는 것에 가깝게는 만들 수 있다. 또한 SGD는 BGD보다 학습 속도가 빠르기 때문에 대부분 우리는 large dataset에 대해서는 BGD보다 SGD를 사용한다.
    • Mini-batch Stochastic Gradient Descent(Mini-batch SGD)
      • BGD와 SGD 간의 절충안으로 나온 것이다. 전체 batch를 "무작위로" mini-batch하여 업데이트 하는 데, batch size를 1로 하는 것이 아니라 어느 정도의 mini-batch를 통해서 업데이트를 하는 것이다. mini-batch는 위에서도 말한 것과 같이 때에 따라서 달라질 수 있다. 일반적으로 SGD라는 용어는 Mini-batch가 사용되는 경우에도 쓸 수 있다.(내가 이것 때문에 많이 해깔렸던 것 같다.. 이놈의 용어..) Mini-batch SGD는 SGD의 노이즈를 줄이면서도 Total-batch보다 더 효율적이다.



Next time
  • Introduction to neural networks
  • Backpropagation