본문 바로가기

Artificial Intelligence/Deep Learning

CS231N - Lecture 5: Convolutional Neural Networks




Convolutional Neural Networks 



  • Fully Connected Layer

    • Let's say we have an 3D image, 32 x 32 x 3. So we'll stretch all of the pixels out, and then we have this 3072 dimensional vector.
    • And we have these weights. So here for example out W, we're going to say is 10 x 3072.
    • We take each of our 10 rows and we do this dot product with 3072 dimensional input.
    • (comment) Fully Connected Layer is a layer represented by a one-dimensional vector, and usually on the back of Deep Neural Networks.


  • Convolution Layer

    • Instead of stretching this all out into one long vector, now we keep the structure of this image, this three dimensional input. 
    • And then our weights are going to be these small filters, so in this case for example, 5 x 5 x 3 filter, and we take this filter and we slide filter over the image spatially and compute dot products at every location.
    • The result of taking a dot product between the filter and a small 5 x 5 x 3 chunk of the image (i.e. 5 x 5 x 3 = 75-dimensional dot product + bias) is $w^T x + b$
    • (comment) 기본적인 아이디어는 전체의 image의 dimentional tensor(input)가 너무 크다보니 한꺼번에 학습하기가 어렵다. 따라서 image를 하나의 입력으로 받지 말고 filter를 통해서 나눠서 학습하자는 것이다. 위의 예에서는 5 x 5 x 3의 filter를 통해 이루어지고 있고 bias도 추가하여 내적을 통해 계산할 수 있다. filter와 bias는 여기서 동일하게 움직인다. 그러니까 sliding할 때 마다 바뀌는 것이 아니라 동일한 값으로 내적을 진행한다. 결론적으로 우리가 업데이트 해야할 것은 filter와 bias가 되겠다.

    • We're going to do this operation and fill in the corresponding point in our output activation.

    • We want to work with multiple filters. Because each filter is kind of looking for a specific type of template or concept in the input volume. 
    • So we take a second filter, this green filter, which is again 5 x 5 x 3.

    • If we have six filters, six of 5 x 5 filters, then we get in total 6 activation maps out.
  • Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions


    •  We intersperse these with activation functions, for example, a ReLU activation function.

    • We pass input image through this sequence of layers, where we have a convolutional layer first. 
    • We usually have our non-linear layer after convolutional layer.
    • So ReLU is something that's very commonly used.
    • Pooling layer basically downsamples the size of our activation maps.
  • A closer look at spatial dimensions:

    • We take this filter is applied from our upper left-hand corner and multiply the dot product. So this go into the upper left-hand value of our activation map.
    • We can continue with this to have another value, and in the end we get 5 x 5 output.

    • I apply the stride. Previously we used the stride of one. So now let's see what happens if we have a stride of two.
    • If we use a stride of two, we have in total three of these that can fit, and so we get 3 x 3 output.

    • We slide filter over by three and now it doesn't fit nicely within the image.
    • We don't do convolutions like this. Because it lead to asymmetric outputs happening.

    • We computed how many the output size is going to be. This actually can work into a nice formula.
  • In practice: Common to zero pad the border

    • With a stride of three, this doesn't really work out. In practice it's actually common to zero pad the borders in order to make the size work out. It is related to what happens is at the corners.
    • We actually pad our input image with zeros, and so now you're going to be able to place a filter centered at the upper right-hand pixel location of your actual input image.
    • Q1: Does the zero padding add some sort of extraneous features at the corners?
    • A1: We're doing our best to still get some value and process that specific region of the image, and so zero pdding is kind of one way to do this. 
    • Q2: Why do we do zero padding?
    • A2: The way we do zero padding is to maintain the same input size as we had before. The motivation for doing this type of zero padding and trying to maintain the input size. If you have multiple of these layers stacked and If we don't do this kind of zero padding, we really quickly shrink the size of the outputs that we have. If you have a pretty deep network, the size of your activation maps is going to shrink to something very small. Because we lose some of corner information that each time.
  • Examples time:

    • (32+2*2(number of pads)-5)/1+1 = 32, so 32 x 32 x 10. (공식기억! (N-F)/Stride + 1)

    • {(5 x 5 x 3)(weight) + 1(bias)} x 10(number of filters) = 760 parameters.
  • Summary. To summarize, the Conv Layer:

    • We have to choice all of filters, where we have number of filters, the filter size, the stride of the size, the amount of zero padding.
    • And you basically can go through the computations earlier in order to find out what your output volume is actually going to be and how many total parameters that you have.

    • There are some common settings of this.
    • Filter size : 3 x 3, 5 x 5
    • Stride : 1, 2 (usually)
    • Padding : spatial extent
    • Total number of tilters K : powers of 2 (32, 64, 128, 512, ...)
  • Pooling Layer
    • makes the representations smaller and more manageable (if representations is smaller, it effects the number of parameters fewer.)
    • operates over each activation map independently

    • What the pooling layer does is exactly just downsampling.
    • For example, There is 224 x 224 x 64 input, and downsamples this. So in the end you'll get out 112 x 112.
    • It's important to note this doesn't do anything in the depth!

    • A common way is max pooling. 
    • In this case our polling layer also has a filter size and this filter size is the region at which we pool.
    • We slide filter along our input in exactly the same way as we did for convolutional layer.
    • But instead of doing these dot products, we just take the maximum value of the input volume in that region.
    • Q1: Why is max pooling better than average pooling?
    • A1: Intuition behind why max pooling is commonly used is each value is kind of how much this neuron activated. Thus, each value is how much this filter activated in this location. If we're thinking about detecting, recognition, you want to know whether a light or whether some aspect of your image than whether it happens anywhere in this region
    • Q2: Since pooling and stride both have the same effect of downsampling, can you just use stride instead of pooling and so on?
    • A2: In practice, at more recent neural network architectures, people have begun to use stride more in order to do the downsampling instead of just pooling. Pooling is a kind of stride technique. 

    • Some common settings for the pooling layer is a filter size of 2 x 2 or 3 x 3, stride 2.


Summary

    • There are basically stacks of these convolutional and pooling layers followed by fully connected layers at the end.