본문 바로가기

Artificial Intelligence/Deep Learning

CS231N - Lecture 2: Image Classification pipeline




Image Classification 

  • A core task in Computer Vision(CV)
  • Image Classification Task
    • Input : image
    • Output : categories or labels

  • The problem : Semantic Gap (human <-> machine)
    • Visual system in your brain is hardwired to doing these, sort of, visual recognition tasks.
    • But, the computer really is representing the image as this gigantic grid of numbers. e.g. 800 x 600 x 3 (3 channels RGB)
  • Challenges
    • Viewpoint variation(시각 변화)
      • We moved the camera to the other side, then every single grid, every single pixel, in this giant grid of numbers would be completely different.
    • Illumination(조명)
      • There can be different lighting conditions going on in the scene.
    • Deformation(변형)
      • Objects can deform.
    • Occlusion(가려짐)
      • You might only see part of a cat.
    • Background Clutter(배경과 비슷한 경우)
      • Background clutter could actually look quite similar in appearance to the background.
    • Intraclass variation(물체의 다양성)
      • One notion of "cat" actually spans a lot of different visual appearances.

  • Attempts have been made
    • One thing : we try to compute the edges of this image and then go in and try to categorize all the different.
    • Two thing : if you want to start over for another object category, you'll have to make it separately.
  • Data-Driven Approach
    1. Collect a dataset of images and labels
    2. Use Machine Learning to train a classifier
    3. Evaluate the classifier on new images
  • First classifier : Nearest Neighbor
    • Train

      • Input is images and labels, output is a model

      • Memorize all data and labels

    • Predict
      • Input is the model, output is predictions for images.
      • Predict the label of the most similar training image
  • Example Dataset : CIFAR10
    • 10 classes
    • 50,000 training images
    • 10,000 testing images
  • Distance Metric to compare images
    • L1 distance (Manhattan distance)


  • Nearest Neighbor Classifier with Python

    • Train only memorize training data.
    • For loop of Predict find 1) closest train image and 2) predict label of nearest image. 
    • Q: With N examples, how fast are training and prediction? A: Train O(1), predict O(N)
    • This is bad. We want classifiers that are fast at prediction; slow for training is ok!
  • Distance Metric

    • L1 (Manhattan) distance
      • L1 distance takes the sum of the absolute values between the pixels.
      • Square on the left is actually a circle according to the L1 distance and it forms this square shape around the origin.
        • 글 링크1 : Taxicab geometry - Wikipedia, 글 링크2 : 맨하탄 거리(Manhattan Distance)
        • 영상 링크1 : Euclidean and Taxicab Distances, 영상 링크2 : TaxiCab Geometry
        • (comment) 두 그림 모두 원점에서 거리가 1인 점을 연결한 도형이다. 어떤 L로 사용하는 지에 따라 거리의 정의가 달라지고(거리가 정의되려면 먼저 공간이 정의되어야 하는데, 기하학적으로 "Taxicab geometry"와 "Euclidean geometry"는 서로 비슷하지만 다르다. 그러니까 공간이 다르다는 것이고, 이에 따른 거리의 정의도 다르다는 것이다.) 다른 모양이 된다. 더 구체적으로 말하자면, 원이라는 것은 한 점에서 '거리'가 r인 점들의 집합이다. 따라서 거리를 무엇으로 채택하느냐에 따라 "원"의 모양이 달라진다. 예를 들어, $x^2+y^2=r^2$이 그리는 그래프에서 알 수 있듯이 L2 distance를 사용했을 때의 "원"은 우리가 아는 원이 나오지만, L1 distance를 사용했을 때의 "원"은 $|x|+|y|=r$을 말하고, 이것을 그림으로 그려보면 마름모 형태의 사각형이 나오는 것이다.
      • If you were to rotate the coordinate frame, that would actually change the L1 distance.
      • If the individual feature vector have some important meaning of your task, then maybe L1 might be a more natural fit. (e.g. Height and Weight)
    • L2 (Euclidean) distance
      • L2 distance or Euclidean distance take the square root of the sum of the squares. 
      • Circle of L2 distance is a familiar circle.
      • Changing the coordinate frame in the L2 distance doesn't matter.
      • If the individual feature vector is just a generic vector in some space and you don't know substantive meaning of the different elements, you don't know what they actually mean, then maybe L2 is slightly more natural.
    • Q1 : Where L1 distance might be preferable to using L2 distance? 
    • A1 : I think it's mainly problem-dependent, It's sort of difficult to say in which cases you think one might be better than the other. But I think that because L1 has this sort of coordinate dependency, it actually depends on the coordinate system of your data. When your individual elements actually have some meaning, I think maybe using L1 might make a little bit more sense. 
  • Hyperparameters
    • What is the best value of k of K-Nearest Neighbors to use?
    • What is the best distance to use?
    • These are hyperparameters : choices about the algorithm that we set rather than learn
    • Setting Hyperparameters

      • This is a terrible idea, don't do this.


      • This seems like maybe a more reasonable strategy. but, in fact, this is also a terrible idea and you should never do this.


      • You'd take that best performing classifier on the validation set and run it "once" on the test set.


      • Now, in the example, 5-Fold Cross Validation
      • This is used a little bit more commonly for small data set, not used so much in deep learning.
  • K-Nearest Neighbor on images never used.
    • Very slow at test time
    • Distance metrics on pixels are not informative
    • Curse of dimensionality
      • is that actually densely covering the space, means that we need a number of training examples, which is exponential in the dimension of the problem.



Linear Classification 
  • Parametric Approach : Linear Classifier

    • Input data = a cat image on the left = "X"
    • A set of parameters or weights = "W" = theta
    • Some function takes in both the data, X, and the parameter, W, and this'll spit out now 10 numbers.
    • The simplest possible example of combining these two things is just to multiply X and W. => Linear Classification : F(x,W) = Wx
    • Bias term 
      • does not interact with the training data. (directly does not affect training data.)
      • instead just gives us generalization of data for some classes over another.
      • In F = Wx + b, undermine tackiness of F and Ws.
      • For example, if you're dataset was unbalanced and had many more cats than dogs, then the bias elements corresponding to cat would be higher than the other ones.
  • Example with an image with 4 pixels, and 3 classes (cat / dog / ship)

    • 2x2 input image, 4 pixels total, and stretch it out into a column vector.
    • Weight matrix is going to be 4x3. (4 pixels and 3 classes) In the picture shown above, the weight matrix is transpose.
    • 3 element bias vector. Bias gives you this data independence scaling offset to each of the classes.
  • Interpreting a Linear Classifier

    • Linear Classifier is to view images as a point of high dimensional space.
  • Hard cases for a linear classifier

    • Linear Classifier struggles with multimodal situations. (on the right) 
    • So anytime where you have multimodal data, like one class that can appear in different regions of space, is another place where linear classifiers might struggle.



Coming up!

  • Loss function (quantifying what it means to have a "good" W)
  • Optimization (start with random W and find a W that minimizes the loss)
  • ConvNets (tweak(수정, 변경) the functional form of f)