본문 바로가기

Artificial Intelligence/Deep Learning

CS231N - Lecture 4: Backpropagation and Neural Networks




Backpropagation 

  • Computational graphs

    • Once we can express a function using a computational graph, we can use a technique that we call "Backpropagation" which is going to recursively use the chain rule in order to compute the gradient with respect to every variable in the computational graph.
  • A simple example
    • We want to find the gradients of $f$ with respect to x, y and z. 
    • The very end of the computational graph, and then we're going to work our way backwards and compute all the gradients along chain rule.

    • If we change x by a little bit, q is going to change by approximately the same amount, and so if I change x by a little bit, the influence of x on q is going to be 1. because $\frac{\partial q}{\partial x} = 1$

    • In backprop, What we're doing is we basically have all of these nodes in our computational graph, but each node is only aware of each node's immediate surroundings. So we have the local inputs at each node.
    • Inputs are connected to this node, the values are flowing into the node, and then we have the output that is directly outputted from this node. Here our local inputs are x and y, and the output is z.
    • At this node, we also know the local gradient. 
  • Another example

    • We have for the Ws and Xs, so we can make a forward pass and basically compute what the value is at every stage of the computation.

    • So again, we're going to start at the very end of the graph, and so here again the gradient of the output with respect to the last variable is 1.

$$1) \,\,\,\,\, (\frac{-1}{1.37^2})(1.00) = -0.53 \,\,\,\,\,\,\,\,\,\, s.t. \,\,\, f(x) = \frac{1}{x} \,\,\, \rightarrow \,\,\, \frac{df}{dx} = \frac{-1}{x^2}$$

$$2) \,\,\,\,\, (1)(-0.53) = -0.53 \,\,\,\,\,\,\,\,\,\, s.t. \,\,\, f_c (x) = c + x \,\,\, \rightarrow \,\,\, \frac{df}{dx} = 1$$

$$3) \,\,\,\,\, (e^{-1})(-0.53) = -0.20 \,\,\,\,\,\,\,\,\,\, s.t. \,\,\, f(x) = e^x \,\,\, \rightarrow \,\,\, \frac{df}{dx} = e^x$$

$$4) \,\,\,\,\, (-1)(-0.20) = 0.20 \,\,\,\,\,\,\,\,\,\, s.t. \,\,\, f_a (x) = ax \,\,\, \rightarrow \,\,\, \frac{df}{dx} = a$$

5) [local gradient] x [upstream gradient] = [1] x [0.2] = 0.2 (both inputs)

6) [local gradient] x [upstream gradient] => x0 = 0.4, w0 = -0.2



    • We could take all the computations that we had in our graph that made up this sigmoid, and we could replace it with one big node. That's a sigmoid. because we know the local gradient for this gate, d(differential) of the sigmoid of x over dx.

  • Patterns in backward flow

  • Gradients for vectorized code
    • We've done all these examples in the scalar case, we're going to look at what happens when we have vectors.
    • So now if our variables x, y, and, z, instead of being numbers, we have vectors for these.

    • Now our gradients are going to be Jacobian matrices. (Wikipedia, 위키백과)
    • These are going to be matrices containing the derivative of each element.
  • A vectorized example
    • Now we're going to go through a more concrete vectorized example of a computational graph.

    • We want to find the gradient with respect to our intermediate variable q before the L2.
    • In vector form, it's going to be 2 times our vector of q.

    • What's the gradient with respect to W?
    • We basically compound $\frac{\partial f}{\partial q_k}$ with $\frac{\partial q_k}{\partial W_{i, j}}$. So we find the effect of each element of W on each element of q, and sum this across all q.
    • The both sides one is an indicator function, so this is saying that it's 1 if $k = i$. 
    • Remember, The gradient of a variable should have the same shape as the variable. So your gradient should be check that this is the same shape as your variable.

    • Here if we compute the partial derivatives we can see that $\frac{\partial q_k}{\partial x_i} = W_{k, i}$
    • Using the same way as we did it for W, again we can use the chain rule and get the total expression for that.
  • Modularized implementation: forward / backward API
    • If we have our entire graph, we can make a forward pass through the entire graph by iterating through all the nodes in the graph, all the gates. 
    • Here I'm going to use the word gate and node, so we can iterate through all of these gates and call forward on each of the gates.
    • Then going backwards, we're going to go through all of the gates in this reverse sorted order, and then call backwards on each of these gates.


    • If we look at this in the forward pass, one thing that's important is that we should cache the values of the forward pass. because we end up using forward pass in the backward pass a lot of the time.
  • Summary so far...




Neural Networks 

  • Neural networks: without the brain stuff

    • The top one is a linear transformation, and the other one is a layer in a two-layer neural network.
    • So basically, more broadly speaking, neural networks are a class of functions stacked hierarchically with simple functions to make up more complex nonlinear functions. 

    • We can stack more layers of these to get deeper networks of arbitrary depth.
  • Biological inspirations for Neural Networks

    • Neuron
      • We have the impulses that are carried towards each neuron. So we have a lot of neurons connected together and each neuron has dendrites(수상 돌기).
      • dendrites are what receives the impulses that come into the neuron.
      • Then we have a cell body(세포체).
      • Then it takes cell body, and after integrating all these signals, it passes on another cell body that it's connected to downstream neurons, and it carries this through axons(축색 돌기).
  • Be very careful with your brain analogies!

  • Activation functions

  • Neural networks: Architectures

  • Example feed-forward computation of a neural network
    • Neuron

    • Forward-pass of a 3-layer neural network




Summary