Understanding Back Propagation

Understanding Back Propagation

Backpropagation, short for “backward propagation of errors,” is an algorithm for supervised learning of artificial neural networks using gradient descent. Given an artificial neural network and an error function, the method calculates the gradient of the error function with respect to the neural network’s weights. It is a generalization of the delta rule for perceptrons to multilayer feedforward neural networks.

The “backwards” part of the name stems from the fact that calculation of the gradient proceeds backwards through the network, with the gradient of the final layer of weights being calculated first and the gradient of the first layer of weights being calculated last. Partial computations of the gradient from one layer are reused in the computation of the gradient for the previous layer. This backwards flow of the error information allows for efficient computation of the gradient at each layer versus the naive approach of calculating the gradient of each layer separately.

Backpropagation’s popularity has experienced a recent resurgence given the widespread adoption of deep neural networks for image recognition and speech recognition. It is considered an efficient algorithm, and modern implementations take advantage of specialized GPUs to further improve performance.

Approach

  • Build a small neural network as defined in the architecture below.
  • Initialize the weights and bias randomly.
  • Fix the input and output.
  • Forward pass the inputs. calculate the cost.
  • compute the gradients and errors.
  • Backprop and adjust the weights and bias accordingly

Architecture:

  • Build a Feed Forward neural network with 2 hidden layers. All the layers will have 3 Neurons each.
  • 1st and 2nd hidden layer will have Relu and sigmoid respectively as activation functions. Final layer will have Softmax.
  • Error is calculated using cross-entropy.

Initializing Network


layer-1 Matrix Operation:

layer-1 Relu Operation:

layer-1 Example:


layer-2 Matrix Operation:

layer-2 Sigmoid Operation:

layer-2 Example:


layer-3 Matrix Operation:

layer-3 Softmax Operation:

layer-3 Example:


Analysis:

The Actual Output should be but we got .
To calculate error lets use cross-entropy

Important Derivatives

Sigmoid

Relu

Softmax operation

BackPropagating the error - (Hidden Layer2 - Output Layer) Weights

Lets calculate a few derviates upfront so these become handy and we can reuse them whenever necessary.

Here are we are using only one example (batch_size=1), if there are more examples then just average everything.

by symmetry, We can calculate other derviatives also

In our Example The values will be


Next let us calculate the derviative of each output with respect to their input.

by symmetry, We can calculate other derviatives also

In our Example The values will be

For each input to neuron lets calculate the derivative with respect to each weight.

Now let us look at the final derivative

Now let us look at the final derivative

Using similarity we can write:

Now we will calulate the change in

This will be simply

Using chain rule:

By symmetry

All the above values are calculated above, We just need to substitute these values

Consider a learning rate (lr) of 0.01 We get our final Weight matrix as

Finally We made it, Lets jump to the next layer

BackPropagating the error - (Hidden Layer1 - Hidden Layer2) Weights

Lets calculate a few handy derviatives before we actually calculate the error derviatives wrt Weights in this layer.

In our example , this will be

For each input to neuron lets calculate the derivative with respect to each weight.

Now let us look at the final derivative

Now let us look at the final derivative

Using similarity we can write:

Now we will calulate the change in

and generalize it for all other variables.

Caution: Make sure that you have understood everything we discussed till here.

This will be simply

Using chain rule:

Now we will see each and every equation individually.

Lets look at the matrix

By symmetry

We have Already calculated the 2nd and 3rd term in each matrix. We need to check on the 1st term. If we see the matrix, the first term is common in all the columns. So there are only three values. Lets look into one value

Lets see what each individual term boils down too.

BY symmentry

Again the first two values are already calculated by us when dealing with derviatives of W_{kl}. We just need to calculate the third one, Which is the derivative of input to each output layer wrt output of hidden layer-2. It is nothing but the corresponding weight which connects both the layers.

All Values are calculated before we just need to impute the corresponding values for our example.

Lets look at the matrix

By symmetry

Consider a learning rate (lr) of 0.01 We get our final Weight matrix as

Finally We made it, Lets jump to the next layer

BackPropagating the error - (Input Layer - Hidden Layer1) Weights

Lets calculate a few handy derviatives before we actually calculate the error derviatives wrt Weights in this layer.

We already know

Relu

Since the inputs are positive

For each input to neuron lets calculate the derivative with respect to each weight.

Now let us look at the final derivative

Now let us look at the final derivative

Using similarity we can write:

Now we will calulate the change in

and generalize it for all other variables.

Caution: Make sure that you have understood everything we discussed till here.

This will be simply

Using chain rule:

Now we will see each and every equation individually.

Lets look at the matrix

By symmetry

We know the 2nd and 3rd derivatives in each cell in the above matrix. Lets look at how to get to derivative of 1st term in each cell.

We have calculated all the values previously except the last one in each cell, which is a simple derivative of linear terms.

Lets look at the matrix

By symmetry

Consider a learning rate (lr) of 0.01 We get our final Weight matrix as

The End

Our Inital Weights

Our final weights

Important Notes:

  • I have completely eliminated bias when differentiating. Do you know why ?
  • Backprop of bias should be straight forward. Try on your own.
  • I have taken only one example. What will happen if we take batch of examples?
  • Though I have not mentioned directly about vansihing gradients. Do you see why it occurs?
  • What would happen if all the weights are the same number instead of random ?