Backpropagation, short for “backward propagation of errors,” is an algorithm for supervised learning of artificial neural networks using gradient descent. Given an artificial neural network and an error function, the method calculates the gradient of the error function with respect to the neural network’s weights. It is a generalization of the delta rule for perceptrons to multilayer feedforward neural networks.
The “backwards” part of the name stems from the fact that calculation of the gradient proceeds backwards through the network, with the gradient of the final layer of weights being calculated first and the gradient of the first layer of weights being calculated last. Partial computations of the gradient from one layer are reused in the computation of the gradient for the previous layer. This backwards flow of the error information allows for efficient computation of the gradient at each layer versus the naive approach of calculating the gradient of each layer separately.
Backpropagation’s popularity has experienced a recent resurgence given the widespread adoption of deep neural networks for image recognition and speech recognition. It is considered an efficient algorithm, and modern implementations take advantage of specialized GPUs to further improve performance.
Approach
- Build a small neural network as defined in the architecture below.
- Initialize the weights and bias randomly.
- Fix the input and output.
- Forward pass the inputs. calculate the cost.
- compute the gradients and errors.
- Backprop and adjust the weights and bias accordingly
Architecture:
- Build a Feed Forward neural network with 2 hidden layers. All the layers will have 3 Neurons each.
- 1st and 2nd hidden layer will have Relu and sigmoid respectively as activation functions. Final layer will have Softmax.
- Error is calculated using cross-entropy.
Initializing Network
layer-1 Matrix Operation:
layer-1 Relu Operation:
layer-1 Example:
layer-2 Matrix Operation:
layer-2 Sigmoid Operation:
layer-2 Example:
layer-3 Matrix Operation:
layer-3 Softmax Operation:
layer-3 Example:
Analysis:
The Actual Output should be but we got .
To calculate error lets use cross-entropy
Important Derivatives
Sigmoid
Relu
Softmax operation
BackPropagating the error - (Hidden Layer2 - Output Layer) Weights
Lets calculate a few derviates upfront so these become handy and we can reuse them whenever necessary.
Here are we are using only one example (batch_size=1), if there are more examples then just average everything.
by symmetry, We can calculate other derviatives also
In our Example The values will be
Next let us calculate the derviative of each output with respect to their input.
by symmetry, We can calculate other derviatives also
In our Example The values will be
For each input to neuron lets calculate the derivative with respect to each weight.
Now let us look at the final derivative
Now let us look at the final derivative
Using similarity we can write:
Now we will calulate the change in
This will be simply
Using chain rule:
By symmetry
All the above values are calculated above, We just need to substitute these values
Consider a learning rate (lr) of 0.01 We get our final Weight matrix as
Finally We made it, Lets jump to the next layer
BackPropagating the error - (Hidden Layer1 - Hidden Layer2) Weights
Lets calculate a few handy derviatives before we actually calculate the error derviatives wrt Weights in this layer.
In our example , this will be
For each input to neuron lets calculate the derivative with respect to each weight.
Now let us look at the final derivative
Now let us look at the final derivative
Using similarity we can write:
Now we will calulate the change in
and generalize it for all other variables.
Caution: Make sure that you have understood everything we discussed till here.
This will be simply
Using chain rule:
Now we will see each and every equation individually.
Lets look at the matrix
By symmetry
We have Already calculated the 2nd and 3rd term in each matrix. We need to check on the 1st term. If we see the matrix, the first term is common in all the columns. So there are only three values. Lets look into one value
Lets see what each individual term boils down too.
BY symmentry
Again the first two values are already calculated by us when dealing with derviatives of W_{kl}. We just need to calculate the third one, Which is the derivative of input to each output layer wrt output of hidden layer-2. It is nothing but the corresponding weight which connects both the layers.
All Values are calculated before we just need to impute the corresponding values for our example.
Lets look at the matrix
By symmetry
Consider a learning rate (lr) of 0.01 We get our final Weight matrix as
Finally We made it, Lets jump to the next layer
BackPropagating the error - (Input Layer - Hidden Layer1) Weights
Lets calculate a few handy derviatives before we actually calculate the error derviatives wrt Weights in this layer.
We already know
Relu
Since the inputs are positive
For each input to neuron lets calculate the derivative with respect to each weight.
Now let us look at the final derivative
Now let us look at the final derivative
Using similarity we can write:
Now we will calulate the change in
and generalize it for all other variables.
Caution: Make sure that you have understood everything we discussed till here.
This will be simply
Using chain rule:
Now we will see each and every equation individually.
Lets look at the matrix
By symmetry
We know the 2nd and 3rd derivatives in each cell in the above matrix. Lets look at how to get to derivative of 1st term in each cell.
We have calculated all the values previously except the last one in each cell, which is a simple derivative of linear terms.
Lets look at the matrix
By symmetry
Consider a learning rate (lr) of 0.01 We get our final Weight matrix as
The End
Our Inital Weights
Our final weights
Important Notes:
- I have completely eliminated bias when differentiating. Do you know why ?
- Backprop of bias should be straight forward. Try on your own.
- I have taken only one example. What will happen if we take batch of examples?
- Though I have not mentioned directly about vansihing gradients. Do you see why it occurs?
- What would happen if all the weights are the same number instead of random ?