Note: As an Amazon Associate I earn from qualifying purchases. I get commissions for purchases made through links in this post. See a more full disclaimer here
Another Note: This series assumes you know some middle/high school maths (really just algebra) to get the most from the sections.
While the notes below are my thoughts on generalizing gradient descent, I am following a book that is in much more detail as I try to present a high-level of what I’m learning. You can get that book Grokking Deep Learning here at Manning Publications for the ebook version which is cheaper or Amazon for the physical copy here https://amzn.to/2YVTrmz
Chapter 5: Generalizing Gradient Descent Notes
If Chapter 4 was looking to introduce you to gradient descent (GD), Chapter 5 is looking to generalize that concept in a few different ways:
* Multiple input nodes with one output node
* Freezing One Weight
* One input node with multiple output nodes
* Multiple input and output nodes
Gradient Descent w/ multiple input nodes & one output node
- Since you have multiple input nodes that share one output node, the
deltathat was calculated needs to be distributed evenly back to each of the input nodes. Doing this will give you the appropriate
weight_deltafor each node.
- Remember: The
weight_deltavalue is telling you how far your prediction (positive or negative) is from the actual value in relation to the respective input value. Using math, the equation would look like the following:
weight_delta = that specific input value * the delta calculated”).
- After finding this
weight_deltavalue, you would then calculate the new
weight -= alpha * that specific’s input node’s weight_delta. With this new
weight, your prediction for each of the input nodes would be the
pred = input node value * new weight value. You repeat this process over x iterations.
Freezing One Weight
- Freezing one weight basically allows you to see which of the input nodes has the biggest influence on your prediction value. Another way of saying it is how Trask puts it, "a (or just an input node in your neural network) may be a powerful input with lots of predictive power, but if the network accidentally figures out how to predict accurately on the training data without it, then it will never learn to incorporate a into its prediction" .
- If you’re wondering how you would freeze one weight, you would just make that weight’s value
0on every iteration. If you multiply anything by
0, you’ll always get the value
0. Essentially, the weight value will always be the same as if you were“freezing” the weight in a certain state
Gradient Descent with one input node & multiple out nodes
- This time gradient descent is in reverse to the first subtopic. You have one input node having an influence on three different output nodes. Since three output nodes share one input node, each
deltavalue is going to tell you how far off you are from the original input node in the prediction.
- Equations to keep in mind:
pred = one input node value * initial weight
delta = pred - true
- Because of the three output nodes,
weight_deltais going to be a list of
weight_delta = one input node value * list of weight_deltas from each output node
- Finally you would repeat the first bullet topic’s approach to calculating the new
weightvalue to test out the new prediction
- The difference between the first subtopic and this subtopic is just which side of the neural network (input vs output) has one or more nodes. Then you do the necessary multiplication
Gradient Descent with multiple input and output nodes
- This last subtopic is when you have multiple input and output nodes. If you understood the first and third subtopics (in these notes and the book), then this shouldn’t be as hard to fathom.
- For each row of weight values and input values, you’re going to find the
- After you find the
deltavalues, you have to calculate each row’s
delta_weightvalues for each output.
- Finally, you calculate the new weight values for each column in the row and assign those as the new weight values to use in the prediction. See code snippet below
# this code snippet assumes that you have calculated your weight_deltas # this nested for loop is basically assigning new weights to each column in a row (i -> each row in a matrix, j -> each column in a row) # You go through all the columns (j) in row (i) and then you move to the next row and start at column 0 for i in range(len(weights)): for j in range(len(weights)): weights[i][j] -= alpha * weight_deltas[i][j]
A few GIF's on Gradient Descent
So in between this post's notebook and the previous post's notebook, there is a lot of talking about gradient descent in Deep Learning. However, I want to show a visual representation of what is actually going on with the math. I'm not well suited in matplotlib (as of yet) so I believe the gifs below are good in showing/plotting what is going on mathematically.
With both GIF's below you see that (whether it's the dots or the black line), both are trying to get to the lowest point in the parabola. Trask says, "What you're really trying to do with the neural network is find the lowest point on this big error plane (the parabola's below), where the lowest point refers to the lowest
error" . This "lowest error" means you have reached a point in your iterations where your
pred = input * weights actually is very close to your values that you want to see or your
true in this case .
As always the jupyter notebook is provided chap5_generalizingGradientDescent | Kaggle for you to follow along with.
As always, until next time ✌🏾
 “Learning multiple weights at a time: Generalizing Gradient Descent” Grokking Deep Learning, by Andrew W. Trask, Manning Publications, 2019, p. 263.
 “Learning multiple weights at a time: Generalizing Gradient Descent” Grokking Deep Learning, by Andrew W. Trask, Manning Publications, 2019, p. 267.
 Ng, Andrew. "Linear Regression with One Variable | Gradient Descent - [Andrew Ng]" Youtube, 22 June 2020. https://www.youtube.com/watch?v=F6GSRDoB-Cg&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=8
 Tejani, Alykhan. "A Brief Introduction To Gradient Descent" alykhantejani.github.io, 22 June 2020. https://alykhantejani.github.io/a-brief-introduction-to-gradient-descent/