Note: As an Amazon Associate I earn from qualifying purchases. I get commissions for purchases made through links in this post. See a more full disclaimer here
Another Note: This series assumes you know some middle/high school maths (really just algebra) to get the most from the sections.

## Companion Resource

While the notes below are my thoughts on generalizing gradient descent, I am following a book that is in much more detail as I try to present a high-level of what Iâ€™m learning. You can get that book Grokking Deep Learning here at Manning Publications for the ebook version which is cheaper or Amazon for the physical copy here https://amzn.to/2YVTrmz

## Chapter 5: Generalizing Gradient Descent Notes

If Chapter 4 was looking to introduce you to gradient descent (GD), Chapter 5 is looking to generalize that concept in a few different ways:
* Multiple input nodes with one output node
* Freezing One Weight
* One input node with multiple output nodes
* Multiple input and output nodes

### Gradient Descent w/ multiple input nodes & one output node

• Since you have multiple input nodes that share one output node, the Â `delta` that was calculated needs to be distributed evenly back to each of the input nodes. Doing this will give you the appropriate `weight_delta` for each node.
• Remember: The `weight_delta` value is telling you how far your prediction (positive or negative) is from the actual value in relation to the respective input value. Â Using math, the equation would look like the following: Â `weight_delta = that specific input value * the delta calculated`â€ť).
• After finding this `weight_delta` value, you would then calculate the new `weight` value with `weight -= alpha * that specificâ€™s input nodeâ€™s weight_delta`. With this new `weight` , your prediction for each of the input nodes would be the `pred = input node value * new weight value`. You repeat this process over x iterations.

### Freezing One Weight

• Freezing one weight basically allows you to see which of the input nodes has the biggest influence on your prediction value. Another way of saying it is how Trask puts it, "a (or just an input node in your neural network) may be a powerful input with lots of predictive power, but if the network accidentally figures out how to predict accurately on the training data without it, then it will never learn to incorporate a into its prediction" [1].
• If youâ€™re wondering how you would freeze one weight, you would just make that weightâ€™s value `0` on every iteration. If you multiply anything by `0`, youâ€™ll always get the value `0`. Essentially, the weight value will always be the same as if you wereâ€śfreezingâ€ť the weight in a certain state

### Gradient Descent with one input node & multiple out nodes

• This time gradient descent is in reverse to the first subtopic. You have one input node having an influence on three different output nodes. Since three output nodes share one input node, each `delta` value is going to tell you how far off you are from the original input node in the prediction.
• Equations to keep in mind:
* `pred = one input node value * initial weight`
* `delta = pred - true`
• Because of the three output nodes, `weight_delta` is going to be a list of Â `weight_delta = one input node value * list of weight_deltas from each output node`
• Finally you would repeat the first bullet topicâ€™s approach to calculating the new `weight` value to test out the new prediction
• The difference between the first subtopic and this subtopic is just which side of the neural network (input vs output) has one or more nodes. Then you do the necessary multiplication

### Gradient Descent with multiple input and output nodes

• This last subtopic is when you have multiple input and output nodes. If you understood the first and third subtopics (in these notes and the book), then this shouldnâ€™t be as hard to fathom.
• For each row of weight values and input values, youâ€™re going to find the `delta` values.
• After you find the `delta` values, you have to calculate each rowâ€™s `delta_weight` values for each output.
• Finally, you calculate the new weight values for each column in the row and assign those as the new weight values to use in the prediction. See code snippet below
``````
# this code snippet assumes that you have calculated your weight_deltas
# this nested for loop is basically assigning new weights to each column in a row (i -> each row in a matrix, j -> each column in a row)
# You go through all the columns (j) in row (i) and then you move to the next row and start at column 0

for i in range(len(weights)):
for j in range(len(weights[0])):
weights[i][j] -= alpha * weight_deltas[i][j]

``````

### A few GIF's on Gradient Descent

So in between this post's notebook and the previous post's notebook, there is a lot of talking about gradient descent in Deep Learning. However, I want to show a visual representation of what is actually going on with the math. I'm not well suited in matplotlib (as of yet) so I believe the gifs below are good in showing/plotting what is going on mathematically.

With both GIF's below you see that (whether it's the dots or the black line), both are trying to get to the lowest point in the parabola. Trask says, "What you're really trying to do with the neural network is find the lowest point on this big error plane (the parabola's below), where the lowest point refers to the lowest `error`" [2]. This "lowest error" means you have reached a point in your iterations where your `pred = input * weights` actually is very close to your values that you want to see or your `true` in this case [2].

### Jupyter Notebook

As always the jupyter notebook is provided chap5_generalizingGradientDescent | Kaggle for you to follow along with.

As always, until next time âśŚđźŹľ

## References

[1] Â â€śLearning multiple weights at a time: Generalizing Gradient Descentâ€ť Grokking Deep Learning, by Andrew W. Trask, Manning Publications, 2019, p. 263.

[2] Â â€śLearning multiple weights at a time: Generalizing Gradient Descentâ€ť Grokking Deep Learning, by Andrew W. Trask, Manning Publications, 2019, p. 267.

[3] Â Ng, Andrew. "Linear Regression with One Variable | Gradient Descent - [Andrew Ng]" Youtube, 22 June 2020. https://www.youtube.com/watch?v=F6GSRDoB-Cg&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=8

[4] Â Tejani, Alykhan. "A Brief Introduction To Gradient Descent" alykhantejani.github.io, 22 June 2020. https://alykhantejani.github.io/a-brief-introduction-to-gradient-descent/