Cookie Consent by Privacy Policies

Deep Learning From Scratch, Part 3: Generalizing Gradient Descent

Deep Learning From Scratch, Part 3: Generalizing Gradient Descent
Note: As an Amazon Associate I earn from qualifying purchases. I get commissions for purchases made through links in this post. See a more full disclaimer here
Another Note: This series assumes you know some middle/high school maths (really just algebra) to get the most from the sections.

Companion Resource

While the notes below are my thoughts on generalizing gradient descent, I am following a book that is in much more detail as I try to present a high-level of what I’m learning. You can get that book Grokking Deep Learning here at Manning Publications for the ebook version which is cheaper or Amazon for the physical copy here https://amzn.to/2YVTrmz

Chapter 5: Generalizing Gradient Descent Notes

If Chapter 4 was looking to introduce you to gradient descent (GD), Chapter 5 is looking to generalize that concept in a few different ways:
* Multiple input nodes with one output node
* Freezing One Weight
* One input node with multiple output nodes
* Multiple input and output nodes

Gradient Descent w/ multiple input nodes & one output node

  • Since you have multiple input nodes that share one output node, the  delta that was calculated needs to be distributed evenly back to each of the input nodes. Doing this will give you the appropriate weight_delta for each node.
  • Remember: The weight_delta value is telling you how far your prediction (positive or negative) is from the actual value in relation to the respective input value.  Using math, the equation would look like the following:  weight_delta = that specific input value * the delta calculated”).
  • After finding this weight_delta value, you would then calculate the new weight value with weight -= alpha * that specific’s input node’s weight_delta. With this new weight , your prediction for each of the input nodes would be the pred = input node value * new weight value. You repeat this process over x iterations.

Freezing One Weight

  • Freezing one weight basically allows you to see which of the input nodes has the biggest influence on your prediction value. Another way of saying it is how Trask puts it, "a (or just an input node in your neural network) may be a powerful input with lots of predictive power, but if the network accidentally figures out how to predict accurately on the training data without it, then it will never learn to incorporate a into its prediction" [1].
  • If you’re wondering how you would freeze one weight, you would just make that weight’s value 0 on every iteration. If you multiply anything by 0, you’ll always get the value 0. Essentially, the weight value will always be the same as if you were“freezing” the weight in a certain state

Gradient Descent with one input node & multiple out nodes

  • This time gradient descent is in reverse to the first subtopic. You have one input node having an influence on three different output nodes. Since three output nodes share one input node, each delta value is going to tell you how far off you are from the original input node in the prediction.
  • Equations to keep in mind:
    * pred = one input node value * initial weight
    * delta = pred - true
  • Because of the three output nodes, weight_delta is going to be a list of  weight_delta = one input node value * list of weight_deltas from each output node
  • Finally you would repeat the first bullet topic’s approach to calculating the new weight value to test out the new prediction
  • The difference between the first subtopic and this subtopic is just which side of the neural network (input vs output) has one or more nodes. Then you do the necessary multiplication

Gradient Descent with multiple input and output nodes

  • This last subtopic is when you have multiple input and output nodes. If you understood the first and third subtopics (in these notes and the book), then this shouldn’t be as hard to fathom.
  • For each row of weight values and input values, you’re going to find the delta values.
  • After you find the delta values, you have to calculate each row’s delta_weight values for each output.
  • Finally, you calculate the new weight values for each column in the row and assign those as the new weight values to use in the prediction. See code snippet below

	# this code snippet assumes that you have calculated your weight_deltas
	# this nested for loop is basically assigning new weights to each column in a row (i -> each row in a matrix, j -> each column in a row)
	# You go through all the columns (j) in row (i) and then you move to the next row and start at column 0

	for i in range(len(weights)):
		for j in range(len(weights[0])):
			weights[i][j] -= alpha * weight_deltas[i][j]
		

A few GIF's on Gradient Descent

So in between this post's notebook and the previous post's notebook, there is a lot of talking about gradient descent in Deep Learning. However, I want to show a visual representation of what is actually going on with the math. I'm not well suited in matplotlib (as of yet) so I believe the gifs below are good in showing/plotting what is going on mathematically.

With both GIF's below you see that (whether it's the dots or the black line), both are trying to get to the lowest point in the parabola. Trask says, "What you're really trying to do with the neural network is find the lowest point on this big error plane (the parabola's below), where the lowest point refers to the lowest error" [2]. This "lowest error" means you have reached a point in your iterations where your pred = input * weights actually is very close to your values that you want to see or your true in this case [2].

display image
Fig 1. Gradient Descent GIF (https://gfycat.com/angryinconsequentialdiplodocus) taken from Andrew Ng's Machine Learning Course [3]
visual_image
Fig. 2 Gradient Descent GIF (https://giphy.com/gifs/gradient-O9rcZVmRcEGqI) taken from Alykhan Tejani's blog [4]

Jupyter Notebook

As always the jupyter notebook is provided chap5_generalizingGradientDescent | Kaggle for you to follow along with.

As always, until next time ✌🏾


References

[1]  “Learning multiple weights at a time: Generalizing Gradient Descent” Grokking Deep Learning, by Andrew W. Trask, Manning Publications, 2019, p. 263.

[2]  “Learning multiple weights at a time: Generalizing Gradient Descent” Grokking Deep Learning, by Andrew W. Trask, Manning Publications, 2019, p. 267.

[3]  Ng, Andrew. "Linear Regression with One Variable | Gradient Descent - [Andrew Ng]" Youtube, 22 June 2020. https://www.youtube.com/watch?v=F6GSRDoB-Cg&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=8

[4]  Tejani, Alykhan. "A Brief Introduction To Gradient Descent" alykhantejani.github.io, 22 June 2020. https://alykhantejani.github.io/a-brief-introduction-to-gradient-descent/

Show Comments