Cookie Consent by Privacy Policies

Deep Learning From Scratch, Part 2: Learning Gradient Descent

Deep Learning From Scratch, Part 2: Learning Gradient Descent
Note: As an Amazon Associate I earn from qualifying purchases. I get commissions for purchases made through links in this post. See a more full disclaimer here
Another Note: This series assumes you know some middle/high school maths (really just algebra) to get the most from the sections.

A few thoughts about our current time

This historical moment in the world feels like a turning point that Black people nationally (here in the US) and internationally have been waiting for. This moment was ignited by the recent deaths of George Floyd, Ahmaud Arbery, and Breonna Taylor. However, the wound of institutional, structural, and systemic racism has been “open” since 400+ years ago when Black people were stolen from their land and brought to the US. Thankfully, the healing is starting as people across the globe are recognizing and addressing systems that have long benefited from implicit or explicit racism. This has given my heart both joy during these times and hope for what’s to come. If you are looking for places to give or understand, check out Home - Black Lives Matter for more information. #BlackLivesMatter ✊🏾

If you would like to follow along with this series, the book that I’m using is the one below:
You can get the ebook version at Manning Publications  (which is the cheaper option) or Amazon if you like having hard copy books.

Without further ado, let’s jump into part 2 of our series 😃

Chapter 4: Gradient Descent Notes

Note: Chapter 4 of the book is looking primarily at one input and one weight and not multiple inputs and multiple weights like part 1


  • One of the biggest takeaways from this chapter, is the idea that measuring the error (how much you are off in your calculated vs actual prediction) is extremely important in neural network training. Ultimately you want  error=0 or as close to 0 as possible, as this means the actual and calculated predictions are either the same or close to the same value.
  • In a very simple example, let’s say you went to your favorite fast food joint. You order what you usually get (a cheeseburger 🍔 and fries 🍟), and hand the cashier the amount for the food. The amount comes to $6.10, including tax and you give a $10 bill.  You receive back in change $0.90. At this point, you should be making this face


  • You were gypped 3 WHOLE DOLLARS. Now, you’re ready to get your money back, and with good reason! Similarly in neural networks, the weights (being the cashier in this example) will most likely not get the accurate prediction right the first time around. Calculating the error helps your neural network know, “Oh, I didn’t give you the correct amount of change for you meal. Did I give you too little (negative, -) or too much (positive, +) ?”
  • One way of helping your network figure out on its own, “I gave you too little or too much back in change”, is to adjust the weights manually (this is called Hot and Cold Learning). Adjusting the weights manually may sound good for one or two weights, but is inefficient in the long run.  In another example, Hot and Cold Learning is kinda like playing darts with a blindfold on, one hand tied behind your back, feet tied and being drunk 😂. Thankfully, with each chance, you’re getting closer & closer to the bullseye by looking under the blindfold before making the next throw. This “looking under the blindfold” is done via code with our step_amount by adding/subtracting this value to/from weight. Sadly, it’s going to take many chances (in the example below, 1101 times) before you are close to the bullseye 😭. The code from the Jupyter notebook gives the idea with comments.

A better way: Calculate the Direction and amount

  • A better way to do Hot and Cold Learning would be to scale the error based on the input_val, instead of just the step_amount.  The equation from the book tells us that: direction_and_amount = (prediction - actual_prediction) * input_val [1].
  • (prediction - actual_prediction) 👉🏾 is your pure error, how much you’re off by, either negative or positive
  • (NEW_ADDITION) * input_val 👉🏾 is operating as scaling, negative reversing (see more about this on pg. 191), and stopping (if input_val is 0) our pure error. This allows direction_and_amount to be sensitive to your weight values [2]. Above with Hot and Cold Learning, the adding/subtracting of step_amount from weight is barely sensitive and is a stagnant value.
  • Thus, weight = weight - direction_and_amount is constantly changing the new weight every iteration  👉🏾 which is reducing the error to 0 faster (sensitive to input_val) 👉🏾 which is ultimately going to get your calculated prediction to be close, if not equal to, your actual prediction.
  • If we’re playing darts again, now you’re not drunk, both feet are untied, and your hand is untied from your back. Plus, your blindfold is now also see-through 👌🏾.


  • You’re more likely to hit the dart on the bullseye in fewer iterations/“chances”. The code from the Jupyter notebook gives the idea with comments.

A better way: Calculate the Direction and Amount (with pictures)

  • If the above code with comments and explanation wasn’t clear, below are two pictures which I think will solidify the concepts previously stated 😃

Overcorrecting Neural Networks

  • Understanding the above code and diagrams above, you realize that if you’re scaling your slope by * input_val, the size of input_val is very important. You could potentially “overshoot” the value with the new slope because of the how big input_val is as a value. So you would be overcorrecting the value of the new prediction prediction = input * weight, which is not what you want!  The best way to counter that “overshooting” is to use a value for alpha. This alpha value would be between 0 and 1 and helps scale your slope/derivative value to where it’s manageable and likely to not overshoot (weight = weight - (alpha * derivative)). The code from the Jupyter notebook gives the idea with comments.

Jupyter Notebook

As always the jupyter notebook is provided here chap4_learning_gradientDescent | Kaggle for you to follow along with.

As always, until next time ✌🏾


[1] “Introduction to neural learning: gradient descent” /Grokking Deep Learning/, by Andrew W. Trask, Manning Publications, 2019, p. 190.

[2] “Introduction to neural learning: gradient descent” /Grokking Deep Learning/, by Andrew W. Trask, Manning Publications, 2019, p. 191.

Show Comments