Gradient Descent

Videos:

Simply put, gradient descent is an algorithm for minimizing the loss function of a model during training.

J, the cost function, is a convex function. Referring to the shape of the output of the J function — it’s a bowl shape. Not a squiggly shape. That means that no matter where we start, we eventually end up in the same area.
As mentioned, regardless of the staring point in the data set, J always puts us at the same level in the end for the overall cost function. We call that “level” in the “end” the global optimum.

Here’s Andrew Ng’s illustration of gradient descent, using mathematical notation:

repeat {
    w := w - 𝛼(dj(w)/dw)
}

𝛼 is the learning rate
w is the weights and biases
Gradient descent works by iteratively w and b in the opposite direction of J, until the loss function eventually hits “global optimum”, the lowest it will go before it starts to go up again. This is what we know as “convergence”.

Last modified: May 11, 2023