Gradient Descent#

Orientation#

Where is Gradient Descent used in the Data Science Lifecycle?

06 PREDICTIVE MODELING:

  • select a ML algorithm

  • train the ML model

  • evaluate the performance

  • make predictions

Recap of Optimization of Linear Regression with OLS#



  • Actual relationship (uknown): y = f(x)

  • Model: \(\hat{y}=b_{0}+b_{1}\cdot x\)

Residuals: \(e_{i}=y_{i}-\hat{y}_{i}\)

OLS:
Minimize the sum of squared residuals

\(\sum_{i=1}^{n}e_{i}^{2}\)

results in Normal Equation:

\(b=\left(X^{T}X\right)^{-1}X^{T}y\)

→ closed form solution

Using OLS (normal equation), the best parameters were found at: 
            b₀ = 0.237 
            b₁ = 0.049

Definition#

Gradient Descent is an iterative optimization algorithm to find the minimum of a function.

→ e.g. of a cost function

Note: Unlike the normal equation, Gradient Descent has no closed solution.

Find the parameters of a model by gradually minimizing your cost#

Go downhill in the direction of the steepest slope.

→ endpoint is dependent on the initial starting point

Gradient Descent in a nutshell#

  • Have some function \(J(b)\)

  • Want min \(J(b)\):

    • start with some random parameter \(b\)

    • keep changing \(b\) to reduce \(J(b)\) until we hopefully get to a minimum

When to use Gradient Descent?#

  • GD is a simple optimization procedure that can be used for many machine learning algorithms (eg. linear regression, logistic regression, …)

  • used for “Backpropagation” in Neural Nets (variations of GD)

  • can be used for online learning (update as soon as new data comes in)

  • gives faster results for problems with many features (computationally simpler)

First Step: Define Model#


Example:
least squares with one feature


\[\hat{y}=b_0+b_1\cdot x\]

Second Step: Define Cost Function#

Mean Squared Error with a little bit of cosmetics (2 in denominator)

\[ J(b_{0},b_{1})\;=\;\frac{1}{2n}\sum_{i=1}^{n}{(\hat{y}_{i}-y_{i})}^{2}\,\]

\(\hspace{0.5cm}\)

\[min(J(b_{0},b_{1}))\]

Third Step: Initialize Gradient Descent#

Deliberately set random starting values for \(b\)

../_images/e00b0c18b44366673a792a2493157fc7bb57ccdba70490070555540db0a95c92.png

Forth Step: Derivatives with respect to the parameters#



\[ J(b_{0},b_{1})\,=\,\frac{1}{2n}\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}\,=\,\frac{1}{2n}\sum_{i=1}^{n}\left(y_{i} - b_{0}+b_{1}x_{i}\right)^{2}\]

The chain rule gives us:

\[\begin{split}\begin{align} \frac{\partial J}{\partial b_0} = \frac{\partial}{\partial b_0} \left( \frac{1}{2n} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^2 \right)= \frac{1}{2n} \sum_{i=1}^{n} 2(y_i - b_0 - b_1 x_i)(-1)= -\frac{1}{n} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i) \\\\ \frac{\partial J}{\partial b_1} = \frac{\partial}{\partial b_1} \left( \frac{1}{2n} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^2 \right) = \frac{1}{2n} \sum_{i=1}^{n} 2(y_i - b_0 - b_1 x_i)(-x_i) = -\frac{1}{n} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)\cdot x_i \end{align}\end{split}\]

Forth Step: Gradient Descent#

Start descent:

  • Take derivatives with respect to your parameters \(b\)

  • Set your learning rate (step-size)

  • Adjust your parameters (step)

\[\text{slope of }b_1\text{ at start point }=-6.8\]
../_images/302bfe990e03e8f50942082eab0089ff134c1e30bef7e3ed8eb8a55c42c0dac7.png

Forth Step: Gradient Descent#

Start descent:

  • Take derivatives with respect to your parameters \(b\)

  • Set your learning rate (step-size)

  • Adjust your parameters (step)

\[\begin{split}\begin{align} \alpha&=0.15 \\ slope&=-6.8 \end{align}\end{split}\]
../_images/302bfe990e03e8f50942082eab0089ff134c1e30bef7e3ed8eb8a55c42c0dac7.png

Forth Step: Gradient Descent#

Start descent:

  • Take derivatives with respect to parameters \(b\)

  • Set your learning rate (step-size)

  • Adjust your parameters (step)

\[\begin{split}\begin{align} step&=\alpha \cdot slope \\ b_{1}^{new}&=b_{1}^{old} - step \\ b_{1}^{new}&=0.6-(0.15*(-6.8)) \\ b_{1}^{new}&=1.62 \\ \end{align}\end{split}\]
../_images/eac75f79d70facba7d5936f671241c52bd0dcd948549ffe1d7bcda2adec282ba.png

Forth Step: Gradient Descent#

Start descent:

  • Take derivatives with respect to your parameters \(b\)

  • Set your learning rate (step-size)

  • Adjust your parameters (step)

Repeat till there is no further improvement

\[\begin{split}\begin{align} step&=\alpha \cdot slope \\ b_{1}^{new}&=b_{1}^{old} - step \end{align}\end{split}\]
../_images/467ae471d5e087645acfd51b995e2ebf90c914e101921f98ecfbb752b5897b30.png

Gradient Descent summed up#

\(\hspace{0.5cm}\)

  1. Define model

  2. Define the cost function

  3. Deliberately set some starting values

  4. Start descent:

    • Take derivatives with respect to parameters

    • Set your learning rate (step-size)

    • Adjust your parameters (step)

  5. Repeat 4. till there is no further improvement

Learning Rate#

It is important to find a good value for the learning rate!



../_images/c3bc765c3f2010057f5bb36343e80105539872c7d726aafb3f23428b6ff8746e.png

Learning Rate#

It is important to find a good value for the learning rate!

Learning Rate too small → will take long to find optimum

Learning Rate too large → not able to find optimum

../_images/a7c954ea62c50f347b071e416dd6812454c86b3e6ddf409dd432c318ddb75d7e.png

Two main challenges of Gradient Descent#

The MSE cost function for linear regression is a convex function, it just has one global minimum. This is not always the case. Problems are:

  • local minima

  • plateaus

Convex: If you pick two random points the line that is connecting them will not cross the curve.

../_images/cb73b57e6259a5b3174510b57844884efd3a8d1d993d42d03150975e15637c3a.png

Unscaled Data#

Without scaling the algorithm Gradient Descent needs more time to find the minimum

../_images/def0a37a0091cb5105c5fd963cfc71bf3b89f66b97d045761aa8445b25ec353e.png

Variants of Gradient Descent#

Which route to take downhill?

Batch gradient descent#

All instances are used to calculate the gradient (step).

The algorithm is faster than the Normal Equation but still slow.

Stochastic gradient descent#

Only 1 random instances is used to calculate the gradient (step).

Much faster as there is little data to process at each update.

Can be used for very large data sets → online training possible

Cost decreases only on average and bounces back and forth. → can help not getting stuck in local minima

Optimum will be close but never reached.

Mini-Batch gradient descent#

Trains on small random subset instead of all data or individual ones.

Performs better than SGD as it exploits hardware optimization of matrix operations

Drawback: It is harder not to end up in local minima

Gradient Descent Comparison#

../_images/35a447c15b18665ff18e5dfd41762de67449ebac7fd95a551aa406e1608b2f62.png
../_images/5d642e9adb9af14ba7be8b1006cbabf3c6ddf8bbf9c22fad966f350d0b024acb.png

References#

Gradient Descent Step by Step