Gradient Descent#

Orientation#

Where is Gradient Descent used in the Data Science Lifecycle?

06 PREDICTIVE MODELING:

  • select a ML algorithm

  • train the ML model

  • evaluate the performance

  • make predictions

Recap of Optimization of Linear Regression with OLS#



  • Actual relationship (uknown): y = f(x)

  • Model: \(\hat{y}=b_{0}+b_{1}\cdot x\)

Residuals: \(e_{i}=y_{i}-\hat{y}_{i}\)

OLS:
Minimize the sum of squared residuals

\(\sum_{i=1}^{n}e_{i}^{2}\)

results in Normal Equation:

\(b=\left(X^{T}X\right)^{-1}X^{T}y\)

→ closed form solution

Using OLS (normal equation), the best parameters were found at: 
            b₀ = 0.237 
            b₁ = 0.049

Definition#

Gradient Descent is an iterative optimization algorithm to find the minimum of a function.

→ e.g. of a cost function

Note: Unlike the normal equation, Gradient Descent has no closed solution.

Find the parameters of a model by gradually minimizing your cost#

Go downhill in the direction of the steepest slope.

→ endpoint is dependent on the initial starting point

Gradient Descent in a nutshell#

  • Have some function \(J(b)\)

  • Want min \(J(b)\):

    • start with some random parameter \(b\)

    • keep changing \(b\) to reduce \(J(b)\) until we hopefully get to a minimum

When to use Gradient Descent?#

  • GD is a simple optimization procedure that can be used for many machine learning algorithms (eg. linear regression, logistic regression, …)

  • used for “Backpropagation” in Neural Nets (variations of GD)

  • can be used for online learning (update as soon as new data comes in)

  • gives faster results for problems with many features (computationally simpler)

First Step: Define Model#


Example:
least squares with one feature


\[\hat{y}=b_0+b_1\cdot x\]

Second Step: Define Cost Function#

Mean Squared Error with a little bit of cosmetics (2 in denominator)

\[ J(b_{0},b_{1})\;=\;\frac{1}{2n}\sum_{i=1}^{n}{(\hat{y}_{i}-y_{i})}^{2}\,\]

\(\hspace{0.5cm}\)

\[min(J(b_{0},b_{1}))\]

Third Step: Initialize Gradient Descent#

Deliberately set random starting values for \(b\)

../_images/0a608e67548e63eba2bfae66aed07e65088d8a52f6c532c54167f57d1a9d1041.png

Forth Step: Derivatives with respect to the parameters#



\[ J(b_{0},b_{1})\,=\,\frac{1}{2n}\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}\,=\,\frac{1}{2n}\sum_{i=1}^{n}\left(y_{i} - b_{0}+b_{1}x_{i}\right)^{2}\]

The chain rule gives us:

\[\begin{split}\begin{align} \frac{\partial J}{\partial b_0} = \frac{\partial}{\partial b_0} \left( \frac{1}{2n} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^2 \right)= \frac{1}{2n} \sum_{i=1}^{n} 2(y_i - b_0 - b_1 x_i)(-1)= -\frac{1}{n} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i) \\\\ \frac{\partial J}{\partial b_1} = \frac{\partial}{\partial b_1} \left( \frac{1}{2n} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^2 \right) = \frac{1}{2n} \sum_{i=1}^{n} 2(y_i - b_0 - b_1 x_i)(-x_i) = -\frac{1}{n} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)\cdot x_i \end{align}\end{split}\]

Forth Step: Gradient Descent#

Start descent:

  • Take derivatives with respect to your parameters \(b\)

  • Set your learning rate (step-size)

  • Adjust your parameters (step)

\[\text{slope of }b_1\text{ at start point }=-6.8\]
../_images/1f8c249941047cacba70d69cdda096d46db8aa55c00387d3f62141ab7982dd2e.png

Forth Step: Gradient Descent#

Start descent:

  • Take derivatives with respect to your parameters \(b\)

  • Set your learning rate (step-size)

  • Adjust your parameters (step)

\[\begin{split}\begin{align} \alpha&=0.15 \\ slope&=-6.8 \end{align}\end{split}\]
../_images/1f8c249941047cacba70d69cdda096d46db8aa55c00387d3f62141ab7982dd2e.png

Forth Step: Gradient Descent#

Start descent:

  • Take derivatives with respect to parameters \(b\)

  • Set your learning rate (step-size)

  • Adjust your parameters (step)

\[\begin{split}\begin{align} step&=\alpha \cdot slope \\ b_{1}^{new}&=b_{1}^{old} - step \\ b_{1}^{new}&=0.6-(0.15*(-6.8)) \\ b_{1}^{new}&=1.62 \\ \end{align}\end{split}\]
../_images/c0d59046ea89415196ee7dc4402ab0f2b987824d1080980c06901d7c950996db.png

Forth Step: Gradient Descent#

Start descent:

  • Take derivatives with respect to your parameters \(b\)

  • Set your learning rate (step-size)

  • Adjust your parameters (step)

Repeat till there is no further improvement

\[\begin{split}\begin{align} step&=\alpha \cdot slope \\ b_{1}^{new}&=b_{1}^{old} - step \end{align}\end{split}\]
../_images/a3a787b39ec109e8c71bbf013f0faf7abdc7c0926697f4a0ac844769b9873421.png

Gradient Descent summed up#

\(\hspace{0.5cm}\)

  1. Define model

  2. Define the cost function

  3. Deliberately set some starting values

  4. Start descent:

    • Take derivatives with respect to parameters

    • Set your learning rate (step-size)

    • Adjust your parameters (step)

  5. Repeat 4. till there is no further improvement

Learning Rate#

It is important to find a good value for the learning rate!



../_images/e2b7a22cf81cbb64f514e39e3718dba7e215d3fb9f1f60552478484e94b8ede2.png

Learning Rate#

It is important to find a good value for the learning rate!

Learning Rate too small → will take long to find optimum

Learning Rate too large → not able to find optimum

../_images/35ab324563ce7b4dfcb48396d279a90ebf0a343eec99ffe0bb4cb9d685414fb9.png

Two main challenges of Gradient Descent#

The MSE cost function for linear regression is a convex function, it just has one global minimum. This is not always the case. Problems are:

  • local minima

  • plateaus

Convex: If you pick two random points the line that is connecting them will not cross the curve.

../_images/c55cfa271dd85df67b242a21c6668b3e2a052def231def9cbc61dbb008ab7428.png

Unscaled Data#

Without scaling the algorithm Gradient Descent needs more time to find the minimum

../_images/64f26d2f9c799847cb80cb47926cdf7b1f9bcc065e48aeea87c5b0c75aca45fa.png

Variants of Gradient Descent#

Which route to take downhill?

Batch gradient descent#

All instances are used to calculate the gradient (step).

The algorithm is faster than the Normal Equation but still slow.

Stochastic gradient descent#

Only 1 random instances is used to calculate the gradient (step).

Much faster as there is little data to process at each update.

Can be used for very large data sets → online training possible

Cost decreases only on average and bounces back and forth. → can help not getting stuck in local minima

Optimum will be close but never reached.

Mini-Batch gradient descent#

Trains on small random subset instead of all data or individual ones.

Performs better than SGD as it exploits hardware optimization of matrix operations

Drawback: It is harder not to end up in local minima

Gradient Descent Comparison#

../_images/4fb81ced18f6f50004dcc2825e593066a4299d676bb7c14302c92352a8203153.png
../_images/95442d511351eaa473a1e2b41b5fdc6fa8d4983da99ba62938f473099f2ebc0a.png

References#

Gradient Descent Step by Step