Gradient Descent

Gradient Descent#

Orientation#

Where is Gradient Descent used in the Data Science Lifecycle?

06 PREDICTIVE MODELING:

select a ML algorithm
train the ML model
evaluate the performance
make predictions

Recap of Optimization of Linear Regression with OLS#

Actual relationship (uknown): y = f(x)
Model: \(\hat{y}=b_{0}+b_{1}\cdot x\)

Residuals: \(e_{i}=y_{i}-\hat{y}_{i}\)

OLS:
Minimize the sum of squared residuals

\(\sum_{i=1}^{n}e_{i}^{2}\)

results in Normal Equation:

\(b=\left(X^{T}X\right)^{-1}X^{T}y\)

→ closed form solution

Using OLS (normal equation), the best parameters were found at: 
            b₀ = 0.237 
            b₁ = 0.049

Definition#

Gradient Descent is an iterative optimization algorithm to find the minimum of a function.

→ e.g. of a cost function

Note: Unlike the normal equation, Gradient Descent has no closed solution.

Find the parameters of a model by gradually minimizing your cost#

Go downhill in the direction of the steepest slope.

→ endpoint is dependent on the initial starting point

Gradient Descent in a nutshell#

Have some function \(J(b)\)
Want min \(J(b)\):
- start with some random parameter \(b\)
- keep changing \(b\) to reduce \(J(b)\) until we hopefully get to a minimum

When to use Gradient Descent?#

GD is a simple optimization procedure that can be used for many machine learning algorithms (eg. linear regression, logistic regression, …)
used for “Backpropagation” in Neural Nets (variations of GD)
can be used for online learning (update as soon as new data comes in)
gives faster results for problems with many features (computationally simpler)

First Step: Define Model#

Example:
least squares with one feature

\[\hat{y}=b_0+b_1\cdot x\]

Second Step: Define Cost Function#

Mean Squared Error with a little bit of cosmetics (2 in denominator)

\[ J(b_{0},b_{1})\;=\;\frac{1}{2n}\sum_{i=1}^{n}{(\hat{y}_{i}-y_{i})}^{2}\,\]

\(\hspace{0.5cm}\)

\[min(J(b_{0},b_{1}))\]

Third Step: Initialize Gradient Descent#

Deliberately set random starting values for \(b\)

../_images/e00b0c18b44366673a792a2493157fc7bb57ccdba70490070555540db0a95c92.png

Forth Step: Derivatives with respect to the parameters#

\[ J(b_{0},b_{1})\,=\,\frac{1}{2n}\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}\,=\,\frac{1}{2n}\sum_{i=1}^{n}\left(y_{i} - b_{0}+b_{1}x_{i}\right)^{2}\]

The chain rule gives us:

\[\begin{split}\begin{align} \frac{\partial J}{\partial b_0} = \frac{\partial}{\partial b_0} \left( \frac{1}{2n} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^2 \right)= \frac{1}{2n} \sum_{i=1}^{n} 2(y_i - b_0 - b_1 x_i)(-1)= -\frac{1}{n} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i) \\\\ \frac{\partial J}{\partial b_1} = \frac{\partial}{\partial b_1} \left( \frac{1}{2n} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^2 \right) = \frac{1}{2n} \sum_{i=1}^{n} 2(y_i - b_0 - b_1 x_i)(-x_i) = -\frac{1}{n} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)\cdot x_i \end{align}\end{split}\]

Forth Step: Gradient Descent#

Start descent:

Take derivatives with respect to your parameters \(b\)
Set your learning rate (step-size)
Adjust your parameters (step)

\[\text{slope of }b_1\text{ at start point }=-6.8\]

../_images/302bfe990e03e8f50942082eab0089ff134c1e30bef7e3ed8eb8a55c42c0dac7.png

Forth Step: Gradient Descent#

Start descent:

Take derivatives with respect to your parameters \(b\)
Set your learning rate (step-size)
Adjust your parameters (step)

\[\begin{split}\begin{align} \alpha&=0.15 \\ slope&=-6.8 \end{align}\end{split}\]

Forth Step: Gradient Descent#

Start descent:

Take derivatives with respect to parameters \(b\)
Set your learning rate (step-size)
Adjust your parameters (step)

\[\begin{split}\begin{align} step&=\alpha \cdot slope \\ b_{1}^{new}&=b_{1}^{old} - step \\ b_{1}^{new}&=0.6-(0.15*(-6.8)) \\ b_{1}^{new}&=1.62 \\ \end{align}\end{split}\]

../_images/eac75f79d70facba7d5936f671241c52bd0dcd948549ffe1d7bcda2adec282ba.png

Forth Step: Gradient Descent#

Start descent:

Take derivatives with respect to your parameters \(b\)
Set your learning rate (step-size)
Adjust your parameters (step)

Repeat till there is no further improvement

\[\begin{split}\begin{align} step&=\alpha \cdot slope \\ b_{1}^{new}&=b_{1}^{old} - step \end{align}\end{split}\]

../_images/467ae471d5e087645acfd51b995e2ebf90c914e101921f98ecfbb752b5897b30.png

Gradient Descent summed up#

\(\hspace{0.5cm}\)

Define model
Define the cost function
Deliberately set some starting values
Start descent:
- Take derivatives with respect to parameters
- Set your learning rate (step-size)
- Adjust your parameters (step)
Repeat 4. till there is no further improvement