Linear Regression#

Motivation#

Goals of Linear regression#

  • I own a house in King County!

  • It has 3 Bedrooms, 2 Bathroom, a nice 10.000 sqft lot and is only 10km away from Bill Gates mansion!

  • If only I had a way of estimating what it is worth…

If only I had a way of estimating what it is worth…#

  • Use training data to …

  • … find a similar house …

  • … and use its value for estimation.

  • Use training data to …

  • … find a general rule that …

  • … can be used for estimation.

  • I should train a Regression Model!#

        216.645 $ Basis-price
    
    +   20.033 $ for each bedrooms
    
    +  234.314 $ for each bathrooms
    
    +        1 $ for each sqft lot
    
    -   14.745 $ for each km distance from Bill Gate Mansion
    
    = xyz $ estimated house price
    
    Note:
    The term regression (e.g. regression analysis) usually refers to linear regression.
    (Don't confuse with logistic regression.)

    Building a model#

    Descriptive statistics

    Using LR for explanation (profiling)

    → Why is my house worth xyz?

    → How can I increase the price?

    Inferential statistics

    Using LR to make predictions

    → How much is my house worth?

    Notes: There are two methods in statitics:

    • descriptive (→ EDA) and

    • inferential (→ ML)

    Linear Equation#

    Linear Equation#

    Q: What is the equation of the line?#

    ../_images/c15c7e96eac3d6c886eedf33a9d6064249b480bbe2a1932145a1d7cc43dd8be7.png

    Linear Equation#

    ../_images/e6a922b2999939a2120d6951ea93e4825ab9dc1846ffa7cd474f173cc5b68776.png
    Key terms:
    • Intercept (b0, value of y when x = 0)

    • Slope (b1, weights)

    Linear Regression#

    Linear Regression#

    • Is the variable X associated with a variable y, and if so,

    • what is the relationship and can we use it to predict y?

    Note:
    • Correlation — measures the strength of the relationship → a number

    • Regression — quantifies the nature of the relationship → an equation

    What about more than 2 points?#

    Let’s look at an example#

    Two correlated variables:

    • week of bootcamp, \(x\)

    • coffee consumption, \(y\)

    • \(r=0.9\)

    \( y=b_{0}+b_{1}\cdot x+e\)

    → Find \( b_0 \) and \( b_1 \) !

    ../_images/874fe1bea9849dca1e35224a26a057c97398a4d373f3c1ed610abe8b2423a7c6.png

    Trying out some lines. Which one is better?#

    • Grey: \( \hat{y} = 0.35 + 0.5 \cdot x \)

    • Blue: \( \hat{y} = 1.65 + 0.3 \cdot x \)

    Note:
    ^ - the “hat” notation means the value is estimated (the lines) as opposed to a known value (the dots).
    The estimate has uncertainty (!) whereas the true value is fixed.
    ../_images/d5a8242b71d3158a755a2bb82fa18e23cf0e47816b83747c4efb292f66b079df.png

    How do we know which line is better?#

    Residuals#

    \(e_i = y_i - \hat{y}_i\)

    which means:

    \(y_i = b_0 + b_1 \cdot x_i + e_i\)

    ../_images/64449a006c695ae586ef0e6a65f19dc21b9fb262b10363fa82e33ab210a34882.png
    ../_images/627152f6c1ef2b4bd62d3ea445d48b60cba07b53e10ac8fca1620e1be73e7690.png

    Least squares criterion#

    By comparing the sum of squared residuals (SSR) we can find out which one is better:

    \[ J(b_{0},b_{1})\,=\,\sum_i^n e_{i}^{2}\,=\,\sum_i^n(y_{i}\,-\,\hat{y}_{i})^{2}\,=\,\sum_i^n(y_{i}\,-\,(b_{0}\,+\,b_{1}x_{i}))^{2}\]
    \[n \text{ is the number of observations}\]
    ../_images/3e6d4fad61dc1450143abd4b860cf4ba4ca011012954f258e24742c7517c7607.png

    Trying out several fitted lines#

    By comparing the sum of squared residuals (SSR) we can find out which one is better:

    \[ J(b_{0},b_{1})\,=\,\sum_i^n e_{i}^{2}\,=\,\sum_i^n(y_{i}\,-\,\hat{y}_{i})^{2}\,=\,\sum_i^n(y_{i}\,-\,(b_{0}\,+\,b_{1}x_{i}))^{2}\]
    ../_images/b55cd79ac93a00d1bbb6c4413fc355e165744e097b048783a2902c8b216b2c7e.png

    BUT THERE CAN BE AN INFINITE NUMBER OF LINES!#

    So how do we do this?#

    • Obviously doing it manually is not really scalable

    • We minimize the OLS-function \(J(b_0 , b_1 )\) with respect to \(b_0\) and \(b_1\)!

    • OLS - Ordinary Least Squares

    \[ J(b_{0},b_{1})\;=\;\sum{\left(y_{i}\:-\:b_{0}\:-\:b_{1}x_{i}\right)}^{2}\;\]

    Ordinary least squares regression#

    \[ \mathrm{min}\ J(b_{0},b_{1})\ =\ \sum(y_{i}\ -\ b_{0}\ -\ b_{1}x_{i})^{2}\]
    \[\begin{split}\begin{align} \frac{\partial J}{\partial b_{0}}&=\mathrm{-2\,}\Sigma(y_{i}-b_{0}-b_{1}x_{i})\nonumber\,=\,0 \\ \frac{\partial J}{\partial b_{1}}&=\mathrm{-2\,}\Sigma x_i(y_{i}-b_{0}-b_{1}x_{i})\nonumber\,=\,0 \end{align}\end{split}\]

    we divide the first equation by 2n:

    \[\begin{split} \begin{array}{c}{{-(\bar{y}\ -\ b_{0}\ -\ b_{1}\bar{x})\ =\ 0}} \\ {{b_{0}\ =\ \bar{y}\ -\ b_{1}\bar{x}}}\end{array}\end{split}\]

    … more math leads to:

    \[ b_1 = \frac{\Sigma(y_i - \bar{y})(x_i - \bar{x})}{\sum(x_{i} - \bar{x})^{2}} \]
    Note: the delta (or d) stands for first order derivative

    Fun facts about residuals#

    \[\begin{split}\begin{align} y_i &= b_0 + b_1 \cdot x_i + e_i \\ e_i &= y_i -b_0 -b_1 \cdot x_i \\ b_0 &= \bar{y} - b_1 \cdot \bar{x} \end{align}\end{split}\]
    Which leads to the following conclusions:
    \[\begin{split}\begin{align} \Sigma e_i &= 0 \\ \Sigma(x_i - \bar{x}) \cdot e_i &= 0 \end{align}\end{split}\]
    Note:
    • The second equation means the error/residual is uncorrelated with the explanatory variable

    • feel free to try this out for your models

    Model performance#

    Mean \(\bar{y}=\frac{1}{n} \sum\limits_{i=1}^{n}y_{i}\)#

    Variance \(\sigma^2 = \frac{1}{n-1}\sum_i{(y_i-\bar{y})^2}\)#

    Sum of various squares (variance analysis)#


    SST = total sum of squares
    SSE = explained sum of squares
    SSR = remaining sum of squares

    Note: Being scale dependent means: that they could be cents, kms, meters, lots of meters, lots of money, depending on your problem.. thus you would always need to talk about the scale of y to put things into perspective.

    \(0 ⪯ R^2 ⪯ 1\)

    least squares criterion ~ maximizing R2

    \(R^2= r^2\)( you know.. the Pearson correlation coefficient)

    The terms are named differently: r esidual and regression as well as explained and error start with the same letter.

    Root mean squared error#

    \[ R M S E=~{\sqrt{{\frac{1}{n}}\sum_{i=1}^{n}\left(y_{i}-{\hat{y}}_{i}\right)^{2}}} \]
    Note:
    The objective to fit the linear regression line is minimizing the sum of squared residuals (remarkable called MSE), this also optimizes the RMSE.
    RMSE is also used to compare performance of other ML models.

    Key Terms#

    Key terms: Machine learning#

    Variables:#

    • Target (dependent variable, prediction, response, y)

    • Feature (independent variable, explanatory/predictive variable, attribute, X)

    • Observation (row, instance, example, data point)

    Model:#

    • Fitted values (predicted values) - denoted with the hat notation ŷ

    • Residuals (errors, e) - difference between reality and model

    • Least squares (method for fitting a regression)

    • Coefficients, weights (here: slope, intercept)

    But I want to use more then one feature!#

    Multiple LR#

    Multiple regression#

    So far we’ve dealt with only one feature. Every single observation follows this expression: \(y = b_0 + b_1x + e\).
    Most of the time you will have many features.
    A more general expression for all observations at once and any number of features translates to: \( y=X b+e \)

    where y, b, e are vectors, and X is a matrix.

    \[\begin{split} y = {\left[\begin{array}{l l}{{y_{1}}}\\ {{\cdot\cdot\cdot}}\\{{\ y_{n}}}\end{array}\right]} \hspace{1cm} \ X = \begin{bmatrix} 1 & x_{1 1} & {{\cdot\cdot\cdot}} & x_{1 m}\\ {{\cdot\cdot\cdot}}&{{\cdot\cdot\cdot}}&{{\cdot\cdot\cdot}}&{{\cdot\cdot\cdot}} \\ 1 & x_{n 1} & {{\cdot\cdot\cdot}} & x_{n m} \end{bmatrix} \hspace{1cm} b = \begin{bmatrix} b_0\\ b_1\\ {{\cdot\cdot\cdot}} \\ b_m \end{bmatrix} \hspace{1cm} \ e = {\left[\begin{array}{l l}{{e_{1}}}\\ {{\cdot\cdot\cdot}}\\{{\ e_{n}}}\end{array}\right]} \end{split}\]
    \[\]

    n= number of observations, m= number of features

    \(x_{obs,feature} = x_{row,col} = x_{n,m}\)

    \(y\) and \(X\) are known: They are the real data of all the observations.
    \(b\) and \(e\) are not (yet) known.

    Note:
    The term multiple refers to the independant variable. A model which can predict several dependant variables at once is called multivariate linear regression model.

    Normal Equation#

    The optimal values for \(b\) (\(b_0\), \(b_1\), \(...\), \(b_m\)) will usually be calculated numerically using Gradient Descent method (topic of another lecture). However, they can also be determined analytically with the so-called normal equation:

    (1)#\[b = (X^TX)^{-1}X^Ty\]

    Predictions#

    Now as the \(b\) are also known (determind which whatever method), we can use them for making predictions.
    $\( \hat{y} = b_0 + b_1x_1 + ... + b_mx_m \)$

    Or in matrix notation:

    \[\hat{y} = Xb\]

    \(X\) needs to be in the same format as the one used above but can have a different number of rows (e.g. only 1).
    The error term \(e\) is unfortunately not known but has been minimized.

    Multiple regression — evaluation metrics (special considerations)#

    Root mean squared error (RMSE)#

    \[ R M S E=~{\sqrt{{\frac{1}{n}}\sum_{i=1}^{n}\left(y_{i}-{\hat{y}}_{i}\right)^{2}}} \]

      Most widely used metric to compare models (also non-multiple regr.). It is a measure of the goodness of a model.
    Disadvantage: The use of many (unnecessary) features leads to a (supposedly) good fit.

    Adjusted R squared (\(adj. R^2\)):#

    \[ a d j.\;R^{2}=1-\;\left(1-R^{2}\right)\frac{n-1}{n-p-1}\]

      with n: sample size; p: number of explanatory variables of model

      Modified version of R taking into account how many independent variables are used in the model.
    Method: It penalizes the use of many features.

    Note: RMSE - used also to compare performance to other models trained with other ML techniques

    Overview of LR terms#

    • R: Pearson correlation coefficient — in the intervall between [-1, 1]

    • R²: Coefficient of determination — How much of the variance in the dependent variable is explained by the model

    • MSE: mean squared error — SSR divided by sample size

    • RMSE: root mean squared error — Metric to judge the goodness of a model; root of MSE

    • SST, SSE, SSR: Sum of squares: total, explained, remaining

    • \(\sigma^2\): Variance of a variable — Measure of dispersion from the mean of a set

    Note:
    Many of these terms are similar in meaning:
    * Some refer to a set's mean while others refer to a model's prediction
    * Some are divided by the sample size, while others are not or by the number of DoF
    * By some of them the root is pulled, by others not
    * In any case: SSR ↓ means RMSE ↓ means R² ↑ and vice versa

    References#

    There are also a lot of detailed explanations in the exercise repos.

    Practical Statistics for Data Science - Peter Bruce & Andrew Bruce

    Econometric Methods with Applications in Business and Economics - Christiaan Heij, Paul de Boer, Philip Hans Franses, Teun Kloek, Herman K. van Dijk

    Written explanation of LR

    Difference between \(R^2\) and the adjusted version

    Always welcome: Explation of LR by statquest

    Some more math for those who want to know how to calculate \(b_1\)#

    \(-2 \Sigma x_i(y_i - b_0 - b_1x_i) = 0 \)

    we know that \(b_0 = \bar{y} - b_1 \bar{x}\)

    \(-2 \Sigma x_i(y_i - \bar{y} + b_1 \bar{x} - b_1x_i) = 0 \)

    \(\Sigma(x_iy_i - x_i \bar{y} + b_1(x_i \bar{x} - x_ix_i)) = 0\)

    \(\Sigma(x_iy_i - 2x_i \bar{y} + \bar{x}\bar{y} + b_1(-\bar{x}\bar{x} + 2x_i \bar{x} - x_ix_i)) = 0 \hspace{1cm}|\hspace{1cm} \Sigma x_i = \Sigma \bar{x} \)

    \(\Sigma(x_iy_i - x_i \bar{y} - \bar{x}y_i + \bar{x}\bar{y} + b_1(-\bar{x}\bar{x} + 2x_i \bar{x} - x_ix_i)) = 0 \hspace{1cm}|\hspace{1cm} \Sigma x_i \bar{y}= \Sigma \bar{x}y_i = n \bar{x}\bar{y} \)

    \(\Sigma(y_i - \bar{y})(x_i - \bar{x}) - b_1 \Sigma(x_i - \bar{x})^2 = 0 \)

    \(b_1 = \frac{\Sigma(y_i - \bar{y})(x_i - \bar{x})}{\Sigma(x_i - \bar{x})^2} \)