Logistic Regression#

Recap!#

Confusion Matrix#

Q1: Which cells show the correct predictions

Q2: Now where do we put… TN TP FN FP?

    Predicted
    Negatives Positives
Actual Negatives
Positives

Confusion Matrix#

Q1: Which cells show the correct predictions

Q2: Now where do we put… TN TP FN FP?

    Predicted
    Negatives Positives
Actual Negatives TN FP
Positives FN TP

Classification metrics#

Q3: Which classification metrics do you remember?

Metric

Value

Accuracy

\({\frac{130}{200}}=0.65\)

Precision

\({\frac{70}{110}}=0.64\)

Recall

\({\frac{70}{100}}=0.7\)

F1-Score

\(2\cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{2}{3}\)

ROC AUC

                                                                

Classification metrics#

Q3: Which classification metrics do you remember?

Metric

Value

Accuracy

\({\frac{130}{200}}=0.65\)

Precision

\({\frac{70}{110}}=0.64\)

Recall

\({\frac{70}{100}}=0.7\)

F1-Score

\(2\cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{2}{3}\)

ROC AUC

                                                                

Q4: What are the values?

    Predicted
    Negatives Positives
Actual Negatives 60 40
Positives 30 70

Accuracy - Rate of corrrect predictions - (TP+TN) /(TP+TN+FP+FN) = # correct predictions / # all predictions
Precision - How precise does a model predict positives - TP / (TP+FP) = # correct positives / # all positives
Recall/Sensitivity/TPRate - How many % of the patients will be recalled / How sensitive is the model - TP / (TP+FN) = # correct positives / # actual positive
F1-Score - Harmonic mean between Precision and Recall - 2x (P*R) / (P+R)
ROC AUC - area under ROC curve; parameter: threshold

Classification metrics#

Q3: Which classification metrics do you remember?

Metric

Value

Accuracy

\({\frac{130}{200}}=0.65\)

Precision

\({\frac{70}{110}}=0.64\)

Recall

\({\frac{70}{100}}=0.7\)

F1-Score

\(2\cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{2}{3}\)

ROC AUC

                                                                

Q4: What are the values?

    Predicted
    Negatives Positives
Actual Negatives 60 40
Positives 30 70

ROC Curve#

Q5: What is the threshold?

Q6: When would you change it?

Improving the recall, TP / (TP+FN) = missing no patient, by altering the threshold goes along with:
a worsening (increase) in false alarms, called false positive rate, because there will be more FP.

Logistic Regression#

Classification#

Some examples for classification

  • Email: spam vs not spam

  • Tumor classification: malignant vs benign

  • Bee image: healthy vs not healthy

  • Sentiment analysis: happy vs sad

  • Text analysis: toxic vs not toxic

  • Face recognition: iphone owner or not

  • Animal detection: cat vs dog vs monkey vs… (multi class classification)

In (binary)-classification we have two possible classes:

\[y\in\{0,1\}\]
  • Class 0: the negative class (not spam, malignant, not cat)

  • Class 1: the positive class (spam, benign, cat)

Why classification?#

../_images/6f0e20bb0b0bda31a0e881e1534a3227f9fb1c326833f8e1a4626de52944223f.png

Why do we need another (classification) algorithm if we have linear regression?

Why classification?#

\[h_{b}(x)=b^{T}x\]
\[\text{(or in 1d without matrix notation: }h_b(x) = b_0 + b_1\cdot x)\]
../_images/6bc70c00c989f9e1acccafb8a908aae8ad2d82da805c4f7892ddaa682605c376.png

Why classification?#

if \(h_{b}(x)\geq0.5\) then predict “y = 1” (→ sad)

if \(h_{b}(x)<0.5\) then predict “y = 0” (→ not sad)

\[h_{b}(x)=b^{T}x\]
../_images/5fd1b18f1712aef65510f516c96912a6c92ac14d9fcbfaa60c218e7328c3c82a.png

Why classification?#

../_images/bb00b147a980a5935d7b17e5a184706742fc129a7fd59ddf810a6c9189402d56.png

Why Classification?#

  • applying linear regression to a classification problem is usually not a great idea

  • first time we got lucky and we got a hypothesis that worked well for the particular example

../_images/59e295ba17887683bae3e405db306a9b83ad4ed339df2c862d5287e246230f92.png

Another problem with classification with linear regression#

  • y can be 1 or 0 (or any other labels) \(y\in\{0,1\}\)

  • but the hypothesis \(h_{b}(x)\) can be larger than 1 or smaller than 0 even when all the training data was with 0 and 1

\[\begin{split}\begin{align} h_{b}(x)&=b^{T}x\\[8pt] h_{b}(x)&\in\mathbb{R} \end{align}\end{split}\]

Logistic Regression Model#

Classification hypothesis:

\[0\leq h_{b}(x) \leq 1\]

Linear regression hypothesis:

\[h_{b}(x)=b^{T}x\]

Logistic regression hypothesis:

\[h_{b}(x)=\sigma(b^{T}x)\]

Sigmoid function aka. logistic function:

\[\sigma(t)=\frac{1}{1+e^{-t}}\]

The Logistic function#

Lets discuss some properties of the sigmoid function:

  • Sigmoid function makes values fit into (0,1)

  • The sigmoid asymptotes to 0 when t goes to minus infinity

  • The sigmoid asymptotes to 1 when t goes to infinity

  • All its values lie within (0,1)

\[\begin{split}\begin{align} \sigma(t)&=\frac{1}{1+e^{-t}}\\[4pt] \operatorname*{lim}_{t\rightarrow-\infty}\sigma(t)&=0\\[4pt] \operatorname*{lim}_{t\rightarrow\infty}\sigma(t)&=1 \\[4pt] \sigma(t)&\in\{0,1\}\\ \end{align}\end{split}\]
../_images/ebb3c43a1a85fc0f5197cc1d2e2aa9cf4f512bb13ba01eeb7d9d905da04d9a8a.png

From logistic function to logistic regression#

  • Sigmoid function makes values fit into (0,1)

  • we then need to pick the parameters b

    • t is equivalent to \(b^Tx\)

\[\begin{split}\begin{align} \sigma(t)&=\frac{1}{1+e^{-t}}\\[8pt] h_{b}(x)&=\frac{1}{1+e^{-b^Tx}} \end{align}\end{split}\]
../_images/ebb3c43a1a85fc0f5197cc1d2e2aa9cf4f512bb13ba01eeb7d9d905da04d9a8a.png

Explore the sigmoid function#

\[\sigma(x)=\frac{1}{1+e^{-(b_0 + b_1 \cdot x)}}\]

Please take this code and evaluate the infuence of \(b_0\) and \(b_1\).

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

# create the sigmoid function
def sigmoid_function(t):
    '''Calculates p for given t'''
    p = 1/(1+np.exp(-t))
    return p

# create the logit function
def logit_function(x, b0, b1):
    '''Caluculates the border'''
    t = b0 + b1*x
    return t

b0 = 0
b1 = 0.2

x = np.arange(-50, 50, 1)
t = logit_function(x, b0, b1)
p = sigmoid_function(t)

plt.plot(x,p, color=c_dark)
plt.vlines(x=-b0/b1,ymin=0, ymax=1,color=c_blue,linestyle='--', linewidth=1.2)
plt.hlines(y=0.5,xmin=-50, xmax=-b0/b1,color=c_light, linestyle='--', linewidth=1.2)
plt.show()

Explore the sigmoid function#

Result

  • \(p(t(x))\) (the result of the sigmoid function) or in short \(p(x)\) is mainly only 0 or 1 (about)

  • \(b_0\) shifts the sigmoid function to the left and right

  • Larger \(b_1\) sharpens the sigmoid function.

  • Proper fractions of \(b_1\) smoothens the sigmoid function.

  • Negative \(b_1\) flips the sigmoid function.

Assuming that \(b_0\) and \(b_1\) are known, we can input any \(x\) (a feature of our machine learning problem), and our sigmoid function tells us whether the observation belongs to category 0 or 1.

The only requirement is to customize the model parameters \(b_0\) and \(b_1\) to our needs / input data so that it can be used to discrimitate/binarize/classify any numerical feature \(x\).

Interpretation of the Hypothesis#

\(h_b(x)\) is the probability that y = 1 on input x

Example:

\[\begin{split} x = \begin{bmatrix} x_0 \\ x_1\end{bmatrix} = \begin{bmatrix} 1 \\ \text{#negWords}\end{bmatrix} \end{split}\]

\(h_{b}(x)=0.8\)

The sentence has a 80% probability of being sad. The probability of y=1 given the feature vector x is 80%

Notation:

\(h_{b}(x)=p(y=1\mid x;b)\)

Conditional probability#

The syntax \(p(y=1\mid x;b)\) denotes a so called conditional probability.

  • The \(p\) is called probability function or just probability.

  • \(y = 1\) is the event that we are interested in.

  • The \( | \) is pronounced “given”, which refers to the following parameters …

  • \(x; b\) which describe the conditions under which we are considering the probability.

In this example we want to know the probability that y is 1 (that y belongs to class 1) assuming a certain number of negative words \((x)\) and underlying the given weights for \(b\).

What is:

\(p(y=0\mid x;b)\)

Interpretation of the Hypothesis#

This is a classification problem, so y has to be 0 or 1


\[\begin{split}\begin{align} &\underbrace{p(y=1\mid x;b)}_{\color{red}{\text{sad}}}+ \underbrace{p(y=0\mid x;b)}_{\color{green}{\text{not sad (happy)}}} = 1\\[12pt] h_{b}(x)=&\underbrace{p(y=1\mid x;b)}_{\color{red}{\text{sad}}} = 1-\underbrace{p(y=0\mid x;b)}_{\color{green}{\text{not sad (happy)}}} \end{align}\end{split}\]

The Decision Boundary#

Defining the decision boundary / threshold

  • we could do 0.5 as the threshold (default in sklearn implementation)

  • If the probability is bigger than or equal to 0.5 it is classified as 1, else as 0.

\[\begin{split} \hat{y}= \begin{cases} 1 & \text{if } \hat{p} \ge 0.5 \\ 0 & \text{if } \hat{p} < 0.5 \ \end{cases} \end{split}\]

For what t is the sigmoid greater than 0.5?

../_images/ebb3c43a1a85fc0f5197cc1d2e2aa9cf4f512bb13ba01eeb7d9d905da04d9a8a.png

The Decision Boundary#

Defining the decision boundary / threshold

  • we could do 0.5 as the threshold (default in sklearn implementation)

  • If the probability is bigger than or equal to 0.5 it is classified as 1, else as 0.

\[\begin{split} \begin{array}{c}{{\hat{p}\:=\:\sigma\bigl(b^{T}\,x\bigr)\:\ge\:0.5}}\\ {{b^{T}\:x\geq0}}\end{array}\end{split}\]

When do we predict y = 0?

../_images/ebb3c43a1a85fc0f5197cc1d2e2aa9cf4f512bb13ba01eeb7d9d905da04d9a8a.png

Classification#

Now classification will be based on probabilities:

  • 0.5 is the threshold

  • If the probability is bigger than or equal to 0.5 it is classified as 1, else as 0.

\[\begin{split} \hat{y}= \begin{cases} 1 & \text{if } \hat{p} \ge 0.5 \\ 0 & \text{if } \hat{p} < 0.5 \ \end{cases} \end{split}\]

Why 0.5? Good question!

../_images/ebb3c43a1a85fc0f5197cc1d2e2aa9cf4f512bb13ba01eeb7d9d905da04d9a8a.png

The Logistic Function#

Remember: The threshold corresponds to your business problem!

e.g. decrease # missed (potential) patients by decreasing the threshold (drawback: many healthy get paniced)

Hands on. Example time!#

Decision Boundary Example - Let’s use Logistic regression!#

  • We have some texts to classify as sad (class 1) or happy (class 0)

  • Our features are:

    • x1: number of negative words (e.g. “Today was a dull, exhausting and tiring one. 😞”, x1 = 3)

    • x2: number of sad smileys (e.g. “Today was a dull, exhausting and tiring one. 😞”, x2 = 1)

../_images/fbd118e20346f4fe86480f73f1a9bd9f3f7334fb7f4312f8a419e541a554cf5c.png

This figure does not show the response this time but two features!

Decision Boundary Example#

\(\hat{p} = \sigma(b_0+ b_1 x_1 +b_2 x_2) = \sigma(b^T x)\)

\(x_0=1\)

\(b^{T} x=\begin{bmatrix}b_0 & b_1 & b_2 \end{bmatrix}\cdot\begin{bmatrix}x_0\\x_1\\x_2 \end{bmatrix}=\,b_{0}\cdot x_{0}\,+\,b_{1}\cdot x_{1}\,+\,b_{2}\cdot\,x_{2}\)

../_images/fbd118e20346f4fe86480f73f1a9bd9f3f7334fb7f4312f8a419e541a554cf5c.png

\(x\) is one observation. The index refers to different features:
\(x_0\) belongs to the bias and is 1 (so that \(b_0\) is the intercept).
\(x_1\) and \(x_2\) are the # of neg. words and smileys, resp.

Decision Boundary Example#

\(\hat{p} = \sigma(b_0+ b_1 x_1 +b_2 x_2) = \sigma(b^T x)\)

\(b=\left[\begin{array}{c}{{-3\ }}\\ {{1}}\\ {{1}}\end{array}\right]\)

../_images/fbd118e20346f4fe86480f73f1a9bd9f3f7334fb7f4312f8a419e541a554cf5c.png

Decision Boundary Example#

\(\hat{p}= \sigma(b_0+ b_1 x_1 +b_2 x_2) = \sigma(b^T x)\)

\(\hat{p}= \sigma(-3+ x_1 +x_2) = \sigma(b^T x)\)

\(b=\left[\begin{array}{c}{{-3\ }}\\ {{1}}\\ {{1}}\end{array}\right]\)

../_images/fbd118e20346f4fe86480f73f1a9bd9f3f7334fb7f4312f8a419e541a554cf5c.png

Decision Boundary Example#

\(\hat{p} = \sigma(-3+ x_1 +x_2) = \sigma(b^T x) = \frac{1}{1+e^{b_0+b_1\cdot x_1 + b_2 \cdot x_2}}\)

We predict class 1 if

\(\hat{p}\geq0.5 \text{ that means } b^T x\geq0\) (the exponent of \(e\))

../_images/fbd118e20346f4fe86480f73f1a9bd9f3f7334fb7f4312f8a419e541a554cf5c.png

Decision Boundary Example#

\(\hat{p} = \sigma(-3+ x_1 +x_2) = \sigma(b^T x)\)

We predict “y = 1” when:

\(\hat{p}\geq0.5 \text{ when } b^T x\geq0\)

\[\begin{split}\begin{align} -3+x_1+x_2&\geq0 \\ x_1+x_2&\geq3\end{align}\end{split}\]
../_images/fbd118e20346f4fe86480f73f1a9bd9f3f7334fb7f4312f8a419e541a554cf5c.png

Decision Boundary Example#

\(\hat{p} = \sigma(-3+ x_1 +x_2) = \sigma(b^T x)\)

We predict “y = 1” when:

\({x_1+x_2\geq3}\)

Decision Boundary Example#

\(\hat{p} = \sigma(-3+ x_1 +x_2) = \sigma(b^T x)\)

We predict “y = 1” when:

\({x_1+x_2\geq3}\)

We predict “y = 0” when:

\({x_1+x_2<3}\)

Decision Boundary Example#

We predict “y = 1” when:

\({x_1+x_2\geq3}\)

We predict “y = 0” when:

\({x_1+x_2<3}\)

Decision Boundary:

\[x_1+x_2=3\]
\[x_2=3-x_1\]
../_images/b59559d7c4bdc9cafe2dc743128c6017082b5e431f799f026a7911679e7287b2.png

Decision Boundary#

The decision boundary is a property of the hypothesis and not of the data.

We fit the model to find the parameters b and then we can get the decision boundary based on b.

Spoiler alert: if you use higher order polynomial features you get a non-linear decision boundary.

../_images/07aab617cea1e5cec9dbc51d954e3365e32c23463b7cd97cc9ec73688f9ab431.png

The Cost Function#

Training set: \(\{(x^1, y^1), (x^2, y^2), ... (x^n, y^n)\}\)

\[\begin{split} x\,=\,\left[{\begin{array}{c}{x_{0}}\\ {x_{1}}\\ {\vdots}\\ {x_{m}}\end{array}}\right],x_{0}\,=\,1,\,y\,\in\,\{0,1\}\end{split}\]

How do we chose the right parameters b?

Fitting the parameter b to the data.

\[ \hat{p}=\frac{1}{1+e^{-b^{T}x}}\]

The top numbers are no exponents but refer to the observation number here.

The Cost Function#

For logistic regression the log loss function LLF (also referred to as binary cross entropy) is used:
$\(LLF = -\frac{1}{n}\sum_{i=0}^n\Bigg[ \Bigg(y_i \ln(\hat{p}(x_i))\Bigg) + \Bigg((1-y_i) \ln(1-\hat{p}(x_i))\Bigg)\Bigg]\quad\)$

with \(n\) being the number of observations,
\(y_i\) being the real observations (0 or 1)
and \(\hat{p}(x_i)\) being the predictions, between 0 and 1

\(y_i\) (real value)

\(\hat{p}(x_i)\) (prediction)

val in []

0

\( \sim 0\)

\( \sim 0\)

0

\(\neq 0\)

large

1

\( \sim 1\)

\( \sim 0\)

1

\(\neq 1\)

large

→ The further the prediction is from the real value, the larger the LLF
vice versa: The smaller the loss, the better the prediction.

The Cost Function#

Why don’t we use the cost function we know from linear regression?

\[ \mathrm{min}\ J(b)\ =\ \sum(y_{i}\ - \hat{y})^{2}\]
\[ \text{or the estimated probabilities:}\]
\[\mathrm{min}\ J(b)={\textstyle{\frac{1}{2n}}}\sum(y-\hat{p})^{2}\]

This would lead to a non-convex function which is hard to optimize. (Remember: Gradient Descent)

The Cost Function#

The linear regression cost function (remember OLS).

\[ J(b)={\textstyle{\frac{1}{2n}}}\sum(y\,-\,\hat{p})^{2}\]

Logistic regression hypothesis:

\[ \hat{p}=\frac{1}{1+e^{-b^{T}x}}\]

This would though lead to a non-convex function which is hard to optimize. The usual way to optimize is to find the global minimum.

The Cost Function#

Logistic regression hypothesis:

\[ \hat{p}=\frac{1}{1+e^{-b^{T}x}}\]
\[\begin{split} Loss(\hat{p},y)=\begin{cases} -\ln{(\hat{p})} &\text{if } y=1 \\ -\ln{(1-\hat{p})} &\text{if } y=0 \end{cases} \end{split}\]
../_images/86bec1186a9004bee6959503a1238b09bda6f0d0735f4f1361e1746ec963836d.png

Loss based on logs of probabilities#

For instances belonging to class 1 → Loss is the negative log of \(\hat{p}(y=1)\)

→ Probability is close to zero? Loss is high!

→ Probability is close to one? Loss is low!
\[\begin{split} Loss(\hat{p},y)=\begin{cases} -\ln{(\hat{p})} &\text{if } y=1 \\ -\ln{(1-\hat{p})} &\text{if } y=0 \end{cases} \end{split}\]

For (correct) probabilities of 0 / 1, the loss would be zero.

../_images/d8fdeaae0805d81d73634df383975dd4d21bd0c8571902b1578bdac4c454dfaf.png

Loss based on logs of probabilities#

For instances belonging to class 0 → Loss is the negative log of \(1-\hat{p}(y=1)\)

→ Probability is close to zero? Loss is low!

→ Probability is close to one? Loss is high!
\[\begin{split} Loss(\hat{p},y)=\begin{cases} -\ln{(\hat{p})} &\text{if } y=1 \\ -\ln{(1-\hat{p})} &\text{if } y=0 \end{cases} \end{split}\]

For (correct) probabilities of 0 / 1, the loss would be zero.

../_images/28f4f0492dabc9b744d8bd0c3cf7163ce678b26e7e98ad22cc67c464d7b5d1dc.png

Now, just add it all up#

\[\begin{split} Loss(\hat{p},y)=\begin{cases} -\ln{(\hat{p})} &\text{if } y=1 \\ -\ln{(1-\hat{p})} &\text{if } y=0 \end{cases} \end{split}\]


Because y is always 0 or 1 we can write this:
\[ Loss(\hat{p},y)=y\cdot\big(-\ln{(\hat{p})}\big)+(1-y)\cdot\big(-\ln{(1-\hat{p})}\big) \]
\[ Loss(\hat{p},y)= \underbrace{y\cdot\big(-\ln{(\hat{p})}\big)}_{\color{grey}{\text{“disabled” for }}y=0}+\underbrace{(1-y)\cdot\big(-\ln{(1-\hat{p})}\big)}_{\color{grey}{\text{“disabled” for }}y=1} \]

Now, just add it all up#

\[\begin{split} Loss(\hat{p},y)=\begin{cases} -\ln{(\hat{p})} &\text{if } y=1 \\ -\ln{(1-\hat{p})} &\text{if } y=0 \end{cases} \end{split}\]


Because y is always 0 or 1 we can write this:
\[\begin{split}\begin{align} Loss(\hat{p},y) &= y\cdot\big(-\ln{(\hat{p})}\big)+(1-y)\cdot\big(-\ln{(1-\hat{p})}\big) \\[18pt] Loss(\hat{p},y) &= -\big[y\ln{(\hat{p})} + (1-y)\ln{(1-\hat{p})}\big] \end{align}\end{split}\]


This loss function (binary cross-entropy) can be directly obtained from MLE (maximum likelihood estimation)

From loss to cost#

Now, let’s average the Loss function over all observations:

\[\begin{split}\begin{align} Loss(\hat{p},y) &= -\big[y\ln{(\hat{p})} + (1-y)\ln{(1-\hat{p})}\big] \\[16pt] J(b) &= -\frac{1}{n}\sum_{i=1}^n[y_i\ln\left(\hat{p}_i\right)\ +\ (1-y_i)\ln\left(1-\hat{p}_i\right)\big] \end{align}\end{split}\]

Now we can fit our parameters b to our data:

  • Find b that minimizes the cost function J(b).

  • To minimize the cost function we use gradient descent!

Minimizing J - find \(b\)#


  • There is no closed form solution as the Normal Equation (\(b = (X^TX)^{-1}X^Ty\)) for linear regression.

  • So instead we get the partial derivatives and start the gradient descent!

\[\frac{\partial}{\partial b_{j}}J(b)=\frac{1}{n}\sum_{i=1}^{n}\Bigl(\sigma\Bigl(b^{T}x_{i}\Bigr)-y_{i}\Bigr)x_{i,j}\]
</div>
<div class="images_30">               

Conclusion#

Training the model ~ finding b that minimizes J ~ finding the shape of the decision boundary.

Changing the classification threshold from 0.5 (default) ~ changing the position of the decision boundary (but not the shape).

Multi class problems#

How can more than two different classes be predicted?

This is often done using OvR (one versus rest).

If there are 3 classes, 3 classification runs were performed:

  • class 0 vs. classes 1 \(\cup\) 2

  • class 1 vs. classes 0 \(\cup\) 2

  • class 2 vs. classes 0 \(\cup\) 1

The class with the highest probability is used for final prediction.

Resources#