Logistic Regression

Logistic Regression#

Recap!#

Confusion Matrix#

Q1: Which cells show the correct predictions

Q2: Now where do we put… TN TP FN FP?

		Predicted
		Negatives	Positives
Actual	Negatives
Actual	Positives

Confusion Matrix#

Q1: Which cells show the correct predictions

Q2: Now where do we put… TN TP FN FP?

		Predicted
		Negatives	Positives
Actual	Negatives	TN	FP
Actual	Positives	FN	TP

Classification metrics#

Q3: Which classification metrics do you remember?

Metric

Value

Accuracy

${\frac{130}{200}}=0.65$

Precision

${\frac{70}{110}}=0.64$

Recall

${\frac{70}{100}}=0.7$

F1-Score

$2\cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{2}{3}$

ROC AUC

Classification metrics#

Q3: Which classification metrics do you remember?

Metric	Value
Accuracy	${\frac{130}{200}}=0.65$
Precision	${\frac{70}{110}}=0.64$
Recall	${\frac{70}{100}}=0.7$
F1-Score	$2\cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{2}{3}$
ROC AUC

Q4: What are the values?

		Predicted
		Negatives	Positives
Actual	Negatives	60	40
Actual	Positives	30	70

Accuracy - Rate of corrrect predictions - (TP+TN) /(TP+TN+FP+FN) = # correct predictions / # all predictions
Precision - How precise does a model predict positives - TP / (TP+FP) = # correct positives / # all positives
Recall/Sensitivity/TPRate - How many % of the patients will be recalled / How sensitive is the model - TP / (TP+FN) = # correct positives / # actual positive
F1-Score - Harmonic mean between Precision and Recall - 2x (P*R) / (P+R)
ROC AUC - area under ROC curve; parameter: threshold

Classification metrics#

Q3: Which classification metrics do you remember?

Metric	Value
Accuracy	${\frac{130}{200}}=0.65$
Precision	${\frac{70}{110}}=0.64$
Recall	${\frac{70}{100}}=0.7$
F1-Score	$2\cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}=\frac{2}{3}$
ROC AUC

Q4: What are the values?

		Predicted
		Negatives	Positives
Actual	Negatives	60	40
Actual	Positives	30	70

ROC Curve#

Q5: What is the threshold?

Q6: When would you change it?

Improving the recall, TP / (TP+FN) = missing no patient, by altering the threshold goes along with:
a worsening (increase) in false alarms, called false positive rate, because there will be more FP.

Logistic Regression#

Classification#

Some examples for classification

Email: spam vs not spam
Tumor classification: malignant vs benign
Bee image: healthy vs not healthy
Sentiment analysis: happy vs sad
Text analysis: toxic vs not toxic
Face recognition: iphone owner or not
Animal detection: cat vs dog vs monkey vs… (multi class classification)

In (binary)-classification we have two possible classes:

\[y\in\{0,1\}\]

Class 0: the negative class (not spam, malignant, not cat)
Class 1: the positive class (spam, benign, cat)

Why classification?#

../_images/e05b8d48536d1e52ddf14af91fcad5e1e52592a62bb83b62f5b1816164bde255.png

Why do we need another (classification) algorithm if we have linear regression?

Why classification?#

\[h_{b}(x)=b^{T}x\]

\[\text{(or in 1d without matrix notation: }h_b(x) = b_0 + b_1\cdot x)\]

../_images/301217e8b340e44ca839350667eed2ccb87e6da98945bb03e06b0be1d3eb8049.png

Why classification?#

if $h_{b}(x)\geq0.5$ then predict “y = 1” (→ sad)

if $h_{b}(x)<0.5$ then predict “y = 0” (→ not sad)

\[h_{b}(x)=b^{T}x\]

../_images/f3b1f91eb8dd9d5c983f055389467a0f4bdc0ad09303fa4dabe68bdaa2222c21.png

Why classification?#

../_images/d8facb668233c21658f8d1761cbc7fc58418e31babac02f9db93d364a3bfb25f.png

Why Classification?#

applying linear regression to a classification problem is usually not a great idea
first time we got lucky and we got a hypothesis that worked well for the particular example

../_images/bb693b9f714bdc4a8431474bc61bf8856678116cd585ff798ad5b04bb3b3eab1.png

Another problem with classification with linear regression#

y can be 1 or 0 (or any other labels) $y\in\{0,1\}$
but the hypothesis $h_{b}(x)$ can be larger than 1 or smaller than 0 even when all the training data was with 0 and 1

\[\begin{split}\begin{align} h_{b}(x)&=b^{T}x\\[8pt] h_{b}(x)&\in\mathbb{R} \end{align}\end{split}\]

Logistic Regression Model#

Classification hypothesis:

\[0\leq h_{b}(x) \leq 1\]

Linear regression hypothesis:

\[h_{b}(x)=b^{T}x\]

Logistic regression hypothesis:

\[h_{b}(x)=\sigma(b^{T}x)\]

Sigmoid function aka. logistic function:

\[\sigma(t)=\frac{1}{1+e^{-t}}\]

The Logistic function#

Lets discuss some properties of the sigmoid function:

Sigmoid function makes values fit into (0,1)
The sigmoid asymptotes to 0 when t goes to minus infinity
The sigmoid asymptotes to 1 when t goes to infinity
All its values lie within (0,1)

\[\begin{split}\begin{align} \sigma(t)&=\frac{1}{1+e^{-t}}\\[4pt] \operatorname*{lim}_{t\rightarrow-\infty}\sigma(t)&=0\\[4pt] \operatorname*{lim}_{t\rightarrow\infty}\sigma(t)&=1 \\[4pt] \sigma(t)&\in\{0,1\}\\ \end{align}\end{split}\]

../_images/c19d8778fb076c9a7bb809ebdbf75f137420cf37083c34552365a86e282f6cf8.png

From logistic function to logistic regression#

Sigmoid function makes values fit into (0,1)
we then need to pick the parameters b
- t is equivalent to $b^Tx$

\[\begin{split}\begin{align} \sigma(t)&=\frac{1}{1+e^{-t}}\\[8pt] h_{b}(x)&=\frac{1}{1+e^{-b^Tx}} \end{align}\end{split}\]

Explore the sigmoid function#

\[\sigma(x)=\frac{1}{1+e^{-(b_0 + b_1 \cdot x)}}\]

Please take this code and evaluate the infuence of $b_0$ and $b_1$.

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

# create the sigmoid function
def sigmoid_function(t):
    '''Calculates p for given t'''
    p = 1/(1+np.exp(-t))
    return p

# create the logit function
def logit_function(x, b0, b1):
    '''Caluculates the border'''
    t = b0 + b1*x
    return t

b0 = 0
b1 = 0.2

x = np.arange(-50, 50, 1)
t = logit_function(x, b0, b1)
p = sigmoid_function(t)

plt.plot(x,p, color=c_dark)
plt.vlines(x=-b0/b1,ymin=0, ymax=1,color=c_blue,linestyle='--', linewidth=1.2)
plt.hlines(y=0.5,xmin=-50, xmax=-b0/b1,color=c_light, linestyle='--', linewidth=1.2)
plt.show()

Explore the sigmoid function#

Result

$p(t(x))$ (the result of the sigmoid function) or in short $p(x)$ is mainly only 0 or 1 (about)
$b_0$ shifts the sigmoid function to the left and right
Larger $b_1$ sharpens the sigmoid function.
Proper fractions of $b_1$ smoothens the sigmoid function.
Negative $b_1$ flips the sigmoid function.

Assuming that $b_0$ and $b_1$ are known, we can input any $x$ (a feature of our machine learning problem), and our sigmoid function tells us whether the observation belongs to category 0 or 1.

The only requirement is to customize the model parameters $b_0$ and $b_1$ to our needs / input data so that it can be used to discrimitate/binarize/classify any numerical feature $x$.

Interpretation of the Hypothesis#

$h_b(x)$ is the probability that y = 1 on input x

Example:

\[\begin{split} x = \begin{bmatrix} x_0 \\ x_1\end{bmatrix} = \begin{bmatrix} 1 \\ \text{#negWords}\end{bmatrix} \end{split}\]

$h_{b}(x)=0.8$

The sentence has a 80% probability of being sad. The probability of y=1 given the feature vector x is 80%

Notation:

$h_{b}(x)=p(y=1\mid x;b)$

Conditional probability#

The syntax $p(y=1\mid x;b)$ denotes a so called conditional probability.

The $p$ is called probability function or just probability.
$y = 1$ is the event that we are interested in.
The $ | $ is pronounced “given”, which refers to the following parameters …
… $x; b$ which describe the conditions under which we are considering the probability.

In this example we want to know the probability that y is 1 (that y belongs to class 1) assuming a certain number of negative words $(x)$ and underlying the given weights for $b$.

What is:

$p(y=0\mid x;b)$

Interpretation of the Hypothesis#

This is a classification problem, so y has to be 0 or 1

\[\begin{split}\begin{align} &\underbrace{p(y=1\mid x;b)}_{\color{red}{\text{sad}}}+ \underbrace{p(y=0\mid x;b)}_{\color{green}{\text{not sad (happy)}}} = 1\\[12pt] h_{b}(x)=&\underbrace{p(y=1\mid x;b)}_{\color{red}{\text{sad}}} = 1-\underbrace{p(y=0\mid x;b)}_{\color{green}{\text{not sad (happy)}}} \end{align}\end{split}\]

The Decision Boundary#

Defining the decision boundary / threshold

we could do 0.5 as the threshold (default in sklearn implementation)
If the probability is bigger than or equal to 0.5 it is classified as 1, else as 0.

\[\begin{split} \hat{y}= \begin{cases} 1 & \text{if } \hat{p} \ge 0.5 \\ 0 & \text{if } \hat{p} < 0.5 \ \end{cases} \end{split}\]

For what t is the sigmoid greater than 0.5?

The Decision Boundary#

Defining the decision boundary / threshold

we could do 0.5 as the threshold (default in sklearn implementation)
If the probability is bigger than or equal to 0.5 it is classified as 1, else as 0.

\[\begin{split} \begin{array}{c}{{\hat{p}\:=\:\sigma\bigl(b^{T}\,x\bigr)\:\ge\:0.5}}\\ {{b^{T}\:x\geq0}}\end{array}\end{split}\]

When do we predict y = 0?

Classification#

Now classification will be based on probabilities:

0.5 is the threshold
If the probability is bigger than or equal to 0.5 it is classified as 1, else as 0.

\[\begin{split} \hat{y}= \begin{cases} 1 & \text{if } \hat{p} \ge 0.5 \\ 0 & \text{if } \hat{p} < 0.5 \ \end{cases} \end{split}\]

Why 0.5? Good question!

The Logistic Function#

Remember: The threshold corresponds to your business problem!

e.g. decrease # missed (potential) patients by decreasing the threshold (drawback: many healthy get paniced)

Hands on. Example time!#

Decision Boundary Example - Let’s use Logistic regression!#

We have some texts to classify as sad (class 1) or happy (class 0)
Our features are:
- x1: number of negative words (e.g. “Today was a dull, exhausting and tiring one. 😞”, x1 = 3)
- x2: number of sad smileys (e.g. “Today was a dull, exhausting and tiring one. 😞”, x2 = 1)

../_images/d0785afe098ed0eb75f4381610c70d896cc7e96362d68e1ce8662e330f1224fa.png

This figure does not show the response this time but two features!

Decision Boundary Example#

$\hat{p} = \sigma(b_0+ b_1 x_1 +b_2 x_2) = \sigma(b^T x)$

$x_0=1$

$b^{T} x=\begin{bmatrix}b_0 & b_1 & b_2 \end{bmatrix}\cdot\begin{bmatrix}x_0\\x_1\\x_2 \end{bmatrix}=\,b_{0}\cdot x_{0}\,+\,b_{1}\cdot x_{1}\,+\,b_{2}\cdot\,x_{2}$

$x$ is one observation. The index refers to different features:
$x_0$ belongs to the bias and is 1 (so that $b_0$ is the intercept).
$x_1$ and $x_2$ are the # of neg. words and smileys, resp.

Decision Boundary Example#

$\hat{p} = \sigma(b_0+ b_1 x_1 +b_2 x_2) = \sigma(b^T x)$

$b=\left[\begin{array}{c}{{-3\ }}\\ {{1}}\\ {{1}}\end{array}\right]$

Decision Boundary Example#

$\hat{p}= \sigma(b_0+ b_1 x_1 +b_2 x_2) = \sigma(b^T x)$

$\hat{p}= \sigma(-3+ x_1 +x_2) = \sigma(b^T x)$

$b=\left[\begin{array}{c}{{-3\ }}\\ {{1}}\\ {{1}}\end{array}\right]$

Decision Boundary Example#

$\hat{p} = \sigma(-3+ x_1 +x_2) = \sigma(b^T x) = \frac{1}{1+e^{b_0+b_1\cdot x_1 + b_2 \cdot x_2}}$

We predict class 1 if

$\hat{p}\geq0.5 \text{ that means } b^T x\geq0$ (the exponent of $e$)

Decision Boundary Example#

$\hat{p} = \sigma(-3+ x_1 +x_2) = \sigma(b^T x)$

We predict “y = 1” when:

$\hat{p}\geq0.5 \text{ when } b^T x\geq0$

\[\begin{split}\begin{align} -3+x_1+x_2&\geq0 \\ x_1+x_2&\geq3\end{align}\end{split}\]

Decision Boundary Example#

$\hat{p} = \sigma(-3+ x_1 +x_2) = \sigma(b^T x)$

We predict “y = 1” when:

${x_1+x_2\geq3}$

Decision Boundary Example#

$\hat{p} = \sigma(-3+ x_1 +x_2) = \sigma(b^T x)$

We predict “y = 1” when:

${x_1+x_2\geq3}$

We predict “y = 0” when:

${x_1+x_2<3}$

Decision Boundary Example#

We predict “y = 1” when:

${x_1+x_2\geq3}$

We predict “y = 0” when:

${x_1+x_2<3}$

Decision Boundary:

\[x_1+x_2=3\]

\[x_2=3-x_1\]

../_images/e58ad69e81f0e8a047b136cbc52cb65aeb068466c711a09a7a203ac3af0f4163.png

Decision Boundary#

The decision boundary is a property of the hypothesis and not of the data.

We fit the model to find the parameters b and then we can get the decision boundary based on b.

Spoiler alert: if you use higher order polynomial features you get a non-linear decision boundary.

../_images/100689cd27fb5db041b0ce9abb1f28bf5285c694a3153c72e9485e6021faae8b.png

The Cost Function#

Training set: $\{(x^1, y^1), (x^2, y^2), ... (x^n, y^n)\}$

\[\begin{split} x\,=\,\left[{\begin{array}{c}{x_{0}}\\ {x_{1}}\\ {\vdots}\\ {x_{m}}\end{array}}\right],x_{0}\,=\,1,\,y\,\in\,\{0,1\}\end{split}\]

How do we chose the right parameters b?

Fitting the parameter b to the data.

\[ \hat{p}=\frac{1}{1+e^{-b^{T}x}}\]

The top numbers are no exponents but refer to the observation number here.

The Cost Function#

For logistic regression the log loss function LLF (also referred to as binary cross entropy) is used:
$$LLF = -\frac{1}{n}\sum_{i=0}^n\Bigg[ \Bigg(y_i \ln(\hat{p}(x_i))\Bigg) + \Bigg((1-y_i) \ln(1-\hat{p}(x_i))\Bigg)\Bigg]\quad$$

with $n$ being the number of observations,
$y_i$ being the real observations (0 or 1)
and $\hat{p}(x_i)$ being the predictions, between 0 and 1

$y_i$ (real value)	$\hat{p}(x_i)$ (prediction)	val in []
0	$ \sim 0$	$ \sim 0$
0	$\neq 0$	large
1	$ \sim 1$	$ \sim 0$
1	$\neq 1$	large

→ The further the prediction is from the real value, the larger the LLF
vice versa: The smaller the loss, the better the prediction.

The Cost Function#

Why don’t we use the cost function we know from linear regression?

\[ \mathrm{min}\ J(b)\ =\ \sum(y_{i}\ - \hat{y})^{2}\]

\[ \text{or the estimated probabilities:}\]

\[\mathrm{min}\ J(b)={\textstyle{\frac{1}{2n}}}\sum(y-\hat{p})^{2}\]

This would lead to a non-convex function which is hard to optimize. (Remember: Gradient Descent)

The Cost Function#

The linear regression cost function (remember OLS).

\[ J(b)={\textstyle{\frac{1}{2n}}}\sum(y\,-\,\hat{p})^{2}\]

Logistic regression hypothesis:

\[ \hat{p}=\frac{1}{1+e^{-b^{T}x}}\]

This would though lead to a non-convex function which is hard to optimize. The usual way to optimize is to find the global minimum.