Evaluation Metrics

Evaluation Metrics#

Orientation#

Where are we now?

06 PREDICTIVE MODELING:

select a ML algorithm
train the ML model
evaluate the performance

make predictions

Evaluate model performance#

Regression Metrics (R-squared, RMSE…)
Classification Metrics (accuracy, precision, recall…)
Custom Metrics
→ e.g. based on the worst case scenarios of your product

Note: If you need to present to stakeholders you need a simple metric! MSE, precision, recall... are too complex to explain

Short recap on regression metrics#

What Metrics do we have to evaluate model performance?#

R2 (R-squared)

\[{R}^{2}=1-\frac{R S S}{T S S}=1-\frac{\sum_{i=1}^{n}(y_{i}-{\hat{y}}_{i})^{2}}{\sum_{i=1}^{n}(y_{i}-{\bar{y}})^{2}}=1-\frac{\sum_{i=1}^{n}e_{i}^{2}}{\sum_{i=1}^{n}(y_{i}-{\bar{y}})^{2}}\]

MSE (Mean Square Error)

\[MSE={\frac{1}{n}}\sum_{i=1}^{n}\left(y_{i}-{\hat{y}}_{i}\right)^{2}\]

RMSE (Root Mean Square Error)

\[RMSE = \sqrt{MSE}\]

MAPE (Mean Absolute Percentage Error)

\[{MAPE}=\frac{1}{n}\sum_{i=1}^{n}\frac{|y_{i}-\hat{y}_{i}|}{\mathrm{max}(\epsilon,\left|y_{i}\right|)}\]

Let’s talk about classification#

source

Confusion Matrix#

Counts how often the model predicted correctly and how often it got confused.

False Positive: false alarm / type I error
False Negative: missed detection / type II error

		Predicted
		Negatives	Positives
Actual	Negatives	TN	FP
Actual	Positives	FN	TP

Accuracy#

How often the model has been right.

\[\frac{Correct}{All}=\frac{TP+TN}{TP+FP+TN+FN}\]

		Predicted
		Negatives	Positives
Actual	Negatives	TN	FP
Actual	Positives	FN	TP

Drawbacks#

When one class is very rare it leads to false conclusions
Here, Accuracy is 94 %
But 5 out of 6 positives have been predicted incorrectly

\[ \frac{C o r r e c t}{A l l}=\frac{1+93}{1+5+93+1}\]

		Predicted
		Negatives	Positives
Actual	Negatives	93	1
Actual	Positives	5	1

Accuracy might not be good enough#

source

Precision and Recall#

Focus on positive class
Number of True Negatives are not taken into account
When trying to detect a rare event
The number of negatives is very large

What proportion of actual positives was predicted correctly?#

The TPR is also called sensitivity or recall.
Here, the True Positive Rate is ⅙ (~16.67%).

\[Recall=\frac{T P}{T P+F N}\]

		Predicted
		Negatives	Positives
Actual	Negatives	93	1
Actual	Positives	5	1

Tweaking the model#

Every model has a threshold that discerns positive from negative predictions.
Typically, instances will get predicted positive if the probability for that is \(\geq\) 0.5.

Before tweaking

		Predicted
		Negatives	Positives
Actual	Negatives	93	1
Actual	Positives	5	1

Tweaking the model#

The lower the threshold the more instances get predicted positive.
This will automatically raise the True Positive Rate (TPR) / Recall.

After tweaking

		Predicted
		Negatives	Positives
Actual	Negatives	93	1
Actual	Positives	5	1

Now let’s tweak the model#

After tweaking the True Positive Rate (TPR) is at 100%.
But are we entirely happy?

Before tweaking

		Predicted
		Negatives	Positives
Actual	Negatives	93	1
Actual	Positives	5	1

After tweaking

		Predicted
		Negatives	Positives
Actual	Negatives	14	80
Actual	Positives	0	6

What proportion of positive predictions are actually correct?#

Precision is 6/86 (~6.97 %), even lower as recall before tweaking!
But: if it’s too low or acceptable depends on the business case.
For detecting cancer it might be okay for the stakeholders.
→ Still, costs for screening millions of people might be very high.

\[Precision=\frac{T P}{T P+F P}\]

		Predicted
		Negatives	Positives
Actual	Negatives	14	80
Actual	Positives	0	6

Summary#

\(N_+ \) : the number of positives
\(N_- \) : the number of negatives
n = # observations

		Predicted
		Negatives / 0	Positives / 1	∑
Actual	Negatives / 0	TN	FP	N_- = FP + TN
Actual	Positives / 1	FN	TP	N₊ = TP + FN
	∑	N-hat_- = FN + TN	N-hat₊ = TP + FP	n = TP + FP + FN + TN

Precision-Recall Curve#

Plots Precision vs. Recall depending on the threshold.
If threshold is high:
→ Precision is close to 1.
→ Recall will be very low.

Precision-Recall Curve#

If threshold is effectively zero:
→ Predicting all instances as positives.
→ Recall will be 1.
→ Precision is equal to the share of positives.
Goal: Get a threshold the stakeholder agrees on.
Starting point might be estimation of economic benefit and cost.

F1-Score#

Harmonic mean of precision and recall
Here the F1-Score is 13.08%

\(F_1 = 2\cdot\frac{Precision\;\cdot\;Recall}{Precision\;+\;Recall}\)

\(F_1 = 2\cdot\frac{7\cdot100}{7+100}=13.08\%\)

		Predicted
		Negatives	Positives
Actual	Negatives	14	80
Actual	Positives	0	6

F1-Score#

The harmonic mean punishes low rates.

\(F_1 = 2\cdot\frac{Precision\;\cdot\;Recall}{Precision\;+\;Recall}\)

Precision	Recall	F1-Score
5%	50%	9%
90%	90%	90%
30%	60%	40%

Let’s take negatives into the equation#

The amount of correct negative predictions is sometimes just as important
Spam vs. ham is just one example (email spam detection)

Image source

What proportion of actual negatives was predicted as positives?#

False Positive Rate
Here the FPR is: \(\frac{80}{80+14} = 85.11\%\)

\[FPR=\frac{FP}{TN+FP}\]

		Predicted
		Negatives	Positives
Actual	Negatives	14	80
Actual	Positives	0	6

What proportion of actual negatives was predicted correctly?#

True Negative Rate also called specificity
Here the TNR is: \(\frac{14}{80+14} = 14.89\%\)
FPR = 1 - specificity

\[TNR=\frac{TN}{T N+F P}\]

		Predicted
		Negatives	Positives
Actual	Negatives	14	80
Actual	Positives	0	6

Receiver Operating Characteristic Curve (ROC Curve)#

TPR vs. FPR plotted for different thresholds

The 45° line is equivalent to throwing a coin
If all positives are correctly predicted and no negative is incorrectly predicted ROC curve would be the green curve
Aim: ROC curve as closely as possible to (0,1)