Evaluation Metrics#

Orientation#

Where are we now?

06 PREDICTIVE MODELING:

  • select a ML algorithm

  • train the ML model

  • evaluate the performance
  • make predictions

Evaluate model performance#

  • Regression Metrics (R-squared, RMSE…)

  • Classification Metrics (accuracy, precision, recall…)

  • Custom Metrics
    → e.g. based on the worst case scenarios of your product

Note: If you need to present to stakeholders you need a simple metric! MSE, precision, recall... are too complex to explain

Short recap on regression metrics#

What Metrics do we have to evaluate model performance?#

R2 (R-squared)

\[{R}^{2}=1-\frac{R S S}{T S S}=1-\frac{\sum_{i=1}^{n}(y_{i}-{\hat{y}}_{i})^{2}}{\sum_{i=1}^{n}(y_{i}-{\bar{y}})^{2}}=1-\frac{\sum_{i=1}^{n}e_{i}^{2}}{\sum_{i=1}^{n}(y_{i}-{\bar{y}})^{2}}\]

MSE (Mean Square Error)

\[MSE={\frac{1}{n}}\sum_{i=1}^{n}\left(y_{i}-{\hat{y}}_{i}\right)^{2}\]

RMSE (Root Mean Square Error)

\[RMSE = \sqrt{MSE}\]

MAPE (Mean Absolute Percentage Error)

\[{MAPE}=\frac{1}{n}\sum_{i=1}^{n}\frac{|y_{i}-\hat{y}_{i}|}{\mathrm{max}(\epsilon,\left|y_{i}\right|)}\]

Let’s talk about classification#

source

Confusion Matrix#

Counts how often the model predicted correctly and how often it got confused.

  • False Positive: false alarm / type I error

  • False Negative: missed detection / type II error

    Predicted
    Negatives Positives
Actual Negatives TN FP
Positives FN TP

Accuracy#

How often the model has been right.

\[\frac{Correct}{All}=\frac{TP+TN}{TP+FP+TN+FN}\]
    Predicted
    Negatives Positives
Actual Negatives TN FP
Positives FN TP

Drawbacks#

  • When one class is very rare it leads to false conclusions

  • Here, Accuracy is 94 %

  • But 5 out of 6 positives have been predicted incorrectly

\[ \frac{C o r r e c t}{A l l}=\frac{1+93}{1+5+93+1}\]
    Predicted
    Negatives Positives
Actual Negatives 93 1
Positives 5 1

Accuracy might not be good enough#

source

Precision and Recall#

  • Focus on positive class

  • Number of True Negatives are not taken into account

  • When trying to detect a rare event

  • The number of negatives is very large

What proportion of actual positives was predicted correctly?#

  • The TPR is also called sensitivity or recall.

  • Here, the True Positive Rate is ⅙ (~16.67%).

\[Recall=\frac{T P}{T P+F N}\]
    Predicted
    Negatives Positives
Actual Negatives 93 1
Positives 5 1

Tweaking the model#

  • Every model has a threshold that discerns positive from negative predictions.

  • Typically, instances will get predicted positive if the probability for that is \(\geq\) 0.5.



Before tweaking
    Predicted
    Negatives Positives
Actual Negatives 93 1
Positives 5 1

Tweaking the model#

  • The lower the threshold the more instances get predicted positive.

  • This will automatically raise the True Positive Rate (TPR) / Recall.



After tweaking
    Predicted
    Negatives Positives
Actual Negatives 93 1
Positives 5 1

Now let’s tweak the model#

  • After tweaking the True Positive Rate (TPR) is at 100%.

  • But are we entirely happy?



Before tweaking
    Predicted
    Negatives Positives
Actual Negatives 93 1
Positives 5 1
After tweaking
    Predicted
    Negatives Positives
Actual Negatives 14 80
Positives 0 6

What proportion of positive predictions are actually correct?#

  • Precision is 6/86 (~6.97 %), even lower as recall before tweaking!

  • But: if it’s too low or acceptable depends on the business case.

  • For detecting cancer it might be okay for the stakeholders.
    → Still, costs for screening millions of people might be very high.



\[Precision=\frac{T P}{T P+F P}\]
    Predicted
    Negatives Positives
Actual Negatives 14 80
Positives 0 6

Summary#

\(N_+ \) : the number of positives
\(N_- \) : the number of negatives
n = # observations

    Predicted
    Negatives / 0 Positives / 1
Actual Negatives / 0 TN FP N- = FP + TN
Positives / 1 FN TP N+ = TP + FN
  N-hat- = FN + TN N-hat+ = TP + FP n = TP + FP + FN + TN

Precision-Recall Curve#

  • Plots Precision vs. Recall depending on the threshold.

  • If threshold is high:
    → Precision is close to 1.
    → Recall will be very low.

Precision-Recall Curve#

  • If threshold is effectively zero:
    → Predicting all instances as positives.
    → Recall will be 1.
    → Precision is equal to the share of positives.

  • Goal: Get a threshold the stakeholder agrees on.

  • Starting point might be estimation of economic benefit and cost.

F1-Score#

  • Harmonic mean of precision and recall

  • Here the F1-Score is 13.08%

\(F_1 = 2\cdot\frac{Precision\;\cdot\;Recall}{Precision\;+\;Recall}\)

\(F_1 = 2\cdot\frac{7\cdot100}{7+100}=13.08\%\)


    Predicted
    Negatives Positives
Actual Negatives 14 80
Positives 0 6

F1-Score#

  • The harmonic mean punishes low rates.

\(F_1 = 2\cdot\frac{Precision\;\cdot\;Recall}{Precision\;+\;Recall}\)


Precision Recall F1-Score
5% 50% 9%
90% 90% 90%
30% 60% 40%

Let’s take negatives into the equation#

  • The amount of correct negative predictions is sometimes just as important

  • Spam vs. ham is just one example (email spam detection)

What proportion of actual negatives was predicted as positives?#

  • False Positive Rate

  • Here the FPR is: \(\frac{80}{80+14} = 85.11\%\)



\[FPR=\frac{FP}{TN+FP}\]
    Predicted
    Negatives Positives
Actual Negatives 14 80
Positives 0 6

What proportion of actual negatives was predicted correctly?#

  • True Negative Rate also called specificity

  • Here the TNR is: \(\frac{14}{80+14} = 14.89\%\)

  • FPR = 1 - specificity



\[TNR=\frac{TN}{T N+F P}\]
    Predicted
    Negatives Positives
Actual Negatives 14 80
Positives 0 6

Receiver Operating Characteristic Curve (ROC Curve)#

TPR vs. FPR plotted for different thresholds

  • The 45° line is equivalent to throwing a coin

  • If all positives are correctly predicted and no negative is incorrectly predicted ROC curve would be the green curve

  • Aim: ROC curve as closely as possible to (0,1)

ROC and the Area Under the Curve (ROC AUC)#

Metric to compare different classifiers

  • Random classifier:
    → ROC AUC is 0.5
    → ROC curve is on the 45° line

  • Perfect classifier:
    → ROC AUC is 1

Explanation of ROC curve#

cutoff.gif

Explanation of ROC curve#

Image source

Imbalanced classes#

Image source

Going beyond aggregated metrics#

All the performance metrics we’ve seen today are aggregated metrics.

They help determine whether a model has learned well from a dataset or needs improvement.

Next step:
Examine the results and errors to understand why and how the model is failing or succeeding.

Why: validation and iteration

Performance metrics can be deceptive, on highly imbalanced datasets a classifier can reach very high accuracy without any predictive power

Validate your model → inspect how it is performing#

There are a lot of way to do this. You want to contrast data (target and/or features) and predictions.

Regression:
Looking at residuals, for example doing EDA on residuals and inspecting the outliers.

Classification:
One can start with a confusion matrix, breaking results in true class and predictions.

Resources#

roc-auc-precision-and-recall-visually-explained

Building Machine Learning Powered Applications - Emmanuel Ameisen