Designing Data Products#

Why would we build a model?#

  • Exploratory analysis - understand what happened in the past

  • Predictive analyis - predict what will happen

  • Predict what, for whom and for what purpose?

Note: You do not always need a ML model.

Product = Customer x Business x Technology#

  • Usability

  • Business viability

  • Feasibility

Value = product of the three (If one is zero then the value is zero too.)

Measuring success#

The first model you build should be the simplest model that could address the product needs.

Business performance: measured usually by one KPI (key performance indicator)

Model performance: an offline metric that captures how well the model will fit the business need

Note: The business metric is independent from the model metric... It is a measure of the product success.

Business performance vs. model performance#

Business performance vs. model performance#

Business performance vs. model performance#

Business performance vs. model performance#

Examples of measuring business performance#

Business metrics:

  • Click through rate (CTR) - for recommenders

  • Usage - model that generates html from hand drawn diagrams

  • Adoption by finance team - internal revenue forecasting

Note: The business metrics for your model might be impossible to measure before the model goes live!

Examples of measuring model performance#

Regression:

  • RMSE, RMSLE

  • MAPE ( mean absolute percentage error) - accuracy as a ratio

Classification:

  • Accuracy

  • Precision

  • Recall

Custom metric: based on the worst case scenarios of your product.

Note: If you need to present to stakeholders you need a simple metric!

Relationship between business performance & model performance#

Thinking of the business value of your model and the cost of being wrong can help you choose the right model metric.

Always start from the value!

Error Analysis#

Remember the Summary vs details?#

Going beyond aggregated metrics#

  • Most model performance metrics we’ve seen are aggregated metrics

  • They help determine whether a model has learned well from a dataset or needs improvement

  • Next step: examine results and errors to understand why and how is the model failing or succeeding

Why: validation and iteration

Note: Performance metrics can be deceptive, on highly imbalanced datasets a classifier can reach very high accuracy without any predictive power

Types of supervised learning#

source

Validate your model - inspect how it is performing#

There are lot of ways to do this. You want to contrast data (target and/or features) and predictions.

Regression: looking at residuals, for example doing EDA on residuals and inspecting the outliers

Classification: one can start with a confusion matrix, breaking results in true class and predictions

Confusion Matrix for classification#

Counts how often the model predicted correctly and how often it got confused.

  • False Positive: false alarm / type I error

  • False Negative: missed detection / type II error

    Predicted
    Negatives Positives
Actual Negatives TN FP
Positives FN TP

What do the misclassified examples have in common?

Residual analysis for regression#

  • This is like EDA again but on residuals (predicted - observed)

  • Plot residuals /and standardized residuals vs predicted

  • We want our residuals to have no patterns, to be symmetrically distributed, centered in the middle of the plot

  • IF NOT…

Residual analysis for regression#

  • IF NOT.. then there is room for improvement in the model.

What if my residuals look like this? Check out this walkthrough


Resources#

https://svpg.com/what-is-a-product/
https://medium.com/analytics-vidhya/root-mean-square-log-error

Building Machine Learning Powered Applications(https://medium.com/analytics-vidhya/root-mean-square-log-error-rmse-vs-rmlse-935c6cc1802a) - EmmanuelAmeisen
https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/
https://www.scikit-yb.org/en/latest/api/regressor/residuals.html

Example of EDA with error analysis
https://www.kaggle.com/elitcohen/forest-cover-type-eda-modeling-error-analysis#Error-Analysis
https://www.kaggle.com/pestipeti/error-analysis
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

ML Project Topics#

Kickstarter Project Success#

Analyse and model success factors of kickstarter campaigns. Give new projects an idea what is needed for a successful funding and potentially even predict campaign success upfront.

  • 221811 rows of data on campaigns

  • (medium)

Example Image

Tanzania Tourism Prediction#

Can you use tourism survey data and ML to predict how much money a tourist will spend when visiting Tanzania?

  • Survey Data from 6476 participants

  • (easy/medium)

Zindi-Tansania-Tourism

Fraud Detection Challenge in Electricity and Gas Consumption#

  • Based on client’s billing history detect clients involved in fraudulent activities

  • (medium/advanced)

Fraud Detection Challenge

Urban Air Pollution Challenge#

Predict air quality levels and empower communities to plan and protect their health

  • weather data and daily observations collected from Sentinel 5P satellite tracking various pollutants in the atmosphere

  • (medium/advanced -> domain knowledge helpful)

Air Pollution Challenge

Flight Delay Prediction Challenge#

Predict airline delays for Tunisian aviation company, Tunisair

  • Data on flight delays. Can be combined with airport locations

  • (medium)

Flight Delay Prediction Challenge

Financial Inclusion in East Africa#

Can you predict who in East Africa is most likely to have a bank account?

  • Survey data on financial inclusion of ~33,600 participants

  • (easy/medium)

Financial Inclusion in East Africa