Exploratory Data Analysis

Exploratory Data Analysis#

Table of Content#

The EDA Process put into context
Types of EDA
- uni-/bi-/multi-variate
- numerical vs. graphical
- continuous vs. categorical data
Measures of central tendency / dispersion
Distribution functions / histograms / modality
Correlations
Special considerations for categorical / discrete variables
Summary / References

The EDA Process#

Research / Business Questions#

Those questions that arise from a researcher guessing about reality (data). They are written in the form of a question.

Examples:

Which factors can increase ice cream sales?
Can we predict skin cancer using photographs of the melanome?
How does the material composition of a bridge affect its durability?

Hypothesis Generation#

Hypotheses are assumptions or educated guesses we make about the data, using our domain knowledge.
You can form a hypothesis in the form of “if/then” or “the more the”.
A Hypothesis is formed as a measurable (operationisable) statement you can validate by looking at data.
A research question can have multiple hypotheses attached to it.

Examples:

If the sun is shining then ice cream sales increase.
The larger the melanoma, the greater the risk that it is malignant.
The cheaper the material composition the shorter its durability.

Hypothesis Generation != Hypothesis testing#

The process is an educated guess, the hypothesis could still come out to be true or false after EDA and hypothesis testing (is the conclusion random or not?)

Hypothesis testing

Where does EDA belong in the bigger picture?#

Black cats and domain experts#

The hardest thing of all is to find a black cat in a dark room, especially if there is no cat.

→ Talk to domain experts or become one.

What is EDA and why do we do it?#

What is EDA?#

→ Detective work

EDA is the process of summarizing and visualising important characteristics of the data in order to gain insights.

It’s an approach/process not a set of techniques.

Tools:

Any method of looking at data that does not include formal statistical modelling and inference
Visualisation

Note:

Confirm the expected or show the unexpected!

Goals and Benefits#

Understand each variable
Get insight into relationships between the variables
Draw valid conclusions
Checking assumptions
Aid in decision making and planning
Help in causal analysis

→ To build intuition about the data and gain insights

→ To generate and corroborate or reject hypotheses

Types of EDA#

Univariate or multivariate
Graphical or non-graphical
Dealing with both categorical and numerical data

Estimates#

We can rarely have money/resources to measure everything
So we will have samples of the population which we hope to be representative of the whole population
The more data we have the more confident we are in the estimates
Ideally: the results drawn from our experiments are reproducible

Note:

In Data Science most metrics omit the “estimate” word… nevertheless mostly all metrics we use are estimates.

Univariate vs. Multivariate Analysis#

Univariate Analysis:

Analysis of a single variable (often called feature)
Characterises data by focusing on distribution, central tendency, dispertion, etc
Represents information numerically and visually

Multivariate Analysis:

Simultaneous analysis of multiple variables
Examines how changes in one variable are associated with changes in others
Characterises dependence by a numerical coefficient

→ Description and understanding of individual variables

→ Understanding of the relationship and interaction among multiple variables

Data Types#

Univariate EDA#

Describing Central Tendency#

Mean#

sum of data points divided by number of data points
often used as default
sensitive to extreme values

What is the mean in this example?

id	y
1	2
2	6
3	6
4	8
5	10

\[\bar{y}=\frac{1}{n} \sum\limits_{i=1}^{n}y_{i}\]

Median#

value in the middle of a data series ordered by size
more robust against extreme values but computationally more expensive since values need to be sorted

What is the median in this example?

id	y
1	2
2	6
3	6
4	8
5	10

\[\begin{split} y_{i} \leq y_{i+1}\;\text{for}\;i=1,...,n-1 \\ y_{\frac{n+1}{2}}\;\text{if}\;n\;\text{is odd} \\ \frac{1}{2}\big(y_{\frac{n}{2}}+y_{\frac{n+1}{2}}\big)\;\text{if}\;n\;\text{is even} \end{split}\]

Mode#

most frequent values
not necessarily unique
mostly used for categorical data

What is the mode in this example?

id	y
1	2
2	6
3	6
4	8
5	10

Describing the Spread#

Range#

difference between largest and smallest value
measures the spread of the data
sensitive to outliers

Which dataset has the larger range?

Quantiles#

quantiles split sorted data into parts with equal amount of observations
- quartiles: splits data into 4 parts
- deciles: splits data into 10 parts
- percentiles: splits data into 100 parts

Notes: The position of the percentiles are not equidistant (and depend on the distribution)

Interquartile Range (IQR)#

width of interval that contains the middle 50% of the data
interval between the 25th and 75th percentile
interval between 1st and 3rd quartile
robust to outliers

Outliers#

No generally recognized formal definition for outlier
Values outside of the areas of a distribution that would commonly occur

Note:

If an outlier is good or bad depends on the data problem. For example for anomaly detection you want to keep outliers.

Box Plots#

Variance & Standard Deviation#

Variance

average squared difference of the values from the mean: \( \sigma_{sample} = \frac{1}{n-1}\sum_i{(x_i-\bar{x})^2}\)

Standard deviation

square root of variance: \(SD = \sqrt{\sigma}\)
standard difference between each data point and the mean
has the same unit as the original data

Both are not robust to outliers.

degrees of freedom

Describing the Distribution#

Skewness & Kurtosis#

Skewness

degree of asymmetry of the distribution of the data

Kurtosis

degree of pointyness relative to a normal distribution (flat vs. pointy)

Kurtosis

Data Distributions #

Uniform
all events have same frequency, e.g. outcome of a dice roll
Bernoulli
two possible outcomes, e.g. a coin toss
Binomial
“discrete version” of normal distribution, e.g. 100 x two coins: likelihood of a certain number of only heads
Normal
most continuous real-valued variables in nature follow this distribution
Poisson
events occurring at random points of time and space - the number of events Video
Exponential
the interval between events

Visualising Data Distributions with Histograms#

Box Plots#

What do we see here?

Binwidth#

Binwidth matters!
Same data with bin width = 5, 2, 1

Note:

Rule of thumb: Take the square root of the sample size as the number of bins for a first guess (trial and error).

How to choose number of bins#

Rule of thumb: Take the square root of the sample size as the number of bins for a first guess (trial and error).

Consider:

too large bins → we lose details or end up having only one bin
too small → we have too much detail or end up with one bin per observation

Comparing Data Distributions with Histograms#

Histograms:

Good for looking at residuals (variance)
Works best for comparing max 3-4 groups
You can use so-called kernel density estimates (KDE) to plot it continuously

Modality#

Plot a histogram and look at the number of peaks in the distribution

Data Summaries#

Central tendency	Spread	Modality	Shape
mean	range	unimodal	skewness
median	interquartile range	bimodal	kurtosis
mode	variance	multimodal
quantiles	standard deviation	uniform

Multivariate EDA#

Numerical Data#

Scatter Plot#

Used to visualise relationship between two numeric variables
Also called correlation plots
Can encode multiple dimensions by color and size

It visually answers the question:

→ “How are these variables related?”

→ “When variable X grows, what happens to variable Y? With which intensity?”

Pearson correlation coefficient / Pearson’s r#

Measures the linear relationship between two variables
Ranges between -1 and 1

→ close to 1: strong positive linear relationship

→ around 0: no linear relationship

→ close to -1: strong negative linear relationship

guess the correlation

The ice cream example#

Month	Average Temp	Sales
January	4	73
February	4	57
March	7	81
April	8	94
May	12	110
June	15	124
July	16	134
August	17	139
September	14	124
October	11	103
November	7	81
December	5	80

The ice cream example#

r = 0.983

Note:

A correlation analysis may establish a linear relationship but does not allow us to use it to predict the value of a variable given another. Regression analysis allows us to this and more.

Spearman rank correlation coefficient - Spearman’s ρ#

Measure of rank correlation, it is based on the rank of the values vs. the raw data
Represents the strength of a monotonic relationship

Monotonic function:

→ increasing: as X increases Y never decreases

→ decreasing: as X increases Y never increases

Consideration about correlation#

If two variables are independent, their correlation is 0, but a correlation of 0 does not imply that two variables are independent!
The correlation coefficients cannot replace visual examination of data.
The presence of correlation is not enough to infer causation!

Bivariate and Multivariate Analysis#

Looking at all possible combinations of features:

for 9 features bivariate would mean 36 combinations: \(\sum_{i=1}^{8} i\)
How do we reduce the exploration space or focus on interesting combinations?

Correlation Matrix

Special consideration of discrete / categorical data#

mode
frequency tables: number of times a value occurs
expected values: weighted mean when categories can be associated with numerical value

Frequency tables#

Tabulation of the frequencies
Show the range of values and frequency of occurrence

Expected values#

weighted mean

Example: Offers for different course plans for financial purposes we can sum this up in a single “expected value,” which is a form of weighted mean, in which the weights are probabilities.

\(EV = 0.05*300 + 0.15*50 + 0.80*0 = 22.5\)

Cross-tabulation#

Summary / Outlook#

EDA is like a detective’s investigation to

understand the data
identify patterns

Why do we want to know our data? Because we want to find out

how to answer our research / business question
whether the data is suitable / sufficient
how to answer the research questions with the existing data
how to phrase/refine our hypotheses

Summaries vs. Details…#

Exploratory Data Analysis

Contents

Exploratory Data Analysis#

Table of Content#

The EDA Process#

Research / Business Questions#

Hypothesis Generation#

Hypothesis Generation != Hypothesis testing#

Where does EDA belong in the bigger picture?#

Black cats and domain experts#

What is EDA and why do we do it?#

What is EDA?#

Goals and Benefits#

Types of EDA#

Estimates#

Univariate vs. Multivariate Analysis#

Data Types#

Univariate EDA#

Describing Central Tendency#

Mean#

Median#

Mode#

Describing the Spread#

Range#

Quantiles#

Interquartile Range (IQR)#

Outliers#

Box Plots#

Variance & Standard Deviation#

Describing the Distribution#

Skewness & Kurtosis#

Data Distributions#

Visualising Data Distributions with Histograms#

Box Plots#

Binwidth#

How to choose number of bins#

Comparing Data Distributions with Histograms#

Modality#

Data Summaries#

Multivariate EDA#

Numerical Data#

Scatter Plot#

Pearson correlation coefficient / Pearson’s r#

The ice cream example#

The ice cream example#

Spearman rank correlation coefficient - Spearman’s ρ#

Consideration about correlation#

Bivariate and Multivariate Analysis#

Special consideration of discrete / categorical data#

Frequency tables#

Expected values#

Cross-tabulation#

Summary / Outlook#

Summaries vs. Details…#

Techniques Map#

Visual Vocabulary#

References#

Data Distributions #