Distributions

Distributions#

Normal vs Paranormal#

Distributions Functions

MemoStats Game#

Let’s find paired cards in the MemoStats Game !!!

you uncover a card
explain what you see
if you find the paired card, you continue to uncover cards
otherwise, a new player takes your turn

Types of Data#

Quantitative: numerical data that can be measured (e.g., height, weight, temperature)
Qualitative: categorical data that describes characteristics or qualities (e.g., colors, names, types)

Categorical Data#

Nominal: categories without a specific order (e.g., colors, types of fruit)
Ordinal: categories with a specific order (e.g., rankings, satisfaction levels)
Distribution: often shown as a bar chart or pie chart

Numerical Data#

Discrete: countable data (e.g., number of students, number of cars, roll of a die)
Continuous: measurable data (e.g., height, weight, temperature)
Distribution: often shown as a barchart, histogram, probability density functions (PDFs) or line charts

Why Data Types Matter?#

Analysis: Different data types require different statistical methods and probabilty distributions.
Visualization: Choosing the right chart or graph depends on the data type.

Intro to (Probability) Distributions#

Distribution: Shows how values in a dataset are spread out. It tells us how often each value occurs.
Probability Distribution: A mathematical function that describes the likelihood of different outcomes.
Types of Distributions:
- Discrete: Used for countable outcomes (e.g., number of students in a class).
- Continuous: Used for measurable outcomes (e.g., height, weight, temperature).

Probability Mass Function (PMF)#

It gives the probability of a specific outcome for discrete random variables.
It is defined as: $$ P(X = x) = p(x) $$

where $p(x)$ is the PMF of the random variable $X$.
Example: If you roll a die 🎲, the PMF gives the probability of rolling each number (1 to 6).
PMF is often represented as a bar chart.

Probability Density Function (PDF)#

It describes the density of the probabilty for continuous random variables.
Probaility is represented as an area under the curve over a range of values.
It is defined as: $$ P(a < X < b) = \int_{a}^{b} f(x) \, dx $$

where $f(x)$ is the PDF of the random variable $X$.
Example: What is the probability that a randomly chosen person’s height is between 160 cm and 180 cm?

Cumulative Distribution Function (CDF)#

It gives the cumulative probability that a random variable (either discrete or continuous) $X <= \textbf{certain value x} $
It is defined as: $$ F(x) = P(X \leq x) $$

where $F(x)$ is the CDF of the random variable $X$.
Example: What’s the probability that time between arrivals of emails is less than 5 minutes?

Distribution Examples#

Binomial Distribution 🪙#

It describes the number of successes in a fixed number of indipendent trials/experiments, where each trial has two possible outcomes (success or failure).
It is defined as: $$ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} $$

where:
- $\binom{n}{k}$ is the binomial coefficient
- $n$ is the number of trials
- $k$ is the number of successes
- $p$ is the probability of success in each trial (0.5 for a fair coin)
Example: Flipping a coin 10 times and counting how many times it lands on heads (success).

Poisson Distribution 📧#

It describes the number of events that occur within a fixed interval of time or space, given that events happen independently and at a constant average rate.
It is defined as: $$ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} $$

where:
- $\lambda$ is the average rate of events in the interval
- $k$ is the number of events
- $!$ is the factorial function
Example: The number of emails received in an hour, where the average rate is 5 emails per hour.

Uniform Distribution 🎲#

It describes a situation where all outcomes are equally likely within a range.
It is defined as: $$ P(X = x) = \frac{1}{b - a} $$

where:
- $a$ is the minimum value
- $b$ is the maximum value
- $x$ is any value between $a$ and $b$
Example: Rolling a fair die, where each face (1 to 6) has an equal probability of 1/6.

Normal Distribution 📈#

It is a bell-shaped curve where most of data points cluster around the mean, with fewer points as you move away from the mean.
It is defined as: $$ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} $$

where:
- $\mu$ is the mean (average)
- $\sigma$ is the standard deviation (spread of the data)
Example: Heights of people in a population, IQ scores, or errors.
Applications: foundation for many statistical methods, including hypothesis testing and confidence intervals.

Why Distributions matter?#

Help us understand the underlying patterns in data.
Help us in identifying outliers and anomalies.
Guide us in selecting appropriate statistical methods for analysis, modeling

Confidence Intervals#

What is a Confidence Interval?#

Confidence Interval (CI) is a range of values constructed from a sample data so that we can be reasonably sure (usually 95% confidence level) that the true population parameter lies somewhere in that range.

Example: Height#

Let’s say you’re estimating the average height of adult women in a city based on a sample, and you compute:

CI = [162 cm, 168 cm] with 95% confidence

We are 95% confident that the true average height of all adult women in that city is between 162 cm and 168 cm.

✅ What it does mean:

If we repeated this sampling process many times, and built a 95% CI from each sample:

About 95% of those intervals would contain the true population parameter.

The interval reflects the precision of our estimate — wider means more uncertainty; narrower means more precise.

❌ What it does NOT mean:

It does not mean there’s a 95% chance the true value is inside this specific interval — the true value is fixed; the interval is random.

It’s about long-run performance of the method, not a probability about this one interval.

How can we define this range of values?#

A CI is tipically defined as: $$ \text{CI} = \text{Point Estimate} \pm \text{Margin of Error} $$

Point Estimate: sample statistic (e.g., sample mean, sample proportion)
Margin of Error (MoE): quantifies the uncertainty in the estimate, how much we expect the sample statistic to be off from the true population parameter.

Methods to Calculate Margin of Error (MoE)#

1. Z-Score Method for the Mean

MoE = z* × (σ / √n)

Use when:

Population standard deviation (σ) is known.
Sample size (n) is large (n ≥ 30).

Critical value z*:

From the standard normal distribution for the desired confidence level (e.g., 1.96 for 95% confidence).

2. T-Score Method for the Mean

MoE = t* × (s / √n)

Use when:

Population standard deviation (σ) is unknown.
Sample size (n) is small (n < 30).

Use sample standard deviation (s)

Critical value t*:

From the t-distribution for the desired confidence level and degrees of freedom (df = n - 1).

Z vs T Critical Values Comparison Table

Confidence Level	Z-Score (z*)	T-Score (df = 15)
90%	1.645	1.753
95%	1.960	2.131
99%	2.576	2.947

Bootstrapping for calculating CI:#

One way to define this range of values is the so called Bootstrapping Method.