Hypothesis Testing And A/B Testing#

Intro#

Is a Waterfall View Really Worth More?#

Common Belief:
"Houses with a waterfall wiew are more expensive than houses without one".
House with waterfall view House without waterfall view
Which one feels more expensive?

The Problem#

Just seeing a difference in price isn’t enough.

We need to ask:

  • Is the difference real?

  • Or could it just be due to random chance?

Hypothesis#

What is a Hypothesis?#

A hypothesis is an assumption about an expected association.
Our goal is to either reject or retain this hypothesis.
You can test our hypothesis based on data.
The analysis of the data is done with a hypothesis test.
Houses with a waterfall view are more expensive than houses without.
👍 👎
We collect a sample among a 1000 houses in King County.
We pick up the proper hypothesis test.

Research process

  1. Topic

  2. Research Question

  3. Hypotheses

  4. Collect Data

  5. Hypothesis Test

  6. Conclusion

How Do We Formulate a Hypothesis?#

🔍 Research Question

  • Asks whether there's an effect.
    e.g.,"Are homes near waterfall more expensive?".
  • Broader and more exploratory in scope.
  • May not be directly testable as-is; might need to be broken into hypotheses.

💡Hypothesis

  • Provides a specific, testable prediction.
    e.g., Homes near waterfall have a higher average price per square meter.
  • Narrow and focused on what you expect to find.
  • Designed to be testable through statistical methods.

What is the Null and Alternative Hypothesis?#

💡 Hypothesis

Ask yourself:

  • What will change?

  • How will it change?

  • What will cause the change?

🔀 Two possible Hypotheses

Null Hypothesis (H₀):

There is no difference or effect between two or more groups with respect to a characteristic.

Any observed difference is due to chance.

Example:
The price of homes with and without a waterfall view does not differ.
Alternative Hypothesis (H₁):

There is a difference or effect between two or more groups.



Example:
The price of homes with and without a waterfall view does differ.
In a hypothesis test, only H₀ can be tested and the goal is to find out if the null hypothesis is rejected or not.

Types of Hypotheses#

🔄 Non-Directional (Two-Tailed) Hypothesis

States that there is a difference, but does not specify the direction of the difference.

Example (Difference): "There is a difference in average prices between homes with and without waterfall views."

Example (Correlation): "There is a relationship between house size and sale price."

This type of hypothesis is often used when the researcher does not have a specific expectation about the direction of the difference.

➡️ Directional (One-Tailed) Hypothesis

States that there is a difference and specifies the direction of the difference.

Example (Difference): "Homes with waterfall views have higher average prices than those without."

Example (Correlation): "Larger houses are positively correlated with higher sale prices."

This type of hypothesis is used when the researcher has a specific expectation about the direction of the difference.

Hypothesis Testing#

Purpose and Limitations of Hypothesis Testing#

  • ✅ A hypothesis test is used to test an assumption about a population using information from a sample.
  • 🎯 The goal is to determine whether there is enough evidence to reject the null hypothesis or retain it.
  • ⚠️ It never rejects the null hypothesis with absolute certainty.
  • 🔄 There is always a probability of rejecting the null hypothesis even if it is actually true.
  • 📏 This probability threshold is called the level of significance α (commonly 5%).
  • 📊 If the p-value is less than α, we reject the null hypothesis in favor of the alternative hypothesis.
When to use hypothesis testing

One-Tailed (Directional) Hypothesis Test#

Purpose:
Test for a difference in one direction only.

How it works:

  • If your sample result falls in the critical region, you reject the null hypothesis and accept the alternative hypothesis.
  • If it does not, you don't reject the null hypothesis (no evidence to support the alternative hypothesis).

Example:
A real estate company claims homes near a waterfall are more expensive.

Hypotheses:

  • Null Hypothesis (H₀): μ₁ ≤ μ₂ (True mean price near waterfall is less than or equal to that of homes not near)
  • Alternative Hypothesis (H₁): μ₁ > μ₂ (True mean price near waterfall is greater than that of homes not near)

One-tailed test

Sample Sample Mean (Near Water) Sample Mean (Not Near Water) Difference
1 98 106 -8
2 101 104 -3
3 100 100 0
4 103 101 +2

Two-Tailed (Non-Directional) Hypothesis Test#

Purpose:
Test for a difference in both directions.

How it works:

  • If your sample result falls in either critical region, you reject the null hypothesis and accept the alternative hypothesis.
  • If it does not, you don't reject the null hypothesis (no evidence to support the alternative hypothesis).

Example:
A real estate company claims homes near a waterfall are either more or less expensive.

Hypotheses:

  • Null Hypothesis (H₀): μ₁ = μ₂ (True mean price near waterfall is equal to that of homes not near)
  • Alternative Hypothesis (H₁): μ₁ ≠ μ₂ (True mean price near waterfall is different from that of homes not near)

Two-tailed test

Sample Sample Mean (Near Water) Sample Mean (Not Near Water) Difference
1 98 102 -4
2 101 99 +2
3 100 100 0
4 103 97 +6

Understanding the p-value#

Purpose:
Used to decide whether to reject or retain the null hypothesis.

Definition:
The p-value is the probability (assuming the null hypothesis is true) of obtaining a sample result at least as extreme as the one observed.

Decision rule:

  • If p-value ≤ α (e.g., 0.05), reject H₀.
  • If p-value > α, do not reject H₀.

Notes:

  • Smaller p-values give stronger evidence against H₀.
  • The p-value is not the probability that H₀ is true.

Sampling distribution with critical region and test statistic

Sample Test Statistic p-value Decision (α = 0.05)
1 2.10 0.041 Reject H₀
2 1.50 0.141 Do not reject H₀
3 2.85 0.006 Reject H₀

Common Hypothesis Tests#

source

Test Purpose When to Use Type
One-Sample t-Test Tests if the sample mean differs from a known value. Compare average house price in one neighborhood to a known county average. Parametric
Two-Sample t-Test (Independent) Tests if the means of two independent groups differ. Compare average prices of houses with waterfall views vs. without. Parametric
Paired t-Test Tests if the means of two related measurements differ. Compare house prices before and after renovations on the same properties. Parametric
Wilcoxon Signed-Rank Test Non-parametric alternative to the paired t-test. Compare house prices before and after renovation when data are not normally distributed. Non-Parametric
Chi-Square Test Tests for association between categorical variables. Check if having a pool is associated with being in a specific neighborhood. Non-Parametric
ANOVA (Analysis of Variance) Tests if means differ across three or more groups. Compare average prices across urban, suburban, and rural areas. Parametric
Kruskal-Wallis Test Non-parametric alternative to ANOVA. Compare average prices across urban, suburban, and rural areas when data are not normally distributed. Non-Parametric
Correlation Test (Pearson) Tests linear relationship between two continuous variables. Check if house size is related to house price. Parametric
Correlation Test (Spearman) Tests monotonic relationship between two variables. Check if house rank in size is related to rank in price. Non-Parametric

A/B Testing#

🆎 Introduction to A/B Testing#

What is A/B Testing?

A/B Testing is an experiment where you compare two versions (A and B) of something to see which performs better.

Why We Use It

  • To make data-driven decisions instead of relying on guesswork.
  • To improve performance and user experience.
  • To validate design, content, or product changes before full rollout.

How A/B Testing Works#

How it Works

  • Randomly split your audience into two groups.
  • Show group A one version and group B another version.
  • Measure the outcome (click rate, purchase rate, etc.).
  • Use a statistical test (often a two-sample proportion test or t-test) to see if the difference is significant.
Example

💻 A company tests two landing page designs:

  • Version A: Without Chatbot
  • Version B: With Chatbot

Measure which version leads to more sign-ups.

A/B Testing Diagram

Key Points

  • H₀: No difference between A and B.
  • H₁: One version performs better.
  • Significance level (α).

Why Random Assignment#

It ensures that the groups are comparable and any differences in outcomes are likely due to the change you made — not to pre-existing differences.

Why It Matters
  • Removes Selection Bias – Ensures both groups start out similar.
  • Balances Confounding Variables – Distributes known and unknown factors evenly.

Without random assignment, differences may be due to who’s in the groups — not the change you made.

Interpreting A/B Test Results#

Key Outputs from an A/B Test

  • p-value – Probability the observed difference happened by chance.
  • Confidence Interval (CI) – Range of likely true differences.
  • Effect Size – Magnitude of the difference (practical significance).
Example Results
Metric Version A Version B
Conversion Rate 8.0% 9.5%
p-value 0.032
95% CI [0.2%, 2.8%]
Effect Size +1.5 percentage points
How to Interpret
  • p-value < 0.05 → Reject H₀.
  • CI entirely positive → B likely better than A.
  • Effect Size → Shows practical relevance.

Here, Version B is statistically and practically better than Version A.

Key Takeaway

Statistical significance (p-value) is not enough — always check the effect size and the confidence interval.

Conclusion - Hypothesis Testing & A/B Testing#

Key Takeaways

  • Both Hypothesis Testing and A/B Testing help make data-driven decisions rather than relying on guesswork.
  • Every test starts with a clear null hypothesis (H₀) and alternative hypothesis (H₁).
  • The p-value tells us how likely it is to observe the data (or more extreme) if H₀ were true — but it’s not absolute proof.
  • Statistical significance doesn’t always mean practical significance — always check effect sizes and context.

Resources#