Hypothesis Testing: Steps & Examples

Home > Library > Research Methodology > Hypothesis Testing: Steps & Examples

Published by Aadam Mae at August 14th, 2021 , Revised On June 17, 2026

Hypothesis testing is a formal statistical procedure for deciding whether the evidence in a sample is strong enough to support a claim about a wider population. You state two competing hypotheses — a null hypothesis (H₀) of “no effect” and an alternative hypothesis (H₁) of “some effect” — then use sample data to calculate a test statistic and a p-value, and finally decide whether to reject H₀ at a pre-set significance level (usually 5%).

Use hypothesis testing whenever you want to draw an inference about a population parameter — a mean, proportion, difference or relationship — from a sample rather than describing the sample alone. It underpins almost every quantitative dissertation that compares groups, tests a treatment or examines an association.

What is hypothesis testing?

Hypothesis testing is the engine of inferential statistics: it lets you move from “what I observed in my sample” to “what is probably true in the population.” Rather than asking “Is there a difference?” directly, the method asks a sharper question: if there were genuinely no effect in the population, how likely is it that I would see a sample result at least as extreme as mine? If that probability (the p-value) is very small, the “no-effect” explanation becomes implausible and you reject it.

The logic is deliberately conservative. Like a court that presumes innocence, hypothesis testing presumes the null hypothesis is true until the data prove otherwise “beyond reasonable doubt.” You never prove the alternative; you only gather enough evidence to reject the null, or fail to.

Null (H₀) vs alternative (H₁) hypotheses

Every test rests on two mutually exclusive statements about a population parameter:

Null hypothesis (H₀) — the default position of “no effect,” “no difference” or “no relationship.” For example, H₀: the mean exam score is the same for two teaching methods (μ_A = μ_B).
Alternative hypothesis (H₁ or H_a) — the research claim you hope to support, stating that an effect, difference or relationship exists (μ_A ≠ μ_B).

The two hypotheses must be complementary and cover all possibilities. Crucially, you decide between them before collecting data, and you always test the null — the alternative is supported only indirectly, by the null being rejected. Getting the direction of these statements right depends on understanding your types of variables and which one is the outcome.

One-tailed vs two-tailed tests

The alternative hypothesis can be directional or non-directional, and this determines whether your test is one-tailed or two-tailed:

Two-tailed test — H₁ simply says the parameter is different (μ_A ≠ μ_B). The rejection region is split across both tails of the distribution, as shown in the figure below. This is the default and the safer choice for most dissertations.
One-tailed test — H₁ predicts a specific direction (μ_A > μ_B). The whole rejection region sits in one tail, giving more power to detect an effect in that direction — but you must justify the direction theoretically in advance, and you forfeit the ability to detect an effect the other way.

A common and serious mistake is switching to a one-tailed test after seeing the data because it makes a borderline result “significant.” Choose the tail before you look.

Two-tailed test at α = .05: the rejection regions (orange) sit in both tails, each holding 2.5% of the distribution. A test statistic beyond ±1.96 falls in a rejection region, so H₀ is rejected.

The steps of hypothesis testing

Whatever test you ultimately run, the procedure follows the same seven steps. Work through them in order — deciding the test and the significance level before seeing the result is what keeps the process honest.

State H₀ and H₁. Write the null and alternative as precise statements about a population parameter (mean, proportion, difference or correlation).
Set the significance level (α). Decide your tolerance for a false positive — conventionally α = .05, sometimes .01 for high-stakes work. Choose one- or two-tailed here too.
Choose the appropriate test. Match the test to your data type and design (see the selection table below), and check its assumptions (e.g. normality, independence, equal variances).
Compute the test statistic. Calculate the value (t, F, χ², z, r) that summarises how far your sample sits from what H₀ predicts.
Find the p-value (or compare the statistic with the critical value). The p-value is the probability of a result at least as extreme as yours if H₀ were true.
Decide. If p ≤ α (or the statistic exceeds the critical value), reject H₀; otherwise fail to reject H₀. You never “accept” the null.
Interpret in context. Translate the decision back into your research question, report the effect size and confidence interval, and discuss practical — not just statistical — significance.

Significance level and p-value: what they really mean

The significance level (α) is the threshold you set in advance — the maximum probability you are willing to accept of rejecting a true null hypothesis. At α = .05 you accept a 1-in-20 risk of a false positive.

The p-value is computed from your data: it is the probability of obtaining a test statistic at least as extreme as the one observed, assuming H₀ is true. A small p-value means your data would be surprising under the null, so the null looks doubtful.

Two cautions worth memorising. First, the p-value is not the probability that H₀ is true, nor the probability your finding occurred by chance. Second, statistical significance is not the same as importance — with a huge sample, a trivial effect can be “significant.” Always pair the p-value with an effect size.

“The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation.” (Source: Fisher, 1935, The Design of Experiments)

Type I vs Type II errors and statistical power

Because you are deciding under uncertainty, two kinds of error are possible. The table below maps the decision against the (unknown) truth:

Decision	H₀ is actually true	H₀ is actually false
Reject H₀	Type I error (false positive) — probability = α	Correct decision — probability = 1 − β (power)
Fail to reject H₀	Correct decision — probability = 1 − α	Type II error (false negative) — probability = β

A Type I error (false positive) means rejecting a true null — claiming an effect that is not there — and its probability is α. A Type II error (false negative) means failing to reject a false null — missing a real effect — with probability β.

Statistical power is 1 − β: the probability of correctly detecting a real effect. Power rises with a larger sample size, a bigger true effect, a higher α, and lower measurement noise. Researchers usually aim for power of at least 0.80, which is why a power analysis to fix sample size belongs in your methodology. Lowering α reduces Type I errors but, all else equal, increases Type II errors — the two trade off against each other.

Choosing the right test by data type

The single most common dissertation error is running the wrong test. The choice is driven by your research question, the measurement level of your variables and your design (independent vs related groups). Use this table as a quick guide:

Research question	Outcome (dependent) variable	Predictor / grouping variable	Test to use
Compare a mean against a known value, or two group means	Continuous (interval/ratio)	One sample, or one categorical variable with 2 groups	t-test (one-sample, independent, or paired)
Compare means across three or more groups	Continuous	One categorical variable with 3+ groups	ANOVA (one-way; factorial for 2+ factors)
Test whether two categorical variables are associated	Categorical (counts/frequencies)	Categorical	Chi-square test of independence
Test the strength of a linear relationship between two continuous variables	Continuous	Continuous	Pearson correlation (Spearman if ordinal/non-normal)

t-test — compares one or two means. Use a one-sample t-test against a known value, an independent-samples t-test for two separate groups, and a paired t-test for repeated measures on the same people.
ANOVA — extends the t-test to three or more groups while controlling the overall Type I error; follow a significant result with post-hoc comparisons.
Chi-square — tests whether two categorical variables are associated, using observed versus expected frequencies in a contingency table.
Correlation — Pearson’s r quantifies the strength and direction of a linear relationship between two continuous variables; see our guide to correlational research for design considerations and the all-important caveat that correlation is not causation.

A fully worked hypothesis test

To see the steps in action, here is a complete independent two-sample t-test with the arithmetic shown. This is the kind of comparison you would run in an experimental study with a treatment and a control group.

Worked example — independent two-sample t-test: A psychology student tests whether a new revision technique improves exam scores. Thirty students use the new technique (Group A) and thirty use the standard one (Group B). Group A scores: mean = 78, SD = 8, n = 30. Group B scores: mean = 74, SD = 9, n = 30.

Step 1 — State the hypotheses. H₀: μ_A = μ_B (the techniques give equal mean scores). H₁: μ_A ≠ μ_B (the means differ) — a two-tailed test.

Step 2 — Set the significance level. α = .05.

Step 3 — Choose the test. Two independent groups, a continuous outcome and roughly equal variances → independent-samples t-test, df = n_A + n_B − 2 = 58.

Step 4 — Compute the test statistic. First pool the variances:
s_p² = [(n_A−1)s_A² + (n_B−1)s_B²] ÷ (n_A+n_B−2)
s_p² = [(29)(64) + (29)(81)] ÷ 58 = (1856 + 2349) ÷ 58 = 4205 ÷ 58 = 72.5
so s_p = √72.5 = 8.515.
Standard error of the difference: SE = s_p × √(1/n_A + 1/n_B) = 8.515 × √(1/30 + 1/30) = 8.515 × √0.0667 = 2.198.
Test statistic: t = (x̄_A − x̄_B) ÷ SE = (78 − 74) ÷ 2.198 = 4 ÷ 2.198 = 1.819.

Step 5 — Find the critical value / p-value. For a two-tailed test at α = .05 with df = 58, the critical value is t_crit ≈ ±2.00. The computed t = 1.819 corresponds to a two-tailed p ≈ .074.

Step 6 — Decide. Because |t| = 1.819 < 2.00 (equivalently p ≈ .074 > .05), we fail to reject H₀.

Step 7 — Interpret. The 4-point advantage for the new technique is not statistically significant at the 5% level: the data do not provide enough evidence that the techniques differ. Note how close this is — a larger sample (more power) might well detect a real 4-point effect, which is exactly why power and sample size matter.

Strengths and limitations

Hypothesis testing gives quantitative research a transparent, replicable rule for decisions and a shared language (α, p, power) that examiners and journals understand. Its limitations are equally real: the α = .05 threshold is a convention, not a law of nature; p-values are routinely misinterpreted; and over-reliance on “significance” can crowd out effect sizes, confidence intervals and replication. Treat a test as one piece of evidence, reported alongside the magnitude and precision of the effect. Used well, it remains the backbone of confirmatory quantitative research across psychology, medicine, education and the social sciences, and a clear command of it is exactly what dissertation examiners look for in a results chapter.

Common mistakes to avoid

Choosing one- vs two-tailed, or the α level, after seeing the data (“p-hacking”).
Saying you “accept” or “prove” the null — you only fail to reject it.
Interpreting p as the probability that the null is true.
Running multiple tests without correcting for the inflated Type I error.
Ignoring assumptions (normality, independence, equal variances) before applying a parametric test.
Confusing statistical significance with practical importance — always report an effect size.

How to do hypothesis testing well

Strong quantitative chapters share a few habits: state hypotheses and the significance level before data collection; justify the chosen test against its assumptions; run a power analysis to set an adequate sample size; and report the test statistic, exact p-value, effect size and a confidence interval together. Doing this turns a bare “p < .05” into a defensible, transparent result your examiner can trust. If you are unsure which test fits your design or how to report it, our statistical analysis service can help.

Need your hypotheses tested correctly?

Our statisticians select the right test, run it in SPSS or R, and report the results to your university’s standard — with full interpretation.

Get Statistical Analysis Help

Frequently Asked Questions

What is hypothesis testing in simple terms?▾

Hypothesis testing is a statistical method for deciding whether a claim about a population is supported by sample data. You state a null hypothesis (no effect) and an alternative (an effect), calculate a test statistic and p-value from your sample, and reject the null if the evidence is strong enough at your chosen significance level.

What is the difference between the null and alternative hypothesis?▾

The null hypothesis (H0) states there is no effect, difference or relationship in the population — it is the default you assume true. The alternative hypothesis (H1) states that an effect does exist. You always test the null; if the data let you reject it, the alternative is supported indirectly. You never directly prove the alternative.

What does the p-value actually tell you?▾

The p-value is the probability of getting a result at least as extreme as the one you observed, assuming the null hypothesis is true. A small p-value (typically not above 0.05) means your data would be unlikely under the null, so you reject it. The p-value is not the probability that the null is true, nor the chance your result was a fluke.

What is the difference between a Type I and Type II error?▾

A Type I error is a false positive — rejecting a true null hypothesis and claiming an effect that is not there; its probability is the significance level (alpha). A Type II error is a false negative — failing to reject a false null and missing a real effect; its probability is beta. Statistical power (1 minus beta) is the chance of correctly detecting a real effect.

When should I use a one-tailed versus a two-tailed test?▾

Use a two-tailed test when you only predict that the parameter differs (the default and safer choice). Use a one-tailed test only when theory justifies a specific direction in advance, as it gives more power in that direction but cannot detect an effect the other way. Never switch to one-tailed after seeing the data just to reach significance.

How do I choose which statistical test to use?▾

Match the test to your variables and design. Compare one or two means with a t-test; compare three or more group means with ANOVA; test the association between two categorical variables with chi-square; and measure a linear relationship between two continuous variables with Pearson correlation. Always check the test’s assumptions, such as normality and independence, first.

OUR EXPERTS CAN HELP WITH YOUR:

OUR EXPERTS CAN HELP WITH YOUR:

OUR EXPERTS CAN HELP WITH YOUR:

OUR EXPERTS CAN HELP WITH YOUR:

OUR EXPERTS CAN HELP WITH YOUR:

OUR EXPERTS CAN HELP WITH YOUR:

OUR EXPERTS CAN HELP WITH YOUR:

Hypothesis Testing: Steps & Examples

What is hypothesis testing?

Null (H₀) vs alternative (H₁) hypotheses

One-tailed vs two-tailed tests

The steps of hypothesis testing

Significance level and p-value: what they really mean

Type I vs Type II errors and statistical power

Choosing the right test by data type

A fully worked hypothesis test

Strengths and limitations

Common mistakes to avoid

How to do hypothesis testing well

Need your hypotheses tested correctly?

Frequently Asked Questions

You May Also Like

OUR EXPERTS CAN HELP WITH YOUR:

OUR EXPERTS CAN HELP WITH YOUR:

OUR EXPERTS CAN HELP WITH YOUR:

OUR EXPERTS CAN HELP WITH YOUR:

OUR EXPERTS CAN HELP WITH YOUR:

OUR EXPERTS CAN HELP WITH YOUR:

OUR EXPERTS CAN HELP WITH YOUR:

MORE AI TOOLS

Hypothesis Testing: Steps & Examples

What is hypothesis testing?

Null (H0) vs alternative (H1) hypotheses

One-tailed vs two-tailed tests

The steps of hypothesis testing

Significance level and p-value: what they really mean

Type I vs Type II errors and statistical power

Choosing the right test by data type

A fully worked hypothesis test

Strengths and limitations

Common mistakes to avoid

How to do hypothesis testing well

Need your hypotheses tested correctly?

Frequently Asked Questions

You May Also Like

Null (H₀) vs alternative (H₁) hypotheses