"> Hypothesis Testing: Steps & Examples - ResearchProspect
Home > Library > Research Methodology > Hypothesis Testing: Steps & Examples

Published by at August 14th, 2021 , Revised On June 17, 2026

Hypothesis testing is a formal statistical procedure for deciding whether the evidence in a sample is strong enough to support a claim about a wider population. You state two competing hypotheses — a null hypothesis (H0) of “no effect” and an alternative hypothesis (H1) of “some effect” — then use sample data to calculate a test statistic and a p-value, and finally decide whether to reject H0 at a pre-set significance level (usually 5%).

Use hypothesis testing whenever you want to draw an inference about a population parameter — a mean, proportion, difference or relationship — from a sample rather than describing the sample alone. It underpins almost every quantitative dissertation that compares groups, tests a treatment or examines an association.

What is hypothesis testing?

Hypothesis testing is the engine of inferential statistics: it lets you move from “what I observed in my sample” to “what is probably true in the population.” Rather than asking “Is there a difference?” directly, the method asks a sharper question: if there were genuinely no effect in the population, how likely is it that I would see a sample result at least as extreme as mine? If that probability (the p-value) is very small, the “no-effect” explanation becomes implausible and you reject it.

The logic is deliberately conservative. Like a court that presumes innocence, hypothesis testing presumes the null hypothesis is true until the data prove otherwise “beyond reasonable doubt.” You never prove the alternative; you only gather enough evidence to reject the null, or fail to.

Null (H0) vs alternative (H1) hypotheses

Every test rests on two mutually exclusive statements about a population parameter:

  • Null hypothesis (H0) — the default position of “no effect,” “no difference” or “no relationship.” For example, H0: the mean exam score is the same for two teaching methods (μA = μB).
  • Alternative hypothesis (H1 or Ha) — the research claim you hope to support, stating that an effect, difference or relationship exists (μA ≠ μB).

The two hypotheses must be complementary and cover all possibilities. Crucially, you decide between them before collecting data, and you always test the null — the alternative is supported only indirectly, by the null being rejected. Getting the direction of these statements right depends on understanding your types of variables and which one is the outcome.

One-tailed vs two-tailed tests

The alternative hypothesis can be directional or non-directional, and this determines whether your test is one-tailed or two-tailed:

  • Two-tailed test — H1 simply says the parameter is different (μA ≠ μB). The rejection region is split across both tails of the distribution, as shown in the figure below. This is the default and the safer choice for most dissertations.
  • One-tailed test — H1 predicts a specific direction (μA > μB). The whole rejection region sits in one tail, giving more power to detect an effect in that direction — but you must justify the direction theoretically in advance, and you forfeit the ability to detect an effect the other way.

A common and serious mistake is switching to a one-tailed test after seeing the data because it makes a borderline result “significant.” Choose the tail before you look.

Fail to reject H0(95% of the distribution)μ (under H0)−zcrit+zcrit−1.96+1.96α/2 = .025α/2 = .025reject H0reject H0
Two-tailed test at α = .05: the rejection regions (orange) sit in both tails, each holding 2.5% of the distribution. A test statistic beyond ±1.96 falls in a rejection region, so H0 is rejected.

The steps of hypothesis testing

Whatever test you ultimately run, the procedure follows the same seven steps. Work through them in order — deciding the test and the significance level before seeing the result is what keeps the process honest.

  1. State H0 and H1. Write the null and alternative as precise statements about a population parameter (mean, proportion, difference or correlation).
  2. Set the significance level (α). Decide your tolerance for a false positive — conventionally α = .05, sometimes .01 for high-stakes work. Choose one- or two-tailed here too.
  3. Choose the appropriate test. Match the test to your data type and design (see the selection table below), and check its assumptions (e.g. normality, independence, equal variances).
  4. Compute the test statistic. Calculate the value (t, F, χ2, z, r) that summarises how far your sample sits from what H0 predicts.
  5. Find the p-value (or compare the statistic with the critical value). The p-value is the probability of a result at least as extreme as yours if H0 were true.
  6. Decide. If p ≤ α (or the statistic exceeds the critical value), reject H0; otherwise fail to reject H0. You never “accept” the null.
  7. Interpret in context. Translate the decision back into your research question, report the effect size and confidence interval, and discuss practical — not just statistical — significance.

Significance level and p-value: what they really mean

The significance level (α) is the threshold you set in advance — the maximum probability you are willing to accept of rejecting a true null hypothesis. At α = .05 you accept a 1-in-20 risk of a false positive.

The p-value is computed from your data: it is the probability of obtaining a test statistic at least as extreme as the one observed, assuming H0 is true. A small p-value means your data would be surprising under the null, so the null looks doubtful.

Two cautions worth memorising. First, the p-value is not the probability that H0 is true, nor the probability your finding occurred by chance. Second, statistical significance is not the same as importance — with a huge sample, a trivial effect can be “significant.” Always pair the p-value with an effect size.

“The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation.” (Source: Fisher, 1935, The Design of Experiments)

Type I vs Type II errors and statistical power

Because you are deciding under uncertainty, two kinds of error are possible. The table below maps the decision against the (unknown) truth:

Decision H0 is actually true H0 is actually false
Reject H0 Type I error (false positive) — probability = α Correct decision — probability = 1 − β (power)
Fail to reject H0 Correct decision — probability = 1 − α Type II error (false negative) — probability = β

A Type I error (false positive) means rejecting a true null — claiming an effect that is not there — and its probability is α. A Type II error (false negative) means failing to reject a false null — missing a real effect — with probability β.

Statistical power is 1 − β: the probability of correctly detecting a real effect. Power rises with a larger sample size, a bigger true effect, a higher α, and lower measurement noise. Researchers usually aim for power of at least 0.80, which is why a power analysis to fix sample size belongs in your methodology. Lowering α reduces Type I errors but, all else equal, increases Type II errors — the two trade off against each other.

Choosing the right test by data type

The single most common dissertation error is running the wrong test. The choice is driven by your research question, the measurement level of your variables and your design (independent vs related groups). Use this table as a quick guide:

Research question Outcome (dependent) variable Predictor / grouping variable Test to use
Compare a mean against a known value, or two group means Continuous (interval/ratio) One sample, or one categorical variable with 2 groups t-test (one-sample, independent, or paired)
Compare means across three or more groups Continuous One categorical variable with 3+ groups ANOVA (one-way; factorial for 2+ factors)
Test whether two categorical variables are associated Categorical (counts/frequencies) Categorical Chi-square test of independence
Test the strength of a linear relationship between two continuous variables Continuous Continuous Pearson correlation (Spearman if ordinal/non-normal)
  • t-test — compares one or two means. Use a one-sample t-test against a known value, an independent-samples t-test for two separate groups, and a paired t-test for repeated measures on the same people.
  • ANOVA — extends the t-test to three or more groups while controlling the overall Type I error; follow a significant result with post-hoc comparisons.
  • Chi-square — tests whether two categorical variables are associated, using observed versus expected frequencies in a contingency table.
  • Correlation — Pearson’s r quantifies the strength and direction of a linear relationship between two continuous variables; see our guide to correlational research for design considerations and the all-important caveat that correlation is not causation.

A fully worked hypothesis test

To see the steps in action, here is a complete independent two-sample t-test with the arithmetic shown. This is the kind of comparison you would run in an experimental study with a treatment and a control group.

Worked example — independent two-sample t-test: A psychology student tests whether a new revision technique improves exam scores. Thirty students use the new technique (Group A) and thirty use the standard one (Group B). Group A scores: mean = 78, SD = 8, n = 30. Group B scores: mean = 74, SD = 9, n = 30.

Step 1 — State the hypotheses. H0: μA = μB (the techniques give equal mean scores). H1: μA ≠ μB (the means differ) — a two-tailed test.

Step 2 — Set the significance level. α = .05.

Step 3 — Choose the test. Two independent groups, a continuous outcome and roughly equal variances → independent-samples t-test, df = nA + nB − 2 = 58.

Step 4 — Compute the test statistic. First pool the variances:
sp2 = [(nA−1)sA2 + (nB−1)sB2] ÷ (nA+nB−2)
sp2 = [(29)(64) + (29)(81)] ÷ 58 = (1856 + 2349) ÷ 58 = 4205 ÷ 58 = 72.5
so sp = √72.5 = 8.515.
Standard error of the difference: SE = sp × √(1/nA + 1/nB) = 8.515 × √(1/30 + 1/30) = 8.515 × √0.0667 = 2.198.
Test statistic: t = (x̄A − x̄B) ÷ SE = (78 − 74) ÷ 2.198 = 4 ÷ 2.198 = 1.819.

Step 5 — Find the critical value / p-value. For a two-tailed test at α = .05 with df = 58, the critical value is tcrit ≈ ±2.00. The computed t = 1.819 corresponds to a two-tailed p ≈ .074.

Step 6 — Decide. Because |t| = 1.819 < 2.00 (equivalently p ≈ .074 > .05), we fail to reject H0.

Step 7 — Interpret. The 4-point advantage for the new technique is not statistically significant at the 5% level: the data do not provide enough evidence that the techniques differ. Note how close this is — a larger sample (more power) might well detect a real 4-point effect, which is exactly why power and sample size matter.

Strengths and limitations

Hypothesis testing gives quantitative research a transparent, replicable rule for decisions and a shared language (α, p, power) that examiners and journals understand. Its limitations are equally real: the α = .05 threshold is a convention, not a law of nature; p-values are routinely misinterpreted; and over-reliance on “significance” can crowd out effect sizes, confidence intervals and replication. Treat a test as one piece of evidence, reported alongside the magnitude and precision of the effect. Used well, it remains the backbone of confirmatory quantitative research across psychology, medicine, education and the social sciences, and a clear command of it is exactly what dissertation examiners look for in a results chapter.

Common mistakes to avoid

  • Choosing one- vs two-tailed, or the α level, after seeing the data (“p-hacking”).
  • Saying you “accept” or “prove” the null — you only fail to reject it.
  • Interpreting p as the probability that the null is true.
  • Running multiple tests without correcting for the inflated Type I error.
  • Ignoring assumptions (normality, independence, equal variances) before applying a parametric test.
  • Confusing statistical significance with practical importance — always report an effect size.

How to do hypothesis testing well

Strong quantitative chapters share a few habits: state hypotheses and the significance level before data collection; justify the chosen test against its assumptions; run a power analysis to set an adequate sample size; and report the test statistic, exact p-value, effect size and a confidence interval together. Doing this turns a bare “p < .05” into a defensible, transparent result your examiner can trust. If you are unsure which test fits your design or how to report it, our statistical analysis service can help.

Need your hypotheses tested correctly?

Our statisticians select the right test, run it in SPSS or R, and report the results to your university’s standard — with full interpretation.

Frequently Asked Questions

What is hypothesis testing in simple terms?

Hypothesis testing is a statistical method for deciding whether a claim about a population is supported by sample data. You state a null hypothesis (no effect) and an alternative (an effect), calculate a test statistic and p-value from your sample, and reject the null if the evidence is strong enough at your chosen significance level.

The null hypothesis (H0) states there is no effect, difference or relationship in the population — it is the default you assume true. The alternative hypothesis (H1) states that an effect does exist. You always test the null; if the data let you reject it, the alternative is supported indirectly. You never directly prove the alternative.

The p-value is the probability of getting a result at least as extreme as the one you observed, assuming the null hypothesis is true. A small p-value (typically not above 0.05) means your data would be unlikely under the null, so you reject it. The p-value is not the probability that the null is true, nor the chance your result was a fluke.

A Type I error is a false positive — rejecting a true null hypothesis and claiming an effect that is not there; its probability is the significance level (alpha). A Type II error is a false negative — failing to reject a false null and missing a real effect; its probability is beta. Statistical power (1 minus beta) is the chance of correctly detecting a real effect.

Use a two-tailed test when you only predict that the parameter differs (the default and safer choice). Use a one-tailed test only when theory justifies a specific direction in advance, as it gives more power in that direction but cannot detect an effect the other way. Never switch to one-tailed after seeing the data just to reach significance.

Match the test to your variables and design. Compare one or two means with a t-test; compare three or more group means with ANOVA; test the association between two categorical variables with chi-square; and measure a linear relationship between two continuous variables with Pearson correlation. Always check the test’s assumptions, such as normality and independence, first.

About Aadam Mae

Avatar for Aadam MaeAadam Mae, an academic researcher and author with a PhD in NLP (Natural Language Processing) at ResearchProspect. Mae's work delves into the intricacies of language and technology, delivering profound insights in concise prose. Pioneering the future of communication through scholarship.

WhatsApp Live Chat