Hypothesis testing is a formal statistical procedure for deciding whether the evidence in a sample is strong enough to support a claim about a wider population. You state two competing hypotheses — a null hypothesis (H0) of “no effect” and an alternative hypothesis (H1) of “some effect” — then use sample data to calculate a test statistic and a p-value, and finally decide whether to reject H0 at a pre-set significance level (usually 5%).
Use hypothesis testing whenever you want to draw an inference about a population parameter — a mean, proportion, difference or relationship — from a sample rather than describing the sample alone. It underpins almost every quantitative dissertation that compares groups, tests a treatment or examines an association.
What is hypothesis testing?
Hypothesis testing is the engine of inferential statistics: it lets you move from “what I observed in my sample” to “what is probably true in the population.” Rather than asking “Is there a difference?” directly, the method asks a sharper question: if there were genuinely no effect in the population, how likely is it that I would see a sample result at least as extreme as mine? If that probability (the p-value) is very small, the “no-effect” explanation becomes implausible and you reject it.
The logic is deliberately conservative. Like a court that presumes innocence, hypothesis testing presumes the null hypothesis is true until the data prove otherwise “beyond reasonable doubt.” You never prove the alternative; you only gather enough evidence to reject the null, or fail to.
Null (H0) vs alternative (H1) hypotheses
Every test rests on two mutually exclusive statements about a population parameter:
- Null hypothesis (H0) — the default position of “no effect,” “no difference” or “no relationship.” For example, H0: the mean exam score is the same for two teaching methods (μA = μB).
- Alternative hypothesis (H1 or Ha) — the research claim you hope to support, stating that an effect, difference or relationship exists (μA ≠ μB).
The two hypotheses must be complementary and cover all possibilities. Crucially, you decide between them before collecting data, and you always test the null — the alternative is supported only indirectly, by the null being rejected. Getting the direction of these statements right depends on understanding your types of variables and which one is the outcome.
One-tailed vs two-tailed tests
The alternative hypothesis can be directional or non-directional, and this determines whether your test is one-tailed or two-tailed:
- Two-tailed test — H1 simply says the parameter is different (μA ≠ μB). The rejection region is split across both tails of the distribution, as shown in the figure below. This is the default and the safer choice for most dissertations.
- One-tailed test — H1 predicts a specific direction (μA > μB). The whole rejection region sits in one tail, giving more power to detect an effect in that direction — but you must justify the direction theoretically in advance, and you forfeit the ability to detect an effect the other way.
A common and serious mistake is switching to a one-tailed test after seeing the data because it makes a borderline result “significant.” Choose the tail before you look.
The steps of hypothesis testing
Whatever test you ultimately run, the procedure follows the same seven steps. Work through them in order — deciding the test and the significance level before seeing the result is what keeps the process honest.
- State H0 and H1. Write the null and alternative as precise statements about a population parameter (mean, proportion, difference or correlation).
- Set the significance level (α). Decide your tolerance for a false positive — conventionally α = .05, sometimes .01 for high-stakes work. Choose one- or two-tailed here too.
- Choose the appropriate test. Match the test to your data type and design (see the selection table below), and check its assumptions (e.g. normality, independence, equal variances).
- Compute the test statistic. Calculate the value (t, F, χ2, z, r) that summarises how far your sample sits from what H0 predicts.
- Find the p-value (or compare the statistic with the critical value). The p-value is the probability of a result at least as extreme as yours if H0 were true.
- Decide. If p ≤ α (or the statistic exceeds the critical value), reject H0; otherwise fail to reject H0. You never “accept” the null.
- Interpret in context. Translate the decision back into your research question, report the effect size and confidence interval, and discuss practical — not just statistical — significance.
Significance level and p-value: what they really mean
The significance level (α) is the threshold you set in advance — the maximum probability you are willing to accept of rejecting a true null hypothesis. At α = .05 you accept a 1-in-20 risk of a false positive.
The p-value is computed from your data: it is the probability of obtaining a test statistic at least as extreme as the one observed, assuming H0 is true. A small p-value means your data would be surprising under the null, so the null looks doubtful.
Two cautions worth memorising. First, the p-value is not the probability that H0 is true, nor the probability your finding occurred by chance. Second, statistical significance is not the same as importance — with a huge sample, a trivial effect can be “significant.” Always pair the p-value with an effect size.
“The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation.” (Source: Fisher, 1935, The Design of Experiments)
Type I vs Type II errors and statistical power
Because you are deciding under uncertainty, two kinds of error are possible. The table below maps the decision against the (unknown) truth:
| Decision | H0 is actually true | H0 is actually false |
|---|---|---|
| Reject H0 | Type I error (false positive) — probability = α | Correct decision — probability = 1 − β (power) |
| Fail to reject H0 | Correct decision — probability = 1 − α | Type II error (false negative) — probability = β |
A Type I error (false positive) means rejecting a true null — claiming an effect that is not there — and its probability is α. A Type II error (false negative) means failing to reject a false null — missing a real effect — with probability β.
Statistical power is 1 − β: the probability of correctly detecting a real effect. Power rises with a larger sample size, a bigger true effect, a higher α, and lower measurement noise. Researchers usually aim for power of at least 0.80, which is why a power analysis to fix sample size belongs in your methodology. Lowering α reduces Type I errors but, all else equal, increases Type II errors — the two trade off against each other.
Choosing the right test by data type
The single most common dissertation error is running the wrong test. The choice is driven by your research question, the measurement level of your variables and your design (independent vs related groups). Use this table as a quick guide:
| Research question | Outcome (dependent) variable | Predictor / grouping variable | Test to use |
|---|---|---|---|
| Compare a mean against a known value, or two group means | Continuous (interval/ratio) | One sample, or one categorical variable with 2 groups | t-test (one-sample, independent, or paired) |
| Compare means across three or more groups | Continuous | One categorical variable with 3+ groups | ANOVA (one-way; factorial for 2+ factors) |
| Test whether two categorical variables are associated | Categorical (counts/frequencies) | Categorical | Chi-square test of independence |
| Test the strength of a linear relationship between two continuous variables | Continuous | Continuous | Pearson correlation (Spearman if ordinal/non-normal) |
- t-test — compares one or two means. Use a one-sample t-test against a known value, an independent-samples t-test for two separate groups, and a paired t-test for repeated measures on the same people.
- ANOVA — extends the t-test to three or more groups while controlling the overall Type I error; follow a significant result with post-hoc comparisons.
- Chi-square — tests whether two categorical variables are associated, using observed versus expected frequencies in a contingency table.
- Correlation — Pearson’s r quantifies the strength and direction of a linear relationship between two continuous variables; see our guide to correlational research for design considerations and the all-important caveat that correlation is not causation.
A fully worked hypothesis test
To see the steps in action, here is a complete independent two-sample t-test with the arithmetic shown. This is the kind of comparison you would run in an experimental study with a treatment and a control group.
Step 1 — State the hypotheses. H0: μA = μB (the techniques give equal mean scores). H1: μA ≠ μB (the means differ) — a two-tailed test.
Step 2 — Set the significance level. α = .05.
Step 3 — Choose the test. Two independent groups, a continuous outcome and roughly equal variances → independent-samples t-test, df = nA + nB − 2 = 58.
Step 4 — Compute the test statistic. First pool the variances:
sp2 = [(nA−1)sA2 + (nB−1)sB2] ÷ (nA+nB−2)
sp2 = [(29)(64) + (29)(81)] ÷ 58 = (1856 + 2349) ÷ 58 = 4205 ÷ 58 = 72.5
so sp = √72.5 = 8.515.
Standard error of the difference: SE = sp × √(1/nA + 1/nB) = 8.515 × √(1/30 + 1/30) = 8.515 × √0.0667 = 2.198.
Test statistic: t = (x̄A − x̄B) ÷ SE = (78 − 74) ÷ 2.198 = 4 ÷ 2.198 = 1.819.
Step 5 — Find the critical value / p-value. For a two-tailed test at α = .05 with df = 58, the critical value is tcrit ≈ ±2.00. The computed t = 1.819 corresponds to a two-tailed p ≈ .074.
Step 6 — Decide. Because |t| = 1.819 < 2.00 (equivalently p ≈ .074 > .05), we fail to reject H0.
Step 7 — Interpret. The 4-point advantage for the new technique is not statistically significant at the 5% level: the data do not provide enough evidence that the techniques differ. Note how close this is — a larger sample (more power) might well detect a real 4-point effect, which is exactly why power and sample size matter.
Strengths and limitations
Hypothesis testing gives quantitative research a transparent, replicable rule for decisions and a shared language (α, p, power) that examiners and journals understand. Its limitations are equally real: the α = .05 threshold is a convention, not a law of nature; p-values are routinely misinterpreted; and over-reliance on “significance” can crowd out effect sizes, confidence intervals and replication. Treat a test as one piece of evidence, reported alongside the magnitude and precision of the effect. Used well, it remains the backbone of confirmatory quantitative research across psychology, medicine, education and the social sciences, and a clear command of it is exactly what dissertation examiners look for in a results chapter.
Common mistakes to avoid
- Choosing one- vs two-tailed, or the α level, after seeing the data (“p-hacking”).
- Saying you “accept” or “prove” the null — you only fail to reject it.
- Interpreting p as the probability that the null is true.
- Running multiple tests without correcting for the inflated Type I error.
- Ignoring assumptions (normality, independence, equal variances) before applying a parametric test.
- Confusing statistical significance with practical importance — always report an effect size.
How to do hypothesis testing well
Strong quantitative chapters share a few habits: state hypotheses and the significance level before data collection; justify the chosen test against its assumptions; run a power analysis to set an adequate sample size; and report the test statistic, exact p-value, effect size and a confidence interval together. Doing this turns a bare “p < .05” into a defensible, transparent result your examiner can trust. If you are unsure which test fits your design or how to report it, our statistical analysis service can help.
Need your hypotheses tested correctly?
Our statisticians select the right test, run it in SPSS or R, and report the results to your university’s standard — with full interpretation.