Validity and reliability in research are the two criteria by which we judge the quality of any measurement. Reliability is consistency: whether a method, test or instrument produces the same results under the same conditions on repeated use. Validity is accuracy: whether the instrument actually measures what it claims to measure. In short, reliability asks “is the measurement repeatable?” while validity asks “is the measurement true?” A strong dissertation needs both, because a finding that cannot be trusted to be either consistent or accurate cannot support a defensible conclusion.
You assess reliability and validity whenever you design a questionnaire, code observations, run an experiment or adopt an existing scale. This guide defines both concepts, walks through every major type with worked examples, explains the crucial relationship between them, and shows you how to improve each one in practice.
What reliability and validity actually mean
Although the terms are often spoken in the same breath, reliability and validity describe two genuinely different properties of a measurement. Getting the distinction right is the single most important step, because almost every downstream decision — which scale to adopt, how to pilot it, how to defend your results to an examiner — depends on it.
Reliability concerns the consistency and repeatability of a measure. If you weighed yourself three times in a minute and the scale read 70 kg, then 74 kg, then 68 kg, the scale would be unreliable: the underlying quantity has not changed, yet the readings have. A reliable instrument produces stable, reproducible scores when the thing being measured is itself stable.
Validity concerns accuracy — whether the instrument measures the construct it is supposed to measure, and whether the inferences you draw from the scores are sound. A bathroom scale that consistently reads 5 kg too high is perfectly reliable but invalid: it gives the same wrong answer every time. Validity is therefore about truth, not merely about repeatability.
“Reliability refers to the consistency of a measure of a concept… Validity is concerned with the integrity of the conclusions that are generated from a piece of research.” (Source: Bryman, Social Research Methods, 2016)
The dartboard analogy
The clearest way to internalise the difference is the classic dartboard analogy, where the bullseye is the true value you are trying to measure and each dart is one measurement.
- Reliable but not valid: all your darts land tightly together, but in the bottom-left corner — consistent (reliable) yet systematically off-target (invalid).
- Valid but not reliable: your darts scatter widely, though they average out around the bullseye — on average accurate, but so inconsistent that any single throw is untrustworthy. In practice this is a fragile, undesirable state.
- Neither: darts scatter widely and miss the centre — inconsistent and inaccurate.
- Both reliable and valid: all darts cluster tightly on the bullseye — the goal of good measurement.
The analogy also previews the key relationship we return to later: tight clustering (reliability) is a precondition for hitting the bullseye consistently, but it does not by itself guarantee you are aiming at the right spot (validity).
Types of reliability
Reliability is not a single number; it is assessed in several complementary ways, depending on the source of inconsistency you are worried about — time, raters, items, or test versions. The four types below are the ones an examiner expects to see addressed.
1. Test–retest reliability
What it checks: consistency of a measure over time. How to assess it: administer the same test to the same people on two occasions separated by a suitable interval, then correlate the two sets of scores (typically a Pearson or intraclass correlation). A high positive correlation (commonly r ≥ 0.7–0.8) indicates good temporal stability. The interval matters: too short and respondents simply remember their answers; too long and the underlying trait may genuinely have changed.
2. Inter-rater (inter-observer) reliability
What it checks: consistency between different observers or coders rating the same material. How to assess it: have two or more raters independently score the same cases and quantify their agreement — Cohen’s kappa for categorical judgements, or an intraclass correlation coefficient for continuous ratings. This is essential wherever human judgement enters the data, such as coding open-ended responses or observing behaviour.
3. Internal consistency (Cronbach’s alpha)
What it checks: whether the multiple items on a scale that are meant to measure the same construct actually hang together. How to assess it: the most widely reported index is Cronbach’s alpha, which summarises the average inter-item correlation across the scale. A common rule of thumb is that alpha ≥ 0.7 indicates acceptable internal consistency, with 0.8–0.9 considered good; values much above 0.95 can signal redundant, near-duplicate items. Split-half reliability is a related approach that correlates one half of the items against the other.
Worked example: computing Cronbach’s alpha
Suppose five respondents each answer a four-item scale (items I1–I4, scored 1–5). The right-hand column is each respondent’s total score across the four items.
| Respondent | I1 | I2 | I3 | I4 | Total |
|---|---|---|---|---|---|
| R1 | 4 | 5 | 4 | 3 | 16 |
| R2 | 2 | 2 | 3 | 2 | 9 |
| R3 | 5 | 4 | 5 | 4 | 18 |
| R4 | 3 | 3 | 2 | 3 | 11 |
| R5 | 1 | 2 | 2 | 1 | 6 |
Step 1 — the formula. With k items:
α = ( k / (k − 1) ) × ( 1 − ( Σσ²item / σ²total ) )
Step 2 — item variances (population variance of each column):
- I1 (4,2,5,3,1): mean 3.0, σ² = 2.00
- I2 (5,2,4,3,2): mean 3.2, σ² = 1.36
- I3 (4,3,5,2,2): mean 3.2, σ² = 1.36
- I4 (3,2,4,3,1): mean 2.6, σ² = 1.04
Σσ²item = 2.00 + 1.36 + 1.36 + 1.04 = 5.76
Step 3 — total-score variance (column of totals 16, 9, 18, 11, 6): mean = 12.0, σ²total = 19.60
Step 4 — plug in (k = 4):
α = (4/3) × (1 − 5.76/19.60) = 1.333 × (1 − 0.294) = 1.333 × 0.706 ≈ 0.94
Interpretation. At α ≈ 0.94 the four items are highly internally consistent — comfortably above the ≥ 0.70 acceptability threshold (and into the 0.8–0.9 “good” band). In a real study an alpha this high would also prompt a check for redundant, near-duplicate items, since values much above 0.95 can indicate the scale is repeating itself rather than measuring the construct more richly.
4. Parallel-forms (alternate-forms) reliability
What it checks: consistency between two equivalent versions of a test built from the same content domain. How to assess it: administer both forms to the same group and correlate the scores. This is valuable when you need to test people twice without practice or memory effects — for example, a pre-test and post-test that must not be identical.
Types of validity
Validity is even more multifaceted than reliability. It is useful to separate the validity of the measure (face, content, construct, criterion) from the validity of the study design (internal and external validity). Both bear on whether your conclusions are defensible.
Face validity
The most superficial check: does the measure look, on the surface, as though it assesses the intended construct? It is judged subjectively. Face validity is the weakest form on its own, but a complete lack of it can deter respondents and reviewers.
Content validity
Whether the measure covers the full domain of the construct, with no important facet missing and no irrelevant content included. It is typically established by having subject-matter experts review the items against a definition of the construct. Strong content validity guards against an instrument that taps only a slice of what it claims to measure.
Construct validity
The most theoretically central form: whether the instrument truly measures the abstract construct it claims to (e.g. “anxiety,” “motivation,” “service quality”). Construct validity is usually evidenced through patterns of correlation:
- Convergent validity: the measure correlates strongly with other measures of the same or related constructs.
- Discriminant validity: the measure correlates weakly with measures of unrelated constructs — confirming it is not just measuring something else under a new name.
Criterion validity
Whether scores relate to a relevant external criterion — an outcome the construct should predict or correspond to. It has two forms:
- Concurrent validity: the measure correlates with a criterion assessed at the same time.
- Predictive validity: the measure forecasts a criterion measured in the future.
Internal vs external validity (design validity)
Beyond the instrument, the design of a study has its own validity, a distinction that matters most for experimental research.
- Internal validity: the extent to which the study can establish a causal link between the independent and dependent variables, free from confounding. Random assignment and control groups protect it.
- External validity: the extent to which findings generalise beyond the specific sample, setting and time of the study.
There is often a trade-off: tightly controlled lab studies maximise internal validity but can sacrifice external validity, while naturalistic field studies do the reverse. Choosing your variables and controls carefully is how you manage this balance.
Summary table: types of reliability and validity
| Type | What it checks | How to assess it |
|---|---|---|
| Test–retest reliability | Consistency over time | Correlate scores from two administrations (r) |
| Inter-rater reliability | Agreement between observers | Cohen’s kappa or intraclass correlation |
| Internal consistency | Items measuring one construct cohere | Cronbach’s alpha (≥ 0.7 acceptable) |
| Parallel-forms reliability | Equivalence of two test versions | Correlate scores across both forms |
| Face validity | Surface plausibility of the measure | Subjective “looks right” judgement |
| Content validity | Full coverage of the construct domain | Expert review against a definition |
| Construct validity | Measures the intended abstract construct | Convergent & discriminant correlations |
| Criterion validity | Relates to an external outcome | Concurrent or predictive correlation |
| Internal validity | Causal inference is sound | Control groups, random assignment |
| External validity | Findings generalise | Representative sampling, replication |
The relationship between reliability and validity
The relationship is asymmetric, and understanding it will save you from a common error in the discussion chapter. A measure can be reliable without being valid, but it cannot be valid without first being reliable.
Return to the scale that consistently reads 5 kg too high: it is reliable (same answer every time) yet invalid (the wrong answer). Reliability alone is therefore no guarantee of accuracy. Conversely, if an instrument gives wildly different readings each time it is used, it cannot be accurately measuring a stable construct — so high validity is impossible without reasonable reliability. In the dartboard terms: reliability is tight grouping; validity is hitting the bullseye. You can group tightly in the wrong place, but you cannot reliably hit the bullseye while your throws scatter everywhere.
The practical takeaway: reliability is necessary but not sufficient for validity. Establish reliability first, then build the case for validity on top of it.
How to improve reliability and validity in practice
Both properties are designed in, not bolted on afterwards. The following steps, applied during instrument development and data collection, do most of the work.
- Pilot the instrument. A pilot study on a small, comparable sample surfaces ambiguous wording, floor/ceiling effects and weak items before they contaminate your main dataset — improving both reliability and content validity.
- Operationalise clearly. Define each construct precisely and translate it into observable, unambiguous items. Vague definitions are the root cause of poor construct validity.
- Standardise procedures. Administer the measure under the same conditions for everyone — identical instructions, timing and environment — to remove avoidable sources of inconsistency and protect test–retest reliability.
- Train your raters and use a clear coding scheme. Where human judgement is involved, detailed coding rules plus rater training and a pre-coding agreement check raise inter-rater reliability.
- Use established, validated scales where possible. Adopting an instrument with published reliability and validity evidence is far safer than inventing one from scratch — and lets you benchmark your own alpha against prior studies.
- Triangulate. Combining methods, data sources or measures lets findings cross-check one another, strengthening validity — a strategy that bridges quantitative and qualitative work.
- Increase scale length sensibly. Adding well-written items that tap the same construct generally raises internal consistency — but stop short of redundant, near-duplicate items.
Once your data are reliable and valid, the credibility of any inferential statistics you run depends entirely on that foundation — garbage in, garbage out. Sound measurement is the precondition for trustworthy hypothesis tests and effect estimates.
Reliability and validity in qualitative research
The classical reliability/validity framework is built around quantitative measurement, where numbers and correlations apply naturally. Qualitative researchers, who do not assume a single fixed reality to “measure,” often prefer the parallel framework of trustworthiness proposed by Lincoln and Guba (1985), with four criteria:
- Credibility — the qualitative analogue of internal validity (are the findings believable?), supported by techniques such as member checking and triangulation.
- Transferability — the analogue of external validity (do findings transfer to other contexts?), supported by rich, “thick” description.
- Dependability — the analogue of reliability (is the process consistent and auditable?), supported by a clear audit trail.
- Confirmability — the analogue of objectivity (are findings shaped by the data, not the researcher’s bias?), supported by reflexivity.
If your project is qualitative, frame quality in these terms rather than reporting a Cronbach’s alpha that does not apply. If it is quantitative, the reliability and validity types above are what your examiner will expect.
Common mistakes to avoid
- Treating reliability as proof of validity. A consistent measure can be consistently wrong.
- Reporting only Cronbach’s alpha. Alpha is one facet of reliability and says nothing about validity; report validity evidence too.
- Inventing a scale and skipping the pilot. Untested instruments routinely contain ambiguous or double-barrelled items.
- Ignoring inter-rater reliability in observational or coding studies. Single-coder data with no agreement check is hard to defend.
- Confusing internal and external validity. Strong causal control in a lab does not guarantee real-world generalisability.
Need to prove your measures are reliable and valid?
Our statisticians run Cronbach’s alpha, inter-rater agreement, factor analysis and validity checks — and write them up for your methodology chapter.
Conclusion
Reliability and validity are the twin pillars of credible measurement: reliability secures consistency, validity secures accuracy, and you need both before any result can be trusted. Establish reliability first — because a measure cannot be valid until it is consistent — then assemble the validity evidence on top of it. Pilot your instrument, operationalise constructs clearly, standardise procedures, train your raters and triangulate, and you will have a measurement strategy that withstands examiner scrutiny and underpins genuinely defensible findings.
Related methodology guides
- Trustworthiness in Qualitative Research
- Triangulation in Research