"> Reliability and Validity in Research Explained - ResearchProspect
Home > Library > Research Methodology > Reliability and Validity in Research Explained

Published by at November 10th, 2025 , Revised On June 17, 2026

Validity and reliability in research are the two criteria by which we judge the quality of any measurement. Reliability is consistency: whether a method, test or instrument produces the same results under the same conditions on repeated use. Validity is accuracy: whether the instrument actually measures what it claims to measure. In short, reliability asks “is the measurement repeatable?” while validity asks “is the measurement true?” A strong dissertation needs both, because a finding that cannot be trusted to be either consistent or accurate cannot support a defensible conclusion.

You assess reliability and validity whenever you design a questionnaire, code observations, run an experiment or adopt an existing scale. This guide defines both concepts, walks through every major type with worked examples, explains the crucial relationship between them, and shows you how to improve each one in practice.

What reliability and validity actually mean

Although the terms are often spoken in the same breath, reliability and validity describe two genuinely different properties of a measurement. Getting the distinction right is the single most important step, because almost every downstream decision — which scale to adopt, how to pilot it, how to defend your results to an examiner — depends on it.

Reliability concerns the consistency and repeatability of a measure. If you weighed yourself three times in a minute and the scale read 70 kg, then 74 kg, then 68 kg, the scale would be unreliable: the underlying quantity has not changed, yet the readings have. A reliable instrument produces stable, reproducible scores when the thing being measured is itself stable.

Validity concerns accuracy — whether the instrument measures the construct it is supposed to measure, and whether the inferences you draw from the scores are sound. A bathroom scale that consistently reads 5 kg too high is perfectly reliable but invalid: it gives the same wrong answer every time. Validity is therefore about truth, not merely about repeatability.

“Reliability refers to the consistency of a measure of a concept… Validity is concerned with the integrity of the conclusions that are generated from a piece of research.” (Source: Bryman, Social Research Methods, 2016)

The dartboard analogy

The clearest way to internalise the difference is the classic dartboard analogy, where the bullseye is the true value you are trying to measure and each dart is one measurement.

Reliability × Validity: the dartboard analogyNot reliable, not validScattered & off-targetReliable, not validTight cluster, off-centreNot reliable, somewhat validSpread around the centreReliable AND validTight cluster on bullseyeeach dot = one measurementbullseye = true value
The four reliability–validity combinations. Tight grouping shows reliability (consistency); landing on the bullseye shows validity (accuracy). Only the bottom-right target is both.
  • Reliable but not valid: all your darts land tightly together, but in the bottom-left corner — consistent (reliable) yet systematically off-target (invalid).
  • Valid but not reliable: your darts scatter widely, though they average out around the bullseye — on average accurate, but so inconsistent that any single throw is untrustworthy. In practice this is a fragile, undesirable state.
  • Neither: darts scatter widely and miss the centre — inconsistent and inaccurate.
  • Both reliable and valid: all darts cluster tightly on the bullseye — the goal of good measurement.

The analogy also previews the key relationship we return to later: tight clustering (reliability) is a precondition for hitting the bullseye consistently, but it does not by itself guarantee you are aiming at the right spot (validity).

Types of reliability

Reliability is not a single number; it is assessed in several complementary ways, depending on the source of inconsistency you are worried about — time, raters, items, or test versions. The four types below are the ones an examiner expects to see addressed.

1. Test–retest reliability

What it checks: consistency of a measure over time. How to assess it: administer the same test to the same people on two occasions separated by a suitable interval, then correlate the two sets of scores (typically a Pearson or intraclass correlation). A high positive correlation (commonly r ≥ 0.7–0.8) indicates good temporal stability. The interval matters: too short and respondents simply remember their answers; too long and the underlying trait may genuinely have changed.

Example: A psychology student adapts a 20-item trait-anxiety questionnaire and gives it to 60 undergraduates, then again two weeks later. Scores correlate at r = 0.84, suggesting the questionnaire measures a stable trait reliably rather than fleeting mood on the day.

2. Inter-rater (inter-observer) reliability

What it checks: consistency between different observers or coders rating the same material. How to assess it: have two or more raters independently score the same cases and quantify their agreement — Cohen’s kappa for categorical judgements, or an intraclass correlation coefficient for continuous ratings. This is essential wherever human judgement enters the data, such as coding open-ended responses or observing behaviour.

Example: Two education researchers independently code 40 classroom video clips for “on-task behaviour.” Cohen’s kappa is 0.78, indicating substantial agreement; had it been low, they would refine the coding scheme and retrain before coding the full sample.

3. Internal consistency (Cronbach’s alpha)

What it checks: whether the multiple items on a scale that are meant to measure the same construct actually hang together. How to assess it: the most widely reported index is Cronbach’s alpha, which summarises the average inter-item correlation across the scale. A common rule of thumb is that alpha ≥ 0.7 indicates acceptable internal consistency, with 0.8–0.9 considered good; values much above 0.95 can signal redundant, near-duplicate items. Split-half reliability is a related approach that correlates one half of the items against the other.

Example: A business student builds an eight-item “job satisfaction” scale. Cronbach’s alpha comes out at 0.86, so the items reliably tap a single underlying construct. An item-total analysis shows one weak item; removing it nudges alpha to 0.88.

Worked example: computing Cronbach’s alpha

Suppose five respondents each answer a four-item scale (items I1–I4, scored 1–5). The right-hand column is each respondent’s total score across the four items.

Respondent I1 I2 I3 I4 Total
R1 4 5 4 3 16
R2 2 2 3 2 9
R3 5 4 5 4 18
R4 3 3 2 3 11
R5 1 2 2 1 6

Step 1 — the formula. With k items:

α = ( k / (k − 1) ) × ( 1 − ( Σσ²item / σ²total ) )

Step 2 — item variances (population variance of each column):

  • I1 (4,2,5,3,1): mean 3.0, σ² = 2.00
  • I2 (5,2,4,3,2): mean 3.2, σ² = 1.36
  • I3 (4,3,5,2,2): mean 3.2, σ² = 1.36
  • I4 (3,2,4,3,1): mean 2.6, σ² = 1.04

Σσ²item = 2.00 + 1.36 + 1.36 + 1.04 = 5.76

Step 3 — total-score variance (column of totals 16, 9, 18, 11, 6): mean = 12.0, σ²total = 19.60

Step 4 — plug in (k = 4):

α = (4/3) × (1 − 5.76/19.60) = 1.333 × (1 − 0.294) = 1.333 × 0.706 ≈ 0.94

Interpretation. At α ≈ 0.94 the four items are highly internally consistent — comfortably above the ≥ 0.70 acceptability threshold (and into the 0.8–0.9 “good” band). In a real study an alpha this high would also prompt a check for redundant, near-duplicate items, since values much above 0.95 can indicate the scale is repeating itself rather than measuring the construct more richly.

4. Parallel-forms (alternate-forms) reliability

What it checks: consistency between two equivalent versions of a test built from the same content domain. How to assess it: administer both forms to the same group and correlate the scores. This is valuable when you need to test people twice without practice or memory effects — for example, a pre-test and post-test that must not be identical.

Example: A health-education team writes two versions of a 30-item diabetes-knowledge quiz from the same blueprint. The forms correlate at r = 0.81, so either can be used at follow-up without the results being distorted by participants remembering specific questions.

Types of validity

Validity is even more multifaceted than reliability. It is useful to separate the validity of the measure (face, content, construct, criterion) from the validity of the study design (internal and external validity). Both bear on whether your conclusions are defensible.

Face validity

The most superficial check: does the measure look, on the surface, as though it assesses the intended construct? It is judged subjectively. Face validity is the weakest form on its own, but a complete lack of it can deter respondents and reviewers.

Example: A questionnaire intended to measure exam stress includes items like “I feel tense before assessments.” At a glance this plainly looks relevant to stress — it has face validity.

Content validity

Whether the measure covers the full domain of the construct, with no important facet missing and no irrelevant content included. It is typically established by having subject-matter experts review the items against a definition of the construct. Strong content validity guards against an instrument that taps only a slice of what it claims to measure.

Example: A “student stress” scale that asks only about exams has weak content validity, because it ignores financial, social and workload stressors. Three lecturers review the blueprint and recommend adding items on those facets.

Construct validity

The most theoretically central form: whether the instrument truly measures the abstract construct it claims to (e.g. “anxiety,” “motivation,” “service quality”). Construct validity is usually evidenced through patterns of correlation:

  • Convergent validity: the measure correlates strongly with other measures of the same or related constructs.
  • Discriminant validity: the measure correlates weakly with measures of unrelated constructs — confirming it is not just measuring something else under a new name.
Example: A new stress scale correlates highly (r = 0.72) with an established anxiety inventory (convergent) but only weakly (r = 0.15) with a measure of mathematical ability (discriminant) — evidence it captures stress specifically.

Criterion validity

Whether scores relate to a relevant external criterion — an outcome the construct should predict or correspond to. It has two forms:

  • Concurrent validity: the measure correlates with a criterion assessed at the same time.
  • Predictive validity: the measure forecasts a criterion measured in the future.
Example: If high scores on a university-admissions aptitude test predict first-year grades obtained months later, the test shows predictive validity. If those scores match a teacher’s current rating of ability, that is concurrent validity.

Internal vs external validity (design validity)

Beyond the instrument, the design of a study has its own validity, a distinction that matters most for experimental research.

  • Internal validity: the extent to which the study can establish a causal link between the independent and dependent variables, free from confounding. Random assignment and control groups protect it.
  • External validity: the extent to which findings generalise beyond the specific sample, setting and time of the study.

There is often a trade-off: tightly controlled lab studies maximise internal validity but can sacrifice external validity, while naturalistic field studies do the reverse. Choosing your variables and controls carefully is how you manage this balance.

Summary table: types of reliability and validity

Type What it checks How to assess it
Test–retest reliability Consistency over time Correlate scores from two administrations (r)
Inter-rater reliability Agreement between observers Cohen’s kappa or intraclass correlation
Internal consistency Items measuring one construct cohere Cronbach’s alpha (≥ 0.7 acceptable)
Parallel-forms reliability Equivalence of two test versions Correlate scores across both forms
Face validity Surface plausibility of the measure Subjective “looks right” judgement
Content validity Full coverage of the construct domain Expert review against a definition
Construct validity Measures the intended abstract construct Convergent & discriminant correlations
Criterion validity Relates to an external outcome Concurrent or predictive correlation
Internal validity Causal inference is sound Control groups, random assignment
External validity Findings generalise Representative sampling, replication

The relationship between reliability and validity

The relationship is asymmetric, and understanding it will save you from a common error in the discussion chapter. A measure can be reliable without being valid, but it cannot be valid without first being reliable.

Return to the scale that consistently reads 5 kg too high: it is reliable (same answer every time) yet invalid (the wrong answer). Reliability alone is therefore no guarantee of accuracy. Conversely, if an instrument gives wildly different readings each time it is used, it cannot be accurately measuring a stable construct — so high validity is impossible without reasonable reliability. In the dartboard terms: reliability is tight grouping; validity is hitting the bullseye. You can group tightly in the wrong place, but you cannot reliably hit the bullseye while your throws scatter everywhere.

The practical takeaway: reliability is necessary but not sufficient for validity. Establish reliability first, then build the case for validity on top of it.

How to improve reliability and validity in practice

Both properties are designed in, not bolted on afterwards. The following steps, applied during instrument development and data collection, do most of the work.

  1. Pilot the instrument. A pilot study on a small, comparable sample surfaces ambiguous wording, floor/ceiling effects and weak items before they contaminate your main dataset — improving both reliability and content validity.
  2. Operationalise clearly. Define each construct precisely and translate it into observable, unambiguous items. Vague definitions are the root cause of poor construct validity.
  3. Standardise procedures. Administer the measure under the same conditions for everyone — identical instructions, timing and environment — to remove avoidable sources of inconsistency and protect test–retest reliability.
  4. Train your raters and use a clear coding scheme. Where human judgement is involved, detailed coding rules plus rater training and a pre-coding agreement check raise inter-rater reliability.
  5. Use established, validated scales where possible. Adopting an instrument with published reliability and validity evidence is far safer than inventing one from scratch — and lets you benchmark your own alpha against prior studies.
  6. Triangulate. Combining methods, data sources or measures lets findings cross-check one another, strengthening validity — a strategy that bridges quantitative and qualitative work.
  7. Increase scale length sensibly. Adding well-written items that tap the same construct generally raises internal consistency — but stop short of redundant, near-duplicate items.

Once your data are reliable and valid, the credibility of any inferential statistics you run depends entirely on that foundation — garbage in, garbage out. Sound measurement is the precondition for trustworthy hypothesis tests and effect estimates.

Reliability and validity in qualitative research

The classical reliability/validity framework is built around quantitative measurement, where numbers and correlations apply naturally. Qualitative researchers, who do not assume a single fixed reality to “measure,” often prefer the parallel framework of trustworthiness proposed by Lincoln and Guba (1985), with four criteria:

  • Credibility — the qualitative analogue of internal validity (are the findings believable?), supported by techniques such as member checking and triangulation.
  • Transferability — the analogue of external validity (do findings transfer to other contexts?), supported by rich, “thick” description.
  • Dependability — the analogue of reliability (is the process consistent and auditable?), supported by a clear audit trail.
  • Confirmability — the analogue of objectivity (are findings shaped by the data, not the researcher’s bias?), supported by reflexivity.

If your project is qualitative, frame quality in these terms rather than reporting a Cronbach’s alpha that does not apply. If it is quantitative, the reliability and validity types above are what your examiner will expect.

Common mistakes to avoid

  • Treating reliability as proof of validity. A consistent measure can be consistently wrong.
  • Reporting only Cronbach’s alpha. Alpha is one facet of reliability and says nothing about validity; report validity evidence too.
  • Inventing a scale and skipping the pilot. Untested instruments routinely contain ambiguous or double-barrelled items.
  • Ignoring inter-rater reliability in observational or coding studies. Single-coder data with no agreement check is hard to defend.
  • Confusing internal and external validity. Strong causal control in a lab does not guarantee real-world generalisability.

Need to prove your measures are reliable and valid?

Our statisticians run Cronbach’s alpha, inter-rater agreement, factor analysis and validity checks — and write them up for your methodology chapter.

Conclusion

Reliability and validity are the twin pillars of credible measurement: reliability secures consistency, validity secures accuracy, and you need both before any result can be trusted. Establish reliability first — because a measure cannot be valid until it is consistent — then assemble the validity evidence on top of it. Pilot your instrument, operationalise constructs clearly, standardise procedures, train your raters and triangulate, and you will have a measurement strategy that withstands examiner scrutiny and underpins genuinely defensible findings.

Related methodology guides

  • Trustworthiness in Qualitative Research
  • Triangulation in Research

Frequently Asked Questions

What is the difference between reliability and validity in research?

Reliability is the consistency or repeatability of a measure — whether it gives the same result under the same conditions. Validity is the accuracy of a measure — whether it actually measures what it is intended to measure. Reliability asks “is it repeatable?”; validity asks “is it true?”

Yes. A bathroom scale that always reads 5 kg too high is reliable (it gives the same answer every time) but invalid (the answer is wrong). Reliability is necessary for validity but does not guarantee it.

Not really. If an instrument produces wildly different scores on each use, it cannot be accurately measuring a stable construct. Reasonable reliability is a precondition for validity, so a valid measure must first be reliable.

As a rule of thumb, Cronbach’s alpha of 0.7 or above is considered acceptable internal consistency, 0.8–0.9 is good, and values much above 0.95 may indicate redundant, near-duplicate items rather than a better scale.

Internal validity is whether a study can establish a genuine causal link between variables, free from confounding factors. External validity is whether the findings generalise beyond the specific sample, setting and time of the study. There is often a trade-off between the two.

Many qualitative researchers use Lincoln and Guba’s (1985) trustworthiness criteria instead: credibility (akin to internal validity), transferability (external validity), dependability (reliability) and confirmability (objectivity), supported by techniques such as triangulation, thick description and audit trails.

About Alaxendra Bets

Avatar for Alaxendra BetsBets earned her degree in English Literature in 2014. Since then, she's been a dedicated editor and writer at ResearchProspect, passionate about assisting students in their learning journey.

WhatsApp Live Chat