"> Regression Analysis: A Beginner's Guide & Examples
Home > Library > Statistics > Regression Analysis: A Beginner’s Guide With Examples

Published by at September 1st, 2021 , Revised On June 16, 2026

Regression analysis is a statistical method for estimating the relationship between a dependent (outcome) variable and one or more independent (predictor) variables. In plain terms, it fits a line or curve through your data so you can describe how the outcome changes as the predictors change, and then use that relationship to make predictions. It is one of the most widely used techniques in research, economics, medicine and business analytics.

If you have ever wanted to answer a question like “how much do exam scores rise for each extra hour of study?” or “what sales should we expect next month given our advertising spend?”, regression is the tool that turns scattered data points into a clear, quantified answer. This beginner’s guide explains what regression analysis is, how the regression line works, the main types, the assumptions you must check, and a fully worked example you can follow step by step.

“Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables.” — Corporate Finance Institute

What Is Regression Analysis?

Imagine you are the CEO of a company trying to predict next month’s profit. Many factors might affect that number: the number of sales, advertising spend, the hours each employee works, or seasonal demand. Each of these factors is a variable, and regression analysis is the process of working out, mathematically, which of these variables genuinely influence the outcome and by how much.

Every regression model has two kinds of variable:

  • Dependent variable (Y) — the outcome you want to explain or predict (e.g. monthly profit). It is also called the response or outcome variable.
  • Independent variable(s) (X) — the predictor(s) you believe drive the outcome (e.g. advertising spend). These are also called explanatory or predictor variables.

Regression then estimates an equation that links them. With a single predictor, that equation describes a straight line; with several predictors it describes a flat “plane” in higher dimensions. The result lets you do two things: understand relationships (which predictors matter, in which direction, and how strongly) and make predictions (estimate Y for new values of X).

What Regression Analysis Is Used For

  • Prediction and forecasting — estimating future sales, prices, demand or risk from historical data.
  • Explaining relationships — quantifying how strongly each factor is associated with the outcome.
  • Identifying the most important drivers — comparing predictors to see which carry real weight.
  • Testing hypotheses — checking whether an effect is statistically significant rather than down to chance.

A crucial caution: regression measures association, not causation. Finding that two variables move together does not prove that one causes the other — that conclusion needs careful study design and subject knowledge.

The Regression Line (Line of Best Fit)

For simple linear regression — one predictor and one outcome — the relationship is summarised by a straight line called the line of best fit. Its equation is:

Y = a + bX

where:

  • Y is the predicted value of the dependent variable;
  • X is the value of the independent variable;
  • a is the intercept — the predicted value of Y when X = 0;
  • b is the slope (regression coefficient) — how much Y changes for each one-unit increase in X.

The line is found using ordinary least squares (OLS), which chooses the values of a and b that make the line as close as possible to every point. “Closeness” is measured by the vertical gaps between each observed point and the line — these gaps are called residuals. OLS picks the line that minimises the sum of the squared residuals, which is where the alternative names least squares and ordinary least squares come from.

x (predictor)
Simple linear regression fits a line of best fit through the data.

The slope tells you the direction of the relationship: a positive slope means Y rises as X rises, while a negative slope means Y falls as X rises. The closer the points cluster around the line, the stronger and more reliable the relationship.

How Good Is the Fit? R-Squared

Once you have a line, you need to know how well it explains the data. The most common measure is R² (the coefficient of determination), which ranges from 0 to 1. It represents the proportion of the variation in Y that the model explains.

  • R² = 0.90 means 90% of the variation in the outcome is explained by the predictor(s) — a strong fit.
  • R² near 0 means the predictors explain almost none of the variation.

R² should always be read alongside whether each coefficient is statistically significant (commonly judged with a p-value below 0.05) and whether the model’s assumptions hold.

Bayesian linear regression is a further variant that uses Bayes’ theorem to estimate the coefficients as probability distributions rather than single fixed values, which can give more stable estimates with limited data.

Assumptions of Linear Regression

Linear regression gives trustworthy results only when certain assumptions hold. Checking them is part of doing regression properly:

  1. Linearity — the relationship between X and Y is genuinely a straight line. A scatter plot is the quickest check.
  2. Independence — the observations (and their residuals) are independent of one another.
  3. Homoscedasticity — the spread of the residuals is roughly constant across all values of X (no funnel shape).
  4. Normality of residuals — the residuals are approximately normally distributed. This mainly matters for confidence intervals and significance tests.
  5. No (or low) multicollinearity — in multiple regression, predictors should not be too strongly correlated with each other.

If these assumptions are badly violated, the coefficients and p-values can be misleading, so consider transforming variables or choosing a different model.

Worked Example: Predicting Sales From Advertising

Let’s build a simple linear regression by hand to see exactly how the line of best fit is calculated.

Example: A shop records advertising spend (X, in £1,000s) and weekly sales (Y, in £1,000s) for five weeks:

Week Advertising X Sales Y
1 1 2
2 2 4
3 3 5
4 4 4
5 5 5

Step 1 — Find the means. Mean of X = (1+2+3+4+5)/5 = 3. Mean of Y = (2+4+5+4+5)/5 = 4.

Step 2 — Find the slope b using b = Σ(X−X̄)(Y−Ȳ) / Σ(X−X̄)².

Deviations and products:

  • Week 1: (1−3)(2−4) = (−2)(−2) = 4; (X−X̄)² = 4
  • Week 2: (2−3)(4−4) = (−1)(0) = 0; (X−X̄)² = 1
  • Week 3: (3−3)(5−4) = (0)(1) = 0; (X−X̄)² = 0
  • Week 4: (4−3)(4−4) = (1)(0) = 0; (X−X̄)² = 1
  • Week 5: (5−3)(5−4) = (2)(1) = 2; (X−X̄)² = 4

Σ(X−X̄)(Y−Ȳ) = 4 + 0 + 0 + 0 + 2 = 6. Σ(X−X̄)² = 4 + 1 + 0 + 1 + 4 = 10.

So b = 6 / 10 = 0.6.

Step 3 — Find the intercept a using a = Ȳ − b·X̄ = 4 − (0.6 × 3) = 4 − 1.8 = 2.2.

Step 4 — Write the line: Y = 2.2 + 0.6X.

Step 5 — Predict. If the shop spends £6,000 on advertising (X = 6): Y = 2.2 + 0.6 × 6 = 2.2 + 3.6 = 5.8, i.e. about £5,800 in weekly sales.

Interpretation: the slope of 0.6 means each extra £1,000 of advertising is associated with roughly £600 more in sales.

This is exactly what statistical software does for you automatically — it just handles far larger datasets and reports extra diagnostics such as R², p-values and confidence intervals. If you are unsure which model or test suits your data, our guide on which statistical test you should use is a helpful next step.

Struggling to run or interpret a regression?

ResearchProspect to the rescue!

Our statisticians can build, run and explain your model end to end — explore our statistical analysis service.

Frequently Asked Questions

What is regression analysis in simple terms?

Regression analysis is a statistical method that fits a line or curve through your data to describe how an outcome (the dependent variable) changes as one or more predictors (independent variables) change. It lets you both understand relationships and make predictions.

For the line Y = a + bX, first find the slope b = Σ(X−X̄)(Y−Ȳ) / Σ(X−X̄)², then the intercept a = Ȳ − b·X̄, where X̄ and Ȳ are the means. The worked example above shows every step, giving Y = 2.2 + 0.6X.

Linear regression predicts a continuous numeric outcome using a straight line. Logistic regression predicts a categorical outcome (such as yes/no) by estimating a probability with an S-shaped sigmoid curve, so it is used for classification.

R-squared (the coefficient of determination) is the proportion of variation in the dependent variable explained by the model, on a scale of 0 to 1. An R² of 0.85 means the model explains 85% of the variation; values closer to 1 indicate a better fit.

No. Regression measures association between variables, not cause and effect. Two variables can be strongly related because of a third factor or coincidence, so causal claims require careful study design and subject-matter knowledge.

Start with simple linear regression (one predictor, one continuous outcome), then move to multiple linear regression when you have several predictors. Use logistic regression when your outcome is categorical.

About Owen Ingram

Avatar for Owen IngramIngram is a dissertation specialist. He has a master's degree in data sciences. His research work aims to compare the various types of research methods used among academicians and researchers.

WhatsApp Live Chat