What Simple Linear Regression Means
Simple linear regression is a statistical method that models the relationship between two quantitative variables by fitting a straight line through the data. The line is described by the equation y = b0 + b1x, where b0 is the intercept and b1 is the slope. It lets you predict the value of one variable from another and quantify how strongly the two move together.
The variable you use to make the prediction, x, is called the independent (or predictor) variable. The variable you are trying to predict, y, is the dependent (or response) variable. “Simple” means there is exactly one predictor; when you have two or more predictors you move to multiple linear regression. If you are new to the topic, our beginner’s guide to regression analysis gives helpful background.
“In simple linear regression, we have a single predictor X, and we model the mean of the response Y as a linear function of X.” — James, Witten, Hastie & Tibshirani, An Introduction to Statistical Learning (Springer)
Everyday Examples
- Predicting a person’s weight (y) from their height (x) — taller people tend to weigh more.
- Predicting monthly sales (y) from advertising spend (x).
- Predicting exam score (y) from hours revised (x).
In each case there is one input we control or observe (x) and one outcome we want to estimate (y). Both variables must be measured on a continuous, numeric scale — see our guide to the levels of measurement in statistics to confirm your data qualify.
Correlation vs Regression
People often confuse the two, but they answer different questions. Correlation measures the strength and direction of the linear association between x and y as a single number, the correlation coefficient r, which ranges from −1 to +1. Regression goes further: it produces an actual equation (the fitted line) that lets you predict y from x. In simple linear regression the relationship between the two is tidy — the coefficient of determination equals the square of the correlation coefficient, written R² = r².
| Aspect | Correlation | Simple Linear Regression |
|---|---|---|
| Question answered | How strongly are x and y related? | How does y change as x changes, and what is the predicted y? |
| Output | A single number, r (−1 to +1) | An equation, y = b0 + b1x |
| Symmetry | Symmetric — corr(x,y) = corr(y,x) | Not symmetric — x predicts y, not the reverse |
| Used for prediction? | No | Yes |
The Simple Linear Regression Formula Explained
The population model for simple linear regression is written as:
y = β0 + β1x + ε
When we estimate the line from a sample of data, we use the fitted equation:
ŷ = b0 + b1x
Where:
- ŷ (“y-hat”) = the predicted value of the dependent variable.
- x = a given value of the independent variable.
- b0 = the intercept — the predicted value of y when x = 0; the point where the line crosses the y-axis.
- b1 = the slope (regression coefficient) — the amount y is predicted to change for each one-unit increase in x.
- ε = the error term (residual) — the gap between the observed value and the value the line predicts. It captures everything the model does not.
Note on notation: textbooks vary. You will often see the same model written as y = a + bx, where a is the intercept (our b0) and b is the slope (our b1). The maths is identical — only the letters change.
The chart above shows the idea visually: each dot is an observation, and the straight line is the fitted regression line that sits as close as possible to all the points at once.
Least Squares: How the Line Is Fitted
There are infinitely many straight lines you could draw through a cloud of points, so we need a rule for choosing the “best” one. Simple linear regression uses the method of least squares: it picks the slope and intercept that minimise the sum of the squared residuals — the squared vertical distances between each observed point and the line.
The least-squares formulas give clean, closed-form answers. The slope is:
b1 = Σ(xi − x̄)(yi − ȳ) ÷ Σ(xi − x̄)²
And the intercept is:
b0 = ȳ − b1x̄
Where x̄ and ȳ are the means of x and y. Two useful facts fall straight out of these formulas:
- The slope numerator, Σ(xi − x̄)(yi − ȳ), is the sum of cross-products, often written Sxy. The denominator, Σ(xi − x̄)², is Sxx. So b1 = Sxy ÷ Sxx.
- Because b0 = ȳ − b1x̄, the fitted line always passes through the point (x̄, ȳ) — the centre of the data.
Squaring the residuals (rather than just adding them) does two things: it stops positive and negative errors cancelling out, and it penalises large misses more heavily than small ones, which is why the technique is sometimes called ordinary least squares (OLS).
Worked Example: Advertising Spend vs Sales
Let’s fit a regression line by hand using a small dataset. Suppose a shop records its weekly advertising spend (x, in £00s) and resulting sales (y, in £00s) over five weeks.
Data: x = (1, 2, 3, 4, 5), y = (2, 4, 5, 4, 5).
Step 1 — Find the means.
x̄ = (1+2+3+4+5) ÷ 5 = 3
ȳ = (2+4+5+4+5) ÷ 5 = 20 ÷ 5 = 4
Step 2 — Compute deviations and products.
| x | y | x−x̄ | y−ȳ | (x−x̄)(y−ȳ) | (x−x̄)² |
|---|---|---|---|---|---|
| 1 | 2 | −2 | −2 | 4 | 4 |
| 2 | 4 | −1 | 0 | 0 | 1 |
| 3 | 5 | 0 | 1 | 0 | 0 |
| 4 | 4 | 1 | 0 | 0 | 1 |
| 5 | 5 | 2 | 1 | 2 | 4 |
| Totals | Sxy = 6 | Sxx = 10 | |||
Step 3 — Calculate the slope.
b1 = Sxy ÷ Sxx = 6 ÷ 10 = 0.6
Step 4 — Calculate the intercept.
b0 = ȳ − b1x̄ = 4 − (0.6 × 3) = 4 − 1.8 = 2.2
Step 5 — Write the fitted line.
ŷ = 2.2 + 0.6x
Step 6 — Predict. For an advertising spend of x = 6 (£600), predicted sales are ŷ = 2.2 + 0.6×6 = 5.8, i.e. £580.
That is the whole procedure. Spreadsheet tools such as Excel (the SLOPE and INTERCEPT functions, or LINEST) and statistics packages like SPSS, R and Python’s scikit-learn perform exactly these calculations for you, but doing it by hand once makes the output far easier to trust.
Interpreting the Slope, Intercept and R²
Fitting the line is only half the job — you have to explain what it means.
- Slope (b1): in our example, 0.6 means that for every one-unit rise in advertising spend, sales are predicted to rise by 0.6 units. A positive slope means y rises with x; a negative slope means y falls as x rises.
- Intercept (b0): 2.2 is the predicted value of y when x = 0. Interpret it with care — if x = 0 lies far outside your observed data, the intercept may have no real-world meaning.
- R² (coefficient of determination): the proportion of the variance in y that the model explains, ranging from 0 to 1. An R² of 0.71 means 71% of the variation in y is explained by x; the remaining 29% is due to other factors and random error.
To judge whether the slope is statistically meaningful rather than a fluke of sampling, you report three further quantities: the estimated coefficient, its standard error, and the p-value from a t-test of the null hypothesis that the true slope is zero. A small p-value (conventionally below 0.05) provides evidence that x genuinely predicts y — see our guide to statistical significance for the full picture.
A typical write-up reads like this: “A significant relationship was found (p < 0.001) between monthly pay and well-being (R² = 0.71), with a 0.71-unit increase in reported well-being for every £1,000 increase in monthly pay.”
Struggling to run or interpret your regression?
ResearchProspect to the rescue!
Our expert statisticians can run, validate and explain your model in SPSS, R or Excel — see our statistical analysis service.
Assumptions of Simple Linear Regression
For the results of an OLS regression to be valid and trustworthy, four key assumptions should hold. A handy mnemonic is LINE:
- Linearity: the relationship between x and the mean of y is genuinely a straight line. Check this with a scatter plot before you start.
- Independence: the observations (and their residuals) are independent of one another — one data point does not influence the next. This matters especially with time-series data.
- Normality: the residuals (errors) are approximately normally distributed. Note that it is the residuals that should be normal, not the raw x or y values themselves — a common misconception.
- Equal variance (homoscedasticity): the spread of the residuals is roughly constant across all values of x. If the scatter fans out, the model is heteroscedastic and standard errors become unreliable.
If these break down, remedies include transforming a variable, adding predictors via multiple linear regression, or choosing a different model. It is also worth remembering the golden rule: a strong regression relationship shows association, not necessarily causation. Before drawing conclusions, make sure your study’s reliability and validity stand up.