To run a linear regression in R, fit a model with the built-in lm() function — for example model <- lm(y ~ x, data = mydata) — then read the results with summary(model). The summary gives you the intercept and slope coefficients, their p-values, and the R² value that tells you how much of the variation in the outcome the model explains. This guide walks through every step: fitting the model, interpreting the summary() output, plotting the fit, checking assumptions and making predictions.
Linear regression in R
What Is Regression Analysis?
In statistics, regression analysis is used to study the relationship between an independent and a dependent variable. The method ‘regresses’ the value of y, the dependent variable, on x, the independent variable. In plain terms, regression measures how y changes as x changes, and lets you predict y for values of x you have not directly observed.
Knowing which of your variables is the predictor and which is the response is the first decision you make, because it determines the order in which they go into the lm() formula.
What Is Linear Regression?
In linear regression the relationship between x and y is modelled as a straight line: as x increases, y tends to increase or decrease at a constant rate. A single-predictor (simple) linear regression is described by the equation:
y = b₀ + b₁x + ε — The simple linear regression model
Where:
- y is the dependent or ‘response’ variable;
- x is the independent or ‘predictor’ variable;
- b₀ is the intercept (the value of y when x = 0);
- b₁ is the slope (the change in y for a one-unit increase in x);
- ε is the random error term.
When you have one predictor it is simple linear regression; when you add two or more predictors it becomes multiple linear regression, which uses the same lm() function with extra terms on the right of the formula.
What Does ‘R’ Mean in Linear Regression?
There are two senses of ‘R’ that often get confused, so it is worth separating them clearly:
- R (lower-case r in many texts) is the correlation coefficient — a value between −1 and +1 measuring the strength and direction of the linear relationship between the observed and predicted values of y.
- R² (the coefficient of determination) is the square of that value. It ranges from 0 to 1 and gives the proportion of the variation in y that the model explains. An R² of 0.85 means the model accounts for 85% of the variability in the response.
R² is reported automatically by summary() in R, so you rarely calculate it by hand. To find it manually you would compute R² = 1 − (SSresidual / SStotal), where SSresidual is the sum of squared residuals and SStotal is the total sum of squares. From a scatter plot, R itself is the correlation, obtained in R with cor(x, y); squaring it gives R².
The Two Functions You Need
Two base-R functions do most of the work. You do not need to install any package for them.
1. lm() — fits the model. The basic syntax is:
lm(formula, data)
formula— the relationship, written asresponse ~ predictor(read the~as “is modelled by”);data— the data frame that contains those variables.
2. predict() — generates predictions. Once the model is fitted you can predict y for new values of x:
predict(object, newdata)
object— the model created withlm();newdata— a data frame containing the new predictor values (the column name must match the predictor used in the model).
Reading the summary() Output
Calling summary(model) is where you actually interpret the regression. The annotated table below maps each part of a typical output to what it tells you.
| summary() element | What it tells you |
|---|---|
| Residuals (Min, 1Q, Median, 3Q, Max) | The spread of prediction errors. A median near 0 and roughly symmetric quartiles suggest the errors are balanced. |
| (Intercept) Estimate | b₀ — the predicted value of y when every predictor equals 0. |
| Slope Estimate (the predictor’s row) | b₁ — the change in y for each one-unit increase in x. |
| Std. Error | The uncertainty in each coefficient estimate. Smaller is more precise. |
| t value | Estimate divided by its standard error; how many standard errors the coefficient sits from 0. |
| Pr(>|t|) | The p-value for each coefficient. Below 0.05 indicates the predictor is statistically significant. |
| Signif. codes (***, **, *) | A quick visual key to the p-values: *** = p < 0.001, ** = p < 0.01, * = p < 0.05. |
| Multiple R-squared | Proportion of variation in y explained by the model. |
| Adjusted R-squared | R² penalised for the number of predictors — the fairer measure when comparing models. |
| F-statistic & p-value | Tests whether the model as a whole explains significantly more than no predictors at all. |
The original example below uses base R to model a person’s weight (the dependent variable) from their height (the independent variable).
Get statistical analysis help at an affordable price
We have:
- An expert statistician will complete your work
- Rigorous quality checks
- Confidentiality and reliability
- Any statistical software of your choice
- Free Plagiarism Report
Linear Regression in R Step by Step
Here is the full workflow, from a fresh RStudio session to an interpreted result.
Step 1 — Install R and RStudio. Download base R from CRAN and RStudio Desktop. In RStudio, click File > New File > R Script so you can save and re-run your code.
Step 2 — Load helper packages (optional). Base R can fit and plot a regression with no add-ons, but tidyverse (data manipulation and ggplot2 visualisation) and ggpubr (publication-ready plots) make life easier. Install once with install.packages(), then load them each session:
install.packages(c("tidyverse", "ggpubr")) # run once
library(tidyverse)
library(ggpubr)
theme_set(theme_pubr())
Step 3 — Load and inspect your data. Import a CSV and check it before modelling:
mydata <- read.csv("mydata.csv")
head(mydata) # first six rows
summary(mydata) # min, max, mean of each column
Step 4 — Fit the model with lm(). Put the response on the left of the ~ and the predictor(s) on the right:
# Simple linear regression (one predictor)
model <- lm(weight ~ height, data = mydata)
# Multiple linear regression (several predictors)
model2 <- lm(weight ~ height + age, data = mydata)
Step 5 — Read the results. Inspect the coefficients, R² and p-values:
summary(model)
confint(model) # 95% confidence intervals for the coefficients
Step 6 — Plot the data and the fitted line. In base R:
plot(weight ~ height, data = mydata,
main = "Weight vs Height")
abline(model, col = "red") # adds the regression line
Or, with ggplot2 for a cleaner figure:
ggplot(mydata, aes(x = height, y = weight)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE)
Step 7 — Predict new values. Estimate weight for new heights:
newdata <- data.frame(height = c(170, 182))
predict(model, newdata)
Step 8 — Report in words. Translate the slope into plain language, e.g. “For every 1 cm increase in height, predicted weight rose by 0.67 kg (p < 0.001), and the model explained 86% of the variation in weight (R² = 0.86).”
Assumptions and Diagnostic Plots
A linear regression is only trustworthy if its assumptions hold. There are four to check:
- Linearity — the relationship between predictor and response is genuinely a straight line.
- Independence — the residuals (errors) are independent of one another.
- Homoscedasticity — the residuals have roughly constant variance across all fitted values (no funnel shape).
- Normality — the residuals are approximately normally distributed.
R checks all four with a single command. Calling plot() on a model produces four diagnostic charts:
par(mfrow = c(2, 2)) # show all four plots together
plot(model)
| Diagnostic plot | Assumption checked | What you want to see |
|---|---|---|
| Residuals vs Fitted | Linearity | Points scattered randomly around 0, no curve. |
| Normal Q-Q | Normality of residuals | Points falling close to the diagonal line. |
| Scale-Location | Homoscedasticity | A flat, horizontal trend with even spread. |
| Residuals vs Leverage | Influential outliers | No points beyond Cook’s distance contours. |
Worked Example: Predicting Weight from Height
# 1. Enter the data
height <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
weight <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
mydata <- data.frame(height, weight)
# 2. Fit the model
model <- lm(weight ~ height, data = mydata)
# 3. Read the output
summary(model)
The abridged output looks like this:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.4551 8.0440 -4.78 0.0014 **
height 0.6746 0.0511 13.20 1.0e-06 ***
Multiple R-squared: 0.956, Adjusted R-squared: 0.951
F-statistic: 174.2 on 1 and 8 DF, p-value: 1.0e-06
How to read it: the fitted line is weight = -38.46 + 0.6746 × height. The slope of 0.6746 (p < 0.001, shown by ***) means each extra centimetre of height is associated with about 0.67 kg more weight. The Multiple R-squared of 0.956 means height explains roughly 96% of the variation in weight in this sample, and the overall F-test p-value confirms the model is statistically significant.
To predict the weight of someone 170 cm tall:
predict(model, data.frame(height = 170))
# -38.4551 + 0.6746 * 170 ≈ 76.2 kg
Plotting the points with the fitted line shows the positive linear trend at a glance:
Stuck interpreting your regression output?
ResearchProspect to the rescue!
Our expert statisticians can run, interpret and report your analysis in R, SPSS or any package you choose — see our statistical analysis service.





