"> Linear Regression in R: A Step-by-Step Guide
Home > Library > Statistics > Linear Regression in R: lm(), summary() & Diagnostics

Published by at September 20th, 2021 , Revised On June 16, 2026

To run a linear regression in R, fit a model with the built-in lm() function — for example model <- lm(y ~ x, data = mydata) — then read the results with summary(model). The summary gives you the intercept and slope coefficients, their p-values, and the R² value that tells you how much of the variation in the outcome the model explains. This guide walks through every step: fitting the model, interpreting the summary() output, plotting the fit, checking assumptions and making predictions.

Linear regression in R

Fit: lm(y ~ x)
Inspect coefficients & R² with summary()
Check assumptions with plot(model)
Predict with predict()

What Is Regression Analysis?

In statistics, regression analysis is used to study the relationship between an independent and a dependent variable. The method ‘regresses’ the value of y, the dependent variable, on x, the independent variable. In plain terms, regression measures how y changes as x changes, and lets you predict y for values of x you have not directly observed.

Knowing which of your variables is the predictor and which is the response is the first decision you make, because it determines the order in which they go into the lm() formula.

What Is Linear Regression?

In linear regression the relationship between x and y is modelled as a straight line: as x increases, y tends to increase or decrease at a constant rate. A single-predictor (simple) linear regression is described by the equation:

y = b₀ + b₁x + ε — The simple linear regression model

Where:

  • y is the dependent or ‘response’ variable;
  • x is the independent or ‘predictor’ variable;
  • b₀ is the intercept (the value of y when x = 0);
  • b₁ is the slope (the change in y for a one-unit increase in x);
  • ε is the random error term.

When you have one predictor it is simple linear regression; when you add two or more predictors it becomes multiple linear regression, which uses the same lm() function with extra terms on the right of the formula.

What Does ‘R’ Mean in Linear Regression?

There are two senses of ‘R’ that often get confused, so it is worth separating them clearly:

  • R (lower-case r in many texts) is the correlation coefficient — a value between −1 and +1 measuring the strength and direction of the linear relationship between the observed and predicted values of y.
  • (the coefficient of determination) is the square of that value. It ranges from 0 to 1 and gives the proportion of the variation in y that the model explains. An R² of 0.85 means the model accounts for 85% of the variability in the response.

R² is reported automatically by summary() in R, so you rarely calculate it by hand. To find it manually you would compute R² = 1 − (SSresidual / SStotal), where SSresidual is the sum of squared residuals and SStotal is the total sum of squares. From a scatter plot, R itself is the correlation, obtained in R with cor(x, y); squaring it gives R².

The Two Functions You Need

Two base-R functions do most of the work. You do not need to install any package for them.

1. lm() — fits the model. The basic syntax is:

lm(formula, data)
  • formula — the relationship, written as response ~ predictor (read the ~ as “is modelled by”);
  • data — the data frame that contains those variables.

2. predict() — generates predictions. Once the model is fitted you can predict y for new values of x:

predict(object, newdata)
  • object — the model created with lm();
  • newdata — a data frame containing the new predictor values (the column name must match the predictor used in the model).

Reading the summary() Output

Calling summary(model) is where you actually interpret the regression. The annotated table below maps each part of a typical output to what it tells you.

summary() element What it tells you
Residuals (Min, 1Q, Median, 3Q, Max) The spread of prediction errors. A median near 0 and roughly symmetric quartiles suggest the errors are balanced.
(Intercept) Estimate b₀ — the predicted value of y when every predictor equals 0.
Slope Estimate (the predictor’s row) b₁ — the change in y for each one-unit increase in x.
Std. Error The uncertainty in each coefficient estimate. Smaller is more precise.
t value Estimate divided by its standard error; how many standard errors the coefficient sits from 0.
Pr(>|t|) The p-value for each coefficient. Below 0.05 indicates the predictor is statistically significant.
Signif. codes (***, **, *) A quick visual key to the p-values: *** = p < 0.001, ** = p < 0.01, * = p < 0.05.
Multiple R-squared Proportion of variation in y explained by the model.
Adjusted R-squared R² penalised for the number of predictors — the fairer measure when comparing models.
F-statistic & p-value Tests whether the model as a whole explains significantly more than no predictors at all.

The original example below uses base R to model a person’s weight (the dependent variable) from their height (the independent variable).

Get statistical analysis help at an affordable price

We have:

  • An expert statistician will complete your work
  • Rigorous quality checks
  • Confidentiality and reliability
  • Any statistical software of your choice
  • Free Plagiarism Report

Linear Regression in R Step by Step

Here is the full workflow, from a fresh RStudio session to an interpreted result.

Step 1 — Install R and RStudio. Download base R from CRAN and RStudio Desktop. In RStudio, click File > New File > R Script so you can save and re-run your code.

Step 2 — Load helper packages (optional). Base R can fit and plot a regression with no add-ons, but tidyverse (data manipulation and ggplot2 visualisation) and ggpubr (publication-ready plots) make life easier. Install once with install.packages(), then load them each session:

install.packages(c("tidyverse", "ggpubr"))   # run once
library(tidyverse)
library(ggpubr)
theme_set(theme_pubr())

Step 3 — Load and inspect your data. Import a CSV and check it before modelling:

mydata <- read.csv("mydata.csv")
head(mydata)      # first six rows
summary(mydata)   # min, max, mean of each column

Step 4 — Fit the model with lm(). Put the response on the left of the ~ and the predictor(s) on the right:

# Simple linear regression (one predictor)
model <- lm(weight ~ height, data = mydata)

# Multiple linear regression (several predictors)
model2 <- lm(weight ~ height + age, data = mydata)

Step 5 — Read the results. Inspect the coefficients, R² and p-values:

summary(model)
confint(model)    # 95% confidence intervals for the coefficients

Step 6 — Plot the data and the fitted line. In base R:

plot(weight ~ height, data = mydata,
     main = "Weight vs Height")
abline(model, col = "red")   # adds the regression line

Or, with ggplot2 for a cleaner figure:

ggplot(mydata, aes(x = height, y = weight)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE)

Step 7 — Predict new values. Estimate weight for new heights:

newdata <- data.frame(height = c(170, 182))
predict(model, newdata)

Step 8 — Report in words. Translate the slope into plain language, e.g. “For every 1 cm increase in height, predicted weight rose by 0.67 kg (p < 0.001), and the model explained 86% of the variation in weight (R² = 0.86).”

Assumptions and Diagnostic Plots

A linear regression is only trustworthy if its assumptions hold. There are four to check:

  1. Linearity — the relationship between predictor and response is genuinely a straight line.
  2. Independence — the residuals (errors) are independent of one another.
  3. Homoscedasticity — the residuals have roughly constant variance across all fitted values (no funnel shape).
  4. Normality — the residuals are approximately normally distributed.

R checks all four with a single command. Calling plot() on a model produces four diagnostic charts:

par(mfrow = c(2, 2))   # show all four plots together
plot(model)
Diagnostic plot Assumption checked What you want to see
Residuals vs Fitted Linearity Points scattered randomly around 0, no curve.
Normal Q-Q Normality of residuals Points falling close to the diagonal line.
Scale-Location Homoscedasticity A flat, horizontal trend with even spread.
Residuals vs Leverage Influential outliers No points beyond Cook’s distance contours.

Worked Example: Predicting Weight from Height

Example: Suppose we record the height (cm) and weight (kg) of ten people and want to predict weight from height in R.

# 1. Enter the data
height <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
weight <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
mydata <- data.frame(height, weight)

# 2. Fit the model
model <- lm(weight ~ height, data = mydata)

# 3. Read the output
summary(model)

The abridged output looks like this:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.4551     8.0440   -4.78  0.0014 **
height        0.6746     0.0511   13.20  1.0e-06 ***

Multiple R-squared:  0.956,  Adjusted R-squared: 0.951
F-statistic: 174.2 on 1 and 8 DF,  p-value: 1.0e-06

How to read it: the fitted line is weight = -38.46 + 0.6746 × height. The slope of 0.6746 (p < 0.001, shown by ***) means each extra centimetre of height is associated with about 0.67 kg more weight. The Multiple R-squared of 0.956 means height explains roughly 96% of the variation in weight in this sample, and the overall F-test p-value confirms the model is statistically significant.

To predict the weight of someone 170 cm tall:

predict(model, data.frame(height = 170))
# -38.4551 + 0.6746 * 170 ≈ 76.2 kg

Plotting the points with the fitted line shows the positive linear trend at a glance:

x (predictor)
R’s lm() fits a regression line minimising the squared residuals.

Stuck interpreting your regression output?

ResearchProspect to the rescue!

Our expert statisticians can run, interpret and report your analysis in R, SPSS or any package you choose — see our statistical analysis service.

Frequently Asked Questions

How do you do a linear regression in R?

Fit the model with the lm() function, putting the response variable on the left of the ~ and the predictor on the right, for example model <- lm(weight ~ height, data = mydata). Then call summary(model) to view the coefficients, R² and p-values, and use plot(model) to check the assumptions.

Use the same lm() function and add each extra predictor with a plus sign, e.g. lm(weight ~ height + age, data = mydata). Everything else — summary(), predict() and the diagnostic plots — works identically. See our guide to multiple linear regression for interpretation tips.

R reports it automatically: run summary(model) and read the “Multiple R-squared” line, or extract it directly with summary(model)$r.squared. R² equals 1 − (residual sum of squares ÷ total sum of squares) and gives the proportion of variation in the response that the model explains.

The R value is the correlation coefficient. Compute it with cor(x, y), which returns a number between −1 and +1. Squaring that value gives R², the same figure shown as “Multiple R-squared” in a simple lm() summary.

Each coefficient’s p-value (the Pr(>|t|) column) tests whether that predictor’s true effect is zero. A value below 0.05 suggests the predictor is statistically significant. The asterisks (*, **, ***) are a shorthand for the significance level.

No. The lm(), summary(), predict() and plot() functions are all part of base R. Packages such as ggplot2 or ggpubr are only needed if you want more polished graphics.

About Aadam Mae

Avatar for Aadam MaeAadam Mae, an academic researcher and author with a PhD in NLP (Natural Language Processing) at ResearchProspect. Mae's work delves into the intricacies of language and technology, delivering profound insights in concise prose. Pioneering the future of communication through scholarship.

WhatsApp Live Chat