linear regression

statistics

Written by Ken Stewart

Fact-checked by The Editors of Encyclopaedia Britannica

Last Updated: May 21, 2025 • Article History

Related Topics:: regression

See all related content

linear regression, in statistics, a process for determining a line that best represents the general trend of a data set.

The simplest form of linear regression involves two variables: y being the dependent variable and x being the independent variable. The equation developed is of the form y = mx + b, where m is the slope of the regression line (or the regression coefficient), and b is where the line intersects the y-axis. The equation for the regression line can be found using the least squares method, where m = (n(Σxy) − ΣxΣy)/(nΣx² − (Σx)²) and b = (Σy − mΣx)/n. The symbol Σ indicates a summation of all values, and n is the number of data points.

When a linear correlation exists in the data, a regression line can be found that will represent the line of best fit for the data. This equation can then be used to predict values not collected in the original data set.

It is often useful to create a graph of the collected data to see if there is a likely correlation within the data set before finding the equation of the regression line. If the data points are scattered and show no sign of a relationship, any equation found using linear regression most likely will not yield useful information. The Pearson’s correlation coefficient for a data set can be calculated to assist with this process, and the resulting coefficient can be used to determine if it makes sense to find a regression line equation. A Pearson’s correlation coefficient that is close to +1 for a positive correlation or −1 for a negative correlation indicates that it makes sense to use linear regression.

As an illustration of regression analysis and the least squares method, suppose a university medical centre is investigating the relationship between stress and blood pressure. Assume that both a stress test score and a blood pressure reading have been recorded for a sample of 20 patients. The data are shown graphically in the figure, called a scatter diagram. Values of the independent variable, stress test score, are given on the horizontal axis, and values of the dependent variable, blood pressure, are shown on the vertical axis. The line passing through the data points is the graph of the estimated regression equation: y = 0.49x + 42.3. The parameter estimates, m = 0.49 and b = 42.3, were obtained using the least squares method.

A primary use of the estimated regression equation is to predict the value of the dependent variable when values for the independent variables are given. For instance, given a patient with a stress test score of 60, the predicted blood pressure is 0.49(60) + 42.3 = 71.7. The values predicted by the estimated linear regression equation are the points on the line in the figure, and the actual blood pressure readings are represented by the points scattered about the line. The difference between the observed value of y and the value of y predicted by the estimated regression equation is called a residual. The least squares method chooses the parameter estimates such that the sum of the squared residuals is minimized.

Ken Stewart

causation

Table of Contents

Introduction References & Edit History Quick Facts & Related Topics

causation

philosophy

Also known as: causality, cause and effect

Written and fact-checked by The Editors of Encyclopaedia Britannica

Last Updated: Jun 3, 2025 • Article History

Key People:: David Hume

Related Topics:: teleology; determinism; first cause; entelechy; fatalism

On the Web:: Education Resources Information Center - Causality: Physics and Philosophy (PDF) (June 03, 2025)

See all related content

causation, Relation that holds between two temporally simultaneous or successive events when the first event (the cause) brings about the other (the effect). According to David Hume, when we say of two types of object or event that “X causes Y” (e.g., fire causes smoke), we mean that (i) Xs are “constantly conjoined” with Ys, (ii) Ys follow Xs and not vice versa, and (iii) there is a “necessary connection” between Xs and Ys such that whenever an X occurs, a Y must follow. Unlike the ideas of contiguity and succession, however, the idea of necessary connection is subjective, in the sense that it derives from the act of contemplating objects or events that we have experienced as being constantly conjoined and succeeding one another in a certain order, rather than from any observable properties in the objects or events themselves. This idea is the basis of the classic problem of induction, which Hume formulated. Hume’s definition of causation is an example of a “regularity” analysis. Other types of analysis include counterfactual analysis, manipulation analysis, and probabilistic analysis.

This article was most recently revised and updated by Brian Duignan.