statistical significance, in statistics, the determination that a result or an observation from a set of data is due to intrinsic qualities and not random variance of a sample. An observation is statistically significant if its probability of occurring is extremely small given the truth of a null hypothesis. Since its conception in the 18th century, statistical significance has become the gold standard for establishing the validity of a result. Statistical significance does not imply the size, importance, or practicality of an outcome; it simply indicates that the outcome’s difference from a baseline is not due to chance.
Statistical significance implies that an observed result is not due to sampling error. Variability must be considered when conducting an experiment, a poll, or another sampling technique, as findings can come from unrepresentative samples. Statistical analysis called hypothesis testing is used to determine whether an observation is statistically significant. The process begins by setting the result against a null hypothesis, which is typically a claim of no difference or relationship between sets of data. The appropriate statistical test is then performed to produce a likelihood called the p-value, which indicates the observation’s probability of occurring if the null hypothesis is true. The lower this probability is, the more likely the event is not due to chance. This p-value is then compared to a predetermined value called α (alpha) or the significance level. If the p-value is lower than this threshold, the null hypothesis is rejected, the observed difference is concluded to be not due to chance, and the result is considered to be statistically significant.
As an illustration of this process, imagine that the mean weight of a sample of apples from Orchard A is 140 grams (5 ounces), while the mean weight of a sample from Orchard B is 170 grams (6 ounces). A hypothesis test establishes whether this 30-gram (1-ounce) difference is statistically significant—that is, whether the apples from Orchard B are truly heavier, on average, than those from Orchard A or whether the difference is due to unrepresentative samples. The null hypothesis would posit that there is no difference in mean weight: “H0: mean weight of Orchard A apples = mean weight of Orchard B apples.” After determining the threshold of significance (α = 0.05), the appropriate statistical test would produce a test statistic and a corresponding p-value—say, p = 0.001. Because this p-value is less than α = 0.05, the null hypothesis would be rejected, leading to the conclusion that a true, statistically significant difference exists between the mean weights of Orchard A’s and Orchard B’s apples. However, if the p-value is greater than α = 0.05, the difference is likely due to chance, and it would not be considered statistically significant.
The concept of statistical significance originated in the 18th century with Scottish physician and mathematician John Arbuthnot. After examining 82 consecutive years of London baptisms indicating that more boys had been born than girls, he calculated the odds of an 82-year streak of more boys being born each year as about 1 in 483 sextillion, a likelihood so improbable he concluded that it could not have been due to chance and hence was due to Divine Providence. While his conclusion was the first to use an occurrence’s small probability as evidence of validity, other mathematicians soon followed suit. In 1781 French mathematician Pierre-Simon Laplace rejected a probability of 1 in 100 to conclude a difference in birth sex ratios between Paris and Naples. Siméon-Denis Poisson, another French mathematician, equated a chance of 1 in 213 with certainty in an examination of trial outcomes in 1837. As the field of statistics advanced, so did the understanding of statistical significance. In 1925 British statistician and geneticist R.A. Fisher characterized p = 0.05 as a convenient point for which a deviation should be considered significant or not. Although he acknowledged that the level should change depending on the need for stricter thresholds, α = 0.05 became the standard for statistical significance in academic research.
A growing number of researchers have voiced concerns over the misinterpretation of, and overreliance on, statistical significance. Often, analysis ends once an observation has been deemed to be statistically significant, and the observation is treated as evidence of an effect. This tendency is especially problematic given that statistical significance is not equal to clinical significance, a measure of effect size and practical importance. In an experiment, a statistically significant result simply indicates that a difference exists between two groups. This difference might be incredibly small, but, without further testing, its practical impact is unknown. Furthermore, a lack of further analysis can lead to incorrect conclusions about validity. Fisher himself noted that a finding should be designated as experimentally established only if the study, when repeated, rarely fails to provide significant results. Although small, p = 0.05 is a nonzero probability, which potentially allows for observations due to chance to be incorrectly categorized as statistically significant.