Statistical Inference (and What is Wrong With Classical Statistics)


This page concerns statistical inference as described by the most prominent and mainstream school of thought, which is variously described as ‘classical statistics’, ‘conventional statistics’, ‘frequentist statistics’, ‘orthodox statistics’ or ‘sampling theory’. Oddly, statistical inference—to draw conclusions from the data—is never defined within the paradigm.

The practice of statistical inference as described here includes estimation (point estimation and interval estimation (using confidence intervals)) and significance tests (testing a null hypothesis and calculating p-values).

The important point is that all of these methods involve pretending that our sample came from an imaginary experiment that involved considering all possible samples of the same size from the population.


The first formal significance test (Arbuthnott, 1710) correctly demonstrated that the excess of male births is statistically significant, but erroneously concluded that this was due to Divine Providence (intelligent design, rather than chance). Modern hypothesis testing is an anonymous hybrid of the tests proposed by Ronald Fisher (1922, 1925) on the one hand, and Jerzy Neyman and Egon Pearson (1933) on the other. Since Berkson (1938) people have questioned the use of hypothesis testing in the sciences. For a historic account of significance testing, see Huberty (1993).

The frequentist interpretation of probability is very limited

A frequentist subscribes to the long run relative frequency interpretation of probability. This is defined as the limiting frequency with which that outcome appears in a long series of similar events. Dice, coins and shuffled playing cards can be used to generate random variables; therefore, they have a frequency distribution, and thus the frequency definition of probability theory can be used. Unfortunately, the frequency interpretation can only be used in cases such as these. The Bayesian interpretation of probability can be used in any situation.

The nature of the null hypothesis test

Why should we choose between just two hypotheses, and why can't we put a probability on a hypothesis? A typical null hypothesis, that two populations means are equal, is daft: they will almost never be exactly equal. What does it mean to accept and reject a hypothesis? If a significance level is used to decide whether a null hypothesis is true or not, note that the level, such as 0.05, is totally arbitrary (the level effectively acts as a prior, but classical statisticians fail to appreciate this).

Prior information is ignored

Almost all prior information is ignored and no opportunity is given to incorporate what we already know.

Assumptions are swept under the carpet

The subjective elements of classical statistics, such as the choice of null hypothesis, determining the outcome space, the appropriate significance level and the dependence of significant tests on the stopping rule are all swept under the carpet. Bayesian methods put them where we can see them - in the prior.

p values are irrelevant (which leads to incoherence) and misleading

With little loss of generality, let us consider a simple problem of inference. Assume that we have a large population with known mean and one sample. All of this makes up our evidence, E. Our hypothesis, H, is that the sample came from a different population (one with a different mean).

The frequentist theory of probability is only capable of dealing with random variables which generate a frequency distribution ‘in the long run’. We have one fixed population and one fixed sample. There is nothing random about this problem and the experiment is conducted only once, so there is no ‘long run’. So, versed in frequentist probability, what is our hapless orthodox statistician to do?

We pretend that the experiment was not conducted once, but an infinite number of times (that is, we consider all possible samples of the same size). Incredibly, all samples are considered equal, that is, our actual sample is not given any privileges over any other (imaginary) sample. We assume that each sample mean includes an ‘error’, which is independently and normally distributed about zero. Optimistically, we now claim that our sample was ‘random’. Voila! The sample mean now becomes our random variable, which we call our ‘statistic’. We can now apply the frequentist interpretation of probability.

We are now able to determine the (frequentist) probability of a (randomly chosen) sample mean having a value at least as extreme as our original sample mean. Note that we are implicitly assuming that the sample mean and the population mean are equal. This probability is our p-value which, incredibly, is assumed to apply to the original problem.

A method similar to that outlined above is common to all Fisher-Neyman-Pearson inference. The p-value also suffers from being an incoherent measure of support, in the sense that we can reject a hypothesis that is a superset of a second hypothesis without rejecting the second. P-values are not just irrelevant, they are dangerous because they are often misunderstood to be probabilities about the hypothesis, given the data (which would be far more intuitive). As the prominent Bayesian Harold Jeffreys observed, ‘What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred’ (Jeffreys, 1961).

In summary:

Important Publications