# A commentary on 'Tests of Significance Considered as Evidence' (1942) by Joseph Berkson

This paper is old, and it shows! He starts of with the following:

There was a time when we did not talk about tests of significance; we simply did them. We tested whether certain quantities we significant in the light of their standard errors, without inquiring as to just what was involved in the procedure, or attempting to generalize it.

Sounds like the golden age of statistics! But the twilight of that age had long passed, for when he wrote this paper, statistics “consists almost entirely of tests of significance”. Everything happened so fast at the beginning of the 20th century!

This paper is an unusually pleasant read. His prose is clear and his style is crisp. So don’t worry about fluffy prose in the style of Ludwig von Mises or Friedrich von Hayek.

So, what’s it about? It’s about tests of significance, and he doesn’t like ’em. For on page two he says, about the meaning of significance testing:

[…] If the experience at hand would occur only very infrequently in a given hypothesis, the hypothesis is considered disproved. The argument has an apparent plausibility and for many years I adhered to it.

“Whereas I was blind,”

However, set against experience with actual problems, reflection has led me to the conclusion that it is erroneous, and that a reevaluation will lead to clearer comprehension in the application of tests of significance and also serve as a corrective of some of its misuses.

… " now I see."

## Human albinos

So what’s his beef with significance testing?

In the first place, the argument seems to be basically illogical. Con sider it in symbolic form. It says “If A is true, B will happen some times; therefore if B has been found to happen, A can be considered proved.” There is no logical warrant for considering an event known to occur in a given hypothesis, even if infrequently, as disproving the hypothesis.

This is true. And I think it’s a good point. Though it is well-appreciated now, at least among statisticians. He has an example of this:

Suppose I said, “Albinos are very rare in human populations, only one in fifty thousand. Therefore, if you have taken a random sample of 100 from a population and found in it an albino, the population is not human.”

This is a good one. First, it doesn’t have the abstract form of a typical statistics example. Second, it involves albinos! (Although he should’ve talked about peacocks and albino peacocks instead.) Third, it shows how bad it is to neglect alternatives. If the family of possible populations are, say, the family of all animal species, I suppose the rate of albinism would not be comparatively high among humans. And it’s silly to reject a hypothesis if none of the alternatives are better able to account for it.

Of course, Neyman and Pearson placed much emphasis on alternatives, as does modern testing theory. But Fisher wanted to avoid explicit alternatives, from what I’ve read. But at least in some cases, such as *t*-test, the alternative is implicitly given. Here’s a quote of Barnard, in a comment to Johnstone’s Tests of significance in theory and practice, p. 500:

With Student’s test, for example, the null hypothesis is, that the observations are independent and normally distributed with zero mean. The alternatives are any possibilities which would tend to make the sample mean large in relation to the sample standard deviation. The distribution could be non-normal, with non-zero mean; or the observations could fail to be independent. Indefiniteness of the precise alternatives is part of the logical set-up. Neyman and Pearson showed that if the alternatives are restricted to independence and normality, then the t-test is uniquely optimal, in their sense-a result closely related to Fisher’s earlier demonstration that, in the normal independent family, the pair (x,s,) is minimally sufficient for the parameters.

Barnard, in the same paper, also claims that Fisher wasn’t against alternatives in general, just parameterized alternatives. To that I can only say “meh”. Statistics is about making your methods formal and precise. This isn’t just important because we want good methods, but it is *honest*, and allows people to criticise your arguments. When your argument is a diffuse “alternatives must be considered”, with no mention of how to do that, criticism becomes too hard. But it fits into my impression of Fisher as a (brilliant, of course) man who prefered to bolster his claims by appeal to personal authority. I mean, why do we need a *method* to find good tests / *p*-values when we have Fisher to guide us?

*Drosophila*

On p. 328 he takes an example directly from Fisher. It’s about testing linearity in a one-variable regression model. I’ve never seen such a test before, so I’m a little uncomfortable commenting it. But the gist of the matter is that Fisher rejected linearity for a *Drosophila* data set, Berkson plots it and shows that it is linear (It is! Look at the plot.), and claims that the test rejected the null-hypothesis *due to heteroskedasticity, not non-linearity*. This is another good example! When you calculate a *p*-value under the any assumption it can, without any further guarantees, reject the nullmodel due to any kind of discrepancy. For instance, when \(H_0\) is the product of \(n\) standard normals, rejecting when the absolute value of \(\sqrt{n}\overline{X}\) is less \(0.0627\) is a valid test of the normal distribution with \(\alpha = 0.05\), but with an *unusual* alternative. If you pretend you test that the mean is unequal to \(0\) when using this test, you are likely to fail — you’re finding something else, something more like a test of the standard deviation being less than \(1\). Likewise, when you test for non-linearity, you want to test for that only, not something else. This sounds like a hard problem though. I don’t know if anyone have attempted to solve it.

So what’s his proposal? Take this quote on p. 329:

According to what is advocated here, we cannot lay down any pat axiomatic rules such as “A very small P disproves the hypothesis tested,” or “Equally, a very high P disproves the hypothesis,” for it is not primarily the infrequency of the P which gives the finding its meaning. Each test will have to be examined and the circumstances in which it is applied will have to be examined, to find out, as best we can, whether any particular regions of P will occur relatively frequently in the case of an alternative to the tested hypothesis. There are situations in which a very large P will be frequent in an alternative, and in these circumstances, but only in these circumstances, a very high P can be said to disfavor the null hypothesis.

Let’s take a model \(P\) and a *p*-value \(p\). If \(p\) is small enough to warrant rejection, recalculate it under the alternative model to see if it has a reasonable high probability there. Huh! I like this proposal!

He has a good example here as well, which is related to my comment about the unusual test for the normal above. The *p*-value corresponding to this test is \(\Phi_{[0,\infty)}(\sqrt{n}|\bar{X}|)\). You can reasonable think of this *p*-value as the *dual* of the usual *p*-value \(1 - \Phi_{[0,\infty)}(\sqrt{n}|\bar{X}|)\). The interesting thing about these two *p*-values is that if one is high, the other is low. Say I observe \(1 - \Phi_{[0,\infty)}(\sqrt{n}|\bar{X}|) = 0.9997\) and think “this distribution is really, *really* normal!”, you will come and say “Uh, no, it’s not normal at all. In fact, my *p*-value is 0.0003, it’s really, *really* non-normal!”. It clearly possible to do this with any *p*-value testing a non-composite hypothesis.

Back to his example: He wants to test whether some observations are Poisson. As a test statistic, he takes \(n\frac{s^2}{\overline{x}}\), which is distributed as \(\chi^2(n)\) when the Poisson model is the true model. What does this statistic measure? Dispersion! Usually we would test against *overdispersion*, so that a large \(\chi^2(n)\)-value gives reason to reject the hypothesis. But the dual *p*-value tests for *underdispersion*, which can obviously happen. And we get the same situation as above: Both \(1-p\) and \(p\) are *p*-values, and we need some understanding of what the alternative is in order to differenatiate between them.

But what happens if you don’t know this, if you listen to someone who says: “Use this test to test for Poissonity!” You do it, expecting it to find any deviation from Poissonity, especially your deviation of underdispersion. But it won’t ever find that! Anyway, the *p*-value is \(0.0001\), you reject the Poisson and go with your underdispersed Poisson instead. Maybe this is a bad example, but the point still stands. A way to avoid such problems it is to calculate the same *p*-value under your proposed alternative. If it is too low (or too high?), reject the alternative as well. But an even better method is follow Neyman and explicitly state your alternative from the start, then develop a test tailored to your specific problem.

He goes on to talk about what you can conclude when the *p*-value in the midrange, say \([0.3,0.7]\). It’s common practice among statistics teacher to say that you never can accept a nullhypothesis, only reject it. Berkson thinks it’s okay to accept it though, and points out that his contemporaries did this all the time! He mentions Student as one of these. Student did a goodness-of-fit test on his purpotedly Poisson data, got a moderate *p*-value, and concluded that Poisson is okay. People do this all the time, especially in economy, where testing for normality and unit roots is the order of the day. Is it okay to do this? I guess it is, provided you have a good reason to assume Poisson or unit roots at the outset.

But might lose something when you interpret middle-*p*s as supporting the nullhypothesis:

Here we have disclosed one fundamental weakness in the position of those who contend that small samples can be effectively utilized in statistical investigations if the calculations of the P’s are correctly made. If it were a fact that conclusions are drawn only when the P is very small and the null hypothesis disproved, then so far as concerns the main considerations here developed, there would be a certain validity to this view, for small P’s are more or less independent, in the weight of the evidence they afford, of the numbers in the sample. But if actually it is the fact that conclusions will be drawn from P’s which are not small, then only very considerable numbers in the sample are reliable.

He means that a middle *p*-value could reasonably be created by the true alternative when \(n\) is small, but not when it is large. Curiously, he doesn’t mention power. Perhaps he wasn’t aware of Neyman-Pearson. At least he didn’t cite them. I would like to add that you should adjust interpretation of what constitutes a small *p*-value together with \(n\). Let it be proportional to \(n^{-\frac{1}{2}}\) This will make your interpretation more approximately Bayesian.

# Verdict

Most of this paper is an informal defense of an informal Neyman-Pearson-ish way of looking at statistics, without any reference to the Neyman-Pearson concepts. For instance, he accepts nullhypotheses, but only in the presence of sufficient power, he wants us to think about alternatives, and he rejects *p*-values as a measure of evidence.

This paper is well written, has good points, and is a light reading. It’s bedtime reading! I recommend it.