On “significance levels”

R (I think it was R) introduced a practice in which multiple asterisk characters are used to indicate different significance levels for tests. [Correction: Bill Idsardi points out some prior art that probably predates the R convention. I have no idea what S or S-Plus did, nor what R was like before 2006 or so. But certainly R has helped popularize it.] For instance, in R statistical summaries, * denotes a p-value such that .01 < p < .05, ** denotes a p-value such that .001 < p < .01, and *** denotes a p-value < .001. This type of reporting increasingly can be found in papers also, but there are good reasons not to copy R’s bad behavior.

In null hypothesis testing, the mere size of the p-value itself has no meaning. All that matters is whether p is greater than or less than the α-level. Depending on space, we may report the exact value of p for a test (often rounded to two digits and “< .01″ used for abbreviatory purposes, since you don’t want to round down here), but we need not. And it simply does not matter at all how small p is when it’s less than the α-level. There is no notion of “more significant” or “less significant”.

R also uses the period character ‘.’ is used to indicate a p-value between .05 and .1. Of course, I have never read a single study using an α-level greater than .05 (I suppose this would simply make the possibility of Type I error too high), so I’m not sure what the point is.

My suggestion here is simple. If you want, use ‘*’ to indicate a significant (p < α) result, and then in the caption write something like “*: < .05″ (assuming that your α-level is .05). Do not use additional asterisks.

5 thoughts on “On “significance levels””

  1. The multiple asterisk convention certainly predates R.
    https://www.jstor.org/stable/3598292 : “… in 1991 the American Sociological Review instituted an editorial policy that forbids the reporting of significance above the 5 percent level and requires one, two or three asterisks to denote significance at the .05, .01 and .001 levels, respectively.”
    And Figure 1 in that article shows at least one example of the “3 star” convention from the 1950s (presumably calculated by hand).
    I’m not defending this practice, but hardly anyone strictly adheres to strict NHST methods, and asterisks provide a quick and dirty visualization of what amounts to effect size, somewhat like Tukey’s methods in Exploratory Data Analysis.

    1. Thanks for the historical correction. R is of a similar vintage though and it’s certainly popularized the notation.

      I disagree that the information conveyed by the asterisks is really closely connected to effect size. Consider two 1,000 element samples I just drew: one with unit variance and centered on 0, another with the same unit variance but centered on .17. We obtain a Welch’s p(t) < .0001 (which ASR and R presumably considers "large"), but a Cohen's d < .2 (which is considered "small"). I'm not a huge fan of the overly verbose outputs you get from statistics packages, but if they're going to provide a bunch of extraneous information why not add Cohen's d? It's rare for me to run a Welch's t-test and not compute it too.

      1. I mean that the number of asterisks convey a crude visualization of relative effect sizes compared between variables of interest within an analysis. Like forward selection in stepwise multiple regression.

  2. Hi Kyle,
    I fully agree on your point of view. I still struggle with students using this as they see it way to often in published articles. Unfortunately, also co-authors are using it.
    Do you have 1 or 2 publications that I could refer to with this issue of not using more than one asterisk after correlations (r is an effect size anyway, no need for more than one asterisk to indicate effect size).
    Best, Stephan

    1. I don’t know but I would have to imagine any standard statistics textbook would distinguish between non-gradeable significance/non-significance in a null hypothesis test, and gradeable effect size. You can also tell them that I review for various publications (including outside of core linguistics) and I have more than once recommended that the authors fix this very issue (and usually just use one asterisk and a separate domain-appropriate effect size statistic).

Leave a Reply

Your email address will not be published. Required fields are marked *