Word frequency norms are usually computed by counting word frequencies in some large, relatively-diverse, but hopefully representative corpus. However, raw frequency is only interpretable relative to the size of that corpus.

Converting raw frequencies to probabilities (i.e., via maximum likelihood estimation) removes this corpus-dependence, but the resulting probabilities are themselves not terribly interpretable either. One slight improvement has been to interpret them as words per million (usually rounding to the nearest power of ten). It is reasonably obvious to me that “100 wpm” is an improvement on “.0001” or the equivalent “1e-4”.

Van Heuven et al. (2014) propose a variation on words-per-million metrics which they call the Zipf scale. While van Heuven et al. do not give a complete formula, examples indicate that their scale is equivalent to wpm + 2, and can be computed from raw frequencies as $log_{10}(c) – log_{10}(N) + 9$ when raw frequency $c > 0$, and $0$ otherwise, and where $N$ is the corpus size. The definition above differs slightly from van Heuven’s formula in that we do not use “add 1” smoothing, which causes issues with fractional frequencies, and give a function which is defined at 0; the Zipf scale of a zero-frequency word, naturally, is zero. Here is a tiny Python module for computing it, and here is the associated unit test.

We use this definition in the new webapp-based CityLex, out now, as just one of several ways to express frequency norms.

References

Van Heuven, W. J.B., Mandera, P., Keuleers, E. and Brysbaert, M. 2014. SUBTLEX-UK: a new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology 67: 1176-1190.

The Zipf scale

References

Leave a Reply Cancel reply