A tutorial on contingency tables

Many results in science and medicine can be compactly represented as a table containing the co-occurrence frequencies of two or more discrete random variables. This data structure is called the contingency table (a name suggested by Karl Pearson in 1904). This tutorial will cover descriptive and inferential statistics that can be used on the simplest form of contingency table, in which both the outcomes are binomial (and thus the table is 2×2).

Let’s begin by taking a look at a real-world example: graduate school admissions a single department at UC Berkeley in 1973. (This is part of a famous real-world example which may be of interest to the reader.) Our dependent variable is “admitted” or “rejected”, and we’ll use applicant gender as an independent variable.

Admitted Rejected
Male 120 205
Female 202 391

I can scarcely look at this table without seeing the inevitable question: are admissions in this department gender biased?

Odds ratios

37% (= 120 / 325) of the male applicants who applied were admitted, and 34% (= 202 / 593) of the female applicants were. Is that 3% difference a meaningful one? It is tempting to focus on “3%”, but the researcher should absolutely avoid this temptation. The magnitude of the difference between admission rates in the two groups (defined by the independent variable) is very sensitive to the base rate, in this case the overall admission rate. Intuitively, if 2% of males were admitted and only 1% of females, we would definitely consider the possibility that admissions are gender-biased: we would estimate that males are twice as likely to be admitted as females But we would be much less likely to say there is an admissions bias if those percentages were 98% and 99%. Yet, in both scenarios the admission rates differ by exactly 1%.

A better way to quantify the effect of gender on admissions—a method that is insensitive to the overall admission rate—is the odds ratio. This name is practically the definition if you are familiar with the notion odds. The odds of some event occurring with probability P is simply P / (1 – P). In our example above, the odds of admission for a male applicant is 0.5854, and is 0.5166 for a female applicant. The ratio of these two is 1.1334 (= 0.5854 / 0.5166). As this ratio is greater than one, we say that maleness was associated with admission. This is not enough to establish bias: it simply means that males were somewhat more likely to be admitted than females.

Tests for association

We can now return to the original question: is this table likely to have arisen if there is in fact no gender bias in admissions? Pearson’s chi-squared test estimates the probability of the observed contingency table under the null hypothesis that there is no association between and y; see here for a worked example. We reject the null hypothesis that there is no association when this probability is sufficiently small (often at P < .05). For this table, χ2 = 0.6332, and the probability of the data under the null hypothesis (no association between gender and admission rate) is P(χ2) = 0.4262. So, we’d probably say the observed difference in admission rates was not sufficient to establish that females were less likely to be admitted than males in this department.

The chi-squared test for contingency tables depends on an approximation which is asymptotically valid, but inadequate for small samples; a popular (albeit arbitrary) rule of thumb is that a sample is “small” if any of the four cells has less than 5 observations. The best solution is to use an alternative, Fisher’s exact test; as the name suggests, it provides an exact p-value. Rather than working with the χ2 statistic, the null hypothesis for the Fisher test is that the true (population) odds ratio is equal to 1.

Accuracy, precision, and recall

In the above example, we attempted to measure the association of two random variables which represented different constructs (i.e., admission status and gender). Contingency tables can also be used to look at random variables which are in some sense imperfect measures of the same underlying construct. In a machine learning context, one variable might represent the predictions of a binary classifier, and the other represents the labels taken from the “oracle”, the trusted data source the classifier is intended to approximate. Such tables are sometimes known as confusion matrices. Convention holds that one of the two outcomes should be labeled (arbitrarily if necessary) as “hit” and the other as “miss”—often the “hit” is the one which requires further attention whereas the “miss” can be ignored—and the the following labels be assigned to the four cells of the confusion matrix:

Prediction / Oracle Hit Miss
Hit True positive (TP) False positive (FP)
Miss False negative (FN) True negative (TN)

The oracle labels are on the row, corresponding to the dependent variable—admission status—in the Berkeley example, and the prediction labels are on the column, corresponding to gender.

When both variables of the 2×2 table measure the same construct, we start with the assumption that the two random variables are associated, and instead measure agreement. The simplest measure of agreement is accuracy, which is the probability that an observation will be correctly classified.

accuracy = (TP + TN) / (TP + FP + FN + TN)

Accuracy is not always the most informative measure, for the same reason that differences in probabilities were not informative above: accuracy neglects the base rate (or prevalence). Consider, for example, the plight of the much-maligned Transportation Safety Administration (TSA). Very, very few airline passengers attempt to commit a terrorist attack during their flight. Since 9/11, there have only been two documented attempted terrorist attacks by passengers on commercial airlines, one by Richard Reid (the “shoe bomber”) and one by Umar Farouk Abdulmutallab (the “underwear bomber”), and in both cases, these attempts were thwarted by some combination of the attentive citizenry and incompetent attacker, not by the TSA’s security theater. There are approximately 650,000,000 passenger-flights per year on US flights, so according to my back-of-envelope calculation, there have been around 7 billion passenger-flights since 9/11. In the meantime, the TSA could have achieved sky-high accuracy simply by making no arrests at all. (I have, of course, ignored the possibility that security theater serves as a deterrent to terrorist activity.) A corollary is that the TSA faces what is called the false positive paradox: a false positive (say, detaining a law-abiding citizen) is much more likely than a true positive (catching a terrorist). The TSA isn’t alone: a famous paper found that few physicians used the base rate (“prevalence”) when estimating the likelihood that a patient has a particular disease, given that they had a positive result on a test.

To better account for the role of base rate, we can break accuracy down into its constituent parts. The best known of these measures is precision (or positive predictive value), which is defined as the probability that a predicted hit is correct.

precision = TP / (TP + FP)

Precision isn’t completely free of the base rate problem, however; it fails to penalize false negatives. For this, we turn to recall (or sensitivity, or true positive rate), which is the probability that a true hit is correctly discovered.

recall = TP / (TP + FN)

It is difficult to improve precision without sacrificing recall, or vis versa. Consider, for example, an information retrieval (IR) application, which takes natural language queries as input and attempts to return all documents relevant for the query. Internally, the IR system ranks all documents for relevance to the query, then returns the top n. A document which is relevant and returned by the system is a true positive, a document which is irrelevant but returned by the system is a false positive, and a document which is relevant but not returned by the system is a false negative (we’ll ignore true negatives for the time being). With this system, we can achieve perfect recall by returning all documents, no matter what the query is, though the precision will be very poor. It is often helpful for n, the number of documents retrieved, to vary as a function of query; in a sports news database, for instance, there are simply more documents about the New York Yankees than about the congenitally mediocre St. Louis Blues. We can maximize precision by reducing the average for queries, but this will also reduce recall, since there will be more false negatives.

To quantify the tradeoff between precision and recall, it is conventional to use the harmonic mean of precision and recall.

F1 = (2 · precision · recall) / (precision + recall)

This measure is known also known as the F-score (or F-measure), though it is properly called F1, since an F-score need not weigh precision and recall equally. In many applications, however the real-world costs of false positives and false negatives are not equivalent. In the context of screening for serious illness, a false positive would simply lead to further testing, whereas a false negative could be fatal; consequently, recall is more important then precision. On the other hand, when the resources necessary to derive value from true positives are limited (such as in fraud detection), false negatives are considered more acceptable than false positives, and so precision is ranked above recall.

Another thing to note about F1: the harmonic mean of two positive numbers is always closer to the smaller of the two. So, if you want to maximize F1, the best place to start is to increase whichever of the two terms (precision and recall) is smaller.

To make this all a bit more concrete, consider the following 2×2 table.

Prediction / Oracle Hit Miss
Hit 10 2
Miss 5 20

We can see that false negatives are somewhat more common than false positives, so we could have predicted that precision (0.8333) would be somewhat greater than recall (0.6667), and that F1 would be somewhat closer to the latter (0.7407).

This of course does not exhaust the space of possible summary statistics of a 2×2 confusion matrix: see Wikipedia for more.

Response bias

It is sometimes useful to directly quantify predictor bias, which can be thought of as a signed measure representing the degree to which prediction system’s base rate differs from the true base rate. A positive bias indicates that the system predicts “hit” more often than it should were it hewing to the true base rate, and a negative bias indicates that “hit” is guessed less often than the true base rate would suggest. One conventional measure of bias is Bd”, which is a function of recall and false positive rate (FAR), defined as follows.

FAR = FP / (TN + FP)

Bd has a rather unwieldy formula; it is

Bd = [(recall · (1 – recall)) – (FAR · (1 – FAR))] / [(recall · (1 – recall)) + (FAR · (1 – FAR))]

when HR ≥ FAR and

Bd = [(FAR · (1 – FAR)) – (recall · (1 – recall))] / [(recall · (1 – recall)) + (FAR · (1 – FAR))]

otherwise (i.e., when FAR > HR).

You may also be familiar with β, a parametric measure of bias, but there does not seem to be anything to recommend it over Bd”, which makes fewer assumptions (see Donaldson 1992 and citations therein).

Cohen’s Κ

Cohen’s Κ (“kappa”) is a statistical measure of interannotator agreement which works on 2×2 tables. Unlike other measures we have reviewed so far, it is adjusted for the percentage of agreement that would occur by chance. Κ is computed from two terms. The first, P(a), is the observed probability of agreement, which is the same formula as accuracy. The second, P(e), is the probability of agreement due to chance. Let Pand Py be the probability of a “yes” or “hit” answer from annotator x and y, respectively. Then, P(e) is

Px Py + (1 – Px) (1 – Py)

and Κ is then given by

[P(a) – P(e)] / [1 – P(e)] .

For the previous 2×2 table, Κ = .5947; but, what does this mean? K is usually interpreted with reference to conventional—but entirely arbitrary—guidelines. One of the best known of these is due to Landis and Koch (1977), who propose 0–0.20 as “slight”, 0.21–0.40 as “fair agreement”, 0.41–0.60 as “moderate”, 0.61–0.80 as “substantial”, and 0.81–1 as “almost perfect” agreement. Κ has a known statistical distribution, so it is also possible to test the null hypothesis that the observed agreement is entirely due to chance. This test is rarely performed or reported, however, as the null hypothesis is exceptionally unlikely to be true in real-world annotation scenarios.

(h/t: Steven Bedrick.)


A. Agresti. 2002. Categorical data analysis. Hoboken, NJ: Wiley.
W. Donaldson. 1992. Measuring recognition memory. Journal of Experimental Psychology: General 121(3): 275-277.
J.R. Landis & G.G. Koch. 1977.The measurement of observer agreement for categorical data. Biometrics 33(1): 159-174.

Fieldwork is hard.

Luo is a language of the Nilotic family spoken by about one million people in Nyanza Province in Kenya in east central Africa. Mr. Ben Blount, then a student at the University of California in Berkeley, went to Kenya in 1967 to make a study of the development of language in eight children encompassing the age range from 12 to 35 months. He intended to make his central procedure the collection on a regular schedule of large samples of spontaneous speech at home, usually with the mother as interpreter. In American and European families, at least of the middle class, it is usually possible to obtain a couple of hundred utterances in as little as a half an hour, at least it is so, once any shyness has passed. Among the Luo, things proved more difficult. In 54 visits of a half an hour or longer Mr. Blount was only able to obtain a total from all the children of 191 multi-word utterances. The problem was primarily one of Luo etiquette, which requires that small children be silent when adults come to visit, and the small children Mr. Blount visited could not throw off their etiquette even though their parents entreated them to speak for the visiting “European,” as Mr. Blount was called.

(Excerpt from A first language: The early stages by Roger Brown, p. 73. There’s a happy ending: Mr. Blount became Dr. Blount in 1969.)

(ing): now with 100% more enregisterment!

In his new novel Bleeding Edge, Thomas Pynchon employs a curious bit of eye dialect for Vyrna McElmo, one of the denizen of his bizarro pre-9/11 NYC:

All day down there. I’m still, like, vibrateen? He’s a bundle of energy, that guy.

Oh? Torn? You’ll think it’s just hippyeen around, but I’m not that cool with a whole shitload of money crashing into our life right now?

What’s going on with vibrateen and hippyeen? I can’t be sure what Pynchon has in mind here—who can? But I speculate the ever-observant author is transcribing a very subtle bit of dialectical variation which has managed to escape the notice of most linguists. But first, a bit of background.

In English, words ending in <ng>, like sing or bang, are not usually pronounced with final [g] as the orthography might lead you to believe. Rather, they end with a single nasal consonant, either dorsal [ŋ] or coronal [n]. This subtle point of English pronunciation is not something most speakers are consciously aware of. But [n ~ ŋ] variation is sometimes commented on in popular discourse, albeit in a phonetically imprecise fashion: the coronal [n] variant is stigmatized as “g-dropping” (once again, despite the fact that neither variant actually contains a [g]). Everyone uses both variants to some degree. But the “dropped” [n] variant can be fraught: Peggy Noonan says it’s inauthentic, Samuel L. Jackson says it’s a sign of mediocrity, and merely transcribing it (as in “good mornin’“) might even get you accused of racism.

Pynchon presumably intends his -eens to be pronounced [in] on analogy with keen and seen. As it happens, [in] is a rarely-discussed variant of <ing> found in the speech of many younger Midwesterners and West Coast types, including yours truly. [1] Vyrna, of course, is a recent transplant from Silicon Valley and her dialogue contains other California features, including intensifiers awesome and totally and discourse particle like. And, I presume that Pynchon is attempting to transcribe high rising terminals, AKA uptalk—another feature associated with the West Coast—when he puts question marks on her declarative sentences (as in the passages above).

Only a tiny fraction of everyday linguistic variation is ever subject to social evaluation, and even less comes to be associated with groups of speakers, attitudes, or regions. As far as I know, this is the first time this variant has received any sort of popular discussion. -een may be on its way to becoming a California dialect marker (to use William Labov’s term [2]), though in reality it has a much wider geographic range.


[1] This does not exhaust the space of (ing) variant, of course. One of the two ancestors of modern (ing) is the Old English deverbal nominalization suffix -ing [iŋg]. In Principles of the English Language (1756), James Elphinston writes that [ŋg] had not fully coalesced, and that the [iŋg] variant was found in careful speech or “upon solemn occasions”. Today this variant is a stereotype of Scouse, and with [ɪŋk], occurs in some contact-induced lects.
[2] It is customary to also refer to Michael Silverstein for his notion of indexical order. Unfortunately, I still do not understand what Silverstein’s impenetrable prose adds to the discussion, but feel free to comment if you think you can explain it to me.

Gigaword English preprocessing

I recently took a little time out to coerce a recent version of the LDC’s Gigaword English corpus into a format that could be used for training conventional n-gram models. This turned out to be harder than I expected.


Gigaword English (v. 5) ships with 7 directories of gzipped SGML data, one directory for each of the news sources. The first step is, obviously enough, to decompress these files, which can be done with gunzip.


The resulting files are, alas, not XML files, which for all their verbosity can be parsed in numerous elegant ways. In particular, the decompressed Gigaword files do not contain a root node: each story is inside of <DOC> tags at the top level of the hierarchy. While this might be addressed by simply adding in a top-level tag, the files also contain a few  “entities” (e.g., &amp;) which ideally should be replaced by their actual referent. Simply inserting the Gigaword Document Type Definition, or DTD, at the start of each SGML file was sufficient to convert the Gigaword files to valid SGML.

I also struggled to find software for SGML-to-XML conversion; is this not something other people regularly want to do? I ultimately used an ancient library called OpenSP (open-sp in Homebrew), in particular the command osx. This conversion throws a small number of errors due to unexpected presence of UTF-8 characters, but these can be ignored (with the flag -E 0).

XML to text

Each Gigaword file contains a series of <DOC> tags, each representing a single news story. These tags have four possible type attributes; the most common one, story, is the only one which consistently contains coherent full sentences and paragraphs. Immediately underneath <DOC> in this hierarchy are two tags: <HEADLINE> and <TEXT>. While it would be fun to use the former for a study of Headlinese, <TEXT>—the tag surrounding the document body—is generally more useful. Finally, good old-fashioned <p> (paragraph) tags are the only children of <TEXT>. I serialized “story” paragraphs using the lxml library in Python. This library supports the elegant XPath query language. To select paragraphs of “story” documents, I used the XPath query /GWENG/DOC[@type="story"]/TEXT, stripped whitespace, and then encoded the text as UTF-8.

Text to sentences

The resulting units are paragraphs (with occasional uninformative line breaks), not sentences. Python’s NLTK module provides an interface to the Punkt sentence tokenizer. However, thanks to this Stack Overflow post, I became aware of its limitations. Here’s a difficult example from Moby Dick, with sentence boundaries (my judgements) indicated by the pipe character (|):

A clam for supper? | a cold clam; is THAT what you mean, Mrs. Hussey?” | says I, “but that’s a rather cold and clammy reception in the winter time, ain’t it, Mrs. Hussey?”

But, the default sentence tokenizer insists on sentence breaks immediately after both occurrences of “Mrs.”. To remedy this, I replaced the space after titles like “Mrs.”
(the full list of such abbreviations was adapted from GPoSTTL) with an underscore so as to “bleed” the sentence tokenizer, then replaced the underscore with a space after tokenization was complete. That is, the sentence tokenizer sees word tokens like “Mrs._Hussey”; since sentence boundaries must line up with word token boundaries, there is no chance a space will be inserted here. With this hack, the sentence tokenizer does that snippet of Moby Dick just right.

Sentences to tokens

For the last step, I used NLTK’s word tokenizer (nltk.tokenize.word_tokenize), which is similar to the (in)famous Treebank tokenizer, and then case-folded the resulting tokens.


In all, 170 million sentences, 5 billion word tokens, and 22 billion characters, all of which fits into 7.5 GB (compressed). Best of luck to anyone looking to do the same!

Defining libfixes

A recent late-night discussion with two fellow philologists revealed some interesting problems in defining libfixes. Arnold Zwicky coined this term to describe affix-like formatives such as -(a)thon (from marathon; e.g., saleathon) or -(o)holic (from alcoholic, e.g., chocoholic) that appear to have been extracted (“liberated”) from another word. These are then affixed to free stems, and the resulting form often conveys a sense of either jocularity or pejoration. The extraction of libfixes is a special case of what historical linguists call “recutting”, and like recutting in general, the ontogenesis of libfixation is largely mysterious.

As the evening’s discussion showed, it is not trivial to distinguish libfixation from similar derivational processes. What follows are a few examples of interesting derivational processes which in my opinion should not be identified with libfixation.

Blending is not libfixation

One superficially similar process is “blending”, in which new forms are derived by combining identifiable subparts of two simplex words. The resulting forms are sometimes called “portmanteaux” (sg. “portmanteau”), a term of art with its own interesting history. Two canonical blends are br-unch and sm-og, derived from the unholy union of breakfast and lunch, and smoke and fog, respectively. These two are particularly memorable—yet unobstrustive—thanks to a clever indexical trick: both word and referent are mongrel-like in their own ways. What exactly distinguishes blending from libfixation? I see two features which distinguish the two word-formation processes.

The first is productivity: libfixation has some degree of productivity whereas blending does not. In no other derivative can one find the “pieces” (I am using the term pretheoretically) of smog, namely sm- and -og. In contrast, there are over a dozen novel -omicses and dozens of -gates. There is therefore no reason to posit that either sm- or -og has been reconceptualized as an affix.

The second feature which distinguishes blending and libfixation deals with the way the pieces are spelled out. Libfixes are affixes and do not normally modify the freestanding base they attach to. In blends, one form overwrites the other (and vis versa). Were -og a newly liberated suffix, we would expect *smoke-og. This criterion also suggests that mansplain, poptimism, and snowquester are not in fact instances of libfixation; in each case, material from the “base” (I also use this term pretheoretically) is deleted.

Zwicky himself has noted the existence of a blend-libfix cline, and the tendency of blends to become libfixes. He suggests the following natural history:

A portmanteau word (useful or playful or both) invites other portmanteaus sharing an element (usually the second), and then these drift from the phonology and semantics of the original to such an extent that the shared element takes on a life of its own — is “liberated” as an affix.


Clipping is not libfixation

“Clipping” (or “truncation”) is a process which reduces a word to one of its parts. Sometimes truncated forms are themselves used for compound formation. For instance, burger is derived from Hamburger ‘resident of Hamburg’ (the semantic connection is a mystery). According to the Online Etymology Dictionary, forms like cheese-burger appear in the historical record at about the same time as burger itself. There is one way that clipping is distinct from libfixation, however. Clippings are free forms (i.e., prosodic words), whereas libfixes need not be. In particular, whereas some libfixes have homophonous free forms (e.g., -gate, -core), these are semantically distinct: whereas one can claim to love burgers, one cannot reasonably claim that the current administration has fallen prey to many gates.

The curious case of -giving

To conclude, consider a new set of words in -giving, including Friendsgiving, Fauxgiving, and Spanksgiving. These are not blends according to the criteria above, and while giving is a free form, the bound form has different semantics (something like ‘holiday gathering’). But is -giving a libfix? I’d say that it depends on whether Thanksgiving, etymologically an noun-gerund compound, is synchronically analyzed as such. If so, -giving has not so much been extracted as reanalyzed as a noun-forming suffix, a curious development but not an event of affix liberation.

h/t: Stacy Dickerman, John Kelly

LOESS hyperparameters without tears

LOESS is a classic non-parametric regression technique. One potential issue that arises is that LOESS fits depend on several hyperparameters (i.e., parameters set by the experimenter a priori). In this post I’ll take a quick look at how to set these.

At each point in a LOESS curve, the y-value is derived from a local, low-degree polynomial weighted regression. The first hyperparameter refers to the degree of the local fits. Most users set degree to 2 (i.e., use local quadratic curves), and with good reason. At degree 1, you’re just computing a local average. Higher degrees than 2 (e.g., cubic) tend to not have much of an effect.

The other hyperparameter is “span”, which controls the degree of smoothing. A value of 0 uses no context and a value of 1 uses the entire sample (so it will be similar to fitting a single quadratic function to the data). The choice of this value has a major effect on the quality of the fit obtained:


For the randomly generated data here, large values of the span parameter (“bad”) produce a LOESS which fails to follow the larger trend, whereas small values (“ugly”) primarily model noise. For this reason alone, the experimenter should probably not be permitted to select the span hyperparameter herself.

Fortunately, there are several objectives used to determine an “optimal” setting for the span parameter. Hurvich et al. (1998) propose a particularly privileged objective, namely minimizing AICC. This has been used to generate the “good” curve above. Here’s how I did it (adapted from this post to R-help):

There is also an R package fANCOVA which apparently includes a function loess.as which automatically determines the span parameter, presumably similar to how I’ve done it here. I haven’t tried it.

PS to those inclined to care: the origins of memetic, snarky, academic “X without tears” is, to my knowledge, J. Eric S. Thompson‘s 1972 book Maya Hieroglyphs Without Tears. While I have every reason to believe Thompson was poking fun at his detractors, it’s interesting to note that he turned out to be fabulously wrong about the nature of the hieroglyphs.