UNIX AV club

[This post was written as a supplement to CS506/606: Research Programming at OHSU.]

SoX and FFmpeg are fast, powerful command-line tools for manipulating audio and video data, respectively. In this short tutorial, I’ll show how to use these tools for two very common tasks: 1) resampling and 2) (de)multiplexing. Both tools are available from your favorite package manager  (like Homebrew or apt-get).

SoX and friends

SoX is a suite of programs for manipulating audio files. Commands are of the form:

sox [flag ...] infile1 [...] outfile [effect effect-options] ...

That is, the command sox, zero or more global flags, one or more input files, one output file, and then a list of “effects” to apply. Unlike most UNIX command-line programs, though, SoX actually cares about file extensions. If the input file is in.wav it better be a WAV file; if the output file is out.flac it will be encoded in the FLAC (“free lossless audio codec”) format.

The simplest invocation of sox converts audio files to new formats. For instance, the following would use audio.wav to create a new FLAC file audio.flac with the same same bit depth and sample rate.

sox audio.wav audio.flac

Concatenating audio files is only slightly more complicated. The following would concatenate 01_Intro.wav and 02_Untitled.wav together into a new file concatenated.wav.

sox 01_Intro.wav 02_Untitled.wav concatenated.wav

Resampling with SoX

But SoX really shines for resampling audio. For this, use the rate effect. The following would downsample the CD-quality (44.1 kHz) audio in CD.wav to the standard sample rate used on telephones (8 kHz) and store the result in telephone.wav

sox CD.wav telephone.wav rate 8k

There are two additional effects you may want to invoke when resampling. First, you may want to “dither” the audio. As man sox explains:

Dithering is a technique used to maximize the dynamic range of audio stored at a particular bit-depth. Any distortion introduced by quantization is decorrelated by adding a small amount of white noise to the signal. In most cases, SoX can determine whether the selected processing requires dither and will add it during output formatting if appropriate.

The following would resample to the telephone rate with dithering (if necessary).

sox CD.wav telephone.wav rate 8k dither -s

Finally, when resampling audio, you may want to invoke the gain effect to avoid clipping. This can be done using the -G (“Gain”) global option.

sox -G CD.wav telephone.wav rate 8k dither -s

(De)multiplexing with SoX

The SoX remix effect is useful for manipulating multichannel audio. The following would remix a multi-channel audio file stereo.wav down to mono.

sox stereo.wav mono.wav remix -

We also can split a stereo file into two mono files.

sox stereo.wav left.wav remix 1
sox stereo.wav right.wav remix 2

Finally, we can merge two mono files together to create one stereo file using the -M (“merge”) global option; this file should be identical to stereo.wav.

sox -M left.wav right.wav stereo2.wav

Other SoX goodies

There are three other useful utilities in SoX: soxi prints information extracted from audio file headers, play uses the SoX libraries to play audio files, and rec records new audio files using a microphone.


The FFmpeg suite is to video files what SoX is to audio. Commands are of the form:

ffmpeg [flag ...] [-i infile1 ...] [-effect ...] [outfile]

The -acodec and -vcodec effects can be used to extract the audio and video streams from a video file, respectively; this process sometimes known as demuxing (short for “de-multiplexing”).

ffmpeg -i both.mp4 -acodec copy -vn audio.ac3
ffmpeg -i both.mp4 -vcodec copy -an video.h264

We can also mux (“multiplex”) them back together.

ffmpeg -i video.h264 -i audio.ac3 -vcodec copy -acodec copy both2.mp4

Hopefully that’ll get you started. Both programs have excellent manual pages; read them!

How "uh" and "um" differ

If you’ve been following the recent discussions on Language Log, then you know that there is a great deal of inter-speaker variation in the use of the fillers uh and um, despite their superficial similarity. In this post, I’ll discuss some published results, summarize some of the Language Log findings (with the obvious caveat that none of it has been subject to any sort of peer review) and explain what I think it all means for our understanding of the contrast between uh and um.

The function of uh and um

The vast majority of work on disfluencies (which include fillers like uh and um as well as repetitions, revisions, and false starts) assumes that uh and um are functionally equivalent, substitutable forms. But Clark and Fox Tree (2002) argue that they are subtly different. They claim that uh serves as signal minor delays and um signals major delays. The evidence for this is straightforward:

  • Um is more often followed by a pause than uh.
  • Pauses after ums tend to be longer than those occurring after uhs (though Mark has failed to replicate this in a much larger corpus, and I am inclined to defer to him).
  • Um is more common than uh in utterance-initial position, the point at which speech planning demands are presumably at their greatest. [1]

From these results, though, it is not obvious that uh and um are qualitatively different. This has not prevented people (myself included) from making this jump. For example, Mark speculated a bit about this for the Atlantic: “People tend to use UM when they’re trying to decide what to say, and UH when they’re trying to decide how to say it.” This is plausible, but the evidence for differential functions of uh and um is lacking.

Intraspeaker differences in uh and um

Gender effects

The first—and probably most robust—finding, is that female speakers have a higher average um/uh ratio than males. This pattern was found in several corpora of American English available from the LDC (1 2 3). It also reported in a recent paper by Acton (2011), who looks two American English corpora. A higher um/uh ratio in females was also found in two corpora of British English. The first looks at data from the HCRC map task and the second at the conversational portion of the British National Corpus (BNC). The latter was earlier the subject of a study by Rayson et al. (1997), who found that that er (the British equivalent of uh) [2] was the one of the words most strongly associated with male (rather than female) speakers; the only word more “masculine” than uh was the expletive fucking.

Social class effects

The second finding is that um/uh ratio is correlated with social class: higher status speakers have a higher um/uh ratio. Once again, this was first reported by Rayson et al., who found that erm is more common in speakers with high-status occupations. Mark found a similar pattern in American English using educational attainment—rather than occupation—as a measure of social class.

Age effects

The third finding is that younger speakers have a higher um/uh ratio than older speakers. This was first reported by Rayson et al. (once again, studying the conversational portion of the BNC), who found that that er is much more common in speakers over the age of 35. Similar patterns are reported by Acton, and several Language Log correspondents (1 2 3 4).

Geographic effects

Finally, Jack Grieve looked at um/uh ratio geographically, and found that um was more common in the Midlands and the central southwest. I see two issues with this result, however. First, I don’t observe any geographic patterns in the raw data (ibid., in the comments section of that post); to my eye, the geographic patterns only emerge after aggressive smoothing; this may just be another case of Smoothers Gone Wild. Secondly, the data was taken from geocoded Twitter posts, not speech. As commenter “BK” asks: “do we have any reason to believe that writing ‘UM’ vs ‘UH’ in a tweet is at all correlated with the use of ‘UM’ vs ‘UH’ in speech?” Regrettably, I suspect the answer is no, but there still is probably something to be gleaned from tweeters’ stylistic use of these fillers.

Uh and um in children with autism

Our recent work on filler use in children with autism spectrum disorders (ASD) might provide us another way to get at the functional differentiation between uh and um. We [3] used a semi-structured corpus of diagnostic interviews of children ages 4-8, and find that children with ASD produce a much lower umuh ratio than typically developing children matched for age and intelligence. Children with specific language impairment—a neurodevelopmental disorder characterized by language delays or deficits in the absence of other developmental or sensory impairments—have an umuh ratio much closer to the typical children; this tells us it’s not about language impairment (something which is relatively common—but not specific to—children with ASD). We also find that umuh ratio is correlated with the Communication Total Score of the Social Communication Questionnaire, a parent-reported measure of communication ability. At the very least, individuals who use more um are perceived to have better communication abilities by their parents. At best, use of um itself contributes to these perceptions.

How uh and um differ

To the sociolinguistic eye, the effects of gender, class, and age just described tell us a lot about uh and um. Given that women have a higher um/uh ratio than men, we expect that um is either the more prestigious variant, or the incoming variant, or both. This is what Labov calls the gender paradox: women consistently lead men in the use of prestige variants, and lead men in the adoption of innovative variants. Further evidence that um is the prestige variant comes from social class: higher status individuals have a higher um/uh ratio. Younger speakers have a higher um/uh ratio, suggesting that um is also the incoming variant. This is not the only possible interpretation, however; it may be that the variants are subject to age grading—meaning that speakers change their use of uh and um as they age—which does not entail that there is any change in progress. Given a change in apparent time—meaning that younger and older speakers use the variants at different rates—the only way to tell whether there is change in progress is to look at data collected at multiple time points. While the evidence is limited, it looks like both age grading and change in progress are occurring—they are not mutually exclusive, after all.

Unfortunately, some evidence from style shifting problematizes this view of um as a prestige variant. O’Connell and Kowal (2005) look at uh and um by analyzing the speech of professional TV and radio personalities interviewing Hillary Clinton. If um is the more prestigious variant, then we would expect a higher um/uh ratio in this formal context compared to the more casual styles recorded in other corpora. But in fact these experienced public speakers have a particularly low um/uh ratio. Hillary Clinton produced 640 uhs and 160 ums, for an um/uh ratio of 0.250; in contrast, Mark found that on average, female speakers in the Fisher corpora favored um more than 2-to-1.

So why is Hillary Clinton hating on um? Can an incoming variable be associated with women and the upper classes yet still avoided in formal contexts? Or are we simply wrong to think of uh and um as variants of a single variable? Is it possible that, given our limited understanding of the functional differences between uh and um, we have failed to account for associations between discourse demands and social groups (or speech styles)? Perhaps Clinton just needs uh more than we could ever know.


[1] This finding is so robust, it even holds in Dutch, which has very similar fillers to those of English (Swerts 1998).
[2] Note that, at least according to the Oxford English Dictionary, British er and erm are just orthographic variants of uh and um, respectively. That’s not to say that they’re pronounced identically, just that they are functionally equivalent.
[3] Early studies geared at speech researchers were conducted by Peter Heeman and Rebecca Lunsford. Other coauthors include Lindsay Olson, Alison Presmanes Hill, and Jan van Santen.


E.K. Acton. 2011. On gender differences in the distribution of um and uhPenn Working Papers in Linguistics 17(2): 1-9.
H.H. Clark & J.E. Fox Tree. 2002. Using uh and um in spontaneous speaking. Cognition 84(1): 73-111.
D.C. O’Connell and S. Kowal. 2005. Uh and um revisited: Are they interjections for signaling delay? Journal of Psycholinguistic Research 34(6): 555-576.
P. Rayson, G. Leech, and M. Hodges. 1997. Social differentiation in the use of English vocabulary: Some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics 2(1): 133-152.
M. Swerts. 1998. Filled pauses as markers of discourse structure. Journal of Pragmatics 30(4): 485-496.

A tutorial on contingency tables

Many results in science and medicine can be compactly represented as a table containing the co-occurrence frequencies of two or more discrete random variables. This data structure is called the contingency table (a name suggested by Karl Pearson in 1904). This tutorial will cover descriptive and inferential statistics that can be used on the simplest form of contingency table, in which both the outcomes are binomial (and thus the table is 2×2).

Let’s begin by taking a look at a real-world example: graduate school admissions a single department at UC Berkeley in 1973. (This is part of a famous real-world example which may be of interest to the reader.) Our dependent variable is “admitted” or “rejected”, and we’ll use applicant gender as an independent variable.

Admitted Rejected
Male 120 205
Female 202 391

I can scarcely look at this table without seeing the inevitable question: are admissions in this department gender biased?

Odds ratios

37% (= 120 / 325) of the male applicants who applied were admitted, and 34% (= 202 / 593) of the female applicants were. Is that 3% difference a meaningful one? It is tempting to focus on “3%”, but the researcher should absolutely avoid this temptation. The magnitude of the difference between admission rates in the two groups (defined by the independent variable) is very sensitive to the base rate, in this case the overall admission rate. Intuitively, if 2% of males were admitted and only 1% of females, we would definitely consider the possibility that admissions are gender-biased: we would estimate that males are twice as likely to be admitted as females But we would be much less likely to say there is an admissions bias if those percentages were 98% and 99%. Yet, in both scenarios the admission rates differ by exactly 1%.

A better way to quantify the effect of gender on admissions—a method that is insensitive to the overall admission rate—is the odds ratio. This name is practically the definition if you are familiar with the notion odds. The odds of some event occurring with probability P is simply P / (1 – P). In our example above, the odds of admission for a male applicant is 0.5854, and is 0.5166 for a female applicant. The ratio of these two is 1.1334 (= 0.5854 / 0.5166). As this ratio is greater than one, we say that maleness was associated with admission. This is not enough to establish bias: it simply means that males were somewhat more likely to be admitted than females.

Tests for association

We can now return to the original question: is this table likely to have arisen if there is in fact no gender bias in admissions? Pearson’s chi-squared test estimates the probability of the observed contingency table under the null hypothesis that there is no association between and y; see here for a worked example. We reject the null hypothesis that there is no association when this probability is sufficiently small (often at P < .05). For this table, χ2 = 0.6332, and the probability of the data under the null hypothesis (no association between gender and admission rate) is P(χ2) = 0.4262. So, we’d probably say the observed difference in admission rates was not sufficient to establish that females were less likely to be admitted than males in this department.

The chi-squared test for contingency tables depends on an approximation which is asymptotically valid, but inadequate for small samples; a popular (albeit arbitrary) rule of thumb is that a sample is “small” if any of the four cells has less than 5 observations. The best solution is to use an alternative, Fisher’s exact test; as the name suggests, it provides an exact p-value. Rather than working with the χ2 statistic, the null hypothesis for the Fisher test is that the true (population) odds ratio is equal to 1.

Accuracy, precision, and recall

In the above example, we attempted to measure the association of two random variables which represented different constructs (i.e., admission status and gender). Contingency tables can also be used to look at random variables which are in some sense imperfect measures of the same underlying construct. In a machine learning context, one variable might represent the predictions of a binary classifier, and the other represents the labels taken from the “oracle”, the trusted data source the classifier is intended to approximate. Such tables are sometimes known as confusion matrices. Convention holds that one of the two outcomes should be labeled (arbitrarily if necessary) as “hit” and the other as “miss”—often the “hit” is the one which requires further attention whereas the “miss” can be ignored—and the the following labels be assigned to the four cells of the confusion matrix:

Prediction / Oracle Hit Miss
Hit True positive (TP) False positive (FP)
Miss False negative (FN) True negative (TN)

The oracle labels are on the row, corresponding to the dependent variable—admission status—in the Berkeley example, and the prediction labels are on the column, corresponding to gender.

When both variables of the 2×2 table measure the same construct, we start with the assumption that the two random variables are associated, and instead measure agreement. The simplest measure of agreement is accuracy, which is the probability that an observation will be correctly classified.

accuracy = (TP + TN) / (TP + FP + FN + TN)

Accuracy is not always the most informative measure, for the same reason that differences in probabilities were not informative above: accuracy neglects the base rate (or prevalence). Consider, for example, the plight of the much-maligned Transportation Safety Administration (TSA). Very, very few airline passengers attempt to commit a terrorist attack during their flight. Since 9/11, there have only been two documented attempted terrorist attacks by passengers on commercial airlines, one by Richard Reid (the “shoe bomber”) and one by Umar Farouk Abdulmutallab (the “underwear bomber”), and in both cases, these attempts were thwarted by some combination of the attentive citizenry and incompetent attacker, not by the TSA’s security theater. There are approximately 650,000,000 passenger-flights per year on US flights, so according to my back-of-envelope calculation, there have been around 7 billion passenger-flights since 9/11. In the meantime, the TSA could have achieved sky-high accuracy simply by making no arrests at all. (I have, of course, ignored the possibility that security theater serves as a deterrent to terrorist activity.) A corollary is that the TSA faces what is called the false positive paradox: a false positive (say, detaining a law-abiding citizen) is much more likely than a true positive (catching a terrorist). The TSA isn’t alone: a famous paper found that few physicians used the base rate (“prevalence”) when estimating the likelihood that a patient has a particular disease, given that they had a positive result on a test.

To better account for the role of base rate, we can break accuracy down into its constituent parts. The best known of these measures is precision (or positive predictive value), which is defined as the probability that a predicted hit is correct.

precision = TP / (TP + FP)

Precision isn’t completely free of the base rate problem, however; it fails to penalize false negatives. For this, we turn to recall (or sensitivity, or true positive rate), which is the probability that a true hit is correctly discovered.

recall = TP / (TP + FN)

It is difficult to improve precision without sacrificing recall, or vis versa. Consider, for example, an information retrieval (IR) application, which takes natural language queries as input and attempts to return all documents relevant for the query. Internally, the IR system ranks all documents for relevance to the query, then returns the top n. A document which is relevant and returned by the system is a true positive, a document which is irrelevant but returned by the system is a false positive, and a document which is relevant but not returned by the system is a false negative (we’ll ignore true negatives for the time being). With this system, we can achieve perfect recall by returning all documents, no matter what the query is, though the precision will be very poor. It is often helpful for n, the number of documents retrieved, to vary as a function of query; in a sports news database, for instance, there are simply more documents about the New York Yankees than about the congenitally mediocre St. Louis Blues. We can maximize precision by reducing the average for queries, but this will also reduce recall, since there will be more false negatives.

To quantify the tradeoff between precision and recall, it is conventional to use the harmonic mean of precision and recall.

F1 = (2 · precision · recall) / (precision + recall)

This measure is known also known as the F-score (or F-measure), though it is properly called F1, since an F-score need not weigh precision and recall equally. In many applications, however the real-world costs of false positives and false negatives are not equivalent. In the context of screening for serious illness, a false positive would simply lead to further testing, whereas a false negative could be fatal; consequently, recall is more important then precision. On the other hand, when the resources necessary to derive value from true positives are limited (such as in fraud detection), false negatives are considered more acceptable than false positives, and so precision is ranked above recall.

Another thing to note about F1: the harmonic mean of two positive numbers is always closer to the smaller of the two. So, if you want to maximize F1, the best place to start is to increase whichever of the two terms (precision and recall) is smaller.

To make this all a bit more concrete, consider the following 2×2 table.

Prediction / Oracle Hit Miss
Hit 10 2
Miss 5 20

We can see that false negatives are somewhat more common than false positives, so we could have predicted that precision (0.8333) would be somewhat greater than recall (0.6667), and that F1 would be somewhat closer to the latter (0.7407).

This of course does not exhaust the space of possible summary statistics of a 2×2 confusion matrix: see Wikipedia for more.

Response bias

It is sometimes useful to directly quantify predictor bias, which can be thought of as a signed measure representing the degree to which prediction system’s base rate differs from the true base rate. A positive bias indicates that the system predicts “hit” more often than it should were it hewing to the true base rate, and a negative bias indicates that “hit” is guessed less often than the true base rate would suggest. One conventional measure of bias is Bd”, which is a function of recall and false positive rate (FAR), defined as follows.

FAR = FP / (TN + FP)

Bd has a rather unwieldy formula; it is

Bd = [(recall · (1 – recall)) – (FAR · (1 – FAR))] / [(recall · (1 – recall)) + (FAR · (1 – FAR))]

when HR ≥ FAR and

Bd = [(FAR · (1 – FAR)) – (recall · (1 – recall))] / [(recall · (1 – recall)) + (FAR · (1 – FAR))]

otherwise (i.e., when FAR > HR).

You may also be familiar with β, a parametric measure of bias, but there does not seem to be anything to recommend it over Bd”, which makes fewer assumptions (see Donaldson 1992 and citations therein).

Cohen’s Κ

Cohen’s Κ (“kappa”) is a statistical measure of interannotator agreement which works on 2×2 tables. Unlike other measures we have reviewed so far, it is adjusted for the percentage of agreement that would occur by chance. Κ is computed from two terms. The first, P(a), is the observed probability of agreement, which is the same formula as accuracy. The second, P(e), is the probability of agreement due to chance. Let Pand Py be the probability of a “yes” or “hit” answer from annotator x and y, respectively. Then, P(e) is

Px Py + (1 – Px) (1 – Py)

and Κ is then given by

[P(a) – P(e)] / [1 – P(e)] .

For the previous 2×2 table, Κ = .5947; but, what does this mean? K is usually interpreted with reference to conventional—but entirely arbitrary—guidelines. One of the best known of these is due to Landis and Koch (1977), who propose 0–0.20 as “slight”, 0.21–0.40 as “fair agreement”, 0.41–0.60 as “moderate”, 0.61–0.80 as “substantial”, and 0.81–1 as “almost perfect” agreement. Κ has a known statistical distribution, so it is also possible to test the null hypothesis that the observed agreement is entirely due to chance. This test is rarely performed or reported, however, as the null hypothesis is exceptionally unlikely to be true in real-world annotation scenarios.

(h/t: Steven Bedrick.)


A. Agresti. 2002. Categorical data analysis. Hoboken, NJ: Wiley.
W. Donaldson. 1992. Measuring recognition memory. Journal of Experimental Psychology: General 121(3): 275-277.
J.R. Landis & G.G. Koch. 1977.The measurement of observer agreement for categorical data. Biometrics 33(1): 159-174.

Fieldwork is hard.

Luo is a language of the Nilotic family spoken by about one million people in Nyanza Province in Kenya in east central Africa. Mr. Ben Blount, then a student at the University of California in Berkeley, went to Kenya in 1967 to make a study of the development of language in eight children encompassing the age range from 12 to 35 months. He intended to make his central procedure the collection on a regular schedule of large samples of spontaneous speech at home, usually with the mother as interpreter. In American and European families, at least of the middle class, it is usually possible to obtain a couple of hundred utterances in as little as a half an hour, at least it is so, once any shyness has passed. Among the Luo, things proved more difficult. In 54 visits of a half an hour or longer Mr. Blount was only able to obtain a total from all the children of 191 multi-word utterances. The problem was primarily one of Luo etiquette, which requires that small children be silent when adults come to visit, and the small children Mr. Blount visited could not throw off their etiquette even though their parents entreated them to speak for the visiting “European,” as Mr. Blount was called.

(Excerpt from A first language: The early stages by Roger Brown, p. 73. There’s a happy ending: Mr. Blount became Dr. Blount in 1969.)

(ing): now with 100% more enregisterment!

In his new novel Bleeding Edge, Thomas Pynchon employs a curious bit of eye dialect for Vyrna McElmo, one of the denizen of his bizarro pre-9/11 NYC:

All day down there. I’m still, like, vibrateen? He’s a bundle of energy, that guy.

Oh? Torn? You’ll think it’s just hippyeen around, but I’m not that cool with a whole shitload of money crashing into our life right now?

What’s going on with vibrateen and hippyeen? I can’t be sure what Pynchon has in mind here—who can? But I speculate the ever-observant author is transcribing a very subtle bit of dialectical variation which has managed to escape the notice of most linguists. But first, a bit of background.

In English, words ending in <ng>, like sing or bang, are not usually pronounced with final [g] as the orthography might lead you to believe. Rather, they end with a single nasal consonant, either dorsal [ŋ] or coronal [n]. This subtle point of English pronunciation is not something most speakers are consciously aware of. But [n ~ ŋ] variation is sometimes commented on in popular discourse, albeit in a phonetically imprecise fashion: the coronal [n] variant is stigmatized as “g-dropping” (once again, despite the fact that neither variant actually contains a [g]). Everyone uses both variants to some degree. But the “dropped” [n] variant can be fraught: Peggy Noonan says it’s inauthentic, Samuel L. Jackson says it’s a sign of mediocrity, and merely transcribing it (as in “good mornin’“) might even get you accused of racism.

Pynchon presumably intends his -eens to be pronounced [in] on analogy with keen and seen. As it happens, [in] is a rarely-discussed variant of <ing> found in the speech of many younger Midwesterners and West Coast types, including yours truly. [1] Vyrna, of course, is a recent transplant from Silicon Valley and her dialogue contains other California features, including intensifiers awesome and totally and discourse particle like. And, I presume that Pynchon is attempting to transcribe high rising terminals, AKA uptalk—another feature associated with the West Coast—when he puts question marks on her declarative sentences (as in the passages above).

Only a tiny fraction of everyday linguistic variation is ever subject to social evaluation, and even less comes to be associated with groups of speakers, attitudes, or regions. As far as I know, this is the first time this variant has received any sort of popular discussion. -een may be on its way to becoming a California dialect marker (to use William Labov’s term [2]), though in reality it has a much wider geographic range.


[1] This does not exhaust the space of (ing) variant, of course. One of the two ancestors of modern (ing) is the Old English deverbal nominalization suffix -ing [iŋg]. In Principles of the English Language (1756), James Elphinston writes that [ŋg] had not fully coalesced, and that the [iŋg] variant was found in careful speech or “upon solemn occasions”. Today this variant is a stereotype of Scouse, and with [ɪŋk], occurs in some contact-induced lects.
[2] It is customary to also refer to Michael Silverstein for his notion of indexical order. Unfortunately, I still do not understand what Silverstein’s impenetrable prose adds to the discussion, but feel free to comment if you think you can explain it to me.

Gigaword English preprocessing

I recently took a little time out to coerce a recent version of the LDC’s Gigaword English corpus into a format that could be used for training conventional n-gram models. This turned out to be harder than I expected.


Gigaword English (v. 5) ships with 7 directories of gzipped SGML data, one directory for each of the news sources. The first step is, obviously enough, to decompress these files, which can be done with gunzip.


The resulting files are, alas, not XML files, which for all their verbosity can be parsed in numerous elegant ways. In particular, the decompressed Gigaword files do not contain a root node: each story is inside of <DOC> tags at the top level of the hierarchy. While this might be addressed by simply adding in a top-level tag, the files also contain a few  “entities” (e.g., &amp;) which ideally should be replaced by their actual referent. Simply inserting the Gigaword Document Type Definition, or DTD, at the start of each SGML file was sufficient to convert the Gigaword files to valid SGML.

I also struggled to find software for SGML-to-XML conversion; is this not something other people regularly want to do? I ultimately used an ancient library called OpenSP (open-sp in Homebrew), in particular the command osx. This conversion throws a small number of errors due to unexpected presence of UTF-8 characters, but these can be ignored (with the flag -E 0).

XML to text

Each Gigaword file contains a series of <DOC> tags, each representing a single news story. These tags have four possible type attributes; the most common one, story, is the only one which consistently contains coherent full sentences and paragraphs. Immediately underneath <DOC> in this hierarchy are two tags: <HEADLINE> and <TEXT>. While it would be fun to use the former for a study of Headlinese, <TEXT>—the tag surrounding the document body—is generally more useful. Finally, good old-fashioned <p> (paragraph) tags are the only children of <TEXT>. I serialized “story” paragraphs using the lxml library in Python. This library supports the elegant XPath query language. To select paragraphs of “story” documents, I used the XPath query /GWENG/DOC[@type="story"]/TEXT, stripped whitespace, and then encoded the text as UTF-8.

Text to sentences

The resulting units are paragraphs (with occasional uninformative line breaks), not sentences. Python’s NLTK module provides an interface to the Punkt sentence tokenizer. However, thanks to this Stack Overflow post, I became aware of its limitations. Here’s a difficult example from Moby Dick, with sentence boundaries (my judgements) indicated by the pipe character (|):

A clam for supper? | a cold clam; is THAT what you mean, Mrs. Hussey?” | says I, “but that’s a rather cold and clammy reception in the winter time, ain’t it, Mrs. Hussey?”

But, the default sentence tokenizer insists on sentence breaks immediately after both occurrences of “Mrs.”. To remedy this, I replaced the space after titles like “Mrs.”
(the full list of such abbreviations was adapted from GPoSTTL) with an underscore so as to “bleed” the sentence tokenizer, then replaced the underscore with a space after tokenization was complete. That is, the sentence tokenizer sees word tokens like “Mrs._Hussey”; since sentence boundaries must line up with word token boundaries, there is no chance a space will be inserted here. With this hack, the sentence tokenizer does that snippet of Moby Dick just right.

Sentences to tokens

For the last step, I used NLTK’s word tokenizer (nltk.tokenize.word_tokenize), which is similar to the (in)famous Treebank tokenizer, and then case-folded the resulting tokens.


In all, 170 million sentences, 5 billion word tokens, and 22 billion characters, all of which fits into 7.5 GB (compressed). Best of luck to anyone looking to do the same!