(ing): now with 100% more enregisterment!

In his new novel Bleeding Edge, Thomas Pynchon employs a curious bit of eye dialect for Vyrna McElmo, one of the denizen of his bizarro pre-9/11 NYC:

All day down there. I’m still, like, vibrateen? He’s a bundle of energy, that guy.

Oh? Torn? You’ll think it’s just hippyeen around, but I’m not that cool with a whole shitload of money crashing into our life right now?

What’s going on with vibrateen and hippyeen? I can’t be sure what Pynchon has in mind here—who can? But I speculate the ever-observant author is transcribing a very subtle bit of dialectical variation which has managed to escape the notice of most linguists. But first, a bit of background.

In English, words ending in <ng>, like sing or bang, are not usually pronounced with final [g] as the orthography might lead you to believe. Rather, they end with a single nasal consonant, either dorsal [ŋ] or coronal [n]. This subtle point of English pronunciation is not something most speakers are consciously aware of. But [n ~ ŋ] variation is sometimes commented on in popular discourse, albeit in a phonetically imprecise fashion: the coronal [n] variant is stigmatized as “g-dropping” (once again, despite the fact that neither variant actually contains a [g]). Everyone uses both variants to some degree. But the “dropped” [n] variant can be fraught: Peggy Noonan says it’s inauthentic, Samuel L. Jackson says it’s a sign of mediocrity, and merely transcribing it (as in “good mornin’“) might even get you accused of racism.

Pynchon presumably intends his -eens to be pronounced [in] on analogy with keen and seen. As it happens, [in] is a rarely-discussed variant of <ing> found in the speech of many younger Midwesterners and West Coast types, including yours truly. [1] Vyrna, of course, is a recent transplant from Silicon Valley and her dialogue contains other California features, including intensifiers awesome and totally and discourse particle like. And, I presume that Pynchon is attempting to transcribe high rising terminals, AKA uptalk—another feature associated with the West Coast—when he puts question marks on her declarative sentences (as in the passages above).

Only a tiny fraction of everyday linguistic variation is ever subject to social evaluation, and even less comes to be associated with groups of speakers, attitudes, or regions. As far as I know, this is the first time this variant has received any sort of popular discussion. -een may be on its way to becoming a California dialect marker (to use William Labov’s term [2]), though in reality it has a much wider geographic range.

Endnotes

[1] This does not exhaust the space of (ing) variant, of course. One of the two ancestors of modern (ing) is the Old English deverbal nominalization suffix -ing [iŋg]. In Principles of the English Language (1756), James Elphinston writes that [ŋg] had not fully coalesced, and that the [iŋg] variant was found in careful speech or “upon solemn occasions”. Today this variant is a stereotype of Scouse, and with [ɪŋk], occurs in some contact-induced lects.
[2] It is customary to also refer to Michael Silverstein for his notion of indexical order. Unfortunately, I still do not understand what Silverstein’s impenetrable prose adds to the discussion, but feel free to comment if you think you can explain it to me.

Gigaword English preprocessing

I recently took a little time out to coerce a recent version of the LDC’s Gigaword English corpus into a format that could be used for training conventional n-gram models. This turned out to be harder than I expected.

Decompression

Gigaword English (v. 5) ships with 7 directories of gzipped SGML data, one directory for each of the news sources. The first step is, obviously enough, to decompress these files, which can be done with gunzip.

SGML to XML

The resulting files are, alas, not XML files, which for all their verbosity can be parsed in numerous elegant ways. In particular, the decompressed Gigaword files do not contain a root node: each story is inside of <DOC> tags at the top level of the hierarchy. While this might be addressed by simply adding in a top-level tag, the files also contain a few “entities” (e.g., &) which ideally should be replaced by their actual referent. Simply inserting the Gigaword Document Type Definition, or DTD, at the start of each SGML file was sufficient to convert the Gigaword files to valid SGML.

I also struggled to find software for SGML-to-XML conversion; is this not something other people regularly want to do? I ultimately used an ancient library called OpenSP (open-sp in Homebrew), in particular the command osx. This conversion throws a small number of errors due to unexpected presence of UTF-8 characters, but these can be ignored (with the flag -E 0).

XML to text

Each Gigaword file contains a series of <DOC> tags, each representing a single news story. These tags have four possible type attributes; the most common one, story, is the only one which consistently contains coherent full sentences and paragraphs. Immediately underneath <DOC> in this hierarchy are two tags: <HEADLINE> and <TEXT>. While it would be fun to use the former for a study of Headlinese, <TEXT>—the tag surrounding the document body—is generally more useful. Finally, good old-fashioned <p> (paragraph) tags are the only children of <TEXT>. I serialized “story” paragraphs using the lxml library in Python. This library supports the elegant XPath query language. To select paragraphs of “story” documents, I used the XPath query /GWENG/DOC[@type="story"]/TEXT, stripped whitespace, and then encoded the text as UTF-8.

Text to sentences

The resulting units are paragraphs (with occasional uninformative line breaks), not sentences. Python’s NLTK module provides an interface to the Punkt sentence tokenizer. However, thanks to this Stack Overflow post, I became aware of its limitations. Here’s a difficult example from Moby Dick, with sentence boundaries (my judgements) indicated by the pipe character (|):

A clam for supper? | a cold clam; is THAT what you mean, Mrs. Hussey?” | says I, “but that’s a rather cold and clammy reception in the winter time, ain’t it, Mrs. Hussey?”

But, the default sentence tokenizer insists on sentence breaks immediately after both occurrences of “Mrs.”. To remedy this, I replaced the space after titles like “Mrs.”
(the full list of such abbreviations was adapted from GPoSTTL) with an underscore so as to “bleed” the sentence tokenizer, then replaced the underscore with a space after tokenization was complete. That is, the sentence tokenizer sees word tokens like “Mrs._Hussey”; since sentence boundaries must line up with word token boundaries, there is no chance a space will be inserted here. With this hack, the sentence tokenizer does that snippet of Moby Dick just right.

Sentences to tokens

For the last step, I used NLTK’s word tokenizer (nltk.tokenize.word_tokenize), which is similar to the (in)famous Treebank tokenizer, and then case-folded the resulting tokens.

Summary

In all, 170 million sentences, 5 billion word tokens, and 22 billion characters, all of which fits into 7.5 GB (compressed). Best of luck to anyone looking to do the same!

Defining libfixes

A recent late-night discussion with two fellow philologists revealed some interesting problems in defining libfixes. Arnold Zwicky coined this term to describe affix-like formatives such as -(a)thon (from marathon; e.g., saleathon) or -(o)holic (from alcoholic, e.g., chocoholic) that appear to have been extracted (“liberated”) from another word. These are then affixed to free stems, and the resulting form often conveys a sense of either jocularity or pejoration. The extraction of libfixes is a special case of what historical linguists call “recutting”, and like recutting in general, the ontogenesis of libfixation is largely mysterious.

As the evening’s discussion showed, it is not trivial to distinguish libfixation from similar derivational processes. What follows are a few examples of interesting derivational processes which in my opinion should not be identified with libfixation.

Blending is not libfixation

One superficially similar process is “blending”, in which new forms are derived by combining identifiable subparts of two simplex words. The resulting forms are sometimes called “portmanteaux” (sg. “portmanteau”), a term of art with its own interesting history. Two canonical blends are br-unch and sm-og, derived from the unholy union of breakfast and lunch, and smoke and fog, respectively. These two are particularly memorable—yet unobstrustive—thanks to a clever indexical trick: both word and referent are mongrel-like in their own ways. What exactly distinguishes blending from libfixation? I see two features which distinguish the two word-formation processes.

The first is productivity: libfixation has some degree of productivity whereas blending does not. In no other derivative can one find the “pieces” (I am using the term pretheoretically) of smog, namely sm- and -og. In contrast, there are over a dozen novel -omicses and dozens of -gates. There is therefore no reason to posit that either sm- or -og has been reconceptualized as an affix.

The second feature which distinguishes blending and libfixation deals with the way the pieces are spelled out. Libfixes are affixes and do not normally modify the freestanding base they attach to. In blends, one form overwrites the other (and vis versa). Were -og a newly liberated suffix, we would expect *smoke-og. This criterion also suggests that mansplain, poptimism, and snowquester are not in fact instances of libfixation; in each case, material from the “base” (I also use this term pretheoretically) is deleted.

Zwicky himself has noted the existence of a blend-libfix cline, and the tendency of blends to become libfixes. He suggests the following natural history:

A portmanteau word (useful or playful or both) invites other portmanteaus sharing an element (usually the second), and then these drift from the phonology and semantics of the original to such an extent that the shared element takes on a life of its own — is “liberated” as an affix.

Clipping is not libfixation

“Clipping” (or “truncation”) is a process which reduces a word to one of its parts. Sometimes truncated forms are themselves used for compound formation. For instance, burger is derived from Hamburger ‘resident of Hamburg’ (the semantic connection is a mystery). According to the Online Etymology Dictionary, forms like cheese-burger appear in the historical record at about the same time as burger itself. There is one way that clipping is distinct from libfixation, however. Clippings are free forms (i.e., prosodic words), whereas libfixes need not be. In particular, whereas some libfixes have homophonous free forms (e.g., -gate, -core), these are semantically distinct: whereas one can claim to love burgers, one cannot reasonably claim that the current administration has fallen prey to many gates.

The curious case of -giving

To conclude, consider a new set of words in -giving, including Friendsgiving, Fauxgiving, and Spanksgiving. These are not blends according to the criteria above, and while giving is a free form, the bound form has different semantics (something like ‘holiday gathering’). But is -giving a libfix? I’d say that it depends on whether Thanksgiving, etymologically an noun-gerund compound, is synchronically analyzed as such. If so, -giving has not so much been extracted as reanalyzed as a noun-forming suffix, a curious development but not an event of affix liberation.

h/t: Stacy Dickerman, John Kelly

LOESS hyperparameters without tears

LOESS is a classic non-parametric regression technique. One potential issue that arises is that LOESS fits depend on several hyperparameters (i.e., parameters set by the experimenter a priori). In this post I’ll take a quick look at how to set these.

At each point in a LOESS curve, the y-value is derived from a local, low-degree polynomial weighted regression. The first hyperparameter refers to the degree of the local fits. Most users set degree to 2 (i.e., use local quadratic curves), and with good reason. At degree 1, you’re just computing a local average. Higher degrees than 2 (e.g., cubic) tend to not have much of an effect.

The other hyperparameter is “span”, which controls the degree of smoothing. A value of 0 uses no context and a value of 1 uses the entire sample (so it will be similar to fitting a single quadratic function to the data). The choice of this value has a major effect on the quality of the fit obtained:

For the randomly generated data here, large values of the span parameter (“bad”) produce a LOESS which fails to follow the larger trend, whereas small values (“ugly”) primarily model noise. For this reason alone, the experimenter should probably not be permitted to select the span hyperparameter herself.

Fortunately, there are several objectives used to determine an “optimal” setting for the span parameter. Hurvich et al. (1998) propose a particularly privileged objective, namely minimizing AIC_C. This has been used to generate the “good” curve above. Here’s how I did it (adapted from this post to R-help):

There is also an R package fANCOVA which apparently includes a function loess.as which automatically determines the span parameter, presumably similar to how I’ve done it here. I haven’t tried it.

PS to those inclined to care: the origins of memetic, snarky, academic “X without tears” is, to my knowledge, J. Eric S. Thompson‘s 1972 book Maya Hieroglyphs Without Tears. While I have every reason to believe Thompson was poking fun at his detractors, it’s interesting to note that he turned out to be fabulously wrong about the nature of the hieroglyphs.

On the Providence word gap intervention

A recent piece in the Boston Globe quoted my take on a grant to Providence, RI for a “word-gap” intervention. In this quote, I expressed some skepticism about the grant’s goals, but omitted the part of the email where I explained why I felt that way. Readers of the piece might have gotten the impression that I had a less, uhm, nuanced take on the Providence grant than I do. So, here is a summary of my full email to Ben from which the quote was taken.

An ambitious proposal

First off, the Providence/LENA team should be congratulated on this successful grant application: I’m glad they got it and not something more “Bloombergian” (like, say, an experimental proposal to ban free-pizzas-with-beer deals in the interest of bulging hipster waistlines). And they deserve respect for getting approved for such an ambitious proposal: the cash involved is an order of magnitude larger than the average applied linguistics grants. And, perhaps most of all, I have a great deal of respect for any linguist who can convince a group of non-experts that, not only is their work important, but that it is worth the opportunity cost. I also note that if materials from the Providence study are made publicly available (and they should be, in a suitably de-identified format, for the sake of the progress of the human race), my own research stands to benefit from this grant.

But there is another sense in which the proposal is ambitious, however: the success of this intervention depends on a long chain of inferences. If any one of these is wrong, the intervention is unlikely to succeed. Here are what I see as the major assumptions under which the intervention is being funded.

Assumption I: There exists a “word gap” in lower-income children

I was initially skeptical of this claim because it is so similar to a discredited assumption of 20th century educational theorists: the assumption that differences in school and standardized test performance were the result of the “linguistically impoverished” environment in which lower class (and especially minority) speakers grew up.

This strikes me as quite silly: no one who has even a tenuous acquintance with African-American communities could fail to note the importance of verbal skills in said community. Every African-American stereotype I can think of has one thing in common: an emphasis on verbal abilities. Here’s what Bill Labov, founder of sociolinguistics, had to say in his 1972 book, Language in the Inner City:

Black children from the ghetto area are said to receive little verbal stimulation, to hear very little well-formed language, and as a result are impoverished in their means of verbal expression…Unfortunately, these notions are based upon the work of educational psychologists who know very little about language and even less about black children. The concept of verbal deprivation has no basis in social reality. In fact, black children in the urban ghettos receive a great deal of verbal stimulation…and participate fully in a highly verbal culture. (p. 201)

I suspect that Labov may have dismissed the possibility of input deficits prematurely, just as I did. After all, it is an empirical hypothesis, and while Betty Hart and Todd Risley’s original study on differences in lexical input involved a small and perhaps-atypical sample, but the correlation between socioeconomic status and lexical input has been many times replicated. So, there may be something to the “impoverishment theory” after all.

Assumption II: LENA can really estimate input frequency

Can we really count words using current speech technology? In a recent Language Log post, Mark Liberman speculated that counting words might be beyond the state of the art. While I have been unable to find much information on the researchers behind the grant or behind LENA, I don’t see any reason to doubt that the LENA Foundation has in fact built a useful state-of-the-art speech system that allows them to estimate input frequencies with great precision. One thing that gives me hope is that a technical report by LENA researchers provides estimates average input frequency in English which are quite close to an estimate computed by developmentalist Dan Swingley (in a peer-reviewed journal) using entirely different methods.

Assumption III: The “word gap” can be solved by intervention

For children who are identified as “at risk”, the Providence intervention offers the following:

Families participating in Providence Talks would receive these data during a monthly coaching visit along with targeted coaching and information on existing community resources like read-aloud programs at neighborhood libraries or special events at local children’s museums.

Will this have an long-term effect? I simply don’t know of any work looking into this (though please comment if you’re aware of something relevant), so this too is a strong assumption.

Given that there is now money in the budget for coaching, why are LENA devices necessary? Would it be better if any concerned parent could get coaching?

And, finally, do the caretakers of the most at-risk children really have time to give to this intervention? I believe the most obvious explanation of the correlation between verbal input and socioeconomic status is that caretakers on the lower end of the socioeconomic scale have less time to give to their children’s education: this follows from the observation that child care quality is a strong predictor of cognitive abilities. If this is the case, then simply offering counseling will do little to eliminate the word gap, since the families most at risk are the least able to take advantage of the intervention.

Assumption IV: The “word gap” has serious life consequences

Lexical input is clearly important for language development: it is, in some sense, the sole factor determining whether a typically developing child acquires English or Yawelmani. And, we know the devastating consequences of impoverished lexical input.

But here we are at risk of falling for the all-too-common fallacy which equates predictors of variance within clinical and subclinical populations. While massively impoverished language input gives rise to clinical language deficits, it does not follow that differences in language skills within typically developing children can be eliminated by leveling the language input playing field.

Word knowledge (as measured by verbal IQ, for instance) is correlated with many other measures of language attainment, but are increases in language skills enough to help an at-risk child to escape the ghetto (so to speak)?

This is the most ambitious assumption of the Providence intervention. Because there is such a strong correlation between lexical input and social class, it is very difficult to control for this while manipuating lexical input (and doing so would presumably be wildly unethical), we know very little on this subject. I hope that the Providence study will shed some light on this question.

So what’s wrong with more words?

This is exactly what my mom wanted to know when I sent her a link to the Globe piece. She wanted to emphasize that I only got the highest-quality word-frequency distributions all throughout my critical period! I support, tentatively, the Providence initiative and wish them the best of luck; if these assumptions all turn out to be true, the organizers and scientists behind the grant will be real heroes to me.

But, that leads me to the only negative effect this intervention could have: if closing the word gap does little to influence long-term educational outcomes, it will have made concerned parents unduly anxious about the environment they provide for their children. And that just ain’t right.

(Disclaimer: I work for OHSU, where I’m supported by grants, but these are my professional and personal opinions, not those of my employer or funding agencies. That should be obvious, but you never know.)

Making high-quality graphics in R

There are a lot of different ways to make an R graph for TeX; this is my workflow.

In R

I use cairo_pdf to write a graph to disk. This command takes arguments for image size and for font size and face. If you’re on a Mac, you will need to install X11.

Image size

I always specify graph size by hand, in inches. For manuscripts and handouts, I usually set the width to be the printable width. If you’re using 1″ margins, that’s 6.5″. Then, I adjust height until a pleasing form emerges.

Fonts

I match the font face of the manuscript (whatever I’m using) and graph labels by passing the font name as the argument to family. This matters most if you’re writing a handout, and matters less if you’re sending it to, say, Oxford University Press, who will redo your text anyways. I found out the hard way that the family keyword argument is absent in older versions of R, so you may need to upgrade. By default, image font are 12pt. This is generally fine, but can be adjusted with the pointsize argument.

Graphing

This is a no-brainer: use ggplot2.

All together now

cairo_pdf('mygraph.pdf', width=6.5, height=4, family='Times New Roman')
qplot(X, Y, data=dat)
dev.off()

In TeX

Add usepackage{graphicx} to your preamble, if it’s not already there. In the body, includegraphics{mygraph.pdf}.

TeX tips for linguists

I’ve been using TeX to write linguistics papers for nearly a decade now. I still think it’s the best option. Since TeX is a big, complex ecosystem and not at all designed with linguists in mind, I thought it might be helpful to describe the tools and workflow I use to produce my papers, handouts, and abstracts.

Michael Becker‘s notes are recommended as well.

Software

I use xetex (this is the same as xelatex) from XeTeX. It has two advantages over the traditional pdflatex and related tools. First, you can use system fonts via fontspec and mathspec. If you are using Computer Modern or packages like txfonts or times, etc., it’s time to join the modern world.

Secondly, it expects UTF-8. If you are using tipa, or multi-character sequences to enter non-ASCII characters, then you probably have ugly transcriptions. (Don’t want to name names…)

Fonts

Linguists generally demand the following types of characters:

Alphabetic small caps
The complete IPA, especially IPA [g] (which is not English “g”)
“European” extensions to ASCII: enye (año), diaresis (coöperation, über), acute (résumé), grave (à), macron (māl), circumflex (être), haček (očudit), ogonek (Pająk), eth (fracoð), thorn (þæt), eszet (Straße), cedilla (açai), dotted g (ealneġ), and so on, for Roman-like writing systems

The only font I’ve found that has all this is Linux Libertine. It has nothing to do with Linux, per se. In general, it’s pretty handsome, especially when printed small (though the Q is ridiculously large). If you can deal without small caps (and arguably, linguists use them too much), then a recent version of Times New Roman (IPA characters were added recently) also fits the bill. Unfortunately, if you’re on Linux and using the “MS Core Fonts”, that version of Times New Roman doesn’t have the IPA characters.

This is real important: do not allow your mathematical characters to be in Computer Modern if your paper is not in Computer Modern. It sticks out like a sore thumb. What you do is put something like this in the preamble:

\usepackage{mathspec}
\setmainfont[Mapping=tex-text]{Times New Roman}
\setmathfont(Digits,Greek,Latin){Times New Roman}

Examples

The gb4e package seems to be the best one for syntax-style examples, with morph-by-morph glossing and the like. I myself deal mostly in phonology, so I use the tabular environment wrapped with an example environment of my own creation called simplex.sty and packaged in my own LingTeX (which is a work-in-progress).

Bibliographies

I use natbib. When I have a choice, I usually reach for pwpl.bst, the bibliography style that we use for the Penn Working Papers in Linguistics. It’s loosely based on the Linguistic Inquiry style.

Compiling

I use make for compiling. This will be installed on most Linux computers. On Macintoshes, you can get it as part of the Developer Tools package, or with Xcode.

I type make to compile, make bib to refresh the bibliography, and make clean to remove all the temporary files. Here’s what a standard Makefile for paper.tex would look like for me.

 # commands
 PDFTEX=xelatex -halt-on-error
 BIBTEX=bibtex

 # files
 NAME=paper

 all: $(NAME).pdf

 $(NAME).pdf: $(NAME).tex $(NAME).bib *.pdf
      $(PDFTEX) $(NAME).tex
      $(BIBTEX) $(NAME).tex
      $(PDFTEX) -interaction=batchmode -no-pdf $(NAME).tex
      $(PDFTEX) -interaction=batchmode $(NAME).tex

 clean:
      latexmk -c

There are a couple interesting things here. -halt-on-error kills a compile the second it goes bad: why wouldn’t you want to fix the problem right when it’s detected, since it won’t produce a full PDF anyways? Both -interaction=batchmode and -no-pdf shave off a considerable amount of compile time, but aren’t practical when debugging, and when producing a final PDF, respectively. I use latexmk -c, which reads the log files and removes temporary files but preserves the target PDF. For some reason, though, it doesn’t remove .bbl files.

Draft mode

Up until you’re done, start your file like so:

\documentclass[draft,12pt]{article}

This will do two things: “overfull hboxes” will be marked with a black line, so you can rewrite and fix them. Secondly, images won’t be rendered into the PDF, which saves time. It’s very easy to tell who does this and who doesn’t.

On slides and posters

There are several TeX-based tools for making slides and posters. Beamer seems to be very popular for slides, but I find the default settings very ugly (gradients!) and cluttered (navigation buttons on every slide!). I use a very minimal Keynote style (Helvetica Neue, black on white). I’m becoming a bigger fan of handouts, since unlike slides or posters, the lengthy process of making a handout gets me so much closer to publication.