On being scooped

Some of my colleagues have over the years expressed concern their ongoing projects are in danger of being “scooped”, and as a result, they need to work rapidly to disseminate the projects in question. This concern is particularly prominent among the fast-moving (and unusually cargo-cultish) natural language processing community, though I have occasionally heard similar concerns in the company of theoretical linguists. Assuming this is not merely hysteria caused by material conditions like casualization and pandemic-related isolation, there is a simple solution: work on something else, something you yourself deem to be less obvious. If you’re in danger of being scooped, it suggest that you’re taking obvious next steps—you’re engaging in what Kuhn calls normal science—that you lack a competitive advantage (such as rare-on-the-ground expertise, special knowledge, proprietary or unreleased data, etc.) that would help you in particular advance the state of knowledge. If you find yourself in this predicament, you should consider allowing somebody else to carry the football across the goal-line. Or don’t, but then you might just get scooped after all.

How to write linguistic examples

There is a standard, well-designed way in which linguists write examples, and failure to use it in a paper about language is a strong shibboleth suggesting unfamiliarity with linguistics as a field. In brief, it is as follows:

When an example (affix, word, phrase, or sentence) appears in the body (i.e., the middle of a sentence):
- if written in Roman, it should be italicized.
- if written in non-Roman, but alphabetic scripts like Cyrillic, italicization is optional. (Cyrillic italics are, like the Russian cursive hand they’re based on, famously hard for Western amateurs like myself to read.)
- if written in a non-alphabetic script, it can just be written as is, though you’re welcome to experiment.
- Examples should never be underlined, bolded, or placed in single or double quotes, regardless of the script used.
When an example is set off from the body (i.e., as a numbered example or in a table), it need not be italicized.
Any non-English example should be immediately followed with a gloss.
- A gloss should always be single-quoted.
- Don’t intersperse words like “meaning”, as in “…kitab meaning ‘book’…”, just write “…kitab ‘book’…”
If using morph-by-morph or word-by-word glossing, follow the Leipzig glossing conventions.

How to write numbers

A lot of students—and increasingly, given how young the field of NLP is—don’t know how to write numbers in papers. Here are a few basic principles (some of these are loosely based off the APA guidelines):

Use the same number of decimals every time and don’t omit trailing zeros after the decimal. Thus “.50” or “.5000” and not “.5”.
Round to a small number of decimals: 2, 4, or 6 are all standard choices.
Omit leading zeros before the decimal if possible values of whatever quantity are always within [0, 1], thus you might say you got “.9823” accuracy.
(For LaTeX users) put the minus sign in math mode, too, or it’ll appear as a hyphen (ASCII char 45), which is quite a bit shorter and just looks wrong.
Use commas to separate the hundreds and thousands place (etc.) in large integers, and try not to use too many large exact integers; rounding is fine once they get large.
Expressions like “3k”, “1.3m” and “2b” are too informal; just write “3,000”, “1.3 million”, and “2 billion”.
Many evaluation metrics can either be written as (pseudo-)probabilities or percentages. Pick one or the other format and stick with it.

A few other points about tables with numbers (looking at you LaTeX users):

Right-align numbers in tables.
Don’t put two numbers (like mean and standard deviation or a range) in a single cell; the alignment will be all wrong. Just use more cells and tweak the intercolumnar spacing.
Don’t make the text of your tables smaller than the body text, which makes the table hard to read. Just redesign the table instead.

Moneyball Linguistics

[This is just a fun thought experiment. Please don’t get mad.]

The other day I had an intrusive thought: the phrase moneyball linguistics. Of course, as soon as I had a moment to myself, I had to sit down and think what this might denote. At first I imagined building out a linguistics program on a small budget like Billy Beane and the Oakland A’s. But it seems to me that linguistics departments aren’t really much like baseball teams—they’re only vaguely competitive (occasionally for graduate students or junior faculty), there’s no imperative to balance the roster, there’s no DL list (or is just sabbatical?), and so on—and the metaphor sort of breaks down. But the ideas of Beane and co. do seem to have some relevance to talking about individual linguists and labs. I don’t have OBP or slugging percentage for linguists, and I wouldn’t dare to propose anything so crude, but I think we can talk about linguists and their research as a sort of “cost center” and identify two major types of “costs” for the working linguist:

cash (money, dough, moolah, chedda, cheese, skrilla, C.R.E.A.M., green), and
carbon (…dioxide emissions).

I think it is a perfectly fine scientific approximation (not unlike competence vs. performance) to treat the linguistic universe as having a fixed amount of cash and carbon, so that we could use this thinking to build out a roster-department and come in just under the pay cap. While state research budgets do fluctuate—and while our imaginings of a better world should also include more science funding—it is hard to imagine near-term political change in the West would substantially increase it. And similarly, while there is roughly 10¹² kg of carbon in the earth’s crust, climate scientists agree that the vast majority of it really ought to stay there. Finally, I should note that maybe we shouldn’t treat these as independent factors, given that there is a non-trivial amount of linguistics funding via petrodollars. But anyways, without further ado, let’s talk about some types of researchers and how they score on the cash-and-carbon rubric.

Armchair research: The armchairist is clearly both low-cash (if you don’t count the sports coats) and low-carbon (if you don’t count the pipe smoke).
Field work: “The field” could be anywhere, even the reasonably affordable, accessible, and often charming Queens, the archetypical fieldworker is flying in, first on a jet and then maybe reaches their destination via helicopter or seaplane. Once you’re there though, life in the field is often reasonably affordable, so this scores as low-cash, high-carbon.
Experimental psycholinguistics: Experimental psycholinguists have reasonably high capital/startup costs (in the form of eyetracking devices, for instance) and steady marginal costs for running subjects: the subjects themselves may come from the Psych 101 pool but somebody’s gotta be paid to consent them and run them through the task. We’ll call this medium-cash, low-carbon.
Neurolinguistics: The neurolinguistic imaging technique du jour, magnetoencephalography (or MEG), requires superconducting coils cooled to a chilly 4.2 K (roughly −452 °F); this in turn is accomplished with liquid helium. Not only is the cooling system expensive and power-hungry, the helium is mostly wasted (i.e., vented to the atmosphere). Helium is itself the second-most common element out there, but we are quite literally running out of the stuff here on Earth. So, MEG, at least, is high-cash, high-carbon.
Computational linguistics: there was a time not so long ago when I would said that computational linguists were a bunch of hacky-sackers filling up legal pads with Greek letters (the weirder the better) and typing some kind of line noise they call “Haskell” into ten-year-old Thinkpads. But nowadays, deep learning is the order of the day, and the substantial carbon impact from these methods are well-documented, or at least well-estimated (e.g., Strubell et al. 2019). Now, it probably should be noted that a lot of the worst offenders (BigCos and the Quebecois) locate their data centers near sources of plentiful hydroelectric power, but not all of us live within the efficient transmission zones for hydropower. And of course, graphics processing units are expensive too. So most computational linguistics is, increasingly, high-cash, high-carbon.

On a more serious note, just so you know, unless you run an MEG lab or are working on something called “GPT-G6”, chances are your biggest carbon contributions are the meat you eat, the cars you drive, and the short-haul jet flights you take, not other externalities of your research.

References

Strubell, M., Ganesh, A. and McCallum, A. 2019. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645-3650.

Don’t take money from the John Templeton Foundation

Don’t take money from the John Templeton Foundation. They backed the murderous Chicago School economists, the genocidal architects of the war on Iraq, and are among the largest contributors to the climate change denial movement. That’s all.

Linguistics has its own Sokal affair

The Sokal affair was a minor incident in which physics professor Alan Sokal published a “hoax” (his term) paper in the cultural studies journal Social Text. Sokal’s intent was to demonstrate that reviewers and editors would approve of an article of utter nonsense so long as it obeyed certain preconceived notions, in this case that everything is a social construct. (It is, but that’s a story for another blog.)

The affair has been “read” many ways but it is generally understood to illustrate poor editorial standards at top humanities journals and/or the bankruptcy of the entire cultural studies enterprise. However, I don’t think we have any reason to suspect that either of these critiques are limited to cultural studies and adjacent fields.

I submit that the Pirahã recursion affair has many of the makings of a linguistic Sokal affair. But if anything, the outlook for linguistics is quite a bit worse than the Sokal story. By all accounts, Sokal’s hoax article was a minor scholarly event, and does not seem to have received much attention before it was revealed to be a hoax. In contrast, when Everett’s article first appeared in Current Anthropology in 2005, it received an enormous amount of attention from both scholars and the press, and ultimately led to to multiple books, including a sympathetic portrait of Everett and his work by none other than the late Tom Wolfe (bang! krrp!). Finally, nearly all of what Everett has written on the subject is manifest nonsense.

I believe many scholars in linguistics and adjacent fields found Everett’s claim compelling, and while I think linguists should have seen through the logical leaps and magical thinking in the Current Anthropology piece, it wasn’t until a few years later, after the exchange with Nevins et al. in Language, that the empirical issues (to put it mildly) with Everett’s claims came to light. But the key element which gave Everett’s work such influence is that, like Sokal intended his hoax to do, it played to the biases (anti-generativist, and particularly, anti-Noam Chomsky) of a wide swath of academics (and to a lesser degree, fans of US empire, like Tom Wolfe). In that regard, it scarcely matters whether Everett himself believes or believed what he wrote: we have all been hoaxed.

Does GPT-3 have free speech rights?

I am pleased to be part of the Dec 9 @coling2020 panel "Should GPT3 Have the Right to Free Speech?" with Robert Dale, @emilymbender, and @pascalefung. I expect it to give me lots to think about. The panelists will make only short remarks before open discussion. Current thoughts:
— Christopher Potts (@ChrisGPotts) December 7, 2020

I have some discomfort with this framing. It strikes me as unnecessarily frivolous about some serious questions. Here is an imagined dialogue.

Should GPT-3 have the right to free speech?

No. Software does not have rights nor should it. Living things are the only agents in moral-ethical calculations. Free speech as it currently is construed should also be recognized as a civic myth of the United States, one not universally recognized. Furthermore it should be recognized that all rights, including the right to self-expression, can impinge upon the rights and dignity of others.

What if a court recognized a free-speech right for GPT-3?

Then that court would be illegitimate. However, it is very easy to imagine this happening in the States given that the US “civic myth” is commonly used to provide extraordinary legal protections to corporate entities.

What if that allowed it to spread disinformation?

Then the operator would be morally responsible for all consequences of that dissemination.

They’re going to tell you…

…at some very near point in the future, that there’s something inherently white supremacist about teaching and studying generative linguistics. They will never tell you how generative linguistics enforces white supremacy, but they will tell you that it represents a hegemonic power in the science of language (it does not, it is clearly just one way of knowing, spottily represented outside the Angophone west) and that it competes for time and mindshare with other forms of linguistic knowledge (an unexamined austerity mindset). This rhetorical trick—the same one used to slander the socialist left across the democratic West 2016-present—would simply not work on the generative community were they a militant, organized, self-assured vanguard rather than a casualized, disorganized, insecure community, one serously committed to diversity in race and sexual orientation but largely uninterested in matters of class and power. And then, once you’ve accepted their framing, they’re going to sell you a radically empiricist psycho-computational mode of inquiry that is deeply incurious about language diversity, that cares not a whit for the agency of speakers, and trains students to serve the interests of the most powerful men in the world.

Words and what we should do about them

Every January linguists and dialectologists gather for the annual meeting of the Linguistics Society of America and its sister societies. And, since 1990, attendees crowd into a conference room to vote for the American Dialect Society’s Word Of The Year (or WOTY for short). The guidelines for nominating and selecting the WOTY are deliberately underdetermined. There are no rules about what’s a word (and, increasingly, picks are not even a word under any recognizable definition thereof), what makes a word “of the year” (should it be a new coinage? should its use be vigorous or merely on the rise? should it be stereotyped or notorious? should it reflect the cultural zeitgeist?) or even whether the journalists in the room are eligible to vote.

By my count, there are two major categories of WOTY winners over the last three decades: commentary on US and/or world political events, or technological jargon; I count 14 in the former category (1990’s bushlips, 1991’s mother of all, 2000’s chad, 2001’s 9-11, 2002’s WMD, 2004’s red state/blue state, 2005’s truthiness, 2007’s subprime and 2008’s bailout, 2011’s occupy, 2014’s #blacklivesmatter, 2016’s dumpster fire, 2017’s fake news, 2018’s tender-age shelter) and 9 in the latter (1993’s information superhighway, 1994’s cyber, 1995’s web, 1997’s millennium bug, 1998’s e-, 1999’s Y2K, 2009’s tweet, 2010’s app, 2012’s hashtag) But, as Allan Metcalf, former executive of the American Dialect Society, writes in his 2004 book Predicting New Words: The Secrets Of Their Success, terms which comment on a situation—rather than fill some denotational gap—rarely have much of a future. And looking back some of these picks not only fail to recapitulate the spirit of the era but many (bushlips, newt, morph, plutoed) barely denote at all. Of those still recognizable, it is shocking how many refer to—avoidable—human tragedies: a presidential election decided by a panel of judges, two bloody US incursions into Iraq and the hundreds of thousands of civilian casualities that resulted, the subprime mortgage crisis and the unprecedented loss of black wealth that resulted, and unchecked violence by police and immigration officers against people of color and asylum-seekers.

Probably the clearest example of this is the 2018 WOTY, tender-age shelter. This ghoulish euphemism was not, in my memory, a prominent 2018 moment, so for the record, it refers to a Trump-era policy of separating asylum-seeking immigrants from their children. Thus, “they’re not child prisons, they’re…”. Ben Zimmer, who organizes the WOTY voting, opined that this was a case of bureaucratic language backfiring, but I disagree: there was no meaningful blowback. The policy remains in place, and the people who engineered the policy remain firmly in power for the forseeable future, just as do the architects of and propagandists for the Iraqi invasions (one of whom happens to be a prominent linguist!), the subprime mortgage crisis, and so on. Tender-age shelter is of course by no means the first WOTY that attempts to call out right-wing double-talk, but as satire it fails. There’s no premise—it is not even in the common ground that the US linguistics community (or the professional societies who represent them) fervently desire an end to the aggressive detention and deportion of undocumented immigrants, which after all has been bipartisan policy for decades, and will likely remain so until at least 2024—and without this there is no irony to be found. Finally, it bespeaks a preoccupation with speech acts rather than dire material realities.

This is not the only dimension on which the WOTY community has failed to self-criticize. A large number of WOTY nominees (though few outright winners) of the last few years have clear origins in the African-American community (e.g., 2017 nominees wypipo, caucasity, and 🐐, 2018 nominees yeet and weird flex but OK, 2019 nominees Karen and woke). Presumably these terms become notable to the larger linguistics community via social media. It is certainly possible for the WOTY community to celebrate language of people of color, but it is also possible to read this as exotificiation. The voting audience, of course, is upper-middle-class and mostly-white, and here these “words”, some quite well-established in the communities in which they originate, compete for novelty and notoriety against tech jargon and of-the-moment political satire. As scholars of color have noted, this could easily reinforce standard ideologies that view African-American English as a debased form of mainstream English rather than a rich, rule-governed system in its own right. In other words, the very means by which we as linguists engage in public-facing research risk reproducing linguistic discrimination:

How might linguistic research itself, in its questions, methods, assumptions, and norms of dissemination, reproduce or work against racism? (“LSA Statement on Race”, Hudley & Mallison 2019)

I conclude that the ADS should issue stringent guidance about what makes expressions “words”, and what makes them “of the year”. In particular, these guidelines should orient voters towards linguistic novelty, something the community is well-situated to assess.

Elizabeth Warren and the morality of the professional class

I am surprised by the outpouring of grief engendered by Senator Elizabeth Warren’s exit from the presidential primary among my professional friends and colleagues. I dare not tell them how they ought to feel, but the spectacle of grief makes me wonder whether my friends are selling themselves short: virtually all of them have lived, in my opinion, far more virtuous lives than the senator from Massachusetts.

First off, none of them have spent most of their professional lives as right-wing activists, as did Warren, a proud Republican until the late ’90s. As recently as 1991, Warren gave a keynote at a meeting of the Federalist Society, the shadowy anti-choice legal organization that gave us Justice Brett Kavanaugh and so many other young ultra-conservative judicial appointees.

Secondly, Warren spent decades lying about her Cherokee heritage, presumably for nothing more than professional gain. This is a stunningly racist personal behavior, one that greatly reinforces white supremacy by equating the almost-unimaginable struggles of indigenous peoples with plagiarized recipes and “high cheekbones”. Were any of my friends or colleagues caught lying so blatantly on a job application, they would likely be subject to immediate termination. It is shocking that Warren has not faced greater professional repercussions for this lapse in judgment.

Warren’s more recent history of regulatory tinkering around the most predatory elements of US capitalism, while important, are hardly an appropriate penance for these two monumental personal-professional sins.