Libfix report for February 2018

While -splain (and -splainer, -splaining) clearly have potential, they hadn’t, as far as I could tell, gotten much beyond mansplain and occasionally, womansplain. But I changed my mind once I saw a podcast episode entitled “Orbsplainer“, about, well, the orb, you remember the orb, right? How could you forget the Orb? The Orb forbids it! Anyways, looks like a libfix to me.
Constantine Lignos draws my attention to -tainment, a term which refers to media (particularly video and video games) which entertains in addition while doing something else. The locus classicus is the ’90s term edutainment, which looks much more like a blend than a libfix, as does infotainment, politainment, and psychotainment. But but pornotainment suggests this is on its way to affix liberation.

How, why, and when to flatten your conditionals

You may be tempted to write code that looks a little like this:

for item in items:
     if not condition_1(item):
         if not condition_2(item, False):
             if not condition_3(item, 3, 3):
                 if not condition_4(item):
                     do_work(item)

But please, don’t. Flatten your conditionals instead.

How to flatten your conditionals

There is a relatively straightforward alternative to the above. Instead, we use continue expressions to short-circuit the cascade. This looks a little bit like this:

for item in items:
     if condition_1(item):
         continue
     if condition_2(item, False):
         continue
     if condition_3(item, 3, 3):
         continue
     if condition_4(item):
         continue
     do_work(item)

That’s pretty much all there’s to it.

Perhaps you’re not inside of a loop, but rather inside of a “nullable” function or method (i.e., one which may reasonably return None); that’s okay, replace continue with return. Perhaps there’s more than one item you’re possibly shipping off to do_work on; that’s okay, wrap the conditionals in a function and use return to short-circuit evaluation. Perhaps you want to terminal the entire loop, not just this iteration thereof; that’s okay, replace continue with break.

Why to flatten your conditionals

The flattened loop is much easier to read. There is no indentation (or bracketing) to track. The fact that each of the conditional expressions is at the same indentation (bracketing) level makes it clear that we’re just dealing with a cascade of conditionals, all of which are handled the same. Realistically, one can only visually parse 3-4 levels of indentation; the Linux kernel, for example, uses an 8-character indent and forbids more than 3 levels of indentation. Flattening your conditionals means you don’t have to deal with that very often.

When to flatten your conditionals

It’s perhaps preferable to write your early code with nested conditional statements. It may turn out that you need to do some work in one of the medial else: clauses (which we’ve elided here), which can make flattening the conditionals hard. But once you’re writing comments about the conditionals in the cascade, and preparing to share your code with others, it’s time to do away with more than a few layers of indentation.

What to do about the academic brain drain

The academy-to-industry brain drain is very real. What can we do about it?

Before I begin, let me confess my biases. I work in the research division of a large tech company (and I do not represent their views). Before that, I worked on grant-funded research in the academy. I work on speech and language technologies, and I’ll largely confine my comments to that area.

[Content warnings: organized labor, name-calling.]

Salary

Fact of the matter is, industry salaries are determined by a relatively-efficient labor market. Academy salaries are compressed, with a relatively firm ceiling for all but a handful of “rock star” faculty. The vast majority of technical faculty are paid substantially less than they’d make if they just took the very next industry offer that came around. It’s even worse for research professors who depend on grant-based “salary support” in a time of unprecedented “austerity”—they can find themselves functionally unemployed any time a pack of incurious morons seem to end up in the White House (as seems to happen every eight years or so).

The solution here is political. Fund the damn NIH and NSF. Double—no, triple—their funding. Pay for it by taxing corporations and the rich, or, better yet, divert some money from the Giant Death Machines fund. Make grant support contractual, so PIs with a five-year grant are guaranteed five years of salary support and a chance to realize their vision. Insist on transparency and consistency in “indirect costs” (i.e., overhead) for grants to drain the bureaucratic swamp (more on that below). Resist the casualization of labor at universities, and do so at every level. Unionize every employee at every American university. Aggressively lobby Democrat presidential candidates to agree to appoint the National Labor Relations Board who will continue to recognize graduate students’ right to unionize.

Administration & bureaucracy

Industry has bureaucratic hurdles, of course, but they’re in no way comparable to the profound dysfunction taken for granted in the academic bureaucracy. If you or anyone you love has ever written a scientific grant, you know what I mean; if not, find a colleague who has and politely ask them to tell you their story. At the same time American universities are cutting their labor costs through casualization, they are massively increasing their administrative costs. You will not be surprised to find that this does not produce better scientific outcomes, or make it easier to submit a grant. This is a case of what Noam Chomsky has described as the “neoliberal confidence trick”. It goes a little something like this:

Appoint/anoint all-powerful administrators/bureaucrats, selecting for maximal incompetence.
Permit them to fail.
Either GOTO #1, or use this to justify cutting investment in whatever was being administered in the first place.

I do not see any way out of this situation except class consciousness and labor organizing. Academic researchers must start seeing the administration as potentially hostile to their interests, and refuse to identify with, or (or quelle horreur, to join) the managerial classes.

Computing power & data

The big companies have more computers than universities. But in my area, speech and language technology, nearly everything worth doing can still be done with a commodity cluster (like you’d find in the average American CS departments) or a powerful desktop with a big GPU. And of those, the majority can still be done on a cheap laptop. (Unless, of course, you’re one of those deep learning eliminationist true believers, in which case, reconsider.) Quite a bit of great speech & language research—in particular, work on machine translation—has come from collaborations between the Giant Death Machines funding agencies (like DARPA) and academics, with the former usually footing the bill for computing and data (usually bought from the Linguistic Data Consortium (LDC), itself essentially a collaboration between the military-industrial complex and the Ivy League). In speech recognition, there are hundreds of hours of transcribed speech in the public domain, and hundreds more can be obtained with a LDC contract paid for by your funders. In natural language processing, it is by now almost gauche for published research to make use of proprietary data, possibly excepting the venerable Penn Treebank.

I feel the data-and-computing issue is largely a myth. I do not know where it got started, though maybe it’s this bizarre press-release-masquerading-as-an-article (and note that’s actually about leaving one megacorp for another).

Talent & culture

Movements between academy & industry have historically been cyclic. World War II and the military-industrial-consumer boom that followed siphoned off a lot of academic talent. In speech & language technologies, the Bell breakup and the resulting fragmentation of Bell Labs pushed talent back to the academy in the 1980s and 1990s; the balance began to shift back to Silicon Valley about a decade ago.

There’s something to be said for “game knows game”—i.e., the talented want to work with the talented. And there’s a more general factor—large industrial organizations engage in careful “cultural design” to keep talent happy in ways that go beyond compensation and fringe benefits. (For instance, see Fergus Henderson’s description of engineering practices at Google.) But I think it’s important to understand this as a symptom of the problem, a lagging indicator, and as part of an unpredictable cycle, not as something to optimize for.

Closing thoughts

I’m a firm believer in “you do you”. But I do have one bit of specific advice for scientists in academia: don’t pay so much damn attention to Silicon Valley. Now, if you’re training students—and you’re doing it with the full knowledge that few of them will ever be able to work in the academy, as you should—you should educate yourself and your students to prepare for this reality. Set up a little industrial advisory board, coordinate interview training, talk with hiring managers, adopt industrial engineering practices. But, do not let Silicon Valley dictate your research program. Do not let Silicon Valley tell you how many GPUs you need, or that you need GPUs at all. Do not believe the hype. Remember always that what works for a few-dozen crypto-feudo-fascisto-libertario-utopio-futurist billionaires from California may not work for you. Please, let the academy once again be a refuge from neoliberalism, capitalism, imperialism, and war. America has never needed you more than we do right now.

If you enjoyed this, you might enjoy my paper, with Richard Sproat, on an important NLP task that neural nets are really bad at.

Disfluency in children with ASD and SLI

Our new article on disfluency in children with autism spectrum disorders (ASD) or specific language impairment (SLI) is now out in PLOS ONE. (The team consisted of Heather MacFarlane—who also did most of the annotation and much of the writing—myself, and Rosemary Ingham, Alison Presmanes Hill, Katina Papadakis, Géza Kiss, and Jan van Santen.)

There is a long-standing clinical impression that children with ASD are in some ways more disfluent than typically developing children, something likely related to their general difficulties with the set of abilities known as pragmatic language. We found that the few prior attempts to quantify this impression were difficult to interpret, and in some cases, put forth contradictory findings. One limitation that we observed in the prior work (other than poor controls and small samples, which one more or less expects in this area) is the lack of a well-thought-out schema for talking about different kinds of disfluency. While specialists in disfluency have largely operated “under the hypothesis that different types of disfluency manifest from different types of processing breakdowns”, so it is valuable to have a taxonomy of the types of disfluency so as to know what to count. Thus one of our goals in the paper is to adapt—to simplify, really—the schema used by Elizabeth Shriberg (in her 1995 UC Berkeley dissertation) and show that semi-skilled transcribers can achieve high rates of interannotator agreement using our schema. (We also show that much of the annotation can be automated, if one so chooses, and provide code for that.) Of course, we are even more interested in what we can learn about pragmatic language in children with ASD from our efforts at quantifying disfluency.

In in sample of 110 children with ASD, SLI, or typical development, we find two robust results. First, we found that children with ASD produced a higher ratio of content mazes (repetitions, revisions, and false starts) to fillers (e.g., uh, um) compared to their typically developing peers. Secondly, we found that children with ASD produced lower ratios of cued mazes—that is, content mazes that contain a filler—than their typically developing peers. We also found a suggestive result in a follow-up exploratory analysis: the use of cued mazes is positively correlated with chronological age in typically developing children (but not in children with ASD or SLI), which at least hints at a maturational account.

If you have anything to add, please feel free to leave post-publication comments at the PLOS one website.

Classifying paraphasias with NLP

I’m excited about our new article in the American Journal of Speech-Language Pathology (with Gerasimos Fergadiotis and Steven Bedrick) on automatic classification of paraphasias using basic natural language processing techniques.

Paraphasias are speech errors associated with aphasia. Roughly speaking, these errors may be phonologically similar to the target (dog for the target LOG) or dissimilar. They also may be semantically similar to the target (dog for the target CAT), or both (rat for the target CAT). Finally, they may be neologisms (tat for the target CAT). Finally, some paraphasias may be real words but neither phonologically nor semantically similar. The relative frequencies of these types of errors differ between people with aphasia. These can be measured in a confrontation naming task and, with complex and time-consuming manual error classification, used to create individualized profiles for treatment.

In the paper, we take archival data from a confrontation naming task and attempt to automate the classification of paraphasias. To quantify phonological similarity, we automate a series of baroque rules. To quantify semantic similarity, we use a computational model of semantic similarity (namely cosine similarity with word2vec embeddings). And, to identify neologisms, we use frequency in the SUBTLEX-US corpus. The results suggest that test scoring can in fact be automated with performance close to that of human annotators. With advances in speech recognition, it may soon be possible to develop a fully-automated computer-adaptive confrontation naming task in the near future!

Evaluating machine translation quality with BLEU

I wrote this quite a while ago, but here’s my handout on BLEU, a metric used to evaluate machine translation systems. Everything here is still just as applicable in the era of neural machine translation.

New Pynini tutorial

Pynini is my weighted finite-state transducer/grammar compilation library for Python, and O’Reilly Media recently published a short introductory tutorial on Pynini, cowritten with my colleague Richard Sproat.

Using the P2FA/FAVE-align SCOTUS acoustic models in Prosodylab-Aligner

Chris Landreth writes in with a tip on how to use the SCOTUS Corpus acoustic model (the one used in P2FA and FAVE-align) from within Prosodylab-Aligner. This is as simple as downloading the data and modifying the YAML configuration file and placing the model data in the right place. Here is the 16k model.

To use it, simply download into your working directory and then execute something like the following:

python3 -m aligner -r eng-SCOTUS-16k.zip -a yrdata -d eng.dict

Please let me know if you have any problems with that.

Libfix report for May 2016

Two bits of creative morphology I’ve been seeing around the city:

Lime-a-rita: This trademark (of Anheuser-Busch InBev) isn’t just a redundant way to refer to a margarita (which has a lime base—a non-lime “margarita” is a barbarism), but rather a “light American lager” blended with additional lime-y-ness. I have to imagine this coinage, albeit rather corporate, was helped along by the existence of the truncation ‘rita, occasionally used in casual conversation by their most comitted devotees.
-otto: I first came aware of this through pastotto, the suggested name for a dish of pasta (perhaps penne), fried in olive oil and butter and then cooked in stock, like risotto; according to popularizer Mark Bittman, this is an old trick. Now, that one looks a bit blend-y, given that the ris- part of risotto is really a reference to arborio rice, and that the final -a in the base pasta appears to be lost in the combination. But not so much for barleyotto, which satisfies even the most stringent criteria for libfix-hood.

Latent semantic analysis lecture

Here is an IPython notebook from a recent lecture I gave on Latent Semantic Analysis (LSA) in my natural language processing class (CS 562/662).