The academy-to-industry brain drain is very real. What can we do about it?

Before I begin, let me confess my biases. I work in the research division of a large tech company (and I do not represent their views). Before that, I worked on grant-funded research in the academy. I work on speech and language technologies, and I’ll largely confine my comments to that area.

[Content warnings: organized labor, name-calling.]

# Salary

Fact of the matter is, industry salaries are determined by a relatively-efficient labor market. Academy salaries are compressed, with a relatively firm ceiling for all but a handful of “rock star” faculty. The vast majority of technical faculty are paid substantially less than they’d make if they just took the very next industry offer that came around. It’s even worse for research professors who depend on grant-based “salary support” in a time of unprecedented “austerity”—they can find themselves functionally unemployed any time a pack of incurious morons seem to end up in the White House (as seems to happen every eight years or so).

The solution here is political. Fund the damn NIH and NSF. Double—no, triple—their funding. Pay for it by taxing corporations and the rich, or, better yet, divert some money from the Giant Death Machines fund. Make grant support contractual, so PIs with a five-year grant are guaranteed five years of salary support and a chance to realize their vision. Insist on transparency and consistency in “indirect costs” (i.e., overhead) for grants to drain the bureaucratic swamp (more on that below). Resist the casualization of labor at universities, and do so at every level. Unionize every employee at every American university. Aggressively lobby Democrat presidential candidates to agree to appoint the National Labor Relations Board who will continue to recognize graduate students’ right to unionize.

Industry has bureaucratic hurdles, of course, but they’re in no way comparable to the profound dysfunction taken for granted in the academic bureaucracy. If you or anyone you love has ever written a scientific grant, you know what I mean; if not, find a colleague who has and politely ask them to tell you their story. At the same time American universities are cutting their labor costs through casualization, they are massively increasing their administrative costs. You will not be surprised to find that this does not produce better scientific outcomes, or make it easier to submit a grant. This is a case of what Noam Chomsky has described as the “neoliberal confidence trick”. It goes a little something like this:

1. Appoint/anoint all-powerful administrators/bureaucrats, selecting for maximal incompetence.
2. Permit them to fail.
3. Either GOTO #1, or use this to justify cutting investment in whatever was being administered in the first place.

I do not see any way out of this situation except class consciousness and labor organizing. Academic researchers must start seeing the administration as potentially hostile to their interests, and refuse to identify with, or (or quelle horreur, to join) the managerial classes.

# Computing power & data

The big companies have more computers than universities. But in my area, speech and language technology, nearly everything worth doing can still be done with a commodity cluster (like you’d find in the average American CS departments) or a powerful desktop with a big GPU. And of those, the majority can still be done on a cheap laptop. (Unless, of course, you’re one of those deep learning eliminationist true believers, in which case, reconsider.) Quite a bit of great speech & language research—in particular, work on machine translation—has come from collaborations between the Giant Death Machines funding agencies (like DARPA) and academics, with the former usually footing the bill for computing and data (usually bought from the Linguistic Data Consortium (LDC), itself essentially a collaboration between the military-industrial complex and the Ivy League). In speech recognition, there are hundreds of hours of transcribed speech in the public domain, and hundreds more can be obtained with a LDC contract paid for by your funders. In natural language processing, it is by now almost gauche for published research to make use of proprietary data, possibly excepting the venerable Penn Treebank.

I feel the data-and-computing issue is largely a myth. I do not know where it got started, though maybe it’s this bizarre press-release-masquerading-as-an-article (and note that’s actually about leaving one megacorp for another).

# Talent & culture

Movements between academy & industry have historically been cyclic. World War II and the military-industrial-consumer boom that followed siphoned off a lot of academic talent. In speech & language technologies, the Bell breakup and the resulting fragmentation of Bell Labs pushed talent back to the academy in the 1980s and 1990s; the balance began to shift back to Silicon Valley about a decade ago.

There’s something to be said for “game knows game”—i.e., the talented want to work with the talented. And there’s a more general factor—large industrial organizations engage in careful “cultural design” to keep talent happy in ways that go beyond compensation and fringe benefits. (For instance, see Fergus Henderson’s description of engineering practices at Google.) But I think it’s important to understand this as a symptom of the problem, a lagging indicator, and as part of an unpredictable cycle, not as something to optimize for.

# Closing thoughts

I’m a firm believer in “you do you”. But I do have one bit of specific advice for scientists in academia: don’t pay so much damn attention to Silicon Valley. Now, if you’re training students—and you’re doing it with the full knowledge that few of them will ever be able to work in the academy, as you should—you should educate yourself and your students to prepare for this reality. Set up a little industrial advisory board, coordinate interview training, talk with hiring managers, adopt industrial engineering practices. But, do not let Silicon Valley dictate your research program. Do not let Silicon Valley tell you how many GPUs you need, or that you need GPUs at all. Do not believe the hype. Remember always that what works for a few-dozen crypto-feudo-fascisto-libertario-utopio-futurist billionaires from California may not work for you. Please, let the academy once again be a refuge from neoliberalism, capitalism, imperialism, and war. America has never needed you more than we do right now.

If you enjoyed this, you might enjoy my paper, with Richard Sproat, on an important NLP task that neural nets are really bad at.

## Latent semantic analysis lecture

Here is an IPython notebook from a recent lecture I gave on Latent Semantic Analysis (LSA) in my natural language processing class (CS 562/662).

# Understanding text encoding in Python 2 and Python 3

Computers were rather late to the word processing game. The founding mothers and fathers of computing were primarily interested in numbers. This is fortunate: after all, computers only know about numbers. But as Brian Kunde explains in his brief history of word processing, word processing existed long before digital computing, and the text processing has always been something of an afterthought.

Humans think of text as consisting of ordered sequence of “characters” (an ill-defined Justice-Stewart-type concept which I won’t attempt to clarify here). To manipulate text in digital computers, we have to have a mapping between the character set (a finite list of the characters the system recognizes) and numbers. Encoding is the process of converting characters to numbers, and decoding is (naturally) the process of converting numbers to characters. Before we get to Python, a bit of history.

# ASCII and Unicode

There are only a few character sets that have any relevance to life in 2014. The first is ASCII (American Standard Code for Information Interchange), which was first published in 1963. This character set consists of 128 characters intended for use by an English audience. Of these 95 are printable, meaning that they correspond to lay-human notions about characters. On a US keyboard, these are (approximately) the alphanumeric and punctuation characters that can be typed with a single keystroke, or with a single keystroke while holding down the Shift key, space, tab, the two newline characters (which you get when you type return), and a few apocrypha. The remaining 33 are non-printable “control characters”. For instance, the first character in the ASCII table is the “null byte”. This is indicated by a '' in C and other languages, but there’s no standard way to render it. Many control characters were designed for earlier, more innocent times; for instance, character #7 'a' tells the receiving device to ring a cute little bell (which were apparently attached to teletype terminals); today your computer might make a beep, or the terminal window might flicker once, but either way, nothing is printed.

Of course, this is completely inadequate for anything but English (not to mention those users of superfluous diaresis…e.g., the editors of the New Yorker, Motörhead). However, each ASCII character takes up only 7 bits, leaving room for another 128 characters (since a byte has an integer value between 0-255, inclusive), and so engineers could exploited the remaining 128 characters to write the characters from different alphabets, alphasyllabaries, or syllabaries. Of these ASCII-based character sets, the best-known are ISO/IEC 8859-1, also known as Latin-1, and Windows-1252, also known as CP-1252. Unfortunately, this created more problems than it solved. That last bit just didn’t leave enough space for the many languages which need a larger character set (Japanese kanji being an obvious example). And even when there are technically enough code points left over, engineers working in different languages didn’t see eye-to-eye about what to do with them. As a result, the state of affairs made it impossible to, for example, write in French (ISO/IEC 8859-1) about Ukrainian (ISO/IEC 8859-5, at least before the 1990 orthography reform).

Clearly, fighting over scraps isn’t going to cut it in the global village. Enter the Unicode standard and its Universal Character Set (UCS), first published in 1991. Unicode is the platonic ideal of an character encoding, abstracting away from the need to efficiently convert all characters to numbers. Each character is represented by a single code with various metadata (e.g., A is an “Uppercase Letter” from the “Latin” script). ASCII and its extensions map onto a small subset of this code.

Fortunately, not all encodings are merely shadows on the walls of a cave. The One True Encoding is UTF-8, which implements the entire UCS using an 8-bit code. There are other encodings, of course, but this one is ours, and I am not alone in feeling strongly that UTF-8 is the chosen encoding. At the risk of getting too far afield, here are two arguments for why you and everyone you know should just use UTF-8. First off, it is hardly matters much which UCS-compatible encoding we all use (the differences between them are largely arbitrary), but what does matter is that we all choose the same one. There is no general procedure for “sniffing” out the encoding of a file, and  there’s nothing preventing you from coming up with a file that’s a French cookbook in one encoding, and a top-secret message in another. This is good for steganographers, but bad for the rest of us, since so many text files lack encoding metadata. When it comes to encodings, there’s no question that UTF-8 is the most popular Unicode encoding scheme worldwide, and is on its way to becoming the de-facto standard. Secondly, ASCII is valid UTF-8, because UTF-8 and ASCII encode the ASCII characters in exactly the same way. What this means, practically speaking, is you can achieve nearly complete coverage of the world’s languages simply by assuming that all the inputs to your software are UTF-8. This is a big, big win for us all.

# Decode early, encode late

A general rule of thumb for developers is “decode early” (convert inputs to their Unicode representation), “encode late” (convert back to bytestrings). The reason for this is that in nearly any programming language, Unicode strings behave the way our monkey brains expect them to, but bytestrings do not. To see why, try iterating over non-ASCII bytestring in Python (more on the syntax later).

>>> for byte in b"año":
...     print(byte)
...
a
?
?
o

There are two surprising things here: iterating over the bytestring returned more bytes then there are “characters” (goodbye, indexing), and furthermore the 2nd “character” failed to render properly. This is what happens when you let computers dictate the semantics to our monkey brains, rather than the other way around. Here’s what happens when we try the same with a Unicode string:

>>> for byte in u"año":
...     print(byte)
...
a
ñ
o

# The Python 2 & 3 string models

Before you put this all into practice, it is important to note that Python 2 and Python 3 use very different string models. The familiar Python 2 str class is a bytestring. To convert it to a Unicode string, use the str.decode instance method, which returns a copy of the string as an instance of the unicode class. Similarly, you can make a str copy of a unicode instance with unicode.encode. Both of these functions take a single argument: a string (either kind!) representing the encoding.

Python 2 provides specific syntax for Unicode string literals (which you saw above): the a lower-case u prefix before the initial quotation mark (as in u"año").

When it comes to Unicode-awareness, Python 3 has totally flipped the script; in my opinion, it’s for the best. Instances of str are now Unicode strings (the u"" syntax still works, but is vacuous). The (reduced) functionality of the old-style strings is now just available for instances of the class bytes. As you might expect, you can create a bytes instance by using the encode method of a new-style str. Python 3 decodes bytestrings as soon as they are created, and (re)encodes Unicode strings only at the interfaces; in other words, it gets the “early/late” stuff right by default. Your APIs probably won’t need to change much, because Python 3 treats UTF-8 (and thus ASCII) as the default encoding, and this assumption is valid more often than not.

If for some reason, you want a bytestring literal, Python has syntax for that, too: prefix the quotation marks delimiting the string with a lower-case b (as in b"año"; see above also).

# tl;dr

Strings are ordered sequences of characters. But computers only know about numbers, so they are encoded as byte arrays; there are many ways to do this, but UTF-8 is the One True Encoding. To get the strings to have the semantics you expect as a human, decode a string to Unicode as early as possible, and encode it as bytes as late as possible. You have to do this explicitly in Python 2; it happens automatically in Python 3.

For more of the historical angle, see Joel Spolsky’s epic essay The absolute minimum every software developer absolutely, positively must know About Unicode and character sets (no excuses!).

# Simpler sentence boundary detection

Consider the following sentence, from the Wall St. Journal portion of the Penn Treebank:

Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.

This sentence contains 4 periods, but only the last denotes a sentence boundary. It’s obvious that the first one in U.S. is unambiguously part of an acronym, not a sentence boundary, and the same is true of expressions like \$12.53. But the periods at the end of Inc. and U.S. could easily have been on the left edge of a sentence boundary; it just turns out they’re not. Humans can use local context to determine that neither of these are likely to be sentence boundaries; for example, the verb expect selects two arguments (an object its U.S. sales and the infinitival clause to remain steady…), neither of which would be satisfied if U.S. was sentence-final. Similarly, not all question marks or exclamation points are sentence-final (strictu sensu):

He says the big questions–“Do you really need this much money to put up these investments? Have you told investors what is happening in your sector? What about your track record?–“aren’t asked of companies coming to market.

Much of the available data for natural language processing experiment—including the enormous Gigaword corpus—does not include annotations for sentence boundaries providence annotations for sentence boundaries. In Gigaword, for example, paragraphs and articles are annotated, but paragraphs may contain internal sentence boundaries, which are not indicated in any way. In natural language processing (NLP), this task is known as sentence boundary detection (SBD). [1] SBD is one of the earliest steps in many natural language processing (NLP) pipelines, and since errors at this step are very likely to propagate, it is particularly important to just Get It Right.

An important component of this problem is the detection of abbreviations and acronyms, since a period ending an abbreviation is generally not a sentence boundary. But some abbreviations and acronyms do sometimes occur in sentence-final position (for instance, in the Wall St. Journal portion of the Penn Treebank, there are 99 sentence-final instances of U.S.). In this context, English writers generally omit one period, a sort of orthographic haplology.

NLTK provides an implementation of Punkt (Kiss & Strunk 2006), an unsupervised sentence boundary detection system; perhaps because it is easily available, it has been widely used. Unfortunately, Punkt is simply not very accurate compared to other systems currently available. Promising early work by Riley (1989) suggested a different way: a supervised classifier (in Riley’s case, a decision tree). Gillick (2009) achieved the best published numbers on the “standard split” for this task using another classifier, namely a support vector machine (SVM) with a linear kernel; Gillick’s features are derived from the words to the left and right of a period. Gillick’s code has make available under the name Splitta.

I recently attempted to construct my own SBD system, loosely inspired by Splitta, but expanding the system to handle ellipses (), question marks, exclamation points, or sentence-final punctuation marks. Since Gillick found no benefits from tweaking the hyperparameters of the SVM, I used a hyperparameter-free classifier, the averaged perceptron (Freund & Schapire 1999). After performing a stepwise feature ablation, I settled on a relatively small set of features, extracted as follows. Candidate boundaries are identified using the following nasty regular expression:

/(S+)s*((.+)|([!?]))(['")}]]*)(s+)s*(S+)/

The first group matches the left token L, and the last group matches the right token R. If the L or R tokens match a regular expression for American English numbers (including prices, decimals, negatives, etc.), they are merged into a special token *NUMBER* (per Kiss & Strunk 2006); a similar approach is used to convert various types of quotation marks into *QUOTE*. The following features were then extracted:

• the identity of the punctuation mark
• identity of L and R (Reynar & Ratnaparkhi 1997, etc.)
• the joint identity of both L and R (Gillick 2009)
• does L contain a vowel? (Mikheev 2002)
• does L contain a period? (Grefenstette 1999)
• length of L (Riley 1989)
• case of L and R (Riley 1989)

This 8-feature system performed exceptionally well on the “standard split”, with an accuracy of .9955, an F-score of .9971, and just 46 errors in all. This is very comparable with the results I obtained with a fork of Splitta extended to handle ellipses, question marks, etc.; this forked system produced 55 errors.

I have made my system freely available as a Python 3 module (and command-line tool) under the name DetectorMorse. Both code and dependencies are pure Python, so it can be run using pypy3`, if you’re in a hurry.

# Endnotes

[1] Or, sometimes, sentence boundary disambiguationsentence segmentationsentence splitting, sentence tokenization, etc.

# References

Y. Freund & R.E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37(3): 277-296.
D. Gillick. 2009. Sentence boundary detection and the problem with the U.S. In Proc. NAACL-HLT, pages 241-244.
G. Grefenstette. 1999. Tokenization. In H. van Halteren (ed.), Syntactic wordclass tagging, pages 117-133. Dordrecht: Kluwer.
T. Kiss & J. Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4): 485-525.
A. Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics 28(3): 289-318.
J.C. Reynar & A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proc. 5th Conference on Applied Natural Language Processing, pages 16-19.
M.D. Riley. 1989. Some applications of tree-based modelling to speech and language indexing. In Proc. DARPA Speech and Natural Language Workshop, pages 339-352.