Using a fixed training-development-test split in sklearn

The scikit-learn machine learning library has good support for various forms of model selection and hyperparameter tuning. For setting regularization hyperparameters, there are model-specific cross-validation tools, and there are also tools for both grid (e.g., exhaustive) hyperparameter tuning with the sklearn.model_selection.GridSearchCV and random hyperparameter tuning (in the sense of Bergstra & Bengio 2012) with sklearn.model_selection.RandomizedSearchCV, respectively. While you could probably could implement these yourself, the sklearn developers have enabled just about every feature you could want, including multiprocessing support.

One apparent limitation of these classes is that, as their names suggest, they are designed for use in a cross-validation setting. In the speech & language technology, however, standard practice is to use a fixed partition of the data into training, development (i.e., validation), and test (i.e., evaluation) sets, and to select hyperparameters which maximize performance on the development set. This is in part an artifact of limited computing resources of the Penn Treebank era and I’ve long suspected it has serious repercussions for model evaluation. But tuning and evaluating with a standard split is faster than cross-validation and can make exact replication much easier. And, there are also some concerns about whether cross-validation is the best way to set hyperparameters anyways. So what can we do?

The GridSearchCV and RandomSearchCV classes take an optional cv keyword argument, which can be, among other things, an object implementing the cross-validation iterator interface. At first I thought I would create an object which allowed me to use a fixed development set for hyperparameter tuning, but then I realized that I could do this with one of the existing iterator classes, namely one called sklearn.model_selection.PredefinedSplit. The constructor for this class takes a single argument test_fold, an array of integers of the same size as the data passed to the fitting method.  As the documentation explains “…when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.” That we can do. Suppose that we have training data x_train and y_train and development data x_dev and y_dev laid out as NumPy arrays. We then create a training-and-development set like so:

x = numpy.concatenate([x_train, x_dev])
y = numpy.concatenate([y_train, y_dev])

Then, we create the iterator object:

test_fold = numpy.concatenate([
    # The training data.
    numpy.full(-1, x_train.shape[1], dtype=numpy.int8),
    # The development data.
    numpy.zeros(x_dev.shape[1], dtype=numpy.int8)
])
cv = sklearn.model_selection.PredefinedSplit(test_fold)

Finally, we provide cv as a keyword argument to the grid or random search constructor, and then train. For instance, similar to this example we might do something like:

base = sklearn.ensemble.RandomForestClassifier()
grid = {"bootstrap": [True, False], 
        "max_features": [1, 3, 5, 7, 9, 10]}
model = sklearn.model_select.GridSearchCV(base, grid, cv=cv)
model.fit(x, y)

Now just add n_jobs=-1 to the constructor for model and to spread the work across all your logical cores.

References

Bergstra, J., and Bengio, Y. 2012. Random search for hyperparameter optimization. Journal of Machine Learning Research 13: 281-305.

arXiv vs. LingBuzz

In the natural language processing community, there has been a bit of kerfuffle about the ACL preprint policy, which essentially prevents you from submitting a manuscript to preprint aggregation websites like arXiv when the m.s. is also under review for a conference. I personally think this is a good policy: double blind review is really important for fairness. This lead me to reflect a bit on the outsized role that arXiv plays in natural language processing research. It is interesting to contrast arXiv with LingBuzz, a preprint aggregator for formal linguistics research.1 arXiv is visually ugly and cluttered, expensive (it somehow takes over $800,000 from Simons Foundations’ money to run it every year), and submissions tare subject to detailed, strict, carefully enforced editorial guidelines. In contrast, LingBuzz has a minimalistic text interface, is run and operated by a single professor (Michael Starke at the University of Tromsø), and the editorial guidelines are simple (they fit on a single page) and laxily enforced (mostly after the fact). Despite the laissez-faire attitude at LingBuzz, it has seen some rather contentious debates involving the usual trollish suspects (Postal, Everett, Behme, etc.) but it managed to keep things under control. But what I really love about LingBuzz is that unlike arXiv, no linguist is under the impression that it is any sort of substitute for peer review, or that authors need to know about (and cite) late-breaking work only available on LingBuzz. I think NLP researchers should take a hint from this and stop pretending arXiv is a reasonable alternative to peer review.

Endnotes

1. There are a few other such repositories. The Rutgers Optimality Archive (ROA) was once a popular repository for pre-prints of Optimality Theory work, but its contents are re-syndicated on LingBuzz and Optimality Theory is largely dead anyways. There is also the Semantics Archive.

Text encoding issues in Universal Dependencies

Do you know why the following comparison (in Python 3.7) fails?

>>> s1 = "ड़"
>>> s2 = "ड़"
>>> s1 == s2
False

I’ll give you a hint:

>>> len(s1)
1
>>> len(s2)
2

Despite the two strings rendering identically, they are encoded differently. The string s1 is a single-codepoint sequence, whereas s2 contains two codepoints. Thus string comparison fails, whether it’s done at the level of bytes or of Unicode codepoints.

Some NLP researchers are aware of issues arising from faulty string encoding. Eckhart de Castilho (2016), for example, describes a tool which automatically identifies misencoded pre-trained data, whereas Wu & Yarowsky (2018) report issues using an existing tool for transliteration on certain languages because of encoding issues. However, I suspect that far fewer NLP researchers are familiar with the aforementioned problem, which is specific to Unicode normalization. To put it simply, Unicode defines four normalization forms (and associated conversion algorithms) for strings, and the key distinction is between “composed” and “decomposed” forms of characters (using that term in a pretheoretic sense). The string s1 is composed into a single Unicode codepoint; s2 is decomposed into two.

Unfortunately, three columns of the Hindi Dependency Treebank (hi_hdtb, commit 54c4c0f; Bhat et al. 2017, Palmer et al. 2009) have a chaotic mix of composed and decomposed representations. It seems most if not all of these have to do with the encoding of the six nuqta (‘dot’) consonants, which are usually found in borrowings from Arabic or Persian (via Urdu, presumably). In Devangari these consonants are written by adding a dot to a phonetically similar native consonant; for instance ड [ɖə] plus the nuqta produces ड़ [ɽə]. As is usually the case in Unicode, there is more than one way to do it: you can either encode ड़ with a composed character (U+095C DEVANAGARI LETTER DDDHA) or with the native Devangari character (U+O921 DEVANAGARI LETTER DDA) plus a combining character (U+093C DEVANAGARI SIGN NUKTA). In practical terms, this means that strings containing diferent encodings of <ṛa> (as it is sometimes transliterated) will be treated as totally separate during training and evaluation, except on the off chance that all associated tools perform Unicode normalization ahead of time.

This does have negative consequences for NLP. Consider the UDPipe system (Straka & Straková 2017) at the CoNLL 2017 shared task on dependency parsing (Zeman et al. 2017), for which the primary metric is labeled attachment score (LAS). I first attempted to replicate the UDPipe results for the Hindi Dependency Treebank. Using UDPipe 1.2.0, word2vec (commit 20c129a), the hyperparameters given in the authors’ supplementary materials, and the official evaluation script, I obtain LAS = 87.09 on the “gold tokenization” subtask. However I can improve this simply by converting the training, development, and test data to a consistent normalization like so:

for FILE in *.conllu; do
    TMPFILE="$(mktemp)"
    uconv -x nfkc "${FILE}" > "${TMPFILE}"
    mv "${TMPFILE}" "${FILE}"
done

and then retraining. Here I have chosen to apply the NFKC (“compatibility composed”) normalization form. While Zeman et al. do not discuss the encoding of the labeled Universal Dependencies data, they do mention that they apply NFKC normalization to the addditional raw data. But it doesn’t really matter in this case which you choose so long as you are consistent. After retraining, I obtain LAS = 87.38, or .29 points for free. I also ran an “mismatch” experiment, where the training and testing data have different normalization forms; naturally, this causes a slight degradation to LAS = 86.98.

Straka & Straková (2017) report a separate set of experiments in which they have attempted to rebalance the training-development-test splits. Just to be sure, I repeated the above experiments using their original rebalancing script. With the baseline—mixed normalization—data, I can replicate their result exactly: LAS = 87.30. With a consistent NFKC normalization of training, development and test data, I get LAS = 87.50. And with a normalization mismatch between training and test data, I get LAS = 87.07, a slight degradation. And the improvements are more or less for free.

While I have not yet done a consistent audit, I found three other UD treebanks that have encoding issues. The ar_padt treebank has a non-canonical ordering of combining characters in the lemma column (the shaddah, which indicates geminates, should come before the fathah and not the other way around), but this is unlikely to have any major effect on model performance because it uses this non-canonical ordering consistently. The ko_kaist and ur_udtb treebanks also have minor inconsistencies.

Unfortunately my corporate overlord doesn’t permit me to file a pull request here because of the Hindi data is released under a CC BY-NC-SA license. But if you’re not so constrained, feel free to do so, and ping this thread once you have! And pay attention in the future.

References

Bhat, R. A., Bhatt, R., Farudi, A., Klassen, P., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D. M., Vaidya, A., Vishnu, S. R., and Xia, F. 2017. The Hindi/Urdu Treebank Project. In Ide., N., and Pustejovsky, J. (ed.), The Handbook of Linguistic Annotation, pages 659-698. Springer.
Eckhart de Castilho, R. 2016. Automatic analysis of flaws in pre-trained NLP models. In 3rd International Workshop on Worldwide Language Service Infrastructure and 2nd Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies, pages 19-27.
Palmer, M., Bhatt, R., Narasimhan, B., Rambow, O., Sharma, D. M., and Xia, F. 2009. Hindi syntax: Annotation dependency, lexical predicate-argument structure, and phrase structure. In ICON, pages 14-17.
Straka, M., and Straková, J. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88-99.
Wu, W. and Yarowsky, D. 2018. A comparative study of extremely low-resource transliteration of the world’s languages. In LREC, pages 938-943.
Zeman, D., Popel, M., Straka, M., Hajič, J., Nivre, J., Ginter, F., … and Li, J. 2017. CoNLL Shared Task: Multilingual parsing from raw text to Universal Dependencies. In CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1-19.

Lessons learned from my time at Google

  • C++ 11 is a powerful, elegant language and the right choice for performant general-purpose code. Bash is an excellent lingua franca for chaining a long series of commands. Python is best for everything else.
  • Data should be passed around in schematic form, with a compact serializations over the wire and a human-readable format at rest. Protocol buffers (and the lesser-known text format) are an ideal cross-language solution.
  • Grammar development is more important than model building.
  • Model building is easier than deployment.
  • Whiteboards are useful.
  • I can only do certain sorts of work without an office (yes, that thing with a door).

A minimalist project design for NLP

Let’s say you want to build a new tagger, a new named entity recognizer, a new dependency parser, or whatever. Or perhaps you just want to see how your coreference resolution engine performs on your new database of anime reviews. So how should you structure your project? Here’s my minimalist solution.

There are two principles that guide my design. The first one is modularity. Some of these components will get run many times, some won’t. If you’re doing model comparison—and you should be doing model comparison—some components will get swapped out with someone else’s code. This sort of thing is a major lift unless you opt for modularity. The second principle is filesystem state. The filesystem is your friend. If your embedding table eats up all your RAM and you have to restart, the filesystem will be in roughly the same state as when you left. The filesystem allows you to organize things into directories and subdirectories, and give the pieces informative names; I like to record information about datasets and hyperparameter values in my file and directory names. So without further ado, here are the recommended scripts or applications to create when you’re starting off on a new project.

  1. split takes the full dataset and a random seed (which you should store for later) as input. The script reads the data in, randomly shuffles the data, and then splits it into an 80% training set, 10% development set, and a 10% test (i.e., evaluation set) which it then outptus. If you’re comparing to prior work that used a “standard split” you may want to have a separate script that generates that too, but I strongly recommend using randomly generated splits.
  2. train takes the training set as input and outputs a model file or directory. If you’re automating hyperparameter tuning you will also want to provide the development set as input; if not you will probably want to either add a bunch of flags to control the hyperparameters or allow the user to pass some kind of model configuration file (I like YAML for this).
  3. apply takes as input the model file(s) produced in (2) and the test set, and applies the model to the data, outputting a new hypothesized test data set (i.e., the model’s predictions). One open question is whether this ought to take only unlabeled data or should overwrite the existing labels: it depends.
  4. evaluate takes as input the gold test set and the hypothesized test data set generated in (3) and outputs the evaluation results (as text or in some structured data format—sometimes YAML is a good choice, other times TSV files will do). I recommend you test this with a small amount of data first.

That’s all there’s to it. When you begin doing model comparison you may find yourself swapping out (2-3) for somebody else’s code, but make sure to still stick to the same evaluation script.

Another pseudo-decipherment of the Voynich manuscript

The Voynich manuscript consists of 240 pages of text and fanciful illustrations written in an unknown script. It is first mentioned in the 16th century, then largely disappears from the record for several centuries, only to resurface in for sale in 1903. An independent carbon dating assigns a early 15th century date to the vellum, but some scholars speculate it may have been inked or re-inked at a later date. Other scholars believe it to be an elaborate hoax or forgery. A recent paper in Transactions of the Association for Computational Linguistics (TACL) by Bradley Hauer & Grzegorz Kondrak (henceforth H&K) entitled Decoding Anagrammed Texts Written in an Unknown Language was touted to have enabled a decipherment of the Voynich. Have H&K succeeded where others have failed? Unfortunately, having reviewed the paper carefully, I can say with some certainty that they they have not.

H&K propose two techniques towards decipherment. First, they describe methods to determine the underlying language of a plaintext using only the ciphertext, assuming a simple bijective substitution cipher. Their preferred method does not depend on the linear order of strings within the ciphertext, and thus works equally well when the ciphertext characters have been permuted within words (assuming that word boundaries are somehow clearly delimited in the ciphertext), a point which will be become important shortly. Then, they describe methods for cryptanalysis when the encipherment consists of a bijective substitution cipher under certain degenerate conditions, such as where the ciphertext lacks vowels, or where the ciphertext characters have been randomly permuted within words.

That much is fine (though I have some quibbles with the details, as you’ll see). My major issue with H&K is that they don’t provide any evidence that the Voynich is so encoded, they simply assume it. And, despite the press hype, their preferred method fails to produce anything remotely readable.

I don’t have much to say about their method for identifying source language; it is a relatively novel task—they only cite one prior work—and their method and evaluation both appear to be sound. I appreciate that their evaluation includes a brute force-like method of simply attempting to decipher the text as a given language, and as a topline, an “oracle” scenario in which the decipherment is known and the problem reduces to standard language ID. But I was struck by the following claim about their decipherment method (p. 79):

“We conclude that our greedy-swap algorithm strikes the right balance between accuracy and speed required for the task of cipher language identification.”

It’s hard for me to imagine in what sense “cipher language identification” might be considered something which needs to be fast (rather than merely feasible). I think, in contrast, we would be just fine with using supercomputers for this task if it worked.[1]

So what does their preferred method say about the plaintext language of the Voynich? It assigns, by far, the highest probability to Hebrew.[2,3] Naturally, the oracle scenario is inapplicable here; whereas most archaeological decipherments have worked from a small set of candidate languages for the plaintext, there is nothing like a consensus regarding the language of the Voynich.

H&K then consider methods for decipherment itself. This problem is essentially a type of unsupervised machine learning in which the objective is to identify a mapping from ciphertext to plaintext (a key) such that for the ciphertext, we maximize the probability of the plaintext with respect to some language model. Kevin Knight and colleagues have, in the last two decades, proposed three distinct applications for this scenario:

  • Unsupervised translation: Knight & Graehl (1998) use this scenario to learn low-resource transliteration models, and some subsequent work has applied this to other low-resource, small-vocabulary tasks, but as of yet such methods don’t scale well to machine translation in general.
  • Steampunk cryptanalysis: Knight et al. (2006) use this scenario for unknown-plaintext cryptanalysis of bijective substitution ciphers, and subsequent work has also applied this to homophonic and running key ciphers. But the aforementioned ciphers have been known to be vulnerable to pencil-and-paper attacks for a century or more, and it’s not clear that these methods are effective attacks against any cryptosystem in widespread use today.
  • Archaelogical decipherment: Snyder et al. (2010) attempt to simulate the automatic decipherment of Ugaritic, a Semitic language written in a cuneiform script in the 14th through the 12th century BCE; these were manually deciphered in 1929-1931 by exploiting the language’s strong similarity to biblical Hebrew. Knight et al. (2012) show that an undeciphered 18th century manuscript is in fact a description of a Masonic ritual written in German and encoded using a homophonic cipher. However, others have argued that computational methods for archaelogical decipherment are still quite limited. For instance, Sproat (2010a,b, 2014) draws attention to the unsolved problem of determining whether a symbol system represents language in the first place, and to the long history of pseudo-decipherment.

Regardless of the application, it should be obvious that decipherment is a computationally challenging problem. Formally, given a bijective cipher over an alphabet K, the keyspace has size |K|!, since each candidate key is a permutation of K. Three classes of methods are found in the literature:

  • Integer linear programming (ILP; e.g., Ravi & Knight 2010)
  • A linear relaxation of the ILP to expectation maximization or related methods (e.g., Knight et al. 2006)
  • Search-based techniques using a beam or tree (e.g., Hauer et al. 2014)

H&K’s preferred method is a case of the latter; they refer to this prior work as “state-of-the-art”, but skimming Hauer et al. suggests they are state-of-the-art on decrypting snippets of text randomly sampled from the English Wikipedia article “History“. This doesn’t strike me as an acceptable benchmark, even if there’s some precedent in the literature, and what’s worse is that they use snippets as short as two characters, which are well below the well-known theoretical bound.

H&K propose two adaptations of the Hauer et al. model. First, they consider a variant which can handle ciphertext in which characters have been permuted (“anagrammed”) within words (assuming that word boundaries are clearly delimited in the ciphertext and are the same as word boundaries in the plaintext). H&K they mention that this has been suggested in prior Voynichology—though this might well be pure speculation, since we can’t read the manuscript—but do not themselves argue that the Voynich is anagrammed. Random permutation of letters within words strikes me as a poor cryptographic strategy due to the non-determinism it introduces. Rof nnastcie anc uyo adre shti eetnencs? That’s hard to read, in my opinion, though not complete impossible with enough context.[4] While I can’t really put myself into the mind of the creators of the Voynich manuscript, it seems that a wide degree of hermeneutic freedom is undesirable in most written genres, even texts of, say, an occult nature: you don’t want to accidentally turn yourself into a newt! Secondly, H&K adapt their model so that it can restore vowels omitted in the plaintext.[5] They refer to the resulting ciphertext with vowels omitted as an “abjad”, using a rare term of art for consonantal writing systems, i.e., those in which vowels are omitted. Phoenician, the ancestor of the Greek & Latin alphabets, did not originally write vowels at all, but they are inconsistently present in later texts and both Hebrew & Arabic write certain vowels. In Standard Arabic, for example, all long vowels are written explicitly, and Hebrew during the Renaissance era was normally written with the Tiberian diacriticization (or niqqud) developed several centuries earlier. H&K seem to be assuming a total omission of vowels which would be both anachronistic and typologically rare, and had H&K mentioned either of those facts in a brief disclaimer admitting to their slight abuse of terminology, I’d wouldn’t think they weren’t mislead, or misleading the reader, about what an abjad (normally) is.

It seems to me that H&K have, at this point, taken a method-free leap of faith towards the hypothesis that the Voynich is vowel-less Hebrew, anagrammed and encoded with a bijective substitution cipher. Perhaps I’d be willing to forgive it if these assumptions allowed them to produce some readable plaintext. Here’s what they have to say about that (p. 84):

“None of the decipherments appear to be syntactically correct or semantically consistent. […] The first line of the VMS [Voynich manuscript]…is deciphered into Hebrew as ועשה לה הכה איש אליו לביחו ו עלי אנשי המצות. According to a native speaker of the language, this is not quite a coherent sentence. However, after making a couple of spelling corrections, Google Translate is able to convert it into passable English: ‘She made recommendations to the priest, man of the house and me and people.'”

So the authors, neither of whom apparently are native speakers of Hebrew, post-edited the output of their system until the MT decoder produced this sentence. As others have noted, this is not an acceptable method—modern MT systems are extremely good at producing locally coherent text from degenerate input.

H&K suggest two possible interpretations of their results: “the results presented in this section could be interpreted either as tantalizing clues for Hebrew as the source language of the VMS, or simply as artifacts of the combinatoric power of anagramming and language models.” (p. 84f.) So they are not really claiming, at least in this article, a decipherment—that’s an addition of the subsequent, irresponsible press coverage, for which I can’t really blame H&K—but I can’t imagine calling this “tantalizing”. I don’t see any reason to think H&K have any confidence in their decipherment, either: they don’t provide more than a single plaintext sentence, and don’t provide a key. Had I been asked to review this paper, I would have requested that the portion of the paper dealing with language identification employ corpora of non-linguistic symbol systems (such as those in Sproat 2014), and I would have insisted that the portion of the paper dealing with the decipherment of the Voynich be essentially scrapped. The Voynich angle is a red herring: there is nothing here. Had they just removed it, this would have been a perfectly good TACL paper!

In 2010, my colleague Richard Sproat wrote a brief article for the journal Computational Linguistics (Sproat 2010b) which reviewed a recent paper by Rao et al. (2009), published in the journal Science. Rao et al. claim to provide statistical evidence that the the Indus Valley seals are a writing system. Now there are quite a few reasons to suspect the seals are not writing under any common-sense definition thereof. More importantly, though, Rao et al.’s method fails to discriminate between linguistic and non-linguistic symbol systems (see, e.g., Sproat 2014). Sproat implies that had the Science editors simply retained computational linguists as referees, they would have been made aware of the manifest flaws of Rao et al.’s paper and would thus have rejected it. With respect to my colleague, he has been shown wrong on both counts. First, when these journals retain computational linguist referees, they simply ignore negative reviews of technically-flawed, linguistically-oriented work when it has sufficient “woo factor”. Secondly, woo factor trumps lack of method even in the one of the top journals for computational linguistics and natural language processing, one which I review for and publish in. Some recent research suggests that fanciful university press releases are a key contributor to scientific hype. As far as I can tell, that is what happened here: the “tantalizing clues” in a flawed journal article were wildly exaggerated by the University of Alberta press office, and major publications took the press release at its hyperbolic word.

PS: If you’re interested in more wild speculation about the Voynich manuscript, may I suggest you check out @voynich_bot on Twitter?

Acknowledgements

Thanks to Brian Roark & Richard Sproat for feedback on this.

Endnotes

[1] The hacks at the Daily Mail are rather confused here; Carmel isn’t a supercomputer—it’s a free software package for doing expectation maximization over finite-state transducers—and at worst you might want to run these kinds of experiments using a top-of-the-line microcomputer, possibly with a powerful graphics card (e.g., Berg-Kirkpatrick & Klein 2013).

[2] An alternative method prefers Mazatec, a which H&K correctly reject as chronologically implausible; a couple other top possibilities are Mozarabic, Italian, and Ladino, which H&K consider “plausible”. Mozarabic is an extinct Romance language that was spoken (but only rarely written) by Christians living in Moorish Spain; it is unclear whether H&K are using the Arabic or the Roman orthography (neither were really standard). Ladino was spoken in the same region and time period but by the Sephardic population; it was written using Hebrew characters. As far as I know, both languages would have declined rapidly after the conclusion of the Reconquista, which imposes a terminus ante quem of roughly 1492, if either is the plaintext language of the Voynich.

[3] For reasons unclear to me they only use 43 pages of the manuscript in their Voynich experiments. This seems like a major flaw to me. Had I been asked to review this paper, I would requested a justification.

[4] To wit, in the CMU dictionary, 17% of six-character words are an anagram of at least one other word, and there are no less than fifteen anagrams of the sequence AEIMNR.

[5] H&K claim that one can’t use the linear relaxation method to restore vowels. I don’t see why, though. If the hypothesis space is expressed as a single-state weighted finite-state transducer, and the plaintext vowels are simply mapped to epsilon, then everything proceeds as normal. In fact I am running such an experiment with a ciphertext consisting of an “abjad” (no-vowel) rendering of the Gettysburg Address. I use a variant of the Knight et al. (2006) approach with Baum-Welch training and forward-backward decoding rather than their Viterbi approximations (software here). Because the resulting lattice is cyclic, the shortest-distance computation during the E-step is more complex than normal, but it does basically work. This is to be expected: you prbbly hv lttl trbl rdng txt tht lks lk ths. Experimental results forthcoming.

References

Berg-Kirkpatrick, Taylor; Klein, Dan. 2013. Decipherment with a million random restarts. In EMNLP, pages 874-878.
Hauer, Bradley; Hayward, Ryan; Kondrak, Grzegorz. 2014. Solving substitution ciphers with combined language models. In COLING, pages 2314-2325.
Hauer, Bradley; Kondrak, Grzegorz. 2016. Decoding anagrammed texts written in an unknown language. Transactions of the Association For Computational Linguistics 4: 75-86.
Knight, Kevin; Graehl, Jonathan. 1998. Machine transliteration. Computational Linguistics 24(4): 599-612.
Knight, Kevin; Nair, Anish; Rathod, Nishi; Yamada, Kenji. 2006. Unsupervised analysis for decipherment problems. In COLING, pages 499-506.
Knight, Kevin; Megyesi, Beáta; Schaefer, Christiane. 2012. The secrets of the Copiale cipher. Journal for Research into Freemasonry 2(2): 314-324.
Ravi, Sujith; Knight, Kevin. 2008. Attacking decipherment problems optimally with low-order n-gram models. In EMNLP, pages 812-819.
Rao, Rajesh; Yadav, Nisha; Vahia, Mayank; Joglekar, Hrishikesh; Adhikari, R.; Mahadevan, Iravatham. 2009. Entropic evidence for linguistic structure in the Indus script. Science 342(5931): 1165.
Snyder, Ben; Barzilay, Regina; Knight, Kevin. 2010. A statistical model for lost language decipherment. In ACL, pages 1048-1057.
Sproat, Richard. 2010a. Language, Technology, and Society. Oxford: Oxford University Press.
Sproat, Richard. 2010b. Ancient symbols, computational linguistics, and the reviewing practices of the general science journals. Computational Linguistics 36(3): 585-594.
Sproat, Richard. 2014. A statistical comparison of written language and nonlinguistic symbol systems. Language 90(2): 457-481.

What to do about the academic brain drain

The academy-to-industry brain drain is very real. What can we do about it?

Before I begin, let me confess my biases. I work in the research division of a large tech company (and I do not represent their views). Before that, I worked on grant-funded research in the academy. I work on speech and language technologies, and I’ll largely confine my comments to that area.

[Content warnings: organized labor, name-calling.]

Salary

Fact of the matter is, industry salaries are determined by a relatively-efficient labor market. Academy salaries are compressed, with a relatively firm ceiling for all but a handful of “rock star” faculty. The vast majority of technical faculty are paid substantially less than they’d make if they just took the very next industry offer that came around. It’s even worse for research professors who depend on grant-based “salary support” in a time of unprecedented “austerity”—they can find themselves functionally unemployed any time a pack of incurious morons seem to end up in the White House (as seems to happen every eight years or so).

The solution here is political. Fund the damn NIH and NSF. Double—no, triple—their funding. Pay for it by taxing corporations and the rich, or, better yet, divert some money from the Giant Death Machines fund. Make grant support contractual, so PIs with a five-year grant are guaranteed five years of salary support and a chance to realize their vision. Insist on transparency and consistency in “indirect costs” (i.e., overhead) for grants to drain the bureaucratic swamp (more on that below). Resist the casualization of labor at universities, and do so at every level. Unionize every employee at every American university. Aggressively lobby Democrat presidential candidates to agree to appoint the National Labor Relations Board who will continue to recognize graduate students’ right to unionize.

Administration & bureaucracy

Industry has bureaucratic hurdles, of course, but they’re in no way comparable to the profound dysfunction taken for granted in the academic bureaucracy. If you or anyone you love has ever written a scientific grant, you know what I mean; if not, find a colleague who has and politely ask them to tell you their story. At the same time American universities are cutting their labor costs through casualization, they are massively increasing their administrative costs. You will not be surprised to find that this does not produce better scientific outcomes, or make it easier to submit a grant. This is a case of what Noam Chomsky has described as the “neoliberal confidence trick”. It goes a little something like this:

  1. Appoint/anoint all-powerful administrators/bureaucrats, selecting for maximal incompetence.
  2. Permit them to fail.
  3. Either GOTO #1, or use this to justify cutting investment in whatever was being administered in the first place.

I do not see any way out of this situation except class consciousness and labor organizing. Academic researchers must start seeing the administration as potentially hostile to their interests, and refuse to identify with, or (or quelle horreur, to join) the managerial classes.

Computing power & data

The big companies have more computers than universities. But in my area, speech and language technology, nearly everything worth doing can still be done with a commodity cluster (like you’d find in the average American CS departments) or a powerful desktop with a big GPU. And of those, the majority can still be done on a cheap laptop. (Unless, of course, you’re one of those deep learning eliminationist true believers, in which case, reconsider.) Quite a bit of great speech & language research—in particular, work on machine translation—has come from collaborations between the Giant Death Machines funding agencies (like DARPA) and academics, with the former usually footing the bill for computing and data (usually bought from the Linguistic Data Consortium (LDC), itself essentially a collaboration between the military-industrial complex and the Ivy League). In speech recognition, there are hundreds of hours of transcribed speech in the public domain, and hundreds more can be obtained with a LDC contract paid for by your funders. In natural language processing, it is by now almost gauche for published research to make use of proprietary data, possibly excepting the venerable Penn Treebank.

I feel the data-and-computing issue is largely a myth. I do not know where it got started, though maybe it’s this bizarre press-release-masquerading-as-an-article (and note that’s actually about leaving one megacorp for another).

Talent & culture

Movements between academy & industry have historically been cyclic. World War II and the military-industrial-consumer boom that followed siphoned off a lot of academic talent. In speech & language technologies, the Bell breakup and the resulting fragmentation of Bell Labs pushed talent back to the academy in the 1980s and 1990s; the balance began to shift back to Silicon Valley about a decade ago.

There’s something to be said for “game knows game”—i.e., the talented want to work with the talented. And there’s a more general factor—large industrial organizations engage in careful “cultural design” to keep talent happy in ways that go beyond compensation and fringe benefits. (For instance, see Fergus Henderson’s description of engineering practices at Google.) But I think it’s important to understand this as a symptom of the problem, a lagging indicator, and as part of an unpredictable cycle, not as something to optimize for.

Closing thoughts

I’m a firm believer in “you do you”. But I do have one bit of specific advice for scientists in academia: don’t pay so much damn attention to Silicon Valley. Now, if you’re training students—and you’re doing it with the full knowledge that few of them will ever be able to work in the academy, as you should—you should educate yourself and your students to prepare for this reality. Set up a little industrial advisory board, coordinate interview training, talk with hiring managers, adopt industrial engineering practices. But, do not let Silicon Valley dictate your research program. Do not let Silicon Valley tell you how many GPUs you need, or that you need GPUs at all. Do not believe the hype. Remember always that what works for a few-dozen crypto-feudo-fascisto-libertario-utopio-futurist billionaires from California may not work for you. Please, let the academy once again be a refuge from neoliberalism, capitalism, imperialism, and war. America has never needed you more than we do right now.

If you enjoyed this, you might enjoy my paper, with Richard Sproat, on an important NLP task that neural nets are really bad at.