Latin glides and the case of “belua”

Latin texts leave the distinction between high monophthongs [i, u, ī, ū] and glides [j, w] unspecified. This has lead some to suggest that the glides are allophones of the monophthongs. For instance, Steriade (1984) implies that the syllabicity of [+high, +vocalic] segments in Latin is largely predictable. Steriade points out two contexts where high vocoids are (almost) always glides: initially before a vowel (# __ V) and intervocalically (V __ V). In these two contexts, the only complications I am aware of arise from competition between generalizations. For instance, in ūua [uː.wa] ‘grape’ and ūuidus [uː.wi.dus] ‘damp’, intervocalic glide formation appears to bleed word-initial glide formation. (Or it could be the case that ū is ineligible for glide formation by virtue of its length.) And the behavior of two adjacent high vocoids flanked by vowels is somewhat idiosyncratic: compare naevus [naj.wus] ‘birthmark’ and saeuiō [saj.wi.oː] ‘I am furious’, where (by hypothesis) /ViuV/ surfaces as [j.w], to dēuius [deː.wi.us] ‘devious’ and pauiō [pa.wi.oː] ‘I beat’, where (by hypothesis) /VuiV/ surfaces as [.wi] but never as *[w.j]. And so on.

However, Cser (2012) claims that syllabicity of high vocoids is not at all predictable after a consonant and before a vowel, i.e., in the context C __ V. Here we usually observe [w] when the preceding consonant is coda [j, l, r], as in the aforementioned naevus or silua [sil.wa] ‘forest’. Cser contrasts this latter form with belua ‘wild beast’, which is trisyllabic rather than bisyllabic. However, it is not clear this is a good near-minimal pair. The word was clearly not pronounced as [be.lu.a] because the first syllable scans heavy. In the following hexameter verse, the word comprises the fifth foot, a dactyl:

et centumgeminus Briareus, ac belua Lernae (Verg., Aen. 6.287)

Lewis & Short and the Oxford Latin Dictionary both give this word as bēlua [beː.lu.a]. However, it seems much more likely that the word is in fact bellua [bel.lu.a], as it was sometimes written. (Note also that tautomorphemic geminate ll is robustly attested in Latin.) In this case we would expect glide formation to be blocked because the [lw] complex onset is totally unattested, just as Cser predicts from general principles of sonority sequencing. Thus the above verse is:

As Cser notes, many of the remaining near-minimal pairs occur at morphological boundaries⁠—and thus look to someone with my theoretical commitments as evidence for the phonological cycle—or relate to the complex onsets qu [kw] and su [sw], which might be treated as contour segments underlyingly. But much work will be needed to show that these apparent exceptions follow from the grammar of Latin.

References

Cser, András. 2012. The role of sonority in the phonology of Latin. In Parker, Steve (ed.), The sonority controversy, pages 39-64. Berlin: Mouton de Gruyter.
Steriade, Donca. 1984. Glides and vowels in Romanian. In Proceedings of the Berkeley Lingusitics Society, pages 47-64.

Exceptions to reduplication in Kinande

Mutaka & Hyman’s (1990) study of reduplication in Kinande, a Bantu language spoken in “Eastern Zaire” (now the Democratic Republic of the Congo), is the sort of phonology study one doesn’t see much of anymore. The authors begin by noting the recent interest in reduplication phenomena, but note that most of the major work has completely ignored Bantu, an enormous language family in which nearly every language has one or more type of reduplication. Mutaka & Hyman (MH) proceed to describe Kindande reduplication in detail, with only occasional reference to other languages.

Nouns that undergo reduplication have the semantics of roughly ‘the real X’. Most Kinande verbs also undergo reduplication, with the semantics of roughly ‘to hurriedly X’ or ‘to repetitively X’. Verbal reduplication is somewhat more interesting because certain other verbal suffixes (or “extensions”, as they’re sometimes called in Bantu) may also be found in the reduplicant, argued to be a roughly-bisyllabic prefix. For instance, the passive suffix is argued to be underlyingly /u/ but surfaces as [w], and is copied over in reduplication. Thus for the verb hum ‘beat’ the passive e-ri-hum-w-a ‘to be beaten’ reduplicates as erihumwahumwa. However, larger vowel-consonant verbal suffixes are not copied; the applied (-ir-) passive infinitive e-ri-hum-ir-w-a ‘to be beaten for’ has a reduplicated form erihumahumirwa, and for the verb tum ‘send’ the applied passive reciprocal (-an-) infinitive e-rí-tum-ir-an-w-a ‘to be sent to each other’ has a reduplicated form erítumatumiranwa (MH, 56).

What’s even more interesting to me is the behavior of verb stems with what MH call ‘unproductive’ extensions (all of which appear to be vowel-consonant). MH report that for only a small minority of these verb stems is there any plausible etymological relationship to a verb without the extension. One example is luh-uk-a ‘take a rest’ which is plausibly related to luh-a ‘be tired’ (MH, 73e), but there is no *bát-a paired with bát-uk-a ‘move’ (MH, 74d). Verb stems bearing unproductive suffixes may have one of three behaviors with respect to reduplication. For some such stems, reduplication is forbidden: eríbugula ‘to find’. For others, reduplication occurs but the ‘unproductive’ extension is stranded (the same behavior as the ‘productive’ extensions): e-rí-banguk-a ‘to jump about’ reduplicates as eríbangabanguka. Finally, some such stems (roughly half) unexpectedly build a trisyllabic (rather than bisyllabic) reduplicant consisting of the verb root and the unproductive extension: e-ri-hurut-a ‘to snore’ reduplicates as erihurutahuruta (MH, 75). This entire distribution poses a fascinating puzzle. How is the failure of reduplication encoded in the first case? What licenses the trisyllabic reduplicant in the last case?

References

Mutaka, Ngessimo and Hyman, Larry M. 1990. Syllables and morpheme integrity in Kinande reduplication. Phonology 7: 73-119.

Libfix report for June 2019

You may be familiar with fatberg, a mass of non-biodegradable solids and fats found in sewers, which suggests -berg has been innovated (presumably via iceberg). And now London is also haunted by a concreteberg.

Late great tech unicorn Theranos made use of a proprietary blood-collection device they called the nanotainer (via container), and I recently found out about vacutainer and a security software package called Cryptainer. So -tainer has been liberated.

The other day in Queens I saw a sign for a Mathnasium, presumably extracted from gymnasium, and the Corpus of Contemporary American English also has a token of jamnasium (a space for jam seshes), suggesting a nascent -nasium.

In a recent, widely-derided ad campaign, Applebee’s coined sizzletonin on analogy with the neurotransmitter seratonin and the hormone melatonin, but as far as I know that’s the end of the line for -tonin.

The revolution will be at Starbucks

One of the biggest shocks about life in The Zone (11/9/2016-present) is how often Starbucks makes the news. Just a couple days after the election, a group of patriots, organizing around the hashtag #TrumpCup, decided to show solidarity with their big wet boy, subverting the sacred ordering ritual to trick baristas to shout “Iced Frappucino for Trump”. Then, there was the everyday-in-America story of a few young men thrown out—by the police—of a Philadelphia Starbucks for the mere act of being black in public. And now gloriously quixotic former CEO Howard Schultz is considering a third-party run for president. How has Trumpism turned America’s top coffee chain into a battleground?

I think I know. Starbucks is a looking glass, and when we gaze into it, we see what we want to. Allow me to explain.

We yet again live in a time where the public commons is contracting. It is not so much being “enclosed” (as it was in Georgian England) as neglected by the inexorable logic of austerity (as it happens, the key plank of Schultz’s platform). Even public libraries—a radical, and incredibly impactful, experiment in architecture and government—are at risk; President Trump has sought to eliminate federal spending on libraries, and they are under threat both in communities small and large. Faced with disappearing public commons, we turn increasingly to private simulcra of the park, the library, the school or university, and for some, a busy Starbucks will have to do.

Starbucks has another thing going for it. The product is really not bad, and of surprisingly uniform quality. While coffee snobs turn their noses up at the burnt-tasting drip coffee, the espresso drinks are quite good if not always great. The production of a large menu of high-quality, complex, labor-intensive goods, daily, at 14,000 locations across the US, is an incredible feat of logistics. US social welfare programs, increasingly administered by a patchwork of hostile state governments, do not come off well in comparison to the fungible, always-available Starbucks latte. It is easy to see why. Starbucks is embedded in an all-encompassing matrix of market capitalism, but internally, it is a command economy, one in which no store can be left behind. It is hard to even imagine living in an America where say, welfare or health care services are provided to citizens with the same efficiency of Starbucks manager requisitioning a case of oat milk.

At least that’s what I see when I look at Starbucks. But, as #TrumpCup shows, others see something different: the masses of Americans not moved—if not outright repelled—by the mixture of petty grievances and white identity politics that animates President’s Trump’s base. The libs (as we’ll call them) are a diverse group, better defined by exemplars—sometimes, right-wing media caricatures—than prototypes, and one key lib exemplar is the Starbucks barista. The barista is probably young, and possibly urban. Perhaps they have a college education and have taken the job for the health care benefits the state does not provide. Maybe they even share former CEO Schultz’s tepid opposition to President Trump.

If this wasn’t enough to forever code the barista as the Other, there is also a whole new language, not quite English, to learn. A small coffee is unexpectedly “tall”; a large is a “venti”; a “macchiato” is something else entirely. Mastering this language gives the customer the power to summon strange and fantastic beasts: the “blonde espresso”, or if the stars are properly aligned, the “spiced sweet cream nariño 70 cold brew”.

And, perhaps most importantly, the barista is a captive audience. The barista has a manager, and yes, you really can ask to speak to them. For the #TrumpCup Republican, this is a potent brew, a hierarchy in which they stand above the Other, the perfect victim for a bit of everyday cruelty and meaningless self-gratification.

It was probably inevitable that the of the most ubiquitous corporations in American life was going to ultimately come to index something, and where I see the state’s abdication of responsibilities inherent in the social contract, others just see a snot-nosed, underemployed 25-year-old who would rather not be working this job forever. In conclusion, Starbucks is a land of contrasts, and will remain so until we resolve the contradictions inherent in American society.

Using a fixed training-development-test split in sklearn

The scikit-learn machine learning library has good support for various forms of model selection and hyperparameter tuning. For setting regularization hyperparameters, there are model-specific cross-validation tools, and there are also tools for both grid (e.g., exhaustive) hyperparameter tuning with the sklearn.model_selection.GridSearchCV and random hyperparameter tuning (in the sense of Bergstra & Bengio 2012) with sklearn.model_selection.RandomizedSearchCV, respectively. While you could probably could implement these yourself, the sklearn developers have enabled just about every feature you could want, including multiprocessing support.

One apparent limitation of these classes is that, as their names suggest, they are designed for use in a cross-validation setting. In the speech & language technology, however, standard practice is to use a fixed partition of the data into training, development (i.e., validation), and test (i.e., evaluation) sets, and to select hyperparameters which maximize performance on the development set. This is in part an artifact of limited computing resources of the Penn Treebank era and I’ve long suspected it has serious repercussions for model evaluation. But tuning and evaluating with a standard split is faster than cross-validation and can make exact replication much easier. And, there are also some concerns about whether cross-validation is the best way to set hyperparameters anyways. So what can we do?

The GridSearchCV and RandomSearchCV classes take an optional cv keyword argument, which can be, among other things, an object implementing the cross-validation iterator interface. At first I thought I would create an object which allowed me to use a fixed development set for hyperparameter tuning, but then I realized that I could do this with one of the existing iterator classes, namely one called sklearn.model_selection.PredefinedSplit. The constructor for this class takes a single argument test_fold, an array of integers of the same size as the data passed to the fitting method. As the documentation explains “…when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.” That we can do. Suppose that we have training data x_train and y_train and development data x_dev and y_dev laid out as NumPy arrays. We then create a training-and-development set like so:

x = numpy.concatenate([x_train, x_dev])
y = numpy.concatenate([y_train, y_dev])

Then, we create the iterator object:

test_fold = numpy.concatenate([
    # The training data.
    numpy.full(-1, x_train.shape[1], dtype=numpy.int8),
    # The development data.
    numpy.zeros(x_dev.shape[1], dtype=numpy.int8)
])
cv = sklearn.model_selection.PredefinedSplit(test_fold)

Finally, we provide cv as a keyword argument to the grid or random search constructor, and then train. For instance, similar to this example we might do something like:

base = sklearn.ensemble.RandomForestClassifier()
grid = {"bootstrap": [True, False], 
        "max_features": [1, 3, 5, 7, 9, 10]}
model = sklearn.model_select.GridSearchCV(base, grid, cv=cv)
model.fit(x, y)

Now just add n_jobs=-1 to the constructor for model and to spread the work across all your logical cores.

References

Bergstra, J., and Bengio, Y. 2012. Random search for hyperparameter optimization. Journal of Machine Learning Research 13: 281-305.

arXiv vs. LingBuzz

In the natural language processing community, there has been a bit of kerfuffle about the ACL preprint policy, which essentially prevents you from submitting a manuscript to preprint aggregation websites like arXiv when the m.s. is also under review for a conference. I personally think this is a good policy: double blind review is really important for fairness. This lead me to reflect a bit on the outsized role that arXiv plays in natural language processing research. It is interesting to contrast arXiv with LingBuzz, a preprint aggregator for formal linguistics research.¹ arXiv is visually ugly and cluttered, expensive (it somehow takes over $800,000 from Simons Foundations’ money to run it every year), and submissions tare subject to detailed, strict, carefully enforced editorial guidelines. In contrast, LingBuzz has a minimalistic text interface, is run and operated by a single professor (Michael Starke at the University of Tromsø), and the editorial guidelines are simple (they fit on a single page) and laxily enforced (mostly after the fact). Despite the laissez-faire attitude at LingBuzz, it has seen some rather contentious debates involving the usual trollish suspects (Postal, Everett, Behme, etc.) but it managed to keep things under control. But what I really love about LingBuzz is that unlike arXiv, no linguist is under the impression that it is any sort of substitute for peer review, or that authors need to know about (and cite) late-breaking work only available on LingBuzz. I think NLP researchers should take a hint from this and stop pretending arXiv is a reasonable alternative to peer review.

Endnotes

1. There are a few other such repositories. The Rutgers Optimality Archive (ROA) was once a popular repository for pre-prints of Optimality Theory work, but its contents are re-syndicated on LingBuzz and Optimality Theory is largely dead anyways. There is also the Semantics Archive.

Language modeling handout

Today I gave a lecture on (mostly classical, generative) language models and I’m pretty happy how it turned out. Here’s the handout. Note that I cover finite-state encoding in the practicum so it’s not included here.

Text encoding issues in Universal Dependencies

Do you know why the following comparison (in Python 3.7) fails?

>>> s1 = "ड़"
>>> s2 = "ड़"
>>> s1 == s2
False

I’ll give you a hint:

>>> len(s1)
1
>>> len(s2)
2

Despite the two strings rendering identically, they are encoded differently. The string s1 is a single-codepoint sequence, whereas s2 contains two codepoints. Thus string comparison fails, whether it’s done at the level of bytes or of Unicode codepoints.

Some NLP researchers are aware of issues arising from faulty string encoding. Eckhart de Castilho (2016), for example, describes a tool which automatically identifies misencoded pre-trained data, whereas Wu & Yarowsky (2018) report issues using an existing tool for transliteration on certain languages because of encoding issues. However, I suspect that far fewer NLP researchers are familiar with the aforementioned problem, which is specific to Unicode normalization. To put it simply, Unicode defines four normalization forms (and associated conversion algorithms) for strings, and the key distinction is between “composed” and “decomposed” forms of characters (using that term in a pretheoretic sense). The string s1 is composed into a single Unicode codepoint; s2 is decomposed into two.

Unfortunately, three columns of the Hindi Dependency Treebank (hi_hdtb, commit 54c4c0f; Bhat et al. 2017, Palmer et al. 2009) have a chaotic mix of composed and decomposed representations. It seems most if not all of these have to do with the encoding of the six nuqta (‘dot’) consonants, which are usually found in borrowings from Arabic or Persian (via Urdu, presumably). In Devangari these consonants are written by adding a dot to a phonetically similar native consonant; for instance ड [ɖə] plus the nuqta produces ड़ [ɽə]. As is usually the case in Unicode, there is more than one way to do it: you can either encode ड़ with a composed character (U+095C DEVANAGARI LETTER DDDHA) or with the native Devangari character (U+O921 DEVANAGARI LETTER DDA) plus a combining character (U+093C DEVANAGARI SIGN NUKTA). In practical terms, this means that strings containing diferent encodings of <ṛa> (as it is sometimes transliterated) will be treated as totally separate during training and evaluation, except on the off chance that all associated tools perform Unicode normalization ahead of time.

This does have negative consequences for NLP. Consider the UDPipe system (Straka & Straková 2017) at the CoNLL 2017 shared task on dependency parsing (Zeman et al. 2017), for which the primary metric is labeled attachment score (LAS). I first attempted to replicate the UDPipe results for the Hindi Dependency Treebank. Using UDPipe 1.2.0, word2vec (commit 20c129a), the hyperparameters given in the authors’ supplementary materials, and the official evaluation script, I obtain LAS = 87.09 on the “gold tokenization” subtask. However I can improve this simply by converting the training, development, and test data to a consistent normalization like so:

for FILE in *.conllu; do
    TMPFILE="$(mktemp)"
    uconv -x nfkc "${FILE}" > "${TMPFILE}"
    mv "${TMPFILE}" "${FILE}"
done

and then retraining. Here I have chosen to apply the NFKC (“compatibility composed”) normalization form. While Zeman et al. do not discuss the encoding of the labeled Universal Dependencies data, they do mention that they apply NFKC normalization to the addditional raw data. But it doesn’t really matter in this case which you choose so long as you are consistent. After retraining, I obtain LAS = 87.38, or .29 points for free. I also ran an “mismatch” experiment, where the training and testing data have different normalization forms; naturally, this causes a slight degradation to LAS = 86.98.

Straka & Straková (2017) report a separate set of experiments in which they have attempted to rebalance the training-development-test splits. Just to be sure, I repeated the above experiments using their original rebalancing script. With the baseline—mixed normalization—data, I can replicate their result exactly: LAS = 87.30. With a consistent NFKC normalization of training, development and test data, I get LAS = 87.50. And with a normalization mismatch between training and test data, I get LAS = 87.07, a slight degradation. And the improvements are more or less for free.

While I have not yet done a consistent audit, I found three other UD treebanks that have encoding issues. The ar_padt treebank has a non-canonical ordering of combining characters in the lemma column (the shaddah, which indicates geminates, should come before the fathah and not the other way around), but this is unlikely to have any major effect on model performance because it uses this non-canonical ordering consistently. The ko_kaist and ur_udtb treebanks also have minor inconsistencies.

Unfortunately my corporate overlord doesn’t permit me to file a pull request here because of the Hindi data is released under a CC BY-NC-SA license. But if you’re not so constrained, feel free to do so, and ping this thread once you have! And pay attention in the future.

References

Bhat, R. A., Bhatt, R., Farudi, A., Klassen, P., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D. M., Vaidya, A., Vishnu, S. R., and Xia, F. 2017. The Hindi/Urdu Treebank Project. In Ide., N., and Pustejovsky, J. (ed.), The Handbook of Linguistic Annotation, pages 659-698. Springer.
Eckhart de Castilho, R. 2016. Automatic analysis of flaws in pre-trained NLP models. In 3rd International Workshop on Worldwide Language Service Infrastructure and 2nd Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies, pages 19-27.
Palmer, M., Bhatt, R., Narasimhan, B., Rambow, O., Sharma, D. M., and Xia, F. 2009. Hindi syntax: Annotation dependency, lexical predicate-argument structure, and phrase structure. In ICON, pages 14-17.
Straka, M., and Straková, J. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88-99.
Wu, W. and Yarowsky, D. 2018. A comparative study of extremely low-resource transliteration of the world’s languages. In LREC, pages 938-943.
Zeman, D., Popel, M., Straka, M., Hajič, J., Nivre, J., Ginter, F., … and Li, J. 2017. CoNLL Shared Task: Multilingual parsing from raw text to Universal Dependencies. In CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1-19.

Lessons learned from my time at Google

C++ 11 is a powerful, elegant language and the right choice for performant general-purpose code. Bash is an excellent lingua franca for chaining a long series of commands. Python is best for everything else.
Data should be passed around in schematic form, with a compact serializations over the wire and a human-readable format at rest. Protocol buffers (and the lesser-known text format) are an ideal cross-language solution.
Grammar development is more important than model building.
Model building is easier than deployment.
Whiteboards are useful.
I can only do certain sorts of work without an office (yes, that thing with a door).

A minimalist project design for NLP

Let’s say you want to build a new tagger, a new named entity recognizer, a new dependency parser, or whatever. Or perhaps you just want to see how your coreference resolution engine performs on your new database of anime reviews. So how should you structure your project? Here’s my minimalist solution.

There are two principles that guide my design. The first one is modularity. Some of these components will get run many times, some won’t. If you’re doing model comparison—and you should be doing model comparison—some components will get swapped out with someone else’s code. This sort of thing is a major lift unless you opt for modularity. The second principle is filesystem state. The filesystem is your friend. If your embedding table eats up all your RAM and you have to restart, the filesystem will be in roughly the same state as when you left. The filesystem allows you to organize things into directories and subdirectories, and give the pieces informative names; I like to record information about datasets and hyperparameter values in my file and directory names. So without further ado, here are the recommended scripts or applications to create when you’re starting off on a new project.

split takes the full dataset and a random seed (which you should store for later) as input. The script reads the data in, randomly shuffles the data, and then splits it into an 80% training set, 10% development set, and a 10% test (i.e., evaluation set) which it then outptus. If you’re comparing to prior work that used a “standard split” you may want to have a separate script that generates that too, but I strongly recommend using randomly generated splits.
train takes the training set as input and outputs a model file or directory. If you’re automating hyperparameter tuning you will also want to provide the development set as input; if not you will probably want to either add a bunch of flags to control the hyperparameters or allow the user to pass some kind of model configuration file (I like YAML for this).
apply takes as input the model file(s) produced in (2) and the test set, and applies the model to the data, outputting a new hypothesized test data set (i.e., the model’s predictions). One open question is whether this ought to take only unlabeled data or should overwrite the existing labels: it depends.
evaluate takes as input the gold test set and the hypothesized test data set generated in (3) and outputs the evaluation results (as text or in some structured data format—sometimes YAML is a good choice, other times TSV files will do). I recommend you test this with a small amount of data first.

That’s all there’s to it. When you begin doing model comparison you may find yourself swapping out (2-3) for somebody else’s code, but make sure to still stick to the same evaluation script.