It’s time to retire “agglutinative”

A common trope in computational linguistics papers is the use of the technical term agglutinative as a synonym for rich inflectional morphology. This is not really what that term means. Properly, a language has agglutinative morphology in the case that it has affixes, each of which has a single syntacto-semantic function. (To really measure this properly, you probably need a richer, and more syntactically-oriented, theory of morphology than is au courant among the kind of linguistic typologist who would think it interesting to measure this over a wide variety of languages in the first place, but that’s another issue.) Thus Russian, for instance, has rich inflectional morphology, but it is not at all agglutinative, because it is quite happy for the suffix -ov to mark both the genitive and the plural, whereas the genitive plural in Hungarian is marked by two affixes.

I propose that we take agglutinative away from NLP researchers until they learn even a little bit about morphology. If you want to use the term, you need to state why agglutination, rather than the mere matter of lexemes having a large number of inflectional variants, is the thing you want to highlight. While I don’t think WALS is very good —certainly it’s over-used in NLP—it nicely distinguishes between isolation (#20), exponence (#21), and synthesis (#22). This ought to allow one to distinguish between agglutination and synthesis with a carefully-drawn sample, should one wish to.

A prediction

You didn’t build that. – Barack Obama, July 13, 2012

Connectionism originates in psychology, but the “old connectionists” are mostly gone, having largely failed to pass on their ideology to their trainees, and there really aren’t many “young connectionists” to speak of. But, I predict that in the next few years we’ll see a bunch of psychologists of language—the ones who define themselves by their opposition to internalism, innateness, and generativism—become some of the biggest cheerleaders for large language models (LLMs). In fact, psychologists have not made substantial contributions to neural network modeling in many years. Virtually all the work on improving neural networks over the last few decades has been done by computer scientists who cared not a whit whether they had anything to do with human brains or cognitive plausibility.1 (Sometimes they’ll put things like “…inspired by the human brain…” in the press releases, but we all know that’s just fluff.) At this point, psychology as a discipline has no more claim to neural networks than the Irish do to Gaul, and in the rather unlikely case that LLMs do end up furnishing deep truths about cognition, psychology as a discipline will have failed us by not following up on a promising lead. I think it will be particularly revealing if psychologists who previously worshipped at the Church of Bayes suddenly lose all interest in mathematical rigor and find themselves praying to the great Black Box. I want to say it now: if this happens—and I am starting to see signs that it will—those people will be cynics, haters, and trolls, and you shouldn’t pay them any mind.

Endnotes

  1. I am also critical of machine learning pedagogy, and it is therefore interesting to see that those same computer scientists pushing things forward don’t seem to care much for machine learning as an academic discipline either.

Filtering text at scale

[This post describes work in collaboration with Emily Charde.]

It is now commonplace for NLP applications to consume massive amounts of web text  of unknown provenance. Applications which stand to benefit from this firehose of data, but at the same time don’t need it all, may require more attention paid to data quality in the form of high-precision methods to filter out redundancies and junk.

Gorman et al. (2021) follow standard practices for obtaining a “clean” subsample of web data: they filter sentence strings based on the presence of capitalization and sentential punctuation, length, and predictability as measured by a character language model. In an ongoing project on defectivity, we sought to do something similar at a much larger scale. This project was undertaken by myself in collaboration with Emily Charde, a graduate of our master’s program who worked as an RA on the project.

Our data for this project is drawn from CC-100, a recreation of the earlier CC-Net corpus (Wenzek et al. 2020). CC-100 consists of strings from 2018 Common Crawl snapshots, already filtered somewhat and grouped by language using language ID tools. At rest, the CC-100 data is stored in enormous LZMA-compressed files, one per language/locale/script. The largest, English (naturally), occupies 82 GB despite this aggressive compression scheme.

We proceed as follows.

We first shard the data for each language into roughly 1 GB chunks, preserving the LZMA compression.

We then perform sentence and word tokenization in parallel using mudpipe.py, a Python wrapper around the C++ command-line tool UDPipe 1 which automatically decompresses the LZMA files, invokes UDPipe, and recompresses the output CoNLL-U-formatted data, preserving disk space; since this is mostly IO-bound, mudpipe.py does this in parallel across the various shards (the “m” in “mudpipe” stands for “multiprocessing”). This script was originally developed by Yulia Spektor, another graduate student, for her 2020 master’s thesis (Spektor 2020). Applying mudpipe.py to English, Greek, and Russian (our three target languages) took a few weeks of compute time on a single desktop that otherwise would have sat idle. The resulting shards of compressed CoNLL-U sentences are somewhat larger, roughly 2 GB, presumably because of the additional markup.

We now turn to filtering in earnest. Whereas Gorman et al. were working with dozens of millions of sentences of English, the CC-100 language samples contain many billlions of sentences, so filtering based on percentiles, like those used by Gorman et al., must be performed out-of-core. We thus chose SQLite as our data store for this project, and envisioned that SQL would be a natural way to express filters.

Filtering was ultimately performed by a single Python script using the sqlite3 standard library. This script runs through the tokenized shards produced by mudpipe.py, and ultimately produces a single LZMA-compressed, CoNLL-U format file for each language. Working incrementally, each shard is decompressed and the CoNLL-U format is parsed line by line. Once a sentence is obtained, we apply ordinary re regular expression filters. These expressions require each sentence to start with an uppercase letter of the appropriate script, to continue with more letters, space, or punctuation of the appropriate script, and finally to end with sentential punctuation (e.g., /[.!?]/). For instance, a Russian or Greek sentence that contains Latin characters was discarded. If quotation marks are present, they were required to “balance”. Sentences that fail one or more of these constraints are simply removed from further consideration. Additional data is extracted from the sentences that remain:

  • length in characters
  • length in tokens
  • bits per character (BPC) entropy according to an OpenGrm-NGram (Roark et al. 2012) 6-gram character language model

The sentence and these three statistics are then stored in the SQLite database; we also use gzip compression, with the shortest possible compression window and no headers, to save temporary disk space. Accumulating this portion of the table takes quite some time, but it can be performed in parallel across shards or languages. We perform batches of 1m updates at a time. We experimented—well, Emily did, I watched—with various database PRAGMAs to improve performance, but none of these were clearly performance-positive.

Our next step is to actually filter the data. In an inner subquery, we compute quartiles for character length, token length, and BPC. Then in an outer subquery, we return the row IDs of every sentence which is in Q2 or Q3—the middle two quartiles—for all three measures. That is, if a sentence has median BPC but is in the 80th percentile for character length, we remove it. This is highly conservative, but we have more than enough data, and we anticipate that at least character length and token length are highly correlated in any language. In the outermost query, we SELECT row IDs not returned by the outer subquery. This query is a work of art.

SELECT tokenlist FROM table WHERE rowid IN (
    SELECT rowid FROM (
        SELECT rowid,
        NTILE(4) OVER (ORDER BY char_len) AS char_q,
        NTILE(4) OVER (ORDER BY word_len) AS word_q,
        NTILE(4) OVER (ORDER BY bpc) AS bpc_q
        FROM table
    )
    WHERE (char_q BETWEEN 2 AND 3)
    AND (word_q BETWEEN 2 AND 3)
    AND (bpc_q BETWEEN 2 AND 3)
);

We then reserialize and recompress the remaining sentences into a new LZMA-compressed file. Here are some logging statements that give a sense of the scale (this is from Russian):

WARNING 2023-01-06 20:39:41,896: 1,576,171,212 input sentences processed
WARNING 2023-01-06 20:39:41,896: 362 sentences missing text
WARNING 2023-01-06 20:39:41,896: 539,046,034 sentences incomplete
WARNING 2023-01-06 20:39:41,896: 772,566 sentences fail LM composition
WARNING 2023-01-06 21:16:35,406: 1,036,352,250 sentences after primary filtration
WARNING 2023-01-08 09:14:13,110: 232,404,041 sentences after secondary filtration
INFO 2023-01-08 09:14:13,117: Writing to ../conllu/cc100/ru.cc100.filtered.conllu.xz...
INFO 2023-01-09 03:22:08,252: Dropping ru_cc100 table
INFO 2023-01-09 10:42:07,085: Filtering complete

To summarize: there were about 1.6b input sentences after mudpipe.py; of these, 362 (inexplicably, but it happens) had no text at all. Roughly a half billion of these are “incomplete”, meaning they failed the regular expression constraints. A bit less than one million “fail LM composition”; this usually indicates they contain odd, language-inappropriate characters, which were never seen in the (held-out) materials used to train the character LMs. This leaves us with just over one billion sentences for “secondary filtration”. Of these, 232m fall in the two median quartiles for the length and entropy measures and are retained. As you can see, secondary filtration took an otherwise-idle desktop about 36 hours, with reserialization and recompression taking about 18 hours, and DB cleanup (not strictly necessary, but sort of like “be kind, rewind”) adding another 7 hours at the end. Not bad, though certainly this could be made to run much faster (possibly with a different database engine designed for parallel writes).

In practice, we find that this produces data that is highly diverse but extremely clean. Should even more data ever be desired, one could easily imagine relaxing the quartile constraints a bit.

[Late-breaking addition: I should probably explain why we want median entropy text. If one sorts the sentence of a large corpus by bits per character, you will see that the lowest-entropy sentences tend to be boilerplate and the highest-entropy sentences tend to be rubbish. So the middle is “just right” here.]

Acknowledgments

Support for this project was provided by a PSC-CUNY award, jointly funded by the Professional Staff Congress and the City University of New York.

References

Gorman, K., Kirov, C., Roark, B., and Sproat, R. 2021. Structured abbreviation expansion in context. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 995-1005.
Wenzek, G., Lachaux, M.-A., Conneau, A.,  Chaudhary, V., Guzmán, F., Joulin, A., and Grave, E. 2020. CCNet: extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003-4012.
Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., and Tai, T. 2012. The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations, pages 61-66.
Spektor, Y. 2021. Detection and morphological analysis of novel Russian loanwords. Master’s thesis, Graduate Center, City University of New York.

Large LMs and disinformation

I have never understood the idea that large LMs are uniquely positioned to enable the propagation of disinformation. Let us stipulate, for sake of argument, that large LMs can generate high-quality disinformation and that its artificial quality (i.e., not generated by human writers) cannot be reliably detected either by human readers nor by computational means. At the same time, I know of no reason to suppose that large LMs can generate better (less detectable, more plausible) disinformation than can human writers. Then, it is hard to see what advantage there is to using large LMs for disinformation generation beyond a possible economic benefit realized by firing PR writers and replacing them with “prompt engineers”. Ignoring the dubious economics—copywriters are cheap, engineers are expensive—there is a presupposition that disinformation needs to scale, i.e., be generated in bulk, but I see no reason to suppose this either. Disinformation, it seems to me, comes to us either in the form of “big lies” from sources deemed reputable by journalists and lay audiences (think WMDs), or increasingly, from the crowds (think Qanon).

Character-based speech technology

Right now everyone seems to be moving to character-based speech recognizers and synthesizers. A character-based speech recognizer is an ASR system in which there is no explicit representation of phones, just Unicode codepoints on the output side. Similarly, a character-based synthesizer is a TTS engine without an explicit mapping onto pronunciations, just orthographic inputs. It is generally assumed that the model ought to learn this sort of thing implicitly (and only as needed).

I genuinely don’t understand why this is supposed to be better. Phonemic transcription really does carry more information than orthography, in the vast majority of languages, and making it an explicit target is going to do a better job of guiding the model than hoping the system automatically self-organizes. Neural nets trained for language tasks often have a implicit representation of some linguistically well-defined feature, but they often do better when that feature is made explicit.

My understanding is that end-to-end systems have potential advances over feed-forward systems when information and uncertainty from previous steps can be carried through to help later steps in the pipeline. But that doesn’t seem applicable here. Building these explicit mappings from words to pronunciations and vice versa is not all that hard, and the information used to resolve ambiguity is not particularly local. Cherry-picked examples aside, it is not at all clear that these models can handle locally conditioned pronunciation variants (the article a pronounced uh or aye), homographs (the two pronunciations of bass in English), or highly deficient writing systems (think Perso-Arabic) better than the ordinary pipeline approach. One has to suspect the long tail of these character-based systems are littered with nonsense.

RoboCop

I like a lot of different types of films, but my favorite are the subtextually rich, nuance-light action/science fiction films of the late 1970s, 1980s, and early 1990s, made by directors like Cameron, Carpenter, Cronenberg, McTiernan, Scott, and Verhoeven. Perhaps the most prescient of all of these is RoboCop (1984). The film’s feel is set by over-the-top comic sex and violence and silly diagetic TV clips. In less deft hands, it could easily have become the sort of campy farce best described (or perhaps, denigrated) as a “cult classic”. (This usually means a film is just bad.) But Verhoeven wields sex and violence like a master wields a paintbrush. (I take this to be a sort of self-critique of his childhood aesthetic appreciation of the violence he saw as a boy growing up in Nazi-occupied Holland, not far from the V-2 launch sites.) The film is thematically rich, so much so that one can easily forgive Verhoeven’s apparent decision to leave out (in what is probably the most “dated” element of the film) any overt criticism of policing as an institution. It is ruthlessly critical of what we’d now call neoliberalism, of corporatism, and has much to say about the nature of the self. The theme that strikes me as most prescient is how the film hinges on the very modern realization that, to a striking degree, what we call “AI” is fundamentally just “other people”, alienated and dehumanized by contractual labor relations. Verhoeven could somehow see this coming decades before anything that could reasonably be called AI.

ACL Workshop on Computation and Written Language

The first ACL Workshop on Computation and Written Language (CAWL) will be held in conjunction with ACL 2023 in Toronto, Canada, on July 13th or 14th 2023 (TBD). It will feature invited talks by Mark Aronoff (Stony Brook University) and Amalia Gnanadesikan (University of Maryland, College Park). We welcome submissions of scientific papers to be presented at the conference and archived in the ACL Anthology. Information on submission and format will be posted at https://cawl.wellformedness.com shortly.

Foundation models

It is widely admitted that the use of language in terms like formal language and language model tend to mislead neophytes, since they suggest the common-sense notion (roughly, e-language) rather than the narrow technical sense referring to a set of strings. Scholars at Stanford have been trying to push foundation model as an alternative to what were previously called large language models. But I don’t really like the implication—which I take to be quite salient—that such models ought to serve as the foundation for NLP, AI, whatever. I use large language models in my research, but not that often, and I actually don’t think they have to be part of every practitioner’s toolkit. I can’t help thinking that Stanford is trying to “make fetch happen”.

Is NLP stuck?

I can’t help but feel that NLP is once again stuck.

From about 2011 to 2019, I can identify a huge step forward just about every year. But the last thing that truly excited me is BERT, which came out in 2018 and was published in 2019. For those not in the know, the idea of BERT is to pre-train a gigantic language model, with either monolingual or multilingual data. The major pre-training task is masked language model prediction: we pretend some small percentage (usualyl 15%) of the words in a sentence are obscured by noise and try to predict what they were. Ancillary tasks like predicting whether two sentences are adjacent or not (or if they were, what was their order) are also used, but appear to be non-essential. Pre-training (done a single time, at some expense, at BigCo HQ), produces a contextual encoder, a model which can embed words and sentences in ways that are useful for many downstream tasks. But then one can also take this encoder and fine-tune it to some other downstream task (an instance of transfer learning). It turns out that the combination of task-general pre-training using free-to-cheap ordinary text data and a small amount of task-specific fine-tuning using labeled data results in substantial performance gains over what came before. The BERT creators gave away both software and the pre-trained parameters (which would be expensive for an individual or a small academic lab to reproduce on their own), and an entire ecosystem of sharing pre-trained model parameters has emerged. I see this toolkit-development ecosysytem as a sign of successful science.

From my limited perspective, very little has happened since then that is not just more BERTology—that is, exploiting BERT and similar models. The only alternative on the horizon, in the last 4 years now, are pre-trained large language models without the encoder component, of which the best known are the GPT family (now up to GPT-3). These models do one thing well: they take a text prompt and produce more text that seeminly responds to the prompt. However, whereas BERT and family are free to reuse, GPT-3’s parameters and software are both closed source and can only be accessed at scale by paying a licensing fee to Microsoft. That itself is a substantial regression compared to BERT. More importantly, though, the GPT family are far less expressive tools than BERT, since they don’t really support fine-tuning. (More precisely, I don’t see any difficult technical barriers to fine-tuning GPT-style models; it’s just not supported.) Thus they can be only really used for one thing: zero-shot text generation tasks, in which the task is “explained” to the model in the input prompt, and the output is also textual. Were it possible to simply write out, in plain English, what you want, and then get the output in a sensible text format, this of course would be revolutionary, but that’s not the case. Rather, GPT has spawned a cottage industry of prompt engineering. A prompt engineer, roughly, is someone who specializes in crafting prompts. It is of course impressive that this can be done at all, but just because an orangutan can be taught to make an adequate omelette doesn’t mean I am going to pay one to make breakfast. I simply don’t see how any of this represents an improvement over the BERT ecosystem, which at least has an easy-to-use free and open-source ecosystem. And as you might expect, GPT’s zero-shot approach is quite often much worse than what one would obtain using the light supervision of the BERT-style fine-tuning approach.

The next toolkit 2: electric boogaloo

I just got back from the excellent Workshop on Model Theoretic Representations in Phonology. While I am not exactly a member of the “Delaware school”, i.e., the model-theoretic phonology (MTP) crowd, I am a big fan. In my talk, I contrasted the model-theoretic approach to an approach I called the black box approach, using neural networks and program synthesis solvers as examples of the latter. I likened the two styles to neat vs. scruffy, better is better vs. worse is better, rationalists vs. empiricists, and cowboys vs. aliens.

One lesson I drew from this comparison is the need for MTPists to develop high-quality software—the next toolkit 2. I didn’t say much during my talk about what I imagine this to be like, so I thought I’d leave my thoughts here. Several people—Alëna Aksënova, Hossep Dolatian, and Dakotah Lambert, for example—have developed interesting MTP-oriented libraries. While I do not want to give short schrift to their work, I think there are two useful models for the the next next toolkit: (my own) Pynini and PyTorch. Here is what I see as the key features:

  1. They are ordinary Python on the front-end. Of course, both have a C++ back-end, and PyTorch has a rarely used C++  API, but that’s purely a matter of performance; both have been slowly moving Python code into the C++ layer over the course of their development.The fact of the matter is that in 2022, just about anyone who can code at all can do so in Python.
  2. While both are devilishly complex, their design follows the principle of least suprise; there is only a bit of what Pythonistas call exhuberant syntax (Pynini’s use of the @ operator, PyTorch’s use of _ to denote in-place methods).
  3. They have extensive documentation (both in-module and up to book length).
  4. They have extensive test suites.
  5. They are properly packaged and can be installed via PyPi (i.e., via pip) or Conda-Forge (via conda).
  6. They have corporate backing.

I understand that many in the MTP community are naturally—constitutionally, even—drawn to functional languages and literate programming. I think this should not be the initial focus. It should be ease of use, and for that it is hard to beat ordinary Python in 2022. Jupyter/Colab support is a great idea, though, and might satisfy the literate programming itch too.