Vibe check: EACL 2024

I was honored to be able to attend EACL 2024 in Malta last month. The following is a brief, opinionated “vibe check” on NLP based on my experiences there. I had never been to an EACL, but it appealed to me because I’ve always respected the European speech & language processing community’s greater interest in multilingualism compared to what I’m familiar with in the US. And, because when or why else would I get to see Malta? The scale of EACL is a little more manageable than what I’m used to, and I was able to take in nearly every session and keynote. Beyond that, there wasn’t much difference. Here are some trends I noticed.

We’re doing prompt engineering, but we’re not happy about it

It’s hard to get a research paper out of prompt engineering. There really isn’t much to report, except the prompts used and the evaluation results. And, there doesn’t seem to be the slightest theory about how one ought to design a prompt, suggesting that the engineering part of the term is doing a lot of work. So, while I did see some papers (to be fair, mostly student posters) about prompt engineering, the interesting ones actually compared prompting against a custom-built solution.

There’s plenty of headroom for older technologies

I was struck by one of the demonstration papers, which was using fine-tuned BERT for the actual user-facing behaviors, but an SVM or some other type of simple linear model trained on the same data to provide “explanability”. I was also struck by the many papers I saw in which fine-tuned BERT or some other kind of custom-built solution outperformed prompting.

Architectural engineering is dead for now

I really enjoy learning about new “architectures”, i.e., ways to frame speech and language processing problems as a neural network. Unfortunately, I didn’t learn about any new ones this year. I honestly think the way forward, in the long term, will be to identify and eliminate the less-principled parts of our modeling strategies, and replace them with “neat”, perhaps even proof-theoretic, solutions, but I’m sad to say this is not a robust area.

Massive multilingualism needs new application areas

In the first half of Hinrich Schütze’s keynote, he discussed a massively multilingual study covering 1,500 languages in all. That itself is quite impressive. However, I was less impressed with the tasks targeted. One was an LM-based task (predicting the next word, or perhaps a masked word), evaluated with “pseudo-perplexity”. I’m not sure what pseudo-perplexity is but real perplexity isn’t good for much. The other task was predicting, for each verse from the Bible, the appropriate topic code; these topics are things like “recommendation”, “sin”, “grace”, or “violence”. Doing some kind of semantic prediction, at the verse/sentence level, at such scale might be interesting, but this particular instantiation seems to me to be of no use to anyone, and as I understand it, the labels were projected from those given by English annotators, which makes the task less interesting. Let me be clear, I am not calling out Prof. Schütze, for whom I have great respect—and the second half of his talk was very impressive—but I challenge researchers working at massively multilingual scale to think of tasks really worth doing!

We’ve always been at war with Eurasia

I saw at least two pro-Ukraine papers, both focused on the media environment (e.g., propaganda detection). I also saw a paper about media laws in Taiwan that raised some ethical concerns for me. It seems this may be one of those countries where truth is not a defense against charges of libel, and the application was helping the police enforce that illiberal policy. However, I am not at all knowledgeable about the political situation there and found their task explanation somewhat hard to follow, presumably because of my Taiwanese political illiteracy.

My papers

Adam Wiemerslage presented a paper coauthored with me and Katharina von der Wense in which we propose model-agnostic metrics for measuring hyperparameter sensitivity, the first of their kind. We then use these metrics to show that, at least for the character-scale transduction problems we study (e.g., grapheme-to-phoneme conversion and morphological generation), LSTMs really are less hyperparameter-sensitive than transformers, not to mention more accurate when properly tuned. (Our tuned LSTMs turn in SOTA performance on most of the languages and tasks.) I thought this was a very neat paper, but it didn’t get much burn from the audience either.

I presented a paper coauthored with Cyril Allauzen describing a new algorithm for shortest-string decoding that makes fewer assumptions. Indeed, it allows one for the first time to efficiently decode traditional weighted finite automata trained with expectation maximization (EM). This was exciting to me because this is a problem that has bedeviled me for over 15 years now when I first noticed the conceptual gap. <whine>The experience getting this to press was a great frustration to me, however. It was first desk-rejected at a conference on grammatical inference (i.e., people who study things like formal language learning) on the grounds that it was too applied. On the other hand, the editors at TACL desk-rejected a draft of the paper on the grounds that no one does EM anymore, and didn’t respond when I pointed out that there were in fact two papers in the ACL 2023 main session about EM. So we submitted it ARR. The first round of reviews were not much more encouraging. It was clear that these reviewers did not understand the important distinction between the shortest path and shortest string, even though the paper was almost completely self-contained, and were perhaps annoyed at being asked to read mathematics (even if it’s all basic algebra). One reviewer even dared to asked why one would bother, as we do, to prove that our algorithm is correct! To the area chair’s credit, they found better reviewers for the second round, and to those reviewers’ credits, they helped us improve the quality of the paper. However, the first question I got in the talk was basically a heckler asking why I’d bother to submit this kind of work to an ACL venue. Seriously though, where else should I have submitted it? It’s sound work.</whine>

“Segmented languages”

In a recent paper (Gorman & Sproat 2023), we complain about conflation of writing systems with the languages they are used to write, highlighting the nonsense underlying common expressions like “right-to-left language”, “syllabic language” or “ideographic” language found in the literature. Thus we were surprised to find the following:

Four segmented languages (Mandarin, Japanese, Korean and Thai) report character error rate (CER), instead of WER… (Gemini Team 2024:18)

Since the most salient feature of the writing systems used to write Mandarin, Japanese, Korean, and Thai is the absence of segmentation information (e.g., whitespace used to indicate word boundaries), presumably the authors mean to say that the data they are using has already been pre-segmented (by some unspecified means). But this is not a property of these languages, but rather of the available data.

[h/t: Richard Sproat]

References

Gemini Team. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint 2312.11805. URL: https://arxiv.org/abs/2312.11805.

Gorman, K. and Sproat, R.. 2023. Myths about writing systems in speech & language technology. In Proceedings of the Workshop on Computation and Written Language, pages 1-5.

Was rongorongo an independent invention of writing?

My colleague Richard Sproat and I recently went through it with “general science” journal editors regarding an overhyped- and undercooked paper about the origins of writing on Easter Island. Language Log has the scoop.

Growing consensus

Any time I read a paper that begins, roughly, “there is a growing consensus that P“, there is not in fact, as far as I can tell, a growing consensus in support of P.

Optionality as acquirendum

A lot of work deals with the question of acquiring “optional” or “variable” grammatical rules, and my impression is that different communities are mostly talking at cross-purposes. I discern at least three ways linguists conceive of optionality as something which the child must acquire.

Some linguists assume—I think without much evidence—that optionality is mere “free variation”, so that the learner simply needs to infer which rules bear a binary [optional] feature. This is an old idea, going back to at least Dell (1981); Rasin et al. (2021:35) explicitly state the problem in this form.
Variationist sociolinguists focus on the differential rates at which grammatical rules apply. They generally recognize the acquirenda as essentially conditional probability distributions which give the probability of rule application in a given grammatical context. Bill Labov is a clear avatar of this strain of thinking (e.g., Labov 1989). David Adger and colleagues have attempted to situate this within modern syntactic frameworks (e.g., Adger 2006).
Some linguists believe that optionality is not statable within a single grammar, and must reflect the competing grammars. The major proponent of this approach is Anthony Kroch (e.g., Kroch 1989). While this conception might license some degree of “nihilism” about optionality, it also has led to some interesting work which hypothesizes interesting substantive constraints on grammar-internal constraints on variation as in the work of Laurel MacKenzie and colleagues (e.g., MacKenzie 2019). This work is also very good at ridding the (2) of some of its unfortunate “externalist” thinking.

I have to reject (1) as overly simplicistic. I find (2) and (3) both compelling in some way but a lot of work remains to synthesize or adjudicate between them.

References

Adger, D. 2006. Combinatorial variability. Journal of Linguistics 42(3): 503-530.
Dell, F. 1981. On the learnability of optional phonological rules. Linguistic Inquiry 12(1): 31-37.
Kroch, A. 1989. Reflexes of grammar in patterns of language change. Language Variation & Change 1(1): 199-244.
Labov, W. 1989. The child as linguistic historian. Language Variation & Change 1(1): 85-97.
MacKenzie, L. 2019. Perturbing the community grammar: Individual differences and community-level constraints on sociolinguistic variation. Glossa 4(1): 28.
Rasin, E., Berger, I., Lan, R., Shefi, I., and Katzir, R. 2021. Approaching explanatory adequacy in phonology using Minimum Description Length. Journal of Language Modelling 9(1): 17-66.

Kill yr darlings…

…or at least make them more rigorous.

In the field of computational phonology, there were three mid-pandemic articles that presented elaborate computational “theories of everything” in phonology: Ellis et al. (2022), Rasin et al. (2021), and Yang & Piantadosi (2022).¹ I am quite critical of all three offerings. All three provide computational models evaluated for their ability to acquire phonological patterns—with varying amounts overheated rhetoric about what this means for generative grammar—and in each case, there is a utter lack of rigor. None of the papers prove, or even conjecture, anything hopeful or promising about the computational complexity of the proposed models, how long they take to converge (or if they do), or whether there is any bound on the kinds of mistakes the models might make once they converge. What they do instead is demonstrate that the models produce satisfactory results on toy problem sets. One might speculate that these three papers are the result of lockdown-era hyperfocus on thorny passion projects. But I think it’s unfortunate that the authors (and doubly so the reviewers and editors) considered these projects complete before providing formal characterization of the proposed models’ substantive properties.² By stating this critique here, I hopefully commit myself to align actions with my values in my future work, and I challenge the aforementioned authors to study these properties.

Endnotes

To be fair, Yang and Piantadosi claims to be a theory of not just phonology…
I am permitted to state that I reviewed one of these papers—my review was “signed” and made public, along with the paper—and my review was politely negative. However, it was clear to me that the editor and other reviewers had a very high opinion of this work and there was no reason for me to fight the inevitable.

References

Ellis, K., Albright, A., Solar-Lezama, A., Tenenbaum, J. B., and O’Donnell, T. J. 2022. Synthesizing theories of human language with bayesian program induction. Nature Communications 2022:1–13.
Rasin, E., Berger, I., Lan, N., Shefi, I. and Katzir, R. 2021. Approaching explanatory adequacy in phonology using Minimum Description Length. Journal of Language Modelling 9:17–66.
Yang, Y. and Piantadosi, S. T. 2022. One model for the learning of language. Proceedings of the National Academy of Sciences 119:e2021865119.

Yet more on the Pirahã debate

I just read a draft of Geoff Pullum’s paper on the Pirahã controversy, presented at a workshop of the recent LSA meeting.

It’s not a particularly interesting paper to me, since it has nothing to say about the conflicting data claims at the center of the controversy. No one has ever given an explanation of how one might integrate the evidence for clausal embedding in Everett 1986 (etc.) with the writings of Everett from 2005 onward. These two Everetts are in mortal conflict. Everett (1986), for example gives examples of embedded clauses, Everett (2005) denies that the language has clausal embedding, and Everett (2009), faced with the contradiction, has decided to gloss this same example (Nevins et al. 2009, ex. 13, reproduced from Everett 1986, ex. 232) as two sentences, with no argument provided for why earlier Everett was wrong. While one ought not to reason from one’s own limited imagination, it’s hard for me to fathom anything other than incompetence in 1987 or dishonesty 2005-present. Either way, it suggests that additional attention is probably needed on other specific claims about this language, such as the presence of rare phonetic elements (Everett 1988a) and the presence of ternary metrical feet (Everett 1988b); and on these topics there is far less room for creative hermeneutics.

If people have been nasty to Everett—and this seems to be the real complaint from Pullum—it’s because the whole affair stinks to high heaven; it’s a shame Pullum can’t smell the bullshit.

References

Everett, D. L. 1986. Pirahã. In Handbook of Amazonian Languages, vol. 1, D. C. Derbyshire and G. K. Pullum (ed.), pages 200-326. Mouton de Gruyter.
Everett, D. L. 1988a. Phonetic rarities in Pirahã. Journal of the International Phonetic Association 12: 94-96.
Everett, D. L. 1988b. On metrical constituent structure in Pirahã. Natural Language & Linguistic Theory 6: 207-246.
Everett, D. L. 2005. Cultural constraints on grammar and cognition in Pirahã: another look at the design features of human language. Current Anthropology 46: 621-646.
Everett, D. L. 2009. Pirahã culture and grammar: a response to some criticisms. Language 85: 405-442.
Nevins, A., Pesetsky, D., and Rodrigues, C. 2009. Pirahã exceptionality: a reassessment. Language 85: 355-404.

Streaming decompression for the Reddit dumps

I was recently working with the Reddit comments and submission dumps from PushShift (RIP).¹ These are compressed in Zstandard .zstformat. Unfortunately, Python’s extensive standard library doesn’t have native support for this format, and the some of the files are quite large,² so a streaming API is necessary.

After trying various third-party libraries, I finally found one that worked with a minimum of fuss: pyzstd, available from PyPI or Conda. This appears to be using ~~Facebook~~Meta’s reference C implementation as the backend, but more importantly, it provides a stream API like the familiar gzip.open, bz2.open, and lzma.open for .gz, .bz2 and .xz files, respectively. There’s one nit: PushShift’s Reddit dumps were compressed with an uncommonly large window size (2 << 31), and one has to inform the decompression backend. Without this, I was getting the following error:

_zstd.ZstdError: Unable to decompress zstd data: Frame requires too much memory for decoding.

All I have to do to fix this is to pass the relevant parameter:

PARAMS = {pyzstd.DParameter.windowLogMax: 31}

with pystd.open(yourpath, "rt", level_or_options=PARAMS) as source:
    for line in source:
        ...

Then, each line is a JSON message with the post (either a comment or submission) and all the metadata.

Endnotes

Psst, don’t tell anybody, but… while these are no longer being updated they are available through December 2023 here. We have found them useful!
Unfortunately, they’re grouped first by comments vs. submissions, and then by month. I would have preferred the files to be grouped by subreddit instead.

Alt-lingfluencers

It’s really none of my business whether or not a linguist decides to leave the field. Several people I consider friends have, and while I miss seeing them at conferences, none of them were close collaborators. Reasonable people can disagree about just how noble it is to be a professor (I think it is, or can be, but it’s not a major part of my self-worth), and I certainly understand why one might prefer a job in the private sector. At the same time, I think linguists wildly overestimate how easy it is to get rewarding, lucrative work in the private sector, and also overestimate how difficult that work can be on a day-to-day basis. (Private sector work, like virtually everything else in the West, has gotten substantially worse—more socially alienating, more morally compromising—in the last ten years.)

In this context, I am particularly troubled by the rise of a small class of “alt-ac” ex-linguist influencers. I realize there is a market for advice on how to transition careers, and there are certainly honest people working in this space. (For instance, my department periodically invites graduates from our program to talk about their private sector jobs.) But what the worst of the alt-lingfluencers do in actuality is farm for engagement and prosecute grievances from their time in the field. If they were truly happy with their career transitions, they simply wouldn’t care enough—let alone have the time—to post about their obsessions for hours every day. These alt-lingfluencers were bathed in privilege when they were working linguists, so to see them harangue against the field is a bit like listening to a lottery winner telling you not to play. These are deeply unhappy people, and unless you know them well enough to check in on their well-being from time to time, you should pay them no mind. You’d be doing them a favor, in the end. Narcissism is a disease: get well soon.

Chomsky & Katz (1974) on language diversity

Chomsky and others, Stich asserts do not really study a broad range of languages in attempting to construct theories about universal grammatical structure and language acquisition, but merely speculate on the basis of “a single language, or at best a few closely related languages” (814). Stich’s assertion is both false and irrelevant. Transformational grammarians have investigated languages drawn from a wide range of unrelated language families. But this is beside the point, since even if Stich were right in saying that all but a few closely related languages have been neglected by transformational grammarians, this would imply only that they ought to get busy studying less closely related languages, not that there is some problem in relating grammar construction to the study of linguistic universals. (Chomsky & Katz 1974:361)

References

Chomsky, N. and Katz, G. 1974. What the linguist is talking about. Journal of Philosophy 71(2): 347-367.