I was honored to be able to attend EACL 2024 in Malta last month. The following is a brief, opinionated “vibe check” on NLP based on my experiences there. I had never been to an EACL, but it appealed to me because I’ve always respected the European speech & language processing community’s greater interest in multilingualism compared to what I’m familiar with in the US. And, because when or why else would I get to see Malta? The scale of EACL is a little more manageable than what I’m used to, and I was able to take in nearly every session and keynote. Beyond that, there wasn’t much difference. Here are some trends I noticed.

We’re doing prompt engineering, but we’re not happy about it

It’s hard to get a research paper out of prompt engineering. There really isn’t much to report, except the prompts used and the evaluation results. And, there doesn’t seem to be the slightest theory about how one ought to design a prompt, suggesting that the engineering part of the term is doing a lot of work. So, while I did see some papers (to be fair, mostly student posters) about prompt engineering, the interesting ones actually compared prompting against a custom-built solution.

There’s plenty of headroom for older technologies

I was struck by one of the demonstration papers, which was using fine-tuned BERT for the actual user-facing behaviors, but an SVM or some other type of simple linear model trained on the same data to provide “explanability”. I was also struck by the many papers I saw in which fine-tuned BERT or some other kind of custom-built solution outperformed prompting.

Architectural engineering is dead for now

I really enjoy learning about new “architectures”, i.e., ways to frame speech and language processing problems as a neural network. Unfortunately, I didn’t learn about any new ones this year. I honestly think the way forward, in the long term, will be to identify and eliminate the less-principled parts of our modeling strategies, and replace them with “neat”, perhaps even proof-theoretic, solutions, but I’m sad to say this is not a robust area.

Massive multilingualism needs new application areas

In the first half of Hinrich Schütze’s keynote, he discussed a massively multilingual study covering 1,500 languages in all. That itself is quite impressive. However, I was less impressed with the tasks targeted. One was an LM-based task (predicting the next word, or perhaps a masked word), evaluated with “pseudo-perplexity”. I’m not sure what pseudo-perplexity is but real perplexity isn’t good for much. The other task was predicting, for each verse from the Bible, the appropriate topic code; these topics are things like “recommendation”, “sin”, “grace”, or “violence”. Doing some kind of semantic prediction, at the verse/sentence level, at such scale might be interesting, but this particular instantiation seems to me to be of no use to anyone, and as I understand it, the labels were projected from those given by English annotators, which makes the task less interesting. Let me be clear, I am not calling out Prof. Schütze, for whom I have great respect—and the second half of his talk was very impressive—but I challenge researchers working at massively multilingual scale to think of tasks really worth doing!

We’ve always been at war with Eurasia

I saw at least two pro-Ukraine papers, both focused on the media environment (e.g., propaganda detection). I also saw a paper about media laws in Taiwan that raised some ethical concerns for me. It seems this may be one of those countries where truth is not a defense against charges of libel, and the application was helping the police enforce that illiberal policy. However, I am not at all knowledgeable about the political situation there and found their task explanation somewhat hard to follow, presumably because of my Taiwanese political illiteracy.

My papers

Adam Wiemerslage presented a paper coauthored with me and Katharina von der Wense in which we propose model-agnostic metrics for measuring hyperparameter sensitivity, the first of their kind. We then use these metrics to show that, at least for the character-scale transduction problems we study (e.g., grapheme-to-phoneme conversion and morphological generation), LSTMs really are less hyperparameter-sensitive than transformers, not to mention more accurate when properly tuned. (Our tuned LSTMs turn in SOTA performance on most of the languages and tasks.) I thought this was a very neat paper, but it didn’t get much burn from the audience either.

I presented a paper coauthored with Cyril Allauzen describing a new algorithm for shortest-string decoding that makes fewer assumptions. Indeed, it allows one for the first time to efficiently decode traditional weighted finite automata trained with expectation maximization (EM). This was exciting to me because this is a problem that has bedeviled me for over 15 years now when I first noticed the conceptual gap. <whine>The experience getting this to press was a great frustration to me, however. It was first desk-rejected at a conference on grammatical inference (i.e., people who study things like formal language learning) on the grounds that it was too applied. On the other hand, the editors at TACL desk-rejected a draft of the paper on the grounds that no one does EM anymore, and didn’t respond when I pointed out that there were in fact two papers in the ACL 2023 main session about EM. So we submitted it to ARR. The first round of reviews were not much more encouraging. It was clear that these reviewers did not understand the important distinction between the shortest path and shortest string, even though the paper was almost completely self-contained, and were perhaps annoyed at being asked to read mathematics (even if it’s all basic algebra). One reviewer even dared to asked why one would bother, as we do, to prove that our algorithm is correct! To the area chair’s credit, they found better reviewers for the second round, and to those reviewers’ credits, they helped us improve the quality of the paper. However, the first question I got in the talk was basically a heckler asking why I’d bother to submit this kind of work to an ACL venue. Seriously though, where else should I have submitted it? It’s sound work.</whine>

In a recent paper (Gorman & Sproat 2023), we complain about conflation of writing systems with the languages they are used to write, highlighting the nonsense underlying common expressions like “right-to-left language”, “syllabic language” or “ideographic” language found in the literature. Thus we were surprised to find the following:

Four segmented languages (Mandarin, Japanese, Korean and Thai) report character error rate (CER), instead of WER… (Gemini Team 2024:18)

Since the most salient feature of the writing systems used to write Mandarin, Japanese, Korean, and Thai is the absence of segmentation information (e.g., whitespace used to indicate word boundaries), presumably the authors mean to say that the data they are using has already been pre-segmented (by some unspecified means). But this is not a property of these languages, but rather of the available data.

[h/t: Richard Sproat]

References

Gemini Team. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint 2312.11805. URL: https://arxiv.org/abs/2312.11805.

Gorman, K. and Sproat, R.. 2023. Myths about writing systems in speech & language technology. In Proceedings of the Workshop on Computation and Written Language, pages 1-5.

Month: April 2024

A minor syntactic innovation in English: “is crazy”

Vibe check: EACL 2024