Vibe check: EACL 2024

I was honored to be able to attend EACL 2024 in Malta last month. The following is a brief, opinionated “vibe check” on NLP based on my experiences there. I had never been to an EACL, but it appealed to me because I’ve always respected the European speech & language processing community’s greater interest in multilingualism compared to what I’m familiar with in the US. And, because when or why else would I get to see Malta? The scale of EACL is a little more manageable than what I’m used to, and I was able to take in nearly every session and keynote. Beyond that, there wasn’t much difference. Here are some trends I noticed.

We’re doing prompt engineering, but we’re not happy about it

It’s hard to get a research paper out of prompt engineering. There really isn’t much to report, except the prompts used and the evaluation results. And, there doesn’t seem to be the slightest theory about how one ought to design a prompt, suggesting that the engineering part of the term is doing a lot of work. So, while I did see some papers (to be fair, mostly student posters) about prompt engineering, the interesting ones actually compared prompting against a custom-built solution.

There’s plenty of headroom for older technologies

I was struck by one of the demonstration papers, which was using fine-tuned BERT for the actual user-facing behaviors, but an SVM or some other type of simple linear model trained on the same data to provide “explanability”. I was also struck by the many papers I saw in which fine-tuned BERT or some other kind of custom-built solution outperformed prompting.

Architectural engineering is dead for now

I really enjoy learning about new “architectures”, i.e., ways to frame speech and language processing problems as a neural network. Unfortunately, I didn’t learn about any new ones this year. I honestly think the way forward, in the long term, will be to identify and eliminate the less-principled parts of our modeling strategies, and replace them with “neat”, perhaps even proof-theoretic, solutions, but I’m sad to say this is not a robust area.

Massive multilingualism needs new application areas

In the first half of Hinrich Schütze’s keynote, he discussed a massively multilingual study covering 1,500 languages in all. That itself is quite impressive. However, I was less impressed with the tasks targeted. One was an LM-based task (predicting the next word, or perhaps a masked word), evaluated with “pseudo-perplexity”. I’m not sure what pseudo-perplexity is but real perplexity isn’t good for much. The other task was predicting, for each verse from the Bible, the appropriate topic code; these topics are things like “recommendation”, “sin”, “grace”, or “violence”. Doing some kind of semantic prediction, at the verse/sentence level, at such scale might be interesting, but this particular instantiation seems to me to be of no use to anyone, and as I understand it, the labels were projected from those given by English annotators, which makes the task less interesting. Let me be clear, I am not calling out Prof. Schütze, for whom I have great respect—and the second half of his talk was very impressive—but I challenge researchers working at massively multilingual scale to think of tasks really worth doing!

We’ve always been at war with Eurasia

I saw at least two pro-Ukraine papers, both focused on the media environment (e.g., propaganda detection). I also saw a paper about media laws in Taiwan that raised some ethical concerns for me. It seems this may be one of those countries where truth is not a defense against charges of libel, and the application was helping the police enforce that illiberal policy. However, I am not at all knowledgeable about the political situation there and found their task explanation somewhat hard to follow, presumably because of my Taiwanese political illiteracy.

My papers

Adam Wiemerslage presented a paper coauthored with me and Katharina von der Wense in which we propose model-agnostic metrics for measuring hyperparameter sensitivity, the first of their kind. We then use these metrics to show that, at least for the character-scale transduction problems we study (e.g., grapheme-to-phoneme conversion and morphological generation), LSTMs really are less hyperparameter-sensitive than transformers, not to mention more accurate when properly tuned. (Our tuned LSTMs turn in SOTA performance on most of the languages and tasks.) I thought this was a very neat paper, but it didn’t get much burn from the audience either.

I presented a paper coauthored with Cyril Allauzen describing a new algorithm for shortest-string decoding that makes fewer assumptions. Indeed, it allows one for the first time to efficiently decode traditional weighted finite automata trained with expectation maximization (EM). This was exciting to me because this is a problem that has bedeviled me for over 15 years now when I first noticed the conceptual gap. <whine>The experience getting this to press was a great frustration to me, however. It was first desk-rejected at a conference on grammatical inference (i.e., people who study things like formal language learning) on the grounds that it was too applied. On the other hand, the editors at TACL desk-rejected a draft of the paper on the grounds that no one does EM anymore, and didn’t respond when I pointed out that there were in fact two papers in the ACL 2023 main session about EM. So we submitted it to ARR. The first round of reviews were not much more encouraging. It was clear that these reviewers did not understand the important distinction between the shortest path and shortest string, even though the paper was almost completely self-contained, and were perhaps annoyed at being asked to read mathematics (even if it’s all basic algebra). One reviewer even dared to asked why one would bother, as we do, to prove that our algorithm is correct! To the area chair’s credit, they found better reviewers for the second round, and to those reviewers’ credits, they helped us improve the quality of the paper. However, the first question I got in the talk was basically a heckler asking why I’d bother to submit this kind of work to an ACL venue. Seriously though, where else should I have submitted it? It’s sound work.</whine>

“Segmented languages”

In a recent paper (Gorman & Sproat 2023), we complain about conflation of writing systems with the languages they are used to write, highlighting the nonsense underlying common expressions like “right-to-left language”, “syllabic language” or “ideographic” language found in the literature. Thus we were surprised to find the following:

Four segmented languages (Mandarin, Japanese, Korean and Thai) report character error rate (CER), instead of WER… (Gemini Team 2024:18)

Since the most salient feature of the writing systems used to write Mandarin, Japanese, Korean, and Thai is the absence of segmentation information (e.g., whitespace used to indicate word boundaries), presumably the authors mean to say that the data they are using has already been pre-segmented (by some unspecified means). But this is not a property of these languages, but rather of the available data.

[h/t: Richard Sproat]

References

Gemini Team. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint 2312.11805. URL: https://arxiv.org/abs/2312.11805.

Gorman, K. and Sproat, R.. 2023. Myths about writing systems in speech & language technology. In Proceedings of the Workshop on Computation and Written Language, pages 1-5.

Streaming decompression for the Reddit dumps

I was recently working with the Reddit comments and submission dumps from PushShift (RIP).¹ These are compressed in Zstandard .zstformat. Unfortunately, Python’s extensive standard library doesn’t have native support for this format, and the some of the files are quite large,² so a streaming API is necessary.

After trying various third-party libraries, I finally found one that worked with a minimum of fuss: pyzstd, available from PyPI or Conda. This appears to be using ~~Facebook~~Meta’s reference C implementation as the backend, but more importantly, it provides a stream API like the familiar gzip.open, bz2.open, and lzma.open for .gz, .bz2 and .xz files, respectively. There’s one nit: PushShift’s Reddit dumps were compressed with an uncommonly large window size (2 << 31), and one has to inform the decompression backend. Without this, I was getting the following error:

_zstd.ZstdError: Unable to decompress zstd data: Frame requires too much memory for decoding.

All I have to do to fix this is to pass the relevant parameter:

PARAMS = {pyzstd.DParameter.windowLogMax: 31}

with pystd.open(yourpath, "rt", level_or_options=PARAMS) as source:
    for line in source:
        ...

Then, each line is a JSON message with the post (either a comment or submission) and all the metadata.

Endnotes

Psst, don’t tell anybody, but… while these are no longer being updated they are available through December 2023 here. We have found them useful!
Unfortunately, they’re grouped first by comments vs. submissions, and then by month. I would have preferred the files to be grouped by subreddit instead.

Lottery winners

It is commonplace to compare the act of securing a permanent faculty position in linguistics to winning the lottery. I think this is mostly unfair. There are fewer jobs than interested applicants, but the demand is higher— and the supply lower—than students these days suppose. And my junior faculty colleagues mostly got to where they are by years of dedicated, focused work. Because there are a lot of pitfalls on the path to the tenure track, their egos are often a lot smaller than one might suppose.

I wonder if the lottery ticket metaphor might be better applied to graduate trainees in linguistics finding work in the tech sector. I have held both types of positions, and I think I had to work harder to get into tech than to get back into the academy. Some of the “alt-ac influencers” in our field—the ones who ended up in tech, at least—had all the privileges in the world, including some reasonably prestigious teaching positions, before they made the jump. Being able to stay and work in the US—where the vast majority of this kind of work is—requires a sort of luck too, particularly when you reject the idea that “being American” is some kind of default. And finally demand for linguist labor in the tech sector varies enormously from quarter to quarter, meaning that some people are going to get lucky and others won’t.

The Unicoder

I have long encouraged students to turn software demos (which work on their laptop, in their terminal, and maybe nowhere else) into simple web apps. Years ago I built a demo of what this might look like, using Python’s Flask library. The entire app is under 200 lines of Python (and jinja2 template), plus a bit of static HTML and CSS.

It turns out this little demonstration is actually quite useful for my research. For any given string, it gives you the full decomposition of it into Unicode codepoints, with optional Unicode normalization, whitespace stripping, and case-folding. This is very useful for debugging.

The Unicoder, as it is called, is hosted on the free tier of Glitch. [Edit: it is now on Render.] (It used to also be on Heroku, but Salesforce is actively pushing people off that very useful platform.) Because of that, it takes about 10 seconds to “start up” (i.e., I assume the workers are put into some kind of hibernation mode) if it hasn’t been used in the last half hour or so. But, it’s very, very useful.

Robot autopsies

I don’t really understand the exhuberance for studying whether neural networks know syntax. I have a lot to say about this issue—I’ll return to it later—but for today I’d like to briefly discuss this passage from a recent(ish) paper by Baroni (2022). The author expresses great surprise that few formal linguists have cited a particular paper (Linzen et al. 2016) about the ability of neural networks to learn long-distance agreement phenomena. (To be fair, Baroni is not a coauthor of said paper.) He then continues:

While it is possible that deep nets are relying on a completely different approach to language processing than the one encoded in human linguistic competence, theoretical linguists should investigate what are the building blocks making these systems so effective: if not for other reasons, at least in order to explain why a model that is supposedly encoding completely different priors than those programmed into the human brain should be so good at handling tasks, such as translating from a language into another, that should presuppose sophisticated linguistic knowledge. (Baroni 2022: 11).

I think this passage is a useful stepping-off point for what I think. I want to be clear: I am not “picking on” Baroni, who is probably far more senior to and certainly better known than me anyways; this is just a particularly clearly written claim, and I just happen to disagree.

Baroni says it is “possible that deep nets are relying on a completely different approach to language processing…” than humans; I’d say it’s basically certain that they are. We simply have no reason to think they might be using similar mechanisms since humans and neural networks don’t contain any of the same ingredients. Any similarities will naturally be analogies, not homologies.

Without a strong reason to think neural models and humans share some kind of cognitive homologies, there is no reason for theoretical linguists to investigate them; as artifacts of human culture they are no more in the domain of study for theoretical linguists than zebra finches, carburetors, or the perihelion of Mercury.

It is not even clear how one ought to poke into the neural black box. Complex networks are mostly resistent to the kind of proof-theoretic techniques that mathematical linguists (witness the Delaware school or even just work by, say, Tesar) actually rely on, and most of the results are both negative and of minimal applicability: for instance, we know that there always exists a single-layer network large enough to encode, with arbitrary precision, any function a multi-layer network encodes, but we have no way to figure out how big is big enough for a given function.

Probing and other interpretative approaches exist, but have not yet proved themselves, and it is not clear that theoretical linguists have the relevant skills to push things forward anyways. Quality assurance, and adversarial data generation, is not exactly a high-status job; how can Baroni demand Cinque or Rizzi (to choose two of Baroni’s well-known countrymen) to put down their chalk and start doing free or poorly-paid QA for Microsoft?

Why should theoretical linguists of all people be charged with doing robot autopsies when the creators of the very same robots are alive and well? Either it’s easy and they’re refusing to do the work, or—and I suspect this is the case—it’s actually far beyond our current capabilities and that’s why little progress is being made.

I for one am glad that, for the time being, most linguists still have a little more self-respect.

References

Baroni, M. 2022. On the proper role of linguistically oriented deep net analysis in linguistic theorising. In S. Lappin (ed). Algebraic Structures in Natural Language, pages 1-16. Taylor & Francis.
Linzen, T., Dupoux, E., and Goldberg, Y. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4: 521-535.

ACL Rolling Reviews don’t roll anymore

Recently I wanted to submit a paper to the ACL’s rolling reviews system. The idea of this system is that instead of people rushing to make somewhat arbitrary conference deadlines—most everything is published at conferences rather than books or journals in NLP—one can instead submit to an ever-running pool of reviewers and get quick comments. Furthermore, the preprints are available online and one can see the comments. After you the author feel that you’ve received a satisfactory review you can then “submit”, with the push of a button, your already-reviewed paper to a conference and the organizers and area chairs put together a program from these papers. This seems like a good idea thus far, even if the very strong COI policy means that none of the papers I get assigned to review are interesting to me but rather in adjacent (and boring) areas.

I was recently surprised to find—it’s not documented anywhere, I had to write tech support and wait to hear back—that it there are now blackout periods of several weeks where one cannot submit. I have no idea why this is true. Granted, they reduced the frequency of the cycles to six a year (or one every two months), but I don’t understand why I can’t, on July 1st, submit to the August 15th cycle. This makes no sense to me and seems to defeat the most important part of this initiative: the idea that you can submit work when it’s done rather than when certain stars align.

Myths about writing systems

In collaboration with Richard Sproat, I just published a short position paper on “myths about writing systems” in NLP to appear in the proceedings for CAWL, the ACL Workshop on Computation and Writing Systems. I think it will be most of all useful to reviewers and editors who need a resource to combat nonsense like Persian is a right-to-left language and want to suggest a correction. Take a look here.

Debugging CUDA indexing errors

Perhaps you’ve seen pages of the following scary error:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [99,0,0], thread: [115,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

It turns out there is a relatively simple way to figure out what the indexing issue is. The internet suggests prepending

CUDA_LAUNCH_BLOCKING=1

to your command, but this doesn’t seem to help much either. There is a simpler solution: run whatever you’re doing on CPU. It’ll give you much nicer errors.

Online poisoning

One of my working theories for why natural language processing feels unusually contentious at present is, yes, social media. The outspoken researchers speak, more or less constantly, to a large social media audience, and use this forum as the primary way to form and disseminate opinions. For instance, there is a very strong correlation between being an “ACL thought leader”, if not an officer, and tweeting often and aggressively. People of my age understand the addictive and corrosive nature of presenting oneself for online kudos (and jeers), but some people of the older generations lack the appropriate internet literacy to use these tools in moderation, and some people of the younger generations lack the maturity to do the same. Such people have online poisoning. Side-effects include outing oneself as the subject of a subtweet and complaining to a student’s advisor. If you have any of these symptoms, please log off immediately and touch grass.