The Zipf scale

Word frequency norms are usually computed by counting word frequencies in some large, relatively-diverse, but hopefully representative corpus. However, raw frequency is only interpretable relative to the size of that corpus.

Converting raw frequencies to probabilities (i.e., via maximum likelihood estimation) removes this corpus-dependence, but the resulting probabilities are themselves not terribly interpretable either. One slight improvement has been to interpret them as words per million (usually rounding to the nearest power of ten). It is reasonably obvious to me that “100 wpm” is an improvement on “.0001” or the equivalent “1e-4”.

Van Heuven et al. (2014) propose a variation on words-per-million metrics which they call the Zipf scale. While van Heuven et al. do not give a complete formula, examples indicate that their scale is equivalent to wpm + 2, and can be computed from raw frequencies as $\log_{10}(c) – \log_{10}(N) + 9$ when raw frequency $c > 0$, and $0$ otherwise, and where $N$ is the corpus size. The definition above differs slightly from van Heuven’s formula in that we do not use “add 1” smoothing, which causes issues with fractional frequencies, and give a function which is defined at 0; the Zipf scale of a zero-frequency word, naturally, is zero. Here is a tiny Python module for computing it, and here is the associated unit test.

We use this definition in the new webapp-based CityLex, out now, as just one of several ways to express frequency norms.

References

Van Heuven, Walter J.B., Mandera, P., Keuleers, E. and Brysbaert, M. 2014. SUBTLEX-UK: a new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology 67: 1176-1190.

ACL meetings

While I continue to work in computational linguistics, broadly construed, I am feeling less and less motivated to actually attend the core ACL events I’ll denote as *CL (the international ACL and EMNLP meetings, and the regional meetings: EACL in Europe, NAACL in North America, IJCNLP in East Asia, and so on).

This is not a “why I left…” post, nor do I have much constructive criticism, but it is helpful to contrast with the kinds of conferences I attend when wearing my formal linguist hat. As a formal linguist, I overwhelmingly attend conferences in the “ACELA corridor” portion of the Mid-Atlantic and New England well-served by trains, and pay registration fees of $100 or even less. In contrast, I cannot even remember the last time any *CL meeting was in this (rich, populous, and well-educated) region of the country, and I expect to spend my entire travel budget for the year on attending even a domestic *CL thanks to skyrocketing registration fees and hotel prices. (It doesn’t help that these conferences tend to go on for a long time so you need a lot of nights in the hotel.) I don’t know how much the ACL can do about costs, but the *CL conference locations tend towards the random or exotic rather than dense areas with a lot of research output.

There are two other big issues with *CL that make me less likely to attend them. First, other than the handful of senior faculty who are in ACL leadership, and the same invited talks (it’s the same few people over and over…) there are hardly any faculty at *CL conferences anymore. I don’t see many of my mid-career peers, and I don’t see senior people either. Something needs to be done to encourage these people to attend. Secondly, the program of *CL conferences, even in the main sessions, are overwhelmingly inclined towards what are essentially system demonstrations using known technologies, rather than new ideas, critiques, unsolved problems, or system comparisons. To put it another way, I guess I’m glad BERT or GPT-4o worked for your problem, but that kind of talk or poster doesn’t exactly make for scintillating scientific debate.

I continue to work in these areas but I suspect I am going to opt in for online presentation (or just go straight to the journals) more in the future; and perhaps send students in my stead.

Vibe coding in 2025

I have seen a decent amount of software developed via vibe coding, i.e., coding with heavy AI assistance but I have not seen anything that is passable as professionally-developed software. One big tell is that the AI assistant tends to add in copious boundary condition checks that’d never occur to humans because they are unlikely to occur in practice. Another is bizarre style, which imposes a heavy cognitive cost on the reader (whose time is qualitatively more valuable than that of the computer).

For people who basically can’t code anywhere near a professional level, I see why this is valuable, but I think these people should be honest with themselves and admit they can’t really code, and wouldn’t know good code if it hit them in the face.

For people who can (and who can already avail themselves of various autocompletion tools whether AI-powered or knowledge-based), I don’t see the value proposition. It creates hard-to-review code and code review is a more difficult, important, cognitively taxing, and rarified skill than development itself, so this technology could, in the worst case, actually drive up the already high cost of software development.

When LLMing goes wrong

[The following is a guest post from Daniel Yakubov.]

You’ve probably noticed that industries have been jumping to adopt some vague notion of “AI” or peacocking about their AI-powered something-or-other. Unsurprisingly, the scrambled nature of this adoption leads to a slew of issues. This post outlines a fact obvious to technical crowds, but not business folks; even though LLMs are a shiny new toy, LLM-centric systems still require careful consideration.

Hallucination is possibly the most common issue in LLM systems. It is the tendency for an LLM to prioritize responding rather than responding accurately, aka. making stuff up. Considering some of the common approaches to fixing this, we can understand what problems these techniques introduce.

A quick approach that many prompt engineers I know think is the end-all be-all of Generative AI is Chain-of-Thought (CoT; Wei et al 2023). This simple approach just tells the LLM to break down its reasoning “step-by-step” before outputting a response. While a bandage, CoT does not actually inject new knowledge into an LLM, this is where the Retrieval Augmented Generation (RAG) craze began. RAG represents a family of approaches that add relevant context to a prompt via search (Patrick et al. 2020). RAG pipelines come with their own errors that need to be understood, including noise in the source documents, misconfigurations in the context window of the search encoder, and specificity of the LLM reply (Barnett et al. 2024). Specificity is particularly frustrating. Imagine you ask a chatbot “Where is Paris?” and it replies “According to my research, Paris is on Earth.” At this stage, RAG and CoT combined still cannot deal with complicated user queries accurately (or well, math). To address that, the ReAct agent framework (Yao et al 2023) is commonly used. ReAct, in a nutshell, gives the LLM access to a series of tools and the ability to “requery” itself depending on the answer it gave to the user query. A central part of ReAct is the LLM being able to choose which tool to use. This is a classification task, and LLMs are observed to suffer from an inherent label bias (Reif and Schwarz, 2024), another issue to control for.

This can go for much longer, but I feel the point should be clear. Hopefully this gives a more academic crowd some insight into when LLMing goes wrong.

References

Barnett, S., Kurniawan, S., Thudumu, S. Brannelly, Z., and Abdelrazek, M. 2024. Seven failure points when engineering a retrieval augmented generation system.
Lewis, P., Perez, E., Pitkus, A., Petroni, F., Karpukhin, V., Goyal, N., …, Kiela, D. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks.
Reif, Y., and Schwartz, R. 2024. Beyond performance: quantifying and mitigating label bias in LLMs.
Wei, J. Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., …, Zhou, D. 2023. Chain-of-thought prompting elicits reasoning in large language models.
Yao, S., Zhao, J., Yu, D., Shafran, I., Narasimhan, K., and Cao, Y. 2023. ReAct: synergizing reasoning and acting in language models.

Hiring season

It’s hiring season and your dean has approved your linguistics department for a new tenure line. Naturally, you’re looking to hire an exciting young “hyphenate” type who can, among other things, strengthen your computational linguistics offerings, help students transition into industry roles and perhaps even incorporate generative AI into more mundane parts of your curriculum (sigh). There are two problems I see with this. First, most people applying for these positions don’t actually have relevant industry experience, so while they can certainly teach your students to code, they don’t know much about industry practices. Secondly, an awful lot of them would probably prefer to be a full-time software engineer, all things considered, and are going to take leave—if not quit outright—if the opportunity ever becomes available. (“Many such cases.”) The only way to avoid this scenario, as I see it, is to find people who have already been software engineers and don’t want to be them anymore, and fortunately, there are several of us.

Hugging Face needs better curation

Hugging Face is, among other things, a platform for obtaining pre-trained neural network models. We use their tokenizers and transformers Python libraries in a number of projects. While these have a bit more abstraction than I like, and are arguably over-featured, they are fundamentally quite good and make it really easy to, e.g., add a pre-trained encoder. I also appreciate that the tokenizers are mostly compiled code (they’re Rust extensions, apparently), which in practice means that tokenization is IO-bound rather than CPU bound.

My use case mostly involves loading Hugging Face transformers and their tokenizers and using their encoding layers for fine-tuning. To load a model in transformers, one uses the function transformers.AutoModel.from_pretrained and provides the name of the model on Hugging Face as a string argument. If the model exists, but you don’t already have a local copy, Hugging Face will automatically download it for you (and stashes the assets in some hidden directory). One can do something similar with the tokenizers.AutoTokenizer, or one can request the tokenizer from the model instance.

Now you might think that this would make it easy to, say, write a command-line tool where the user can specify any Hugging Face model, but unfortunately, you’d be wrong. First off, a lot of models, including so-called token–free ones lack a tokenizer. Why doesn’t ByT5, for instance, provide as its tokenizer a trivial Rust (or Python, even) function that returns bytes? In practice, one cannot support arbitrary Hugging Face models because one cannot count on them having a tokenizer. In this case, I see no alternative but to keep a list of supported models that lack their own tokenizer. Such a list is necessarily incomplete because the model hub continues to grow.

A similar problem comes with how parameters of the models are named. Most models are trained with dropout and support a dropout parameter, but the name of this parameter is inconsistent from model to model. In UDTube, for instance, dropout is a global parameter and it is applied to each hidden layer of the encoder (which requires us to access the guts of the Hugging Face model), and then again to the pooled contextual subword embeddings just before they’re pooled into word embeddings. Most of the models we’ve looked at call the dropout probability of the encoder hidden_dropout_prob, but others call it dropout or dropout_rate. Because of this, we have to maintain a module which keeps track of what the hidden layer dropout probability parameter is called.

I think this is basically a failure of curation. Hugging Face community managers should be out there fixing these gaps and inconsistencies, or perhaps should also publish standards for such things. They’re valued at $4.5 billion. I would argue this is at least as important as their efforts with model cards and the like.

Python ellipses considered harmful

Python has a conventional object-oriented design, but it was slowly grafted onto the language, something which shows from time to time. Arguably, you see this in the convention that instance methods need self passed as their first argument, and class methods need clsas their first argument. Another place you see it is how Python does abstract classes. First, one can use definitions in the built-in abc module, proposed in PEP-3119, to declare a class as abstract. But in practice most Pythonistas make a class abstract by declaring unimplemented instance methods. There are two conventional ways to do this, either with ellipses or by raising an exception, illustrated below.

class AbstractCandyFactory:
    def make_candy(self, batch_size: int): ...

class AbstractCandyFactory:
    def make_candy(self, batch_size: int):
        raise NotImplementedError

The latter is a bit more verbose, but there is actually a very good reason to prefer it to the former, elliptical version. With the exception version, if one forgets to implement make_candy—say, in a concrete subclass like SnickersFactory(AbstractCandyFactory)—an informative exception will be raised when make_candy is called on a SnickersFactory instance. However, in the elliptical form, the inherited form will be called, and of course will do nothing because the method has no body. This will likely cause errors down the road, but they will not be nearly as easy to track down because there is nothing to directly link the issue to the failure to override this method. For this reason alone, I consider ellipses used to declare abstract instance methods as harmful.

Announcing UDTube

In collaboration with CUNY master’s program graduate Daniel Yakubov, we have recently open-sourced UDTube, our neural morphological analyzer. UDTube performs what is sometimes called morphological analysis in context: it provides morphological analyses—coarse POS tagging, more-detailed morphosyntactic tagging, and lemmatization—to whole sentences using nearby words as context.

The UDTube model, developed in Yakubov 2024, is quite simple: it uses a pre-trained Hugging Face encoders to compute subword embeddings. We then take the last few layers of these embeddings and mean-pool them, then mean-pool subword embeddings for those words which correspond to multiple subwords. The resulting encoding of the input is then fed to separate classifier heads for the different tasks (POS tagging, etc.). During training we fine-tune the pre-trained encoder in addition to fitting the classifier heads, and we make it possible to set separate optimizers, learning rates, and schedulers for the encoder and classifier modules.

UDTube is built atop PyTorch and Lightning, and its command-line interface is made much simpler by the use of LightningCLI, a module which handles most of the interface work. One can configure the entire thing using YAML configuration files. CUDA GPUs and MPS-era Macs (M1 etc.) can be used to accelerate training and inference (and should work out of the box). We also provide scripts to perform hyperparameter tuning using Weights & Biases. We believe that this model, with appropriate tuning, is probably state-of-the-art for morphological analysis in context.

UDTube is available under an Apache 2.0 license on GitHub and on PyPI.

References

Yakubov, D. 2024. How do we learn what we cannot say? Master’s thesis, CUNY Graduate Center.

Learned tokenization

Conventional (i.e., non-neural, pre-BERT) NLP stacks tend to use rule-based systems for tokenizing sentences into words. One good example is Spacy, which provides rule-based tokenizers for the languages it supports. I am sort of baffled this is considered a good idea for languages other than English, since it seems to me that most languages need machine learning for even this task to properly handle phenomena like clitics. If you like the Spacy interface—I admit it’s very convenient—and work in Python, you may want to try thespacy-udpipe library, which exposes the UDPipe 1.5 models for Universal Dependencies 2.5; these in turn use learned tokenizers (and taggers, morphological analyzers, and dependency parsers, if you care) trained on high-quality Universal Dependencies data.

Automatic batch sizing

Yoyodyne is my lab’s sequence-to-sequence library, intended to be a replacement for Fairseq, which is (essentially) abandonware. One matter of urgency for me in building Yoyodyne was to enable automatic hyperparameter tuning. This was accomplished by logging results to Weights & Biases (W&B). We can perform a random or Bayesian hyperparameter sweep using a “grid” specified via a YAML file, monitor progress on the W&B website, or even hit the API to grab the best hyperparameters. One issue that kept coming up, however, is that it is easy to hit out-of-memory (OOM) errors during this process. Here’s what we did about it:

OOMs are not purely due to model size: the model, batch, and gradients all need to fit into the same VRAM. PyTorch Lightning, which is a key part of the Yoyodyne backend, provides a function for automatically determining the maximum batch size that will not trigger an OOM. Basically, it works by starting with a low batch size (by default, 2), randomly drawing three batches of that size, and then attempting training (but in fact caching parameters so that no real training occurs). If this does not trigger an OOM, it doubles the batch size, and so on.^1,2 You can enable this approach in Yoyodyne using the flag --find_batch_size max. You’d want to use this if you believe that a giant batch size is fine and you just want to fully saturate your GPU.

A slightly more sophisticated version of this, useful when you actually want to tune batch size, is enabled with the flag --find_batch_size opt. This again begins by doubling the size of randomly drawn batches as well, but here it halts once the doubling exceeds the value of the --batch_sizeflag. If the max batch size is larger than the requested size, it is used as is; thus this acts as a soft check against OOMs. If, however, the max batch size is smaller than --batch_size it instead solves for a new batch size, the largest batch size which is smaller than the max and which is a divisor of --batch_size`. It then enables multiple rounds of gradient accumulation per update,³ thus perfectly-losslessly simulating the desired batch size while using as much of VRAM as possible. I can assure you this is a killer feature for neural network tuning.

Endnotes

This is a little imprecise, and one can refine it by doing a binary search, but in practice it’s not worth the effort when working with ragged data.
Whatever batch size was requested with the --batch_size flag is ignored.
More formally, given desired batch size $b$ and a max batch size $n’$, it finds $a, n$ such that $a$ is the smallest integer, and $n$ is the largest integer, where $an = b$. This is computed via brute force; my implementation of an elegant solution based on the prime factorization was a bit slower.