The Zipf scale

Word frequency norms are usually computed by counting word frequencies in some large, relatively-diverse, but hopefully representative corpus. However, raw frequency is only interpretable relative to the size of that corpus.

Converting raw frequencies to probabilities (i.e., via maximum likelihood estimation) removes this corpus-dependence, but the resulting probabilities are themselves not terribly interpretable either. One slight improvement has been to interpret them as words per million (usually rounding to the nearest power of ten). It is reasonably obvious to me that “100 wpm” is an improvement on “.0001” or the equivalent “1e-4”.

Van Heuven et al. (2014) propose a variation on words-per-million metrics which they call the Zipf scale. While van Heuven et al. do not give a complete formula, examples indicate that their scale is equivalent to wpm + 2, and can be computed from raw frequencies as $log_{10}(c) – log_{10}(N) + 9$ when raw frequency $c > 0$, and $0$ otherwise, and where $N$ is the corpus size. The definition above differs slightly from van Heuven’s formula in that we do not use “add 1” smoothing, which causes issues with fractional frequencies, and give a function which is defined at 0; the Zipf scale of a zero-frequency word, naturally, is zero. Here is a tiny Python module for computing it, and here is the associated unit test.

We use this definition in the new webapp-based CityLex, out now, as just one of several ways to express frequency norms.

References

Van Heuven, W. J.B., Mandera, P., Keuleers, E. and Brysbaert, M. 2014. SUBTLEX-UK: a new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology 67: 1176-1190.

ACL meetings

While I continue to work in computational linguistics, broadly construed, I am feeling less and less motivated to actually attend the core ACL events I’ll denote as *CL (the international ACL and EMNLP meetings, and the regional meetings: EACL in Europe, NAACL in North America, IJCNLP in East Asia, and so on).

This is not a “why I left…” post, nor do I have much constructive criticism, but it is helpful to contrast with the kinds of conferences I attend when wearing my formal linguist hat. As a formal linguist, I overwhelmingly attend conferences in the “ACELA corridor” portion of the Mid-Atlantic and New England well-served by trains, and pay registration fees of $100 or even less. In contrast, I cannot even remember the last time any *CL meeting was in this (rich, populous, and well-educated) region of the country, and I expect to spend my entire travel budget for the year on attending even a domestic *CL thanks to skyrocketing registration fees and hotel prices. (It doesn’t help that these conferences tend to go on for a long time so you need a lot of nights in the hotel.) I don’t know how much the ACL can do about costs, but the *CL conference locations tend towards the random or exotic rather than dense areas with a lot of research output.

There are two other big issues with *CL that make me less likely to attend them. First, other than the handful of senior faculty who are in ACL leadership, and the same invited talks (it’s the same few people over and over…) there are hardly any faculty at *CL conferences anymore. I don’t see many of my mid-career peers, and I don’t see senior people either. Something needs to be done to encourage these people to attend. Secondly, the program of *CL conferences, even in the main sessions, are overwhelmingly inclined towards what are essentially system demonstrations using known technologies, rather than new ideas, critiques, unsolved problems, or system comparisons. To put it another way, I guess I’m glad BERT or GPT-4o worked for your problem, but that kind of talk or poster doesn’t exactly make for scintillating scientific debate.

I continue to work in these areas but I suspect I am going to opt in for online presentation (or just go straight to the journals) more in the future; and perhaps send students in my stead.

Tipping for counter service

In the States, which already had a large and diverse (some would say annoying) set of interactions which de facto required substantial tips (e.g., 15 to 20% of the purchase price), tips are now also requested (though not yet required) for food and/or drinks purchased at the counter. Hip cafes have long had tip jars (with no particular expectation that one uses them), and one is expected to tip the bartender at least a dollar for an alcoholic drink purchased, but the new thing is that anywhere with an iPad-based payment system also requests a percentage tip.

A lot of people are dismayed by this. I for one select “no tip” or “0%” or whatever the majority of the time. But the new “turning around the iPad” normal doesn’t bother me either; at worst it’s a small, progressive tax transferring money from easily-influenced (rule-following? introverted?) consumers making luxury food purchases to counter-service employees who probably need the money more.

Vibe coding in 2025

I have seen a decent amount of software developed via vibe coding, i.e., coding with heavy AI assistance but I have not seen anything that is passable as professionally-developed software. One big tell is that the AI assistant tends to add in copious boundary condition checks that’d never occur to humans because they are unlikely to occur in practice. Another is bizarre style, which imposes a heavy cognitive cost on the reader (whose time is qualitatively more valuable than that of the computer).

For people who basically can’t code anywhere near a professional level, I see why this is valuable, but I think these people should be honest with themselves and admit they can’t really code, and wouldn’t know good code if it hit them in the face.

For people who can (and who can already avail themselves of various autocompletion tools whether AI-powered or knowledge-based), I don’t see the value proposition. It creates hard-to-review code and code review is a more difficult, important, cognitively taxing, and rarified skill than development itself, so this technology could, in the worst case, actually drive up the already high cost of software development.

Columbia and such

I teach at a public university, and I think that is good. Private education is education (which can be good), but it’s inaccessible to most people (which is not good) and has all kinds of other perverse incentives. If at some point in the future you find me teaching at a private school, you can be reasonably sure that I did it, in part, for the money. Of course I have a great deal of respect for my many colleagues who teach at private institutions: they are not, generally speaking, part of the problem.

Higher education is under assault in the US. An interesting contour of the current battle is how much ire is specifically directed at the elite private universities attended by the same powerful people leading the charge. (Donald, Elon, and I went to one of them.) Is this a First They Came… situation for a public educator? It’s hard to say.

One of the many lines of attack is a proposal to federally tax private endowments at a similar rate to capital gains. I recognize this would be a major blow, at least in the short term, for how these private universities do business. It is motivated by perverse political impulses to which I am opposed. But I can’t actually articulate an ethical reason why private endowment capital gains shouldn’t be taxed like, well, capital gains. Furthermore, it is clear that endowments are increasingly the tail that wags the dog of private education. Alumni donors are at war with actual students at these institutions, and the administrators who are prosecuting this war do so in part because of the power they glean from courting said donors. (The other part, I suspect, is because at least some of these administrators agree on the politics, and/or just want to crack a few heads to get the students back in line more generally, making their day-to-day work, of appeasing the donors, easier.) This is an overall bad situation for the institution, I think, regardless of what the moral truth of the issues are. Recognizing the current moment as exactly the sort of “rainy day” the endowment is ostensibly earmarked for, a private institution could really affect a positive transformation of their campus. I doubt it’d happen that way, but it could.

Let us specifically consider Columbia, which is by all accounts at the center of the conflict, and the similarly expensive (if not quite as “elite”) NYU. Both of these registered 501(c)(3) non-profits, but these two institutions (and no others) are also subject to a number of highly-specific state tax exemptions written directly into New York state law. The Times estimates that these altogether amount to $327m (per annum) state tax break. A lot of this is actually exemptions for property taxes; Columbia owns more land in New York City than any other private entity, followed only by NYU in second place. Much of this is clearly just speculation, since neither have much of a campus. (Both also have a bad habit of abusing adjacent public spaces for their own purposes.) The state assembly has considered a number of bills to remove these exemptions, and most of them focus on returning at least some of the funds to the community. Repealing these exemptions would similarly be a short-term shock to these institutions. But once again, I can’t articulate an ethical argument for why these private institutions should be so exempt, and applying pressure to these institutions to sell off the speculative portion of their real estate portfolio would be good for the institutional soul. I intuit that similar things hold of my (PhD) alma mater, Penn, but I’m less familiar with the issues or law there.

Members of private institutions have my full solidarity against the DOGE boys up until we are discussing taxes, in which case I’ll hold my tongue.

In memoriam: Eugene Buckley

It was just announced that Gene Buckley has passed away. I took nearly all of his classes in graduate school, TAed for him, and he served on my dissertation committee, so he definitely had a profound influence on me and my work. I particularly remember his seminar on opacity, which introduced me to substance-free phonology, and his seminar on Kashaya, where I learned about the ancient, mysterious i → d / _ u rule and the many Russian loanwords from their contact with Fort Ross (e.g., [jaːpalka] ‘apple’ < яблоко, with devoicing, stress-conditioned lengthening, and CV-metathesis). Gene was an island of relative sanity and calm during my turbulent grad school year. He’ll be missed.

[Gene’s colleague Rolf Noyer wrote a brief memorial here.]

How not to acquire phonological exchanges

I recently gave a talk at the Canadian Linguistic Association (sort of like the LSA, but Canadian and frankly a lot better because it doesn’t have nearly as many prominent crackpots and cranks) with Charles Reiss on the notion of “exchange rules” as they’re understood in Logical Phonology (LP). Whereas SPE-era theories can use alpha-notation to generate exchange rules, LP can model exchange processes via a series of seemingly complex rules, either via a Duke of York gambit or opportunistic abuse of underspecification. Since it seems quite likely that purely phonological exchanges don’t in fact exist, we suggest that the language acquisition device is constrained so as to not profer analyses of those types, though we also consider that this may be an accident of the “diachronic filter” of the sort developed by Ju. Blevins, M. Hale, and J. Ohala. The handout is here for those interested, and like other things we’ve been writing about LP, will probably be included in a forthcoming book. One interesting question that we raise, but don’t answer, is what one ought to do about exchanges in Optimality Theory. Alderete (2001), for example, proposes to model them with a family of “anti-faithfulness constraints”. While one could predict the absence of exchanges by eliminating this constraint family, Alderete also uses anti-faithfulness for phenomena other than exchanges, and some of these may be more robustly attested; it’s not clear what ought to be done thus.

References

Alderete, J. 2001. Dominance effects as transderivational anti-faithfulness. Phonology 18: 201-253.

Guest & Martin on neural networks as cognitive models

In our “big questions” class, we read a few papers about whether large artificial neural network language models are good (or even candidate) cognitive models. As part of my background reading I also reviewed this recent paper by Guest & Martin (2023). The crux of the paper is an argument based on simple propositional logic, and because it was hard for me to follow, I thought I’d try to review it here.

G&M first identify a commonly, mostly implicit argument for studying artificial neural networks as cognitive models which takes the form of modus ponens. I will take the liberty of generalizing it considerably here.

$P \rightarrow Q$: if neural networks (i.e., their outputs) are correlated with behavioral or neuroimaging data, they are plausible cognitive models (“do what people do”).
$P$: neural networks are correlated with such data.
$\vdash Q$: therefore they are plausible cognitive models.

G&M give several examples where this argument has been applied and this is the exact motivation that linguists engaged in “LLMology” tend to give during the question period. The problem, as G&M note, is that the correctness of this inference depends crucially on whether $P \rightarrow Q$, and there are no shortage of arguments against that proposition. The most obvious one, of course, is the possibility of multiple realizability. They use the example of two clocks are behaviorally quite similar, but one is actually based on springs and cogs whereas the other has a quartz motion powered by a battery. Clearly, neural networks and human brains could both realize the same sorts of behaviors/mappings without being internally the same.

G&M continue that if the above inference is valid, it should be possible to apply modus tollens to it as well. This has the following general form.

$P \rightarrow Q$: (as above).
$\neg Q$: neural networks are not plausible cognitive models.
$\vdash \neg P$: therefore neural networks (i.e., their outputs) are not correlated with behavioral or neuroimaging data.

G&M give several examples where such an argument could easily be applied: so-called hallucinations, cases where neural networks continue to underperform humans when provided with reasonable amounts of data, as well as cases where neural networks can be shown to exhibit superhuman performance! As they conclude: “Even though $Q$ can, and often does, fail to be true, we, as a field, do not formulate its relationship to $P$ in terms of MT [modus tollens]. (G&M: 217). Rather, they argue, what people actually do is show that:

$Q \rightarrow P$: if the neural networks are plausible cognitive models (“do what people do”), then neural networks are correlated with behavioral or neuroimaging data

Using this to assert $Q$, of course, is the fallacy of affirming the consequent, and is clearly invalid. What G&M ultimately seem to conclude is that little can be logically concluded from cognitive modeling with artificial neural networks, even if these models remain “useful” in many domains.

References

Guest, O., and Martin, A. E. 2023. On logical inference over brains, behaviour, and artificial neural networks. Computational Brain & Behavior 3:213-227.

Cajal on “diseases of the will”

Charles Reiss (h/t) recently recommended me a short book by Santiago Ramón y Cajal (1852-1934), an important Spanish neuroscientist and physician. Cajal first published the evocatively titled Reglas y Consejos sobre Investigación Cientifica: Los tónicos de la voluntad in 1897 and it was subsequently revised and translated various times since then. By far the most entertaining portion for me is chapter 5, entitled “Diseases of the Will”. Despite the name, what Cajal actually presents is a taxonomy of scientists who do contribute little to scientific inquiry: “contemplators”, “bibliophiles and polyglots”, “megalomaniacs”, “instrument addicts”, “misfits”, and “theorists”. I include a PDF of this brief chapter, translated into English, here for interested readers, under my belief it is in the public domain.

“Indo-European” is not a meaningful typological descriptor

A trope I see in a lot of student writing (and computational linguistics writing at all levels) is a critique of prior work as being only on “Indo-European languages”, and sometimes a promise that current or future work will target “non-Indo-European languages”.

To me, this is drivel. The Indo-European language family is quite diverse; i.e., for the vast majority of things I’m interested in, either, say, Italian or Russian is sufficiently different from English to make a relevant comparison. And there are a huge number of “non-Indo-European” languages that are typologically similar to at least some Indo-European languages on at least some dimensions; i.e., Finno-Ugric, “Aquitanian” (i.e., Basque) and the (narrowly defined) “Altaic” families (Mongolic, Tungusic, and Turkic) have quite a bit in common typologically with IE, as do, say, Japanese and Korean. Genetic relatedness just isn’t that typologically informative in very dense, very “old” families like IE.

If you want to talk typology, you should focus on typological aspects actually relevant to your study rather than genetic relatedness. If you’re studying phonology, the presence of vowel harmony in the family may be relevant (but note that Estonian, despite being Finnic, does not have productive harmony); if you’re interested in morphology, then notions like agglutination may be relevant (though not necessarily). Gross word order descriptors (like “VSO”) are likely to be relevant for syntax, and so on.

In some cases, one of the relevant typological aspects is not language typology, but rather writing systems typology. There, genetic relatedness isn’t very informative either, because the vast majority of writing systems used today (and virtually all of them outside East Asia) are ultimately descended from Egyptian hieroglyphs. And we shouldn’t confuse writing system and language.

Linguists ought to know better.