Guest & Martin on neural networks as cognitive models

In our “big questions” class, we read a few papers about whether large artificial neural network language models are good (or even candidate) cognitive models. As part of my background reading I also reviewed this recent paper by Guest & Martin (2023). The crux of the paper is an argument based on simple propositional logic, and because it was hard for me to follow, I thought I’d try to review it here.

G&M first identify a commonly, mostly implicit argument for studying artificial neural networks as cognitive models which takes the form of modus ponens. I will take the liberty of generalizing it considerably here.

  • $P \rightarrow Q$: if neural networks (i.e., their outputs) are correlated with behavioral or neuroimaging data, they are plausible cognitive models (“do what people do”).
  • $P$: neural networks are correlated with such data.
  • $\vdash Q$: therefore they are plausible cognitive models.

G&M give several examples where this argument has been applied and this is the exact motivation that linguists engaged in “LLMology” tend to give during the question period. The problem, as G&M note, is that the correctness of this inference depends crucially on whether $P \rightarrow Q$, and there are no shortage of arguments against that proposition. The most obvious one, of course, is the possibility of multiple realizability. They use the example of two clocks are behaviorally quite similar, but one is actually based on springs and cogs whereas the other has a quartz motion powered by a battery. Clearly, neural networks and human brains could both realize the same sorts of behaviors/mappings without being internally the same.

G&M continue that if the above inference is valid, it should be possible to apply modus tollens to it as well. This has the following general form.

  • $P \rightarrow Q$: (as above).
  • $\neg Q$: neural networks are not plausible cognitive models.
  • $\vdash \neg P$: therefore neural networks (i.e., their outputs) are not correlated with behavioral or neuroimaging data.

G&M give several examples where such an argument could easily be applied: so-called hallucinations, cases where neural networks continue to underperform humans when provided with reasonable amounts of data, as well as cases where neural networks can be shown to exhibit superhuman performance! As they conclude: “Even though $Q$ can, and often does, fail to be true, we, as a field, do not formulate its relationship to $P$ in terms of MT [modus tollens]. (G&M: 217). Rather, they argue, what people actually do is show that:

  • $Q \rightarrow P$: if the neural networks are plausible cognitive models (“do what people do”), then neural networks are correlated with behavioral or neuroimaging data

Using this to assert $Q$, of course, is the fallacy of affirming the consequent, and is clearly invalid. What G&M ultimately seem to conclude is that little can be logically concluded from cognitive modeling with artificial neural networks, even if these models remain “useful” in many domains.

References

Guest, O., and Martin, A. E. 2023. On logical inference over brains, behaviour, and artificial neural networks. Computational Brain & Behavior 3:213-227.

Cajal on “diseases of the will”

Charles Reiss (h/t) recently recommended me a short book by Santiago Ramón y Cajal (1852-1934), an important Spanish neuroscientist and physician. Cajal first published the evocatively titled  Reglas y Consejos sobre Investigación Cientifica: Los tónicos de la voluntad in 1897 and it was subsequently revised and translated various times since then. By far the most entertaining portion for me is chapter 5, entitled “Diseases of the Will”. Despite the name, what Cajal actually presents is a taxonomy of scientists who do contribute little to scientific inquiry:  “contemplators”, “bibliophiles and polyglots”, “megalomaniacs”, “instrument addicts”, “misfits”, and “theorists”. I include a PDF of this brief chapter, translated into English, here for interested readers, under my belief it is in the public domain.

“Indo-European” is not a meaningful typological descriptor

A trope I see in a lot of student writing (and computational linguistics writing at all levels) is a critique of prior work as being only on “Indo-European languages”, and sometimes a promise that current or future work will target “non-Indo-European languages”.

To me, this is drivel. The Indo-European language family is quite diverse; i.e., for the vast majority of things I’m interested in, either, say, Italian or Russian is sufficiently different from English to make a relevant comparison. And there are a huge number of “non-Indo-European” languages that are typologically similar to at least some Indo-European languages on at least some dimensions; i.e., Finno-Ugric, “Aquitanian” (i.e., Basque) and the (narrowly defined) “Altaic” families (Mongolic, Tungusic, and Turkic) have quite a bit in common typologically with IE, as do, say, Japanese and Korean. Genetic relatedness just isn’t that typologically informative in very dense, very “old” families like IE.

If you want to talk typology, you should focus on typological aspects actually relevant to your study rather than genetic relatedness. If you’re studying phonology, the presence of vowel harmony in the family may be relevant (but note that Estonian, despite being Finnic, does not have productive harmony); if you’re interested in morphology, then notions like agglutination may be relevant (though not necessarily). Gross word order descriptors (like “VSO”) are likely to be relevant for syntax, and so on.

In some cases, one of the relevant typological aspects is not language typology, but rather writing systems typology. There, genetic relatedness isn’t very informative either, because the vast majority of writing systems used today (and virtually all of them outside East Asia) are ultimately descended from Egyptian hieroglyphs. And we shouldn’t confuse writing system and language.

Linguists ought to know better.

How many Optimality Theory grammars are there?

How many Optimality Theory (OT) grammars are there? I will assume first off that Con, the constraint set, is fixed and finite. I make this assumption because it is one assumed by nearly all work on the mathematical properties of OT. Given the many unsolved problems in OT learning, joint learning of constraints and rankings seems to add unnecessary additional complexity.1 So we suppose that there are a finite set of constraints, and let $n$ be the cardinality of that set. If we put aside for now the contents of the lexicon (i.e., the set of URs), one can ask how many possible constraint rankings there are as a function of $n$.

Prince & Smolensky (1993), in their “founding document”, seem to argue that constraint rankings are total. Consider the following from a footnote:

With a grammar defined as a total ranking of the constraint set, the underlying hypothesis is that there is some total ranking which works; there could be (and typically will be) several, because a total ranking will often impose noncrucial domination relations (noncrucial in the sense that either order will work). It is entirely conceivable that the grammar should recognize nonranking of pairs of constraints, but this opens up the possibility of crucial nonranking (neither can dominate the other; both rankings are allowed), for which we have not yet found evidence. Given present understanding, we accept the hypothesis that there is a total order of domination on the constraint set; that is, that all nonrankings are noncrucial.

So it seems like they treat strict ranking as a working hypothesis. This is also reflected in the name of their method factorial typology, because there are $n!$ strict rankings of $n$ constraints. Roughly, they propose that one generate all possible rankings of a series of possible constraints, and compare their extension to typological information.2 In practice, as they acknowledge, OT grammars—by which I mean theories of i-languages presented in the OT literature—contain many cases where two constraints $c, d$ need not be ranked with respect to each other. For example, the first chapter of Kager’s (1999) widely used OT textbook includes a “factorial typology” (p. 36, his ex. 53) in which some constraints are not strictly ranked, and this is followed by a number of tableaux in which he uses a dashed vertical line to indicate non-ranking.3

Prince & Smolensky also acknowledge the existence of cases where there is no evidence with which to rank $c$ versus $d$. The situation is important because any learning algorithm which was trying to enforce a strict ranking of constraints would have to resort to a coin flip or some other unprivileged mechanism to finalize the ranking. Such algorithms are in principle possible to imagine, but neither the final step nor the postulate are motivated in the first place.

Prince & Smolensky finally admit the possibility of what they call crucial non-ranking, cases where leaving $c$ and $d$ mutually unranked has the right extension, but $c lt d$ and $d lt c$ both have the wrong extension.4 Such would constitute the strongest evidence for viewing OT grammars as weakly ranked, particularly if such grammars are linguistically interesting.

I will adopt the hypothesis of weak ranking; it seems unavoidable if only as a matter of acquisition. If one does, the set of possible rankings is actually much larger than $n!$. For example, for the empty set and singleton sets, there is but one weak ranking. For sets of size 2, there are 3: ${a lt b; b lt a; a, b}$. And, at the risk of being pedantic, for sets of size 3 there are 13:

  1. $a lt b lt c$
  2. $a lt b, c$
  3. $a lt c lt b$
  4. $a, blt c$
  5. $a, b, c$
  6. $a, c lt b$
  7. $b lt a lt c$
  8. $b lt a, c$
  9. $b lt c lt a$
  10. $b, c lt a$
  11. $c lt a lt b$
  12. $c lt a, b$
  13. $c lt b lt a$

Already one can see that this is growing faster than $n!$ since $3! = 6$.

It is not all that hard to figure out what the cardinality function is here. Indeed, I think Charles Yang (h/t) showed this to me many years ago, and though I lost my notes and had to re-derive it, I’m reasonably sure he came up with is the same solution that I do here. This sequence is known as A000670 and has the formula

$$a(n) = sum_{k = 0}^n k! S(n, k)$$

where $S$ is the Stirling number of the second kind. Expanding that further we obtain:

$$a(n) = sum_{k = 0}^n k! sum_{i = 0}^k frac{(-1)^{k – i}i^n}{(k – i)!i!} .$$

This is a lot of grammars by any account. With just 10 constraints, we are already in the billions, and with 20 it’s between 2 and 3 sextillion. Yet this doesn’t look all that dramatically different than strict ranking when plot in log scale.

Clearly this is essentially infinite in the terminology of Gallistel and King (2009), and this ability to generate such combinatoric possibilities from a small inventory of primitive objects and operations is something that has long been recognized as a desirable element of cognitive theories. Of course, many of these weak rankings will be extensionally equivalent, and even others, I hypothesize, will probably be extensionally non-equivalent but extensionally equivalent with respect to some lexicon. None of this is unique to OT: it’s just good cognitive science.

Endnotes

  1. Note that the constraint induction of Hayes & Wilson (2008), for example, only considers a finite set thereof and therefore there does exist some finite set for their approach. The same is probably true of approaches with constraint conjunction so long as there are some reasonable bounds.
  2. Exactly how the analyst is supposed to use this comparison is a little unclear to me, but presumably one can eyeball it to determine if the typological fit is satisfactory or not.
  3. As far as I can tell, he never explains this notation.
  4. There is a tradition of using $ll$ for OT constraint ranking but I’ll put it aside because $lt$ is perfectly adequate and is the operator used in order theory.

References

Gallistel, C. R. and King, A. P. 2009. Memory and the Computational Brain: Why Cognitive Science Will Transform Neuroscience. Wiley-Blackwell.
Hayes, B. and Wilson, C. 2008. A maximum entropy model of phonotactics and phonotactic learning. Linguistic Inquiry 39: 397-440.
Prince, A., and Smolensky, P. 1993. Optimality Theory: constraint interaction in generative grammar. Rutgers Center for Cognitive Science Technical Report TR-2.

Neural fossils

Neural network cognitive modeling had a brief, precocious golden era between 1986 (the year the Parallel Distributed Processing books came out) and maybe about 1997 (at which point the limitations of those models were widely known…though I’m little fuzzier about when this realization settled in). During that period, I think it’s fair to say, a lot of people got hired into the faculty, in psychology and linguistics in particular, simply because they knew a bit about this exciting new approach. Some of those people went on to do other interesting things once the shine had worn off, but a lot of them didn’t, and some of them are even still around, haunting the halls of R1s. I think something similar will happen to the new crop of LLMologists in the academy: some have the skills to pivot should we reach peak LLM (if we haven’t already), but many don’t.

When LLMing goes wrong

[The following is a guest post from Daniel Yakubov.]

You’ve probably noticed that industries have been jumping to adopt some vague notion of “AI” or peacocking about their AI-powered something-or-other. Unsurprisingly, the scrambled nature of this adoption leads to a slew of issues. This post outlines a fact obvious to technical crowds, but not business folks; even though LLMs are a shiny new toy, LLM-centric systems still require careful consideration.

Hallucination is possibly the most common issue in LLM systems. It is the tendency for an LLM to prioritize responding rather than responding accurately, aka. making stuff up. Considering some of the common approaches to fixing this, we can understand what problems these techniques introduce. 

A quick approach that many prompt engineers I know think is the end-all be-all of Generative AI is Chain-of-Thought (CoT; Wei et al 2023). This simple approach just tells the LLM to break down its reasoning “step-by-step” before outputting a response. While a bandage, CoT does not actually inject new knowledge into an LLM, this is where the Retrieval Augmented Generation (RAG) craze began. RAG represents a family of approaches that add relevant context to a prompt via search (Patrick et al. 2020). RAG pipelines come with their own errors that need to be understood,  including noise in the source documents, misconfigurations in the context window of the search encoder, and specificity of the LLM reply (Barnett et al. 2024). Specificity is particularly frustrating. Imagine you ask a chatbot “Where is Paris?” and it replies “According to my research, Paris is on Earth.” At this stage, RAG and CoT combined still cannot deal with complicated user queries accurately (or well, math). To address that, the ReAct agent framework (Yao et al 2023) is commonly used. ReAct, in a nutshell, gives the LLM access to a series of tools and the ability to “requery” itself depending on the answer it gave to the user query. A  central part of ReAct is the LLM being able to choose which tool to use. This is a classification task, and LLMs are observed to suffer from an inherent label bias (Reif and Schwarz, 2024), another issue to control for.

This can go for much longer, but I feel the point should be clear. Hopefully this gives a more academic crowd some insight into when LLMing goes wrong.

References

Barnett, S., Kurniawan, S., Thudumu, S. Brannelly, Z., and Abdelrazek, M. 2024. Seven failure points when engineering a retrieval augmented generation system.
Lewis, P., Perez, E., Pitkus, A., Petroni, F., Karpukhin, V., Goyal, N., …, Kiela, D. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks.
Reif, Y., and Schwartz, R. 2024. Beyond performance: quantifying and mitigating label bias in LLMs.
Wei, J. Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., …, Zhou, D. 2023. Chain-of-thought prompting elicits reasoning in large language models.
Yao, S., Zhao, J., Yu, D., Shafran, I., Narasimhan, K., and Cao, Y. 2023. ReAct: synergizing reasoning and acting in language models.

 

Introducing speakers

The following are my (admittedly normative) notes on how to introduce a linguistics speaker.

  • The genre most similar to the introduction of speaker is the congratulatory toast. An introduction should be brief, and the lengthy written introduction should be scorned. The speaker is already making an imposition on the audiences’ time, and for the host to usurp more of this time than necessary is a further imposition on both host and audience.
  • The introduction is not an opportunity for the introducer to demonstrate  erudition, but it can be an opportunity to show wit.
  • The introduction should be in the introducer’s voice. For this reason, a biography paragraph provided by the speaker should not be read as part of the introduction.
  • The introduction should be extemporaneous. The introducer can prepare brief notes, but they should fit on a notecard or their hand, and the notes should never be “read”.
  • Polite humor, brief personal anecdotes (e.g., when the introducer first met the speaker or became aware of their work), and heart-felt superlatives or compliments (one of the nicest introductions I ever received stated that I was “in the business of keeping people honest”) are to be encouraged.
  • The introduction should state the speakers’ current affiliation and title, if any, but need not list their full occupational or educational history unless it is judged relevant.
  • Introducers may feel an urge to read the title of the talk when concluding their introduction, but should resist this urge. There is no real need—the audience already has seen the talk title in the program or other announcements, and they can read the slide—and the speaker normally feels the need to read it out loud regardless.
  • The introduction should conclude with the speaker’s name. In one common style, which I consider elegant, the introducer is careful not to say the speaker’s full name until this conclusion, and uses epithets like “our next speaker” or “our honored guest” earlier in the introduction.

Libfix report: -gler and what we can learn from it

The libfix -gler is an interesting case that seems to illustrate Zwicky’s hypothesis that blends lead to libfixes. Patient zero is clearly Googler, which is corporate’s preferred term for Google employees; it is is widely used in-group as well. This is an ordinary example of the relatively productive -er suffix that creates agent nominals (backbencher, J6er), with a connotation that the agent does something habitually (e.g., pickleballer) or as an occupation (cartographer). That a Googler is someone who works at Google, and not just a habitual user of the search engine, is slightly notable but not shocking.

The next best-established forms look more like blends based on Googler. First and most saliently, there is noogler ‘new Google employee’, where the word-onset has been replaced to create a blend with new. Members of an official affiliational group (/listserv) for Jewish Googlers call themselves Jewglers, with a similar single-phoneme substitution. Since both new and Jew share the /Cuː-/ initial of Google(r)—the only adaptation is changing the place and manner of the initial C—this still looks like blending rather than recuttingXoogler ‘former employee of Google’ is presumably pronounced ex-oogler (I’ve never heard it said out loud myself); the term is commonly used by entrepeneurs in their fund-raising (cf. xoogler.co); this could be a blend or just regarded as a one-off truncation of ex-Googler.

Cats and cows (and ball pythons) are not permitted at Google’s offices, but I have seen mewgler and moogler for cat- and cow fancier Google employees. However, Googlers are permitted to bring their (well-behaved, vaccinated) dogs to work, and employees who do so call themselves dooglers. (Then again, my colleague LeeAnn says the dogs themselves are the dooglers!) My intuition is that this is pronounced [duː.glɚ] and not *[dɒ.glɚ], and thus this is less blend-like than any of the aforementioned examples, because dog and the base Googler have a different vowel (albeit both back vowels). The same is true for Zoogler for Google employees based out of the Zurich office since the first vowel in the US English pronunciation of Zurich is [ʊ], and even we get a seemingly more dissimilar-to-base front-gliding diphthong in gaygler, the term for members of the company-internal LGBT affiliation group (/listserv).

In my analysis, dooglerZoogler and gaygler strongly suggest that we have gone from a blend with Googler as its base to an incipient liberated affix -gler denoting agents associated with Google.

The biggest puzzle about English libfixes, for me, one not answered in any of the prior work, is why the recutting occurs where it does. It is perhaps not surprising that the rather-homophonous -er has not been given yet another sense, but why is it -gler and not -0ogler (which would give us the not-preposterous, but unattested *gayoogler) or -ler (*gayler)? While Zwicky’s cline hypothesis does not answer this, here is one possible way to operational this: recutting is blend reanalysis, with the source morphemes like new and Jew are parsed maximally in noogler and Jewgler, with the remainder giving us the new affix -gler.

I know of many other libfixes consistent with this analysis. For example, from the blend fursona (fur + persona) we have –sona (e.g., catsona, puppysona); from glitterati (glitter + literati) we have -rati (e.g., Twitteratitechnorati), from funtastic (fun + fantastic) we have -tastic (e.g., chavtastic, shagtastic), from telethon (telephone marathon) we have -(a)thon (e.g., saleathon, mathathon)

This is of course not the full story. Most other libfixes in my corpus seem to arise at a prexisting morpheme boundary, with one or the other piece reinterpreted as a productive affix with new lexical semantics based on the full form. Some examples include cran- from cranberry (e.g., crantini), -gate from Watergate (e.g., Troopergate), -mare from nightmare (e.g., editmare), and -berg from iceberg (e.g., fatberg). In many cases, the morpheme boundaries are abstract ones mirroring the segmentation of Latin or Greek  complex word borrowings as in -(i)verse from the Latin-based universe (e.g., Buffyverse) or -(o)nomics from the Greek-based economics (e.g., the brand name Chemonomics). My corpus also includes Franken– from the German compound name Franken-stein (e.g. Frankenfood) and -nik from the Russian derivative s-put-nik (e.g., peacenik). In the above cases, the segmentation is more or less the same as in the donor language.

In other cases, though, the segmentation is different than that in the donor language, as in -ohol(ic) ultimately from Arabic, which retain a little less of the base than one might expect (here al- is the Arabic definite prefix), and -copter from neo-Greek and -nado from Spanish, which both retain a little more. While I don’t think it’s that controversial to posit that cranberry, economics or universe are represented as complex nouns (our understanding of the morphophonology of English seems to depend on this conclusion, and virtually all behavioral research on word processing supports this), it is perhaps not shocking that ordinary English speakers are unfamiliar with the morphology of the etyma of alcoholhelicopter, and tornado in the ultimate donor languages. These more ad-hoc recuttings don’t necessarily line up with syllable boundaries—though hypothetical *-cohol or *-ler would have—so I suspect there is just some inherent stochasticity to how this final type of recutting proceeds.

Tutorial on Substance-Free Logical Phonology

A few days ago, we (myself with the help of graduate student Rim Dabbous and professor Charles Reiss) gave a detailed tutorial on Substance-Free Logical Phonology (LP) at the LSA meeting in Philadelphia. I was pleasantly surprised with how many people showed up (I printed about 30 handouts, and we ran out) and how engaged they were. If the question period is any indication, our colleagues understood the theory quite well. The handout is now on LingBuzz and we have a lot more material coming for you soon. 

I’ll try to summarize how LP fits into theories of exceptionality in a separate post soon. 

Postscript

I also took the liberty of creating a simple archive for our Logical Phonology papers, the Logical Phonology Archive (LOA for short, pronounced [lwa] like French loi ‘law’ or Kreyòl loa ‘vodou spirit’). Technically, this site is sort of interesting: other than some static text, the table containing papers is an HTML iframe, dynamically generated from a Google Sheet using a template populated by server-side Javascript using the Google Apps Scripts platform; hat tip to my colleague Rivka Levitan for making me aware of this very handy possibility.