Guest & Martin on neural networks as cognitive models

In our “big questions” class, we read a few papers about whether large artificial neural network language models are good (or even candidate) cognitive models. As part of my background reading I also reviewed this recent paper by Guest & Martin (2023). The crux of the paper is an argument based on simple propositional logic, and because it was hard for me to follow, I thought I’d try to review it here.

G&M first identify a commonly, mostly implicit argument for studying artificial neural networks as cognitive models which takes the form of modus ponens. I will take the liberty of generalizing it considerably here.

  • $P \rightarrow Q$: if neural networks (i.e., their outputs) are correlated with behavioral or neuroimaging data, they are plausible cognitive models (“do what people do”).
  • $P$: neural networks are correlated with such data.
  • $\vdash Q$: therefore they are plausible cognitive models.

G&M give several examples where this argument has been applied and this is the exact motivation that linguists engaged in “LLMology” tend to give during the question period. The problem, as G&M note, is that the correctness of this inference depends crucially on whether $P \rightarrow Q$, and there are no shortage of arguments against that proposition. The most obvious one, of course, is the possibility of multiple realizability. They use the example of two clocks are behaviorally quite similar, but one is actually based on springs and cogs whereas the other has a quartz motion powered by a battery. Clearly, neural networks and human brains could both realize the same sorts of behaviors/mappings without being internally the same.

G&M continue that if the above inference is valid, it should be possible to apply modus tollens to it as well. This has the following general form.

  • $P \rightarrow Q$: (as above).
  • $\neg Q$: neural networks are not plausible cognitive models.
  • $\vdash \neg P$: therefore neural networks (i.e., their outputs) are not correlated with behavioral or neuroimaging data.

G&M give several examples where such an argument could easily be applied: so-called hallucinations, cases where neural networks continue to underperform humans when provided with reasonable amounts of data, as well as cases where neural networks can be shown to exhibit superhuman performance! As they conclude: “Even though $Q$ can, and often does, fail to be true, we, as a field, do not formulate its relationship to $P$ in terms of MT [modus tollens]. (G&M: 217). Rather, they argue, what people actually do is show that:

  • $Q \rightarrow P$: if the neural networks are plausible cognitive models (“do what people do”), then neural networks are correlated with behavioral or neuroimaging data

Using this to assert $Q$, of course, is the fallacy of affirming the consequent, and is clearly invalid. What G&M ultimately seem to conclude is that little can be logically concluded from cognitive modeling with artificial neural networks, even if these models remain “useful” in many domains.

References

Guest, O., and Martin, A. E. 2023. On logical inference over brains, behaviour, and artificial neural networks. Computational Brain & Behavior 3:213-227.

Cajal on “diseases of the will”

Charles Reiss (h/t) recently recommended me a short book by Santiago Ramón y Cajal (1852-1934), an important Spanish neuroscientist and physician. Cajal first published the evocatively titled  Reglas y Consejos sobre Investigación Cientifica: Los tónicos de la voluntad in 1897 and it was subsequently revised and translated various times since then. By far the most entertaining portion for me is chapter 5, entitled “Diseases of the Will”. Despite the name, what Cajal actually presents is a taxonomy of scientists who do contribute little to scientific inquiry:  “contemplators”, “bibliophiles and polyglots”, “megalomaniacs”, “instrument addicts”, “misfits”, and “theorists”. I include a PDF of this brief chapter, translated into English, here for interested readers, under my belief it is in the public domain.

“Indo-European” is not a meaningful typological descriptor

A trope I see in a lot of student writing (and computational linguistics writing at all levels) is a critique of prior work as being only on “Indo-European languages”, and sometimes a promise that current or future work will target “non-Indo-European languages”.

To me, this is drivel. The Indo-European language family is quite diverse; i.e., for the vast majority of things I’m interested in, either, say, Italian or Russian is sufficiently different from English to make a relevant comparison. And there are a huge number of “non-Indo-European” languages that are typologically similar to at least some Indo-European languages on at least some dimensions; i.e., Finno-Ugric, “Aquitanian” (i.e., Basque) and the (narrowly defined) “Altaic” families (Mongolic, Tungusic, and Turkic) have quite a bit in common typologically with IE, as do, say, Japanese and Korean. Genetic relatedness just isn’t that typologically informative in very dense, very “old” families like IE.

If you want to talk typology, you should focus on typological aspects actually relevant to your study rather than genetic relatedness. If you’re studying phonology, the presence of vowel harmony in the family may be relevant (but note that Estonian, despite being Finnic, does not have productive harmony); if you’re interested in morphology, then notions like agglutination may be relevant (though not necessarily). Gross word order descriptors (like “VSO”) are likely to be relevant for syntax, and so on.

In some cases, one of the relevant typological aspects is not language typology, but rather writing systems typology. There, genetic relatedness isn’t very informative either, because the vast majority of writing systems used today (and virtually all of them outside East Asia) are ultimately descended from Egyptian hieroglyphs. And we shouldn’t confuse writing system and language.

Linguists ought to know better.

How many Optimality Theory grammars are there?

How many Optimality Theory (OT) grammars are there? I will assume first off that Con, the constraint set, is fixed and finite. I make this assumption because it is one assumed by nearly all work on the mathematical properties of OT. Given the many unsolved problems in OT learning, joint learning of constraints and rankings seems to add unnecessary additional complexity.1 So we suppose that there are a finite set of constraints, and let $n$ be the cardinality of that set. If we put aside for now the contents of the lexicon (i.e., the set of URs), one can ask how many possible constraint rankings there are as a function of $n$.

Prince & Smolensky (1993), in their “founding document”, seem to argue that constraint rankings are total. Consider the following from a footnote:

With a grammar defined as a total ranking of the constraint set, the underlying hypothesis is that there is some total ranking which works; there could be (and typically will be) several, because a total ranking will often impose noncrucial domination relations (noncrucial in the sense that either order will work). It is entirely conceivable that the grammar should recognize nonranking of pairs of constraints, but this opens up the possibility of crucial nonranking (neither can dominate the other; both rankings are allowed), for which we have not yet found evidence. Given present understanding, we accept the hypothesis that there is a total order of domination on the constraint set; that is, that all nonrankings are noncrucial.

So it seems like they treat strict ranking as a working hypothesis. This is also reflected in the name of their method factorial typology, because there are $n!$ strict rankings of $n$ constraints. Roughly, they propose that one generate all possible rankings of a series of possible constraints, and compare their extension to typological information.2 In practice, as they acknowledge, OT grammars—by which I mean theories of i-languages presented in the OT literature—contain many cases where two constraints $c, d$ need not be ranked with respect to each other. For example, the first chapter of Kager’s (1999) widely used OT textbook includes a “factorial typology” (p. 36, his ex. 53) in which some constraints are not strictly ranked, and this is followed by a number of tableaux in which he uses a dashed vertical line to indicate non-ranking.3

Prince & Smolensky also acknowledge the existence of cases where there is no evidence with which to rank $c$ versus $d$. The situation is important because any learning algorithm which was trying to enforce a strict ranking of constraints would have to resort to a coin flip or some other unprivileged mechanism to finalize the ranking. Such algorithms are in principle possible to imagine, but neither the final step nor the postulate are motivated in the first place.

Prince & Smolensky finally admit the possibility of what they call crucial non-ranking, cases where leaving $c$ and $d$ mutually unranked has the right extension, but $c lt d$ and $d lt c$ both have the wrong extension.4 Such would constitute the strongest evidence for viewing OT grammars as weakly ranked, particularly if such grammars are linguistically interesting.

I will adopt the hypothesis of weak ranking; it seems unavoidable if only as a matter of acquisition. If one does, the set of possible rankings is actually much larger than $n!$. For example, for the empty set and singleton sets, there is but one weak ranking. For sets of size 2, there are 3: ${a lt b; b lt a; a, b}$. And, at the risk of being pedantic, for sets of size 3 there are 13:

  1. $a \lt b \lt c$
  2. $a \lt b, c$
  3. $a \lt c \lt b$
  4. $a, b\lt c$
  5. $a, b, c$
  6. $a, c \lt b$
  7. $b \lt a \lt c$
  8. $b \lt a, c$
  9. $b \lt c \lt a$
  10. $b, c \lt a$
  11. $c \lt a \lt b$
  12. $c \lt a, b$
  13. $c \lt b \lt a$

Already one can see that this is growing faster than $n!$ since $3! = 6$.

It is not all that hard to figure out what the cardinality function is here. Indeed, I think Charles Yang (h/t) showed this to me many years ago, and though I lost my notes and had to re-derive it, I’m reasonably sure he came up with is the same solution that I do here. This sequence is known as A000670 and has the formula

$$a(n) = \sum_{k = 0}^n k! S(n, k)$$

where $S$ is the Stirling number of the second kind. Expanding that further we obtain:

$$a(n) = \sum_{k = 0}^n k! \sum_{i = 0}^k \frac{(-1)^{k – i}i^n}{(k – i)!i!} .$$

This is a lot of grammars by any account. With just 10 constraints, we are already in the billions, and with 20 it’s between 2 and 3 sextillion. Yet this doesn’t look all that dramatically different than strict ranking when plot in log scale.

Clearly this is essentially infinite in the terminology of Gallistel and King (2009), and this ability to generate such combinatoric possibilities from a small inventory of primitive objects and operations is something that has long been recognized as a desirable element of cognitive theories. Of course, many of these weak rankings will be extensionally equivalent, and even others, I hypothesize, will probably be extensionally non-equivalent but extensionally equivalent with respect to some lexicon. None of this is unique to OT: it’s just good cognitive science.

Endnotes

  1. Note that the constraint induction of Hayes & Wilson (2008), for example, only considers a finite set thereof and therefore there does exist some finite set for their approach. The same is probably true of approaches with constraint conjunction so long as there are some reasonable bounds.
  2. Exactly how the analyst is supposed to use this comparison is a little unclear to me, but presumably one can eyeball it to determine if the typological fit is satisfactory or not.
  3. As far as I can tell, he never explains this notation.
  4. There is a tradition of using $\ll$ for OT constraint ranking but I’ll put it aside because $\lt$ is perfectly adequate and is the operator used in order theory.

References

Gallistel, C. Randy and King, A. P. 2009. Memory and the Computational Brain: Why Cognitive Science Will Transform Neuroscience. Wiley-Blackwell.
Hayes, B. and Wilson, C. 2008. A maximum entropy model of phonotactics and phonotactic learning. Linguistic Inquiry 39: 397-440.
Prince, A., and Smolensky, P. 1993. Optimality Theory: constraint interaction in generative grammar. Rutgers Center for Cognitive Science Technical Report TR-2.

Neural fossils

Neural network cognitive modeling had a brief, precocious golden era between 1986 (the year the Parallel Distributed Processing books came out) and maybe about 1997 (at which point the limitations of those models were widely known…though I’m little fuzzier about when this realization settled in). During that period, I think it’s fair to say, a lot of people got hired into the faculty, in psychology and linguistics in particular, simply because they knew a bit about this exciting new approach. Some of those people went on to do other interesting things once the shine had worn off, but a lot of them didn’t, and some of them are even still around, haunting the halls of R1s. I think something similar will happen to the new crop of LLMologists in the academy: some have the skills to pivot should we reach peak LLM (if we haven’t already), but many don’t.

Email discipline

There is a Discourse on what we might call email discipline. Here are a few related takes.

There are those who simply don’t respond to email at all. These people are demons and you should pay them no mind. 

Relatedly, there are those who “perform” some kind of message about their non-email responding. Maybe they have a long FAQ on their personal website about how exactly they do or do not want to be emailed. I am not sure I actually believe these people get qualitatively more email than I do. Maybe they get twice as much as me, but I don’t think anybody’s reading that FAQ buddy. Be serious.

There are those who believe it is a violation to email people off-hours, or on weekends or holidays, or whatever. I don’t agree: it’s an asynchronous communication mechanism, so that’s sort of the whole point. I can have personal rules about when I read email and these depend in no way on my rules (or lack thereof) about when you send them. Expecting people to know and abide by your Email Reading Rules FAQ is just as silly.

I have an executive function deficit, diagnosed as a child (you know the one), and if you’re lucky, they teach you strategies to cope. I think non-impaired people should just model one of the best: email can’t be allowed to linger. If it’s unimportant, you need to archive it. If it’s important you need to respond to it. You should not have a mass of unopened, unarchived emails at any point in your life. It’s really that easy.

When LLMing goes wrong

[The following is a guest post from Daniel Yakubov.]

You’ve probably noticed that industries have been jumping to adopt some vague notion of “AI” or peacocking about their AI-powered something-or-other. Unsurprisingly, the scrambled nature of this adoption leads to a slew of issues. This post outlines a fact obvious to technical crowds, but not business folks; even though LLMs are a shiny new toy, LLM-centric systems still require careful consideration.

Hallucination is possibly the most common issue in LLM systems. It is the tendency for an LLM to prioritize responding rather than responding accurately, aka. making stuff up. Considering some of the common approaches to fixing this, we can understand what problems these techniques introduce. 

A quick approach that many prompt engineers I know think is the end-all be-all of Generative AI is Chain-of-Thought (CoT; Wei et al 2023). This simple approach just tells the LLM to break down its reasoning “step-by-step” before outputting a response. While a bandage, CoT does not actually inject new knowledge into an LLM, this is where the Retrieval Augmented Generation (RAG) craze began. RAG represents a family of approaches that add relevant context to a prompt via search (Patrick et al. 2020). RAG pipelines come with their own errors that need to be understood,  including noise in the source documents, misconfigurations in the context window of the search encoder, and specificity of the LLM reply (Barnett et al. 2024). Specificity is particularly frustrating. Imagine you ask a chatbot “Where is Paris?” and it replies “According to my research, Paris is on Earth.” At this stage, RAG and CoT combined still cannot deal with complicated user queries accurately (or well, math). To address that, the ReAct agent framework (Yao et al 2023) is commonly used. ReAct, in a nutshell, gives the LLM access to a series of tools and the ability to “requery” itself depending on the answer it gave to the user query. A  central part of ReAct is the LLM being able to choose which tool to use. This is a classification task, and LLMs are observed to suffer from an inherent label bias (Reif and Schwarz, 2024), another issue to control for.

This can go for much longer, but I feel the point should be clear. Hopefully this gives a more academic crowd some insight into when LLMing goes wrong.

References

Barnett, S., Kurniawan, S., Thudumu, S. Brannelly, Z., and Abdelrazek, M. 2024. Seven failure points when engineering a retrieval augmented generation system.
Lewis, P., Perez, E., Pitkus, A., Petroni, F., Karpukhin, V., Goyal, N., …, Kiela, D. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks.
Reif, Y., and Schwartz, R. 2024. Beyond performance: quantifying and mitigating label bias in LLMs.
Wei, J. Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., …, Zhou, D. 2023. Chain-of-thought prompting elicits reasoning in large language models.
Yao, S., Zhao, J., Yu, D., Shafran, I., Narasimhan, K., and Cao, Y. 2023. ReAct: synergizing reasoning and acting in language models.

 

Snacks at talks

The following is how to put out a classy spread for your next talk; ignoring beverages and extras, everything listed should ring up at around $50.

  • The most important snack is cheese. Yes, some people are vegan or lactose-intolerant, but cheese is one of the most universally-beloved snacks world-wide. Most cheeses keep for a while with refrigeration, and some even keep at room temperature. Cheese is, as a dear friend says, one of the few products whose quality scales more or less linearly with its price, and I would recommend at least two mid-grade cheeses. I usually buy one soft one (Camembert, Brie, and Stilton are good choices) and one semi-hard one (Emmental or an aged Cheddar for example). The cheese should be laid out on a cutting board with some kind of metal knife for each. The cheese should not be pre-cut (that’s a little tacky). Cheeses should be paired with a box of Carr’s Water Crackers or similar. Estimated price: $15-20.
  • Fresh finger vegetables are also universally liked. The easiest options are finger carrots and pre-cut celery sticks. If you can find pre-cut multi-color bell peppers or broccoli, those are good options too. You can pair this with some kind of creamy dip (it’s easy to make ranch or onion dip using a pint of sour cream and a dip packet, but you need a spoon or spatula to stir it up) but you certainly don’t have to. Estimated price: $10-20.
  • Fruit is a great option. The simplest thing to do is to just buy berries, but this is not foolproof: blueberries are a little small for eating by hand; raspberries lack structural integrity, and where I live, strawberries are only in season in the mid-summer, and are expensive and low-quality otherwise. In Mid-Atlantic cities, there are often street vendors who sell containers of freshly-cut fruit (this usually includes slices of pineapples and mangos and bananas, and perhaps some berries) and if this is available this is a good idea too. Estimated price: $10-15.

This, plus some water, is basically all you need to put out. Here are some ways to potentially extend it.

  • Chips are a good option. I think ordinary salty potato chips are probably the best choice simply because they’re usually eaten by themselves. In contrast, if you put out tortilla chips, you need to pair them with some kind of salsa or dip, and you need to buy a brand with sufficient “structural integrity” to actually pick up the dip.
  • Nuts are good too, obviously; maybe pick out a medley.
  • Soda water is really popular and cheap. I recommend 12oz cans. It should always be served chilled.
  • A few bottles or cans of beer may go over well. With rare exceptions, should be served chilled.
  • A bottle of wine may be appropriate. Chill it if it’s a varietal that needs to be chilled.

If the talk is before noon, coffee (and possibly hot water and tea bags) is more or less expected. There is something of a taboo in the States of consuming or serving alcohol before 4pm or so, and you may or may not want your event to have a happy hour atmosphere even if it’s in the evening.

And here are a few things I cannot recommend:

  • In my milieu it is uncommon for people to drink actual soda.
  • I wouldn’t recommend cured meats or charcuterie for a talk. The majority of people won’t touch the stuff these days, and it’s pretty expensive.
  • I love hummus, but mass-produced hummus is almost universally terrible. Make it at home (it’s easy if you have a food processor) or forget about it.
  • Store-bought guacamole tastes even worse, and it has a very short shelf life.

Introducing speakers

The following are my (admittedly normative) notes on how to introduce a linguistics speaker.

  • The genre most similar to the introduction of speaker is the congratulatory toast. An introduction should be brief, and the lengthy written introduction should be scorned. The speaker is already making an imposition on the audiences’ time, and for the host to usurp more of this time than necessary is a further imposition on both host and audience.
  • The introduction is not an opportunity for the introducer to demonstrate  erudition, but it can be an opportunity to show wit.
  • The introduction should be in the introducer’s voice. For this reason, a biography paragraph provided by the speaker should not be read as part of the introduction.
  • The introduction should be extemporaneous. The introducer can prepare brief notes, but they should fit on a notecard or their hand, and the notes should never be “read”.
  • Polite humor, brief personal anecdotes (e.g., when the introducer first met the speaker or became aware of their work), and heart-felt superlatives or compliments (one of the nicest introductions I ever received stated that I was “in the business of keeping people honest”) are to be encouraged.
  • The introduction should state the speakers’ current affiliation and title, if any, but need not list their full occupational or educational history unless it is judged relevant.
  • Introducers may feel an urge to read the title of the talk when concluding their introduction, but should resist this urge. There is no real need—the audience already has seen the talk title in the program or other announcements, and they can read the slide—and the speaker normally feels the need to read it out loud regardless.
  • The introduction should conclude with the speaker’s name. In one common style, which I consider elegant, the introducer is careful not to say the speaker’s full name until this conclusion, and uses epithets like “our next speaker” or “our honored guest” earlier in the introduction.