Robot autopsies

I don’t really understand the exhuberance for studying whether neural networks know syntax. I have a lot to say about this issue—I’ll return to it later—but for today I’d like to briefly discuss this passage from a recent(ish) paper by Baroni (2022). The author expresses great surprise that few formal linguists have cited a particular paper (Linzen et al. 2016) about the ability of neural networks to learn long-distance agreement phenomena. (To be fair, Baroni is not a coauthor of said paper.) He then continues:

While it is possible that deep nets are relying on a completely different approach to language processing than the one encoded in human linguistic competence, theoretical linguists should investigate what are the building blocks making these systems so effective: if not for other reasons, at least in order to explain why a model that is supposedly encoding completely different priors than those programmed into the human brain should be so good at handling tasks, such as translating from a language into another, that should presuppose sophisticated linguistic knowledge. (Baroni 2022: 11).

I think this passage is a useful stepping-off point for what I think. I want to be clear: I am not “picking on” Baroni, who is probably far more senior to and certainly better known than me anyways; this is just a particularly clearly written claim, and I just happen to disagree.

Baroni says it is “possible that deep nets are relying on a completely different approach to language processing…” than humans; I’d say it’s basically certain that they are. We simply have no reason to think they might be using similar mechanisms since humans and neural networks don’t contain any of the same ingredients. Any similarities will naturally be analogies, not homologies.

Without a strong reason to think neural models and humans share some kind of cognitive homologies, there is no reason for theoretical linguists to investigate them; as artifacts of human culture they are no more in the domain of study for theoretical linguists than zebra finches, carburetors, or the perihelion of Mercury.

It is not even clear how one ought to poke into the neural black box. Complex networks are mostly resistent to the kind of proof-theoretic techniques that mathematical linguists (witness the Delaware school or even just work by, say, Tesar) actually rely on, and most of the results are both negative and of minimal applicability: for instance, we know that there always exists a single-layer network large enough to encode, with arbitrary precision, any function a multi-layer network encodes, but we have no way to figure out how big is big enough for a given function.

Probing and other interpretative approaches exist, but have not yet proved themselves, and it is not clear that theoretical linguists have the relevant skills to push things forward anyways. Quality assurance, and adversarial data generation, is not exactly a high-status job; how can Baroni demand Cinque or Rizzi (to choose two of Baroni’s well-known countrymen) to put down their chalk and start doing free or poorly-paid QA for Microsoft?

Why should theoretical linguists of all people be charged with doing robot autopsies when the creators of the very same robots are alive and well? Either it’s easy and they’re refusing to do the work, or—and I suspect this is the case—it’s actually far beyond our current capabilities and that’s why little progress is being made.

I for one am glad that, for the time being, most linguists still have a little more self-respect.

References

Baroni, M. 2022. On the proper role of linguistically oriented deep net analysis in linguistic theorising. In S. Lappin (ed). Algebraic Structures in Natural Language, pages 1-16. Taylor & Francis.
Linzen, T., Dupoux, E., and Goldberg, Y. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4: 521-535.

A prediction

You didn’t build that. – Barack Obama, July 13, 2012

Connectionism originates in psychology, but the “old connectionists” are mostly gone, having largely failed to pass on their ideology to their trainees, and there really aren’t many “young connectionists” to speak of. But, I predict that in the next few years we’ll see a bunch of psychologists of language—the ones who define themselves by their opposition to internalism, innateness, and generativism—become some of the biggest cheerleaders for large language models (LLMs). In fact, psychologists have not made substantial contributions to neural network modeling in many years. Virtually all the work on improving neural networks over the last few decades has been done by computer scientists who cared not a whit whether they had anything to do with human brains or cognitive plausibility.¹ (Sometimes they’ll put things like “…inspired by the human brain…” in the press releases, but we all know that’s just fluff.) At this point, psychology as a discipline has no more claim to neural networks than the Irish do to Gaul, and in the rather unlikely case that LLMs do end up furnishing deep truths about cognition, psychology as a discipline will have failed us by not following up on a promising lead. I think it will be particularly revealing if psychologists who previously worshipped at the Church of Bayes suddenly lose all interest in mathematical rigor and find themselves praying to the great Black Box. I want to say it now: if this happens—and I am starting to see signs that it will—those people will be cynics, haters, and trolls, and you shouldn’t pay them any mind.

Endnotes

I am also critical of machine learning pedagogy, and it is therefore interesting to see that those same computer scientists pushing things forward don’t seem to care much for machine learning as an academic discipline either.

On the past tense debate; Part 3: the overestimation of overirregularization

One final, and still unresolved, issue in the past tense debate is the role of so-called overirregularization errors.

It is well-known that children acquiring English tend to overregularize irregular verbs; that is, they apply the regular -d suffix to verbs which in adult English form irregular pasts, producing, e.g., *thinked for thought. Maratsos (2000) estimates that children acquiring English very frequently overregularize irregular verbs; for instance, Abe, recorded roughly 45 minutes a week from ages 2;5 to 5;2, overregularizes rare irregular verbs as much as 58% of the time, and even the most frequent irregular verbs are overregularized 18% of the time. Abe appears to have been exceptional in that he had a very large receptive vocabulary for his age (as measured by the Peabody Picture Vocabulary Test), giving him more opportunities (and perhaps more grammatical motivation) for overregularization,¹ but Maratsos estimates that less-precocious children have lower but overall similar rates of overregularization.

In contrast, it is generally agreed that overirregularization, or the application of irregular patterns (e.g., in English, of ablaut, shortening, etc.) are quite a bit rarer. The only serious attempt to count overirregularizations is by Xu & Pinker (1995; henceforth XP). They estimate that children produce such errors no more than 0.2% of the time, which would make overirregularizations roughly two orders of magnitude rarer than overregularizations. This is a substantial difference. If anything, I think that XP overestimate overirregularizations. For instance, XP count brang as an overirregularization, even though this form does exist quite robustly in adult English (though it is somewhat stigmatized). Furthermore, XP count *slep for *slept as an overirregularization, though this is probably just ordinary (td)-deletion, a variable rule that is attested already in early childhood (Payne 1980). But by any account, overirregularization is extremely rare. The same is found in nonce word elicitation experiments such as those conducted by Berko (1958): both children and adults are loath to generate irregular past tenses for nonce verbs.²

This is a problem for most existing computational models. Nearly all of them—Albright & Hayes’ (2003) rule-based model (see their §4.5.3), O’Donnell’s (2015) rules-plus-storage system, and all analogical models and neural networks I am aware of—not only overregularize, like children do, but also overirregularize at rates far exceeding what children do. I submit that any computational model which produces substantial overirregularization is simply on the wrong track.

Endnotes

It is amusing to note that Abe is now, apparently, a trial lawyer and partner at a white-shoe law firm.
As I mentioned in a previous post, this is somewhat obscured by ratings tasks, but that’s further evidence we should disregard such tasks.

References

Albright, A. and Hayes, B. 2003. Rules vs. analogy in English past tenses: a computational/experimental study. Cognition 90(2): 119-161.
Berko, J. 1958. The child’s learning of English morphology. Word 14: 150-177.
Maratsos, M. 2000. More overregularizations after all: new data and discussion on Marcus, Pinker, Ullman, Hollander, Rosen & Xu. Journal of Child Language 27: 183-212.
O’Donnell, T. 2015. Productivity and Reuse in Language: a Theory of Linguistic Computation and Storage. MIT Press.
Payne, A. 1980. Factors controlling the acquisition of the Philadelphia dialect by out-of-state children. In W. Labov (ed.), Locating Language in Time and Space, pages 143-178. Academic Press.
Xu, F. and Pinker, S. 1995. Weird past tense forms. Journal of Child Language 22(3): 531-556.

On the past tense debate; Part 2: dual-route models are (still) incomplete

Dual-route models remain for the most part incompletely specified. Because crucial details are missing from their specification, they have generally not been implemented as computational cognitive models. Therefore, there is far less empirical rigor in dual-route thinking. To put it starkly, dual-route proponents have conducted expensive, elaborate brain imaging studies to validate their model but have not proposed a model detailed enough to implement on a $400 laptop.

The dual-route description of the English past tense can be given as such:

Use associative memory to find a past tense form.
If this lookup fails, or times out, append /-d/ and apply phonology.

Note that this ordering is critical: one cannot ask simply ask whether a verb is regular, since by hypothesis some or all regular verbs are not stored as such. And, as we know (Berko 1958), novel and nonce verbs are almost exclusively inflected with /-d/, consistent with the current ordering.¹ This model equates—rightly, I think—the notions of regularity with the elsewhere condition. The problem is with the fuzziness in how one might reach condition (2). We do not have any notion of what it might mean for associative memory lookup to fail. Neural nets, for instance, certainly do not fail to produce an output, though they will happily produce junk in certain cases. Nor do we much of a notion of how it might time out.

I am aware of two serious attempts to spell out this crucial detail. The first is Baayen et al.’s 1997 visual word recognition study of Dutch plurals. They imagine that (1) and (2) are competing activation “routes” and that recognition occurs when either of the routes reaches activation threshold, as if both routes run in parallel. To actually fit their data, however, their model immediately spawns epicycles in the form of poorly justified hyperparameters (see their fn. 2) and as far as I know, no one has ever bothered to reuse or reimplement their model.²The second is O’Donnell’s 2015 book, which proposes a cost-benefit analysis for storage vs. computation. However, this complex and clever model is not described in enough detail for a “white room” implementation, and no software has been provided. What dual route proponents owe us, in my opinion, is a next toolkit. Without serious investment in formal computational description and reusable, reimplementable, empirically validated models, it is hard to take the dual-route proposal seriously.

Endnotes

There’s a lot of work which obfuscates this point. An impression one might get from Albright & Hayes (2003) is that adult nonce word studies produce quite a bit of irregularity, but this is only true in their rating task and hardly at all in their “volunteering” (production) task, and a hybrid task finds much higher ratings for noce irregulars. Schütze (2005) argues—convincingly, in my opinion—that this is because speakers use a different task model in rating tasks, one that is mostly irrelevant to what Albright & Hayes are studying.
One might be tempted to fault Baayen et al. for using visual stimulus presentation (in a language with one of the more complex and opaque writing systems), or for using recognition as a proxy for production. While these are probably reasonably critiques today, visual word recognition was still the gold standard in 1997.

References

Albright, A. and Hayes, B. 2003. Rules vs. analogy in English past tenses: a computational/experimental study. Cognition 90(2): 119-161.
Baayen, R. H., Dijkstra, T., and Schreuder, R. 1997. Singulars and plurals in Dutch: evidence for a parallel dual-route model. Journal of Memory & Language 37(1): 94-117.
Berko, J. 1958. The child’s learning of English morphology. Word 14: 150-177.
O’Donnell, T. 2015. Productivity and Reuse in Language: a Theory of Linguistic Computation and Storage. MIT Press.
Schütze, C. 2005. Thinking about what we are asking speakers to do. In S. Kepser and M. Reis (ed.), Linguistic Evidence: Empirical, Theoretical, and Computational Perspectives, pages 457-485. Mouton de Gruyter.

On the past tense debate; Part 1: the RAWD approach

I have not had time to blog in a while, and I really don’t have much time now either. But here is a quick note (one of several, I anticipate) about the past tense debate.

It is common to talk as if connectionist approaches and dual-route models are the two opposing approaches to morphological irregularity, when in fact there are three approaches. Linguists since at least Bloch (1947)¹ have claimed that regular, irregular, and semiregular patterns are all rule-governed and ontologically alike. Of course, the irregular and semiregular rules may require some degree lexical conditioning, but phonologists have rightly never seen this as some kind of defect or scandal. Chomsky & Halle (1968), Halle (1977), Rubach (1984), and Halle & Mohanan (1985) all spend quite a bit of space developing these rules, using formalisms that should be accessible to any modern-day student of phonology. These rules all the way down (henceforth RAWD) approaches are empirically adequate and have been implemented computationally with great success: some prominent instances include Yip & Sussman 1996, Albright & Hayes 2003,² and Payne 2022. It is malpractice to ignore these approaches.

One might think that RAWD has more in common with dual-route approaches than with connectionist thinking, but as Mark Liberman noted many years ago, that is not obviously the case. Mark Seidenberg, for instance, one of the most prominent Old Connectionists, has argued that there is a tendency for regulars and irregulars to share certain structural similarities. To take one example, semi-regular slept does not look so different from stepped, and the many zero past tense forms (e.g., hit, bid) end in the same phones—[t, d]—used to mark the plural. While I am not sure this is a meaningfuly generalization, it clearly is something that both connectionist and RAWD models can encode.³ This is in contradistinction to dual-route models, which have no choice but to treat these observations as coincidences. Thus, as Mark notes, connectionists and RAWD proponents find themselves allied against dual-route models.

(Mark’s post, which I recommend, continues to draw a parallel between dual-routism and bi-uniqueness which will amuse anyone interested in the history of phonology.)

Endnotes

This is not exactly obscure work: Bloch taught at two Ivies and was later the president of the LSA.
To be fair, Albright & Hayes’s model does a rather poor job recapitulating the training data, though as they argue, it generalizes nonce words in a way consistent with human behavior.
For instance, one might propose that slept is exceptionally subject to a vowel shortening rule of the sort proposed by Myers (1987) but otherwise regular.

References

Albright, A. and Hayes, B. 2003. Rules vs. analogy in English past tenses: a computational/experimental study. Cognition 90(2): 119-161.
Bloch, B. 1947. English verb inflection. Language 23(4): 399-418.
Chomsky, N., and Halle, M. 1968. Sound Pattern of English. Harper & Row.
Halle, M. 1977. Tenseness, vowel shift and the phonology of back vowels in Modern English. Linguistic Inquiry 8(4): 611-625.
Halle, M., and Mohanan, K. P. 1985. Segmental phonology of Modern English. Linguistic Inquiry 16(1): 57-116.
Myers, S. 1987. Vowel shortening in English. Natural Language & Linguistic Theory 5(4): 485-518.
Payne, S. R. 2022. When collisions are a good thing: the acquisition of morphological marking. Bachelor’s thesis, University of Pennsylvania.
Pinker, S. 1999. Words and Rules: the Ingredients of Language. Basic Books.
Rubach, J. 1984. Segmental rules of English and cyclic phonology. Language 60(1): 21-54.
Yip, K., and Sussman, G. J. 1997. Sparse representations for fast, one-shot learning. In Proceedings of the 14th National Conference on Artificial Intelligence and 9th Conference on Innovative Applications of Artificial Intelligence, pages 521-527.