Optionality as acquirendum

A lot of work deals with the question of acquiring “optional” or “variable” grammatical rules, and my impression is that different communities are mostly talking at cross-purposes. I discern at least three ways linguists conceive of optionality as something which the child must acquire.

Some linguists assume—I think without much evidence—that optionality is mere “free variation”, so that the learner simply needs to infer which rules bear a binary [optional] feature. This is an old idea, going back to at least Dell (1981); Rasin et al. (2021:35) explicitly state the problem in this form.
Variationist sociolinguists focus on the differential rates at which grammatical rules apply. They generally recognize the acquirenda as essentially conditional probability distributions which give the probability of rule application in a given grammatical context. Bill Labov is a clear avatar of this strain of thinking (e.g., Labov 1989). David Adger and colleagues have attempted to situate this within modern syntactic frameworks (e.g., Adger 2006).
Some linguists believe that optionality is not statable within a single grammar, and must reflect the competing grammars. The major proponent of this approach is Anthony Kroch (e.g., Kroch 1989). While this conception might license some degree of “nihilism” about optionality, it also has led to some interesting work which hypothesizes interesting substantive constraints on grammar-internal constraints on variation as in the work of Laurel MacKenzie and colleagues (e.g., MacKenzie 2019). This work is also very good at ridding the (2) of some of its unfortunate “externalist” thinking.

I have to reject (1) as overly simplicistic. I find (2) and (3) both compelling in some way but a lot of work remains to synthesize or adjudicate between them.

References

Adger, D. 2006. Combinatorial variability. Journal of Linguistics 42(3): 503-530.
Dell, F. 1981. On the learnability of optional phonological rules. Linguistic Inquiry 12(1): 31-37.
Kroch, A. 1989. Reflexes of grammar in patterns of language change. Language Variation & Change 1(1): 199-244.
Labov, W. 1989. The child as linguistic historian. Language Variation & Change 1(1): 85-97.
MacKenzie, L. 2019. Perturbing the community grammar: Individual differences and community-level constraints on sociolinguistic variation. Glossa 4(1): 28.
Rasin, E., Berger, I., Lan, R., Shefi, I., and Katzir, R. 2021. Approaching explanatory adequacy in phonology using Minimum Description Length. Journal of Language Modelling 9(1): 17-66.

Chomsky & Katz (1974) on language diversity

Chomsky and others, Stich asserts do not really study a broad range of languages in attempting to construct theories about universal grammatical structure and language acquisition, but merely speculate on the basis of “a single language, or at best a few closely related languages” (814). Stich’s assertion is both false and irrelevant. Transformational grammarians have investigated languages drawn from a wide range of unrelated language families. But this is beside the point, since even if Stich were right in saying that all but a few closely related languages have been neglected by transformational grammarians, this would imply only that they ought to get busy studying less closely related languages, not that there is some problem in relating grammar construction to the study of linguistic universals. (Chomsky & Katz 1974:361)

References

Chomsky, N. and Katz, G. 1974. What the linguist is talking about. Journal of Philosophy 71(2): 347-367.

Entrenched facts

Berko’s (1958) wug-test is a standard part of the phonologist’s toolkit. If you’re not sure if a pattern is productive, why not ask whether speakers extend it to nonce words? It makes sense; it has good face validity. However, I increasingly see linguists who think that the results of wug-tests actually trumps contradictory evidence coming from traditional phonological analysis applied to real words. I respectfully disagree.

Consider for example a proposal by Sanders (2003, 2006). He demonstrates that an alternation in Polish (somewhat imprecisely called o-raising) is not applied to nonce words. From this he takes o-raising to be handled via stem suppletion. He asks, and answers, the very question you may have on your mind. (Note that his H here is the OT constraint hierarchy; you may want to read it as grammar.)

Is phonology obsolete?! No! We still need a phonological H to explain how nonce forms conform to phonotactics. We still need a phonological H to explain sound change. And we may still need H to do more with morphology than simply allow extant (memorized) morphemes to trump nonce forms. (Sanders 2006:10)¹

I read a sort of nihilism into this quotation. However, I submit that the fact that 50 million people just speak Polish—and “raise” and “lower” their ó‘s with a high degree of consistency across contexts, lexemes, and so on—is a more entrenched fact than the results of a small nonce word elicitation task. I am not saying that Sander’s results are wrong, or even misleading, just that his theory has escalated the importance of these results to the point where it has almost nothing to say about the very interesting fact that the genitive singular of lód [lut] ‘ice’ is lodu [lɔdu] and not *[ludu], and that tens of millions of people agree.

Endnotes

Sanders’ 2006 manuscript is a handout but apparently it’s a summary of his 2003 dissertation (Sanders 2003), stripped of some phonetic-interface details not germane to the question at hand. I just mention so that it doesn’t look like I’m picking on a rando. Those familiar with my work will probably guess that I disagree with just about everything in this quotation, but kudos to Sanders for saying something interesting enought to disagree with.

References

Berko, J. 1958. The child’s learning of English morphology. Word 14: 150-177.
Sanders, N. 2003. Opacity and sound change in the Polish lexicon. Doctoral dissertation, University of California, Santa Cruz.
Sanders, N. 2006. Strong lexicon optimization. Ms., Williams College and University of Massachusetts, Amherst.

The different functions of probabilty in probabilistic grammar

I have long been critical of naïve interpretations of probabilistic grammar. To me, it seems like the major motivation for this approach derives from a naïve—I’d say overly naïve—linking hypothesis mapping between acceptability judgments and grammaticality, as seen in Likert scale-style acceptability tasks. (See chapter 2 of my dissertation for a concrete argument against this.) But in this approach, the probabilities are measures of wellformedness.

It occurs to me that there are a number of ontologically distinct interpretations of grammatical probabilities of the sort produced by “maxent”, i.e., logistic regression models.

For instance, at M100 this weekend, I heard Bruce Hayes talk about another use of maximum entropy models: scansion. In poetic meters, there is variation in, say, whether the caesura is masculine (after a stressed syllable) or feminine (after an unstressed syllable), and the probabilities reflect that.¹ However, I don’t think it makes sense to equate this with grammaticality, since we are talking about variation in highly self-conscious linguistic artifacts here and there is no reason to think one style of caesura is more grammatical than the other.²

And of course there is a third interpretation, in which the probabilities are production probabilities, representing actual variation in production, within a speaker or across multiple speakers.

It is not obvious to me that these facts all ought to be modeled the same way, yet the maxent community seems comfortable assuming a single cognitive model to cover all three scenarios. To state the obvious, it makes no sense for a cognitive model to acconut for interspeaker variation because there is no such thing as “interspeaker cognition”, there are just individual mental grammars.

Endnotes

This is a fabricated example because Hayes and colleagues mostly study English meter—something I know nothing about—whereas I’m interested in Latin poetry. I imagine English poetry has caesurae too but I’ve given it no thought yet.
I am not trying to say that we can’t study grammar with poetry. Separately, I note, as did, I think, Paul Kiparsky at the talk, that this model also assumes that the input text the poet is trying to fit to the meter has no role to play in constraining what happens.

A note on pure allophony

I have previously discussed the notion of pure allophony, contrasting it with the facts of alternations. What follows is a lightly edited section from my recent NAPhC 12 talk, which in part hinges on this notion.

While Halle (1959) famously dispenses with the structuralist distinction between phonemics and morphophonemics, some later generativists reject pure allophony outright. Let the phonemic inventory of some grammar G be P and the set of surface phones generated by G from P be S. If some phoneme p ∈ P always corresponds—in some to be made precise—to some phone s ∈ S and if s ∉ P then s is a pure allophone of p. For example, if /s/ is a phoneme and [ʃ] is not, but all [ʃ]s correspond to /s/s, then [ʃ] is a pure allophone of [s]. According to some descriptions, this is the case for Korean, as [ʃ] is a (pure) allophone of /s/ when followed by [i].

One might argue that alternations are more entrenched facts than pure allophony, simply because it is always possible to construct a grammar free of pure allophony. For instance, if one wants to do away with pure allophony one can derive the Korean word [ʃI] ‘poem’ from /ʃi/ rather than from /si/. One early attempt to rule out pure allophony—and thus to motivate the choice of /ʃi/ over /si/ for the this problem—is the alternation condition (Kiparsky 1968). As Kenstowicz & Kisseberth (1979:215) state it, this condition holds that “the UR of a morpheme may not contain a phoneme /x/ that is always realized phonetically as identical to the realization of some other phoneme /y/.” [Note here that /x, y/ are to be interpreted as variables rather than as the voiceless velar fricative or the front high round vowel.–KBG] Another recent version of this idea—often attributed to Dell (1973) or Stampe (1973)—is the notion of lexicon optimization (Prince & Smolensky 1993:192).

A correspondent to this list wonders why, in a grammar G such that G(a) = G(b) for potential input elements /a, b/, a nonalternating observed element [a] is not (sometimes, always, freely) lexically /b/. The correct answer is surely “why bother?”—i.e. to set up /b/ for [a] when /a/ will do […] The basic idea reappears as “lexicon optimization” in recent discussions. (Alan Prince, electronic discussion; cited in Hale & Reiss 2008:246)

Should grammars with pure allophony be permitted? The question is not, as is sometimes supposed, a purely philosophical one (see Hale & Reiss 2008:16-22): both linguists and infants acquiring language require a satisfactory answer. In my opinion, the burden of proof lies with those who would deny pure allophony. They must explain how the language acquisition device (LAD) either directly induces grammars that satisfy the alternation condition, or optimizes all pure allophony out of them after the fact. “Why bother” could go either way: why posit either complication to the LAD when pure allophony will do? The linguist faces a similar problem to the infant. To wit, I began this project assuming Latin glide formation was purely allophonic, and only later uncovered—subtle and rare—evidence for vowel-glide alternations. Thus in this study, I make no apology for—and draw no further attention to—the fact that some data are purely allophonic. This important question will have to be settled by other means.

References

Dell, F. 1973. Les règles et les sons. Hermann.
Hale, M, and Reiss, R.. 2008. The Phonological Enterprise. Oxford University Press.
Halle, M. 1959. The Sound Pattern of Russian. Mouton.
Kenstowicz, M. and Kisseberth, C. 1979. Generative Phonology: Description and Theory. Academic Press.
Kiparsky. P. 1968. How Abstract is Phonology? Indiana University Linguistics Club.
Prince, A. and Smolensky, P. 1993. Optimality Theory: Constraint interaction in generative grammar. Technical Report TR-2, Rutgers University Center For Cognitive Science and Technical Report CU-CS-533-91, University of Colorado, Boulder Department of Computer Science.
Stampe, D. 1973. A Dissertation on Natural Phonology. Garland.

Linguistics and prosociality

It is commonly said that linguistics as a discipline has enormous prosocial potential. What I actually suspect is that this potential is smaller than some linguists imagine. Linguistics is of course essential to the deep question of “what is human nature”, but we are up against our own epistemic bounds in answering these questions and the social impact of answering this question is not at all clear to me. Linguistics is also essential to the design of speech and language processing technologies (despite what you may have heard: don’t believe the hype), and while I find these technologies exciting, it remains to be seen whether they will be as societically transformative as investors think. And language documentation is transformative to some of society’s most marginalized. But I am generally skeptical of linguistics’ and linguists’ ability to combat societal biases more generally. While I don’t think any member of society should be considered well-educated until they’ve thought about the logical problems of language acquisition, considered the idea of language as something that exists in the mind rather than just in the ether, or confronted standard language ideologies, I have to question whether the broader discipline has been very effective here getting these messages out.

A prediction

You didn’t build that. – Barack Obama, July 13, 2012

Connectionism originates in psychology, but the “old connectionists” are mostly gone, having largely failed to pass on their ideology to their trainees, and there really aren’t many “young connectionists” to speak of. But, I predict that in the next few years we’ll see a bunch of psychologists of language—the ones who define themselves by their opposition to internalism, innateness, and generativism—become some of the biggest cheerleaders for large language models (LLMs). In fact, psychologists have not made substantial contributions to neural network modeling in many years. Virtually all the work on improving neural networks over the last few decades has been done by computer scientists who cared not a whit whether they had anything to do with human brains or cognitive plausibility.¹ (Sometimes they’ll put things like “…inspired by the human brain…” in the press releases, but we all know that’s just fluff.) At this point, psychology as a discipline has no more claim to neural networks than the Irish do to Gaul, and in the rather unlikely case that LLMs do end up furnishing deep truths about cognition, psychology as a discipline will have failed us by not following up on a promising lead. I think it will be particularly revealing if psychologists who previously worshipped at the Church of Bayes suddenly lose all interest in mathematical rigor and find themselves praying to the great Black Box. I want to say it now: if this happens—and I am starting to see signs that it will—those people will be cynics, haters, and trolls, and you shouldn’t pay them any mind.

Endnotes

I am also critical of machine learning pedagogy, and it is therefore interesting to see that those same computer scientists pushing things forward don’t seem to care much for machine learning as an academic discipline either.

More than one rule

[Leaving this as a note to myself to circle back.]

I’m just going to say it: some “rules” are probably two or three rules, because the idea that rules are defined by natural classes (and thus free of disjunctions) is more entrenched than our intuitions about whether or not a process in some language is really one rule or not, and we should be Gallilean about this. Here are some phonological “rules” that are probably two or three rules different rules.

Indo-Iranian, Balto-Slavic families, and Albanian “ruki” (environment: preceding {w, j, k, r}): it is not clear to me if any of the languages (the) actually need this as a synchronic rule at all.
Breton voiced stop lenition (change: /b/ to [v], /d/ to [z], /g/ to [x]): the devoicing of /g/ must be a separate rule. Hat tip: Richard Sproat. I believe there’s a parallel set of processes in German.
Lamba patalatalization (change: /k/ to [tʃ], /s/ to [ʃ]): two rules, possibly with a Duke-of-York thing. Hat tip: Charles Reiss.
Mid-Atlantic (e.g., Philadelphia) English ae-tensing (environment: following tautosyllabic, same-stem {m, n, f, θ, s, ʃ]): let’s assume this is allophony; then the anterior nasal and voiceless fricative cases should be separate rules. It is possible the incipient restructuring of this as having a simple [+nasal] context provides evidence for the multi-rule analysis.
Latin glide formation (environment: complex). Front and back glides are formed from high short monophthongs in different but partially overlapping contexts.

Feature maximization and phonotactics

[This is a quick writing exercise for in-progress work with Charles Reiss. Sorry if it doesn’t make sense out of context.]

An anonymous reviewer asks:

I wonder how the author(s) would reconcile this learning model with the evidence that both children and adults seem to aggressively generalize phonotactic restrictions from limited data (e.g. just [p]) to larger, unobserved natural classes (e.g. [p f b v]). See e.g. the discussion in Linzen & Gallagher (2017). If those results are credible, they seem much more consistent with learning minimal feature specifications for natural classes than learning maximal ones.

First, note that Linzen & Gallagher’s study is a study of phonotactic learning, whereas our proposal concerns induction of phonological rules. We have been, independently but complementarily, quite critical of the naïve assumptions inherent in prior work on this topic (e.g., Gorman 2013, ch. 2; Reiss 2017, §6); we have both argued that knowledge of phonotactic generalizations may require much less grammatical knowledge than is generally believed.

Secondly, we note that Linzen & Gallagher’s subjects are (presumably; they were recruited on Mechanical Turk and were paid $0.65 USD for their efforts) adults briefly exposed to an artificial language. While we recognize that adult “artificial language learning” studies are common practice in psycholinguistics, it is not clear what such studies contribute to our understanding of phonotactic acqusition (whatever the phonotactic acquirenda turn out to be) by children robustly exposed to realistic languages in situ.

Third, the reviewer is incorrect; the result reported by Linzen & Gallagher (henceforth L&G) is not consistent with minimal generalization. Let us grant—for sake of argument—that our proposal about rule induction in children is relevant to their work on rapid phonotactic learning in adults. One hypothesis they entertain is that their participants will construct “minimal classes”:

For example, when acquiring the phonotactics of English, learners may first learn that both [b] and [g] are valid onsets for English syllables before they can generalize to other voiced stops (e.g., [d]). This generalization will be restricted to the minimal class that contained the attested onsets (i.e., voiced stops), at least until a voiceless stop onset is encountered.

If by a “minimal class” L&G are referring to a natural class which is consistent with the data and has an extension with the fewest members, then presumably they would endorse our proposal of feature maximization, since the class that satisfies this definition is the most fully specified empirically adequate class. However, it is an open question whether or not such a class would actually contain [d]. For instance, if one assumes that major place features are bivalent, then the intersection of the features associated with [b, g] will contain the specification [−coronal], which rules out [d].

Interestingly, the matter is similarly unclear if we interpret “minimal class” intensionally, in terms of the number of features, rather than in terms of the number of phonemes the class picks out. The (featurewise-)minimal specification for a single phone (as in the reviewer’s example) is the empty set, which would (it is generally assumed) pick out any segment. Then, we would expect that any generalization which held of [p], as in the reviewer’s example, to generalize not just to other labial obstruents (as the reviewer suggests), but to any segment at all. Minimal feature specification cannot yield a generalization from [p] to any proper subset of segments, contra the anonymous reviewer and L&G. An adequate minimal specification which picks out [p] will pick out just [p].; L&G suggest that maximum entropy models of phonotactic knowledge may have this property, but do not provide a demonstration of this for any particular implementation of these models.

We thank the anonymous reviewer for drawing our attention to this study and the opportunity their comment has given us to clarify the scope of our proposal and to draw attention to a defect in L&G’s argumentation.

References

Gorman, K. 2013. Generative phonotactics. Doctoral dissertation, University of Pennsylvania.
Linzen, T., and Gallagher, G. 2017. Rapid generalization in phonotactic learning. Laboratory Phonology: Journal of the Association for Laboratory Phonology 8(1): 1-32.
Reiss, C. 2017. Substance free phonology. In S.J. Hannahs and A. Bosch (ed.), The Routledge Handbook of Phonological Theory, pages 425-452. Routledge.

Phonological nihilism

One might argue that phonology is in something of a crisis period. Phonology seems to be going through early stages of grief for what I see as the failure of teleological, substance-rich, constraint-based, parallel-evaluation approaches to make headway, but the next paradigm shift is yet to become clear to us. I personally think that logical, substance-free, serialist approaches ought to represent our next i-phonology paradigm, with “evolutionary”-historical thinking providing the e-language context, but I may be wrong and altogether different paradigm may be waiting in the wing. The thing that troubles me is that phonologists from these still-dominant constraint-based traditions seem to have less and less faith in the tenets of their theories, and in the worst case this expresses itself as a sort of nihilism. I discern two forms of this nihilism. The first is the phonologist who thinks we’re doing “word sudoku”, playing games of minimal description that produce generalizations without a shred of cognitive support. The second is the phonologist who thinks that everything is memorized, so that the actual domain of phonological generalization are just Psych 101 subject pool nonce word experiments. My pitch to both types of nihilists is the same: if you truly believe this, you ought to spend more time at the beach and less in the classroom, and save some space in the discourse for those of us who believe in something.