The role of phonotactics in language change

How does phonotactic knowledge influence the path taken by language change? As is often the case, the null hypothesis seems to be simply that it doesn’t. Perhaps speakers have projected a phonotactic constraint C into the grammar of Old English, but that doesn’t necessarily mean that Middle English will conform to C, or even that Middle English won’t freely borrow words that flagrantly violate C.

One case comes from the history of English. As is well known, modern English /ʃ/ descends from Old English sk; modern instances of word-initial sk are mostly borrowed from Dutch (e.g., skipper) or Norse (e.g., ski); sky was borrowed from an Old Norse word meaning ‘cloud’ (which tells you a lot about the weather in the Danelaw). Furthermore, Old English forbids super-heavy long vowel-consonant cluster rimes. Because the one major source for /ʃ/ is sk, and because a word-final long vowel followed by sk was unheard of, V̄ʃ# was rare in Middle English and word-final sequences of tense vowels followed by [ʃ] are still rare in Modern English (Iverson & Salmons 2005). Of course there are exceptions, but according to Iverson & Salmons, they tend to:

  • be markedly foreign (e.g., cartouche),
  • to be proper names (e.g., LaRouche),
  • or to convey an “affective, onomatopoeic quality” (e.g., sheesh, woosh).

However, it is reasonably clear that all of these were added during the Middle or Modern period. Clearly, this constraint, which is still statistically robust (Gorman 2014:85), did not prevent speakers from borrowing and coining exceptions to it. However, it is hard to  rule out any historical effect of the constraint: perhaps there would be more Modern English V̄ʃ# words otherwise.

Another case of interest comes from Latin. As is well known Old Latin went through a near-exceptionless “Neogrammarian” sound change, a “primary split” or “conditioned merge” of intervocalic s with r. (The terminus ante quem, i.e., the latest possible date, for the actuation of this change is the 4th c. BCE.) This change had the effect of temporarily eliminating all traces of intervocalic in late Old Latin (Gorman 2014b). From this fact, one might posit that speakers of this era of Latin might project a *VsV constraint. And, one might posit that this would prevent subsequent sound changes from reintroducing intervocalic s. But this is clearly not the case: in the 1st c. BCE, degemination of ss after diphthongs and long monophthongs reintroduced intervocalic s (e.g., caussa > classical causa ’cause’). It is also clear that loanwords with intervocalic s were freely borrowed, and with the exception of the very early Greek borrowing tūs-tūris ‘incense’, none of them were adapted in any way to conform to a putative *VsV constraint:

(1) Greek loanwords: ambrosia ‘id.’, *asōtus ‘libertine’ (acc.sg. asōtum), basis ‘pedestal’, basilica ‘public hall’, casia ‘cinnamon’ (cf. cassia), cerasus ‘cherry’, gausapa ‘woolen cloth’, lasanum ‘cooking utensil’, nausea ‘id.’, pausa ‘pause’, philosophus ‘philosopher’, poēsis ‘poetry’, sarīsa ‘lance’, seselis ‘seseli’
(2) Celtic loanwords: gaesī ‘javelins’, omāsum ‘tripe’
(3) Germanic loanwords: glaesum ‘amber’, bisōntes ‘wild oxen’

References

Gorman, K. 2014a. A program for phonotactic theory. In Proceedings of the 47th Annual Meeting of the Chicago Linguistic Society, pages 79-93.
Gorman, K. 2014b. Exceptions to rhotacism, In Proceedings of the 48th Annual Meeting of the Chicago Linguistic Society, pages 279-293.
Iverson, G. K. and Salmons, J. C. 2005. Filling the gap: English tense vowel plus final
/š/. Journal of English Linguistics 33: 1-15.

Allophones and pure allophones

I assume you know what an allophone is. But what this blog post supposes […beat…] is that you could be more careful about how you talk about them.

Let us suppose the following:

  • the phonemic inventory of some grammar G contains t and d
  • does not contain s or z
  • yet instances of s or z are found on the surface

Thus we might say that /t, d/ are phonemes and [s, z] are allophones (perhaps of /t, d/: maybe in G, derived coronal stop clusters undergo assibilation).

Let us suppose that you’re writing the introduction to a phonological analysis of G, and in Table 1—it’s usually Table 1—you list the phonemes you posit, sorted by place and manner. Perhaps you will place s and in italics or brackets, and the caption will indicate that this refers to segments which are allophones.

I find this imprecise. It suggests that all instances of surface t or d are phonemic (or perhaps more precisely, and more vacuously, are faithful allophones),1 which need not be the case. Perhaps G has a rule of perseveratory obstruent cluster voice assimilation and one can derive surface [pt] from /…p-d…/, or surface [gd] from /…g-t…/, and so on. The confusion here seems to be that we are implicitly treating the sets of allophones and phonemes are disjoint when the former is a superset of the latter. What we seem to actually mean when we say that [s, z] are allophones is rather that they are pure allophones: allophones which are not also phonemes.

Another possible way to clarify the hypothetical table 1 is to simply state what phonemes and z are allophones of, exactly. For instance, if they are purely derived by assibilation, we might write that “the stridents s, z are (pure) allophones of the associated coronal stops /t, d/ respectively”. However, since this might be besides the point, and because there’s no principled upper bound on how many phonemic sources a given (pure or otherwise) allophone might have, I think it should suffice to suggest that s and z are pure allophones and leave it at that.2

This imprecision, I suspect, is a hang-over from structuralist phonemics, which viewed allophony as separate (and arguably, more privileged or entrenched) than alternations (then called morphophonemics). Of course, this assumption does not appear to have any compelling justification, and as Halle (1959) shows, it leads to substantial duplication (in the sense of Kisseberth 1970) between rules of allophony and rules of neutralization.3 Most linguists since Halle seem to have found the structuralist stipulation and the duplication it gives rise to aesthetically displeasing; I concur.

Endnotes

  1. I leave open the question of whether surface representations ever contain phonemes: perhaps vacuous rules “faithfully” convert them to allophones.
  2. One could (and perhaps should) go further into feature logic, and as such, regard both phonemes and pure allophones as mere bundles of features linked to a single timing slot. However, this makes things harder to talk about.
  3. I do not assume that “neutralization” is a grammatical primitive. It is easily defined (see Bale & Reiss 2017, ch. 20) but I see no reason to suppose that grammars distinguish neutralizing processes from other processes.

References

Bale, A. and Reiss, C. 2018. Phonology: A Formal Introduction. MIT Press.
Halle, M. 1959. Sound Pattern of Russian. Mouton.
Kisseberth, C. W. 1970. On the functional unity of phonological rules. Linguistic Inquiry 1(3): 291-306.

The alternation phonotactic hypothesis

The hypothesis

In a recent handout, I discuss the following hypothesis, implicit in my dissertation (Gorman 2013):

(1) Alternation Phonotactic Hypothesis: Let ABC, and D be (possibly-null) string sets. Then, if a grammar G contains a surface-true rule of alternation A → B / C __ D, nonce words containing the subsequence CAD are ill-formed for speakers of G.

Before I continue, note that definition is “phenomenological” in the sense that refers to two notions—alternations and surface-true-ness—which are not generally considered to be encoded directly in the grammar. Regarding the notion of alternations, it is not difficult to formalize whether or not a rule is alternating.

(2) Let a rule be defined by possibly-null string sets A, B, C, and D as in (1). Then if any elements of B are phonemes, then the rule is a rule of alternation.

(3) [ditto] If no elements of B are phonemes, then the rule is a rule of (pure) allophony.

But from the argument against bi-uniqueness in Sound Pattern of Russian (Halle 1959), it follows that we should reject a grammar-internal distinction between rules of alternation and allophony, and subsequent theory provides no way to encode this distinction in the grammar. Similarly, it is not hard to define what it means for a rule to be surface-true.

(4) [ditto] If no instances of CAD are generated by the grammar G, then the rule is surface-true.

But, there does not seem to be much reason for that notion to be encoded in the grammar and the theory does not provide any way to encode it.1 Note further that I am also deliberately stating in (1) that a constraint against CAD has been “projected” from the alternation, rather than treating such constraints as autonomous entities of the theory as is done in Optimality Theory (OT) and friends. Finally, I have phrased this in terms of grammaticality (“are ill-formed”) rather than acceptability.

Why might the Alternation Phonotactic Hypothesis (henceforth, APH) be true? First, I take it as obvious that alternations are more entrenched facts about grammars than pure allophony. For instance, in English, stop aspiration could be governed by a rule of allophony, but it is also plausible that English speakers simply represent aspirated stops as such in their lexical entries since there are no aspiration alternations. This point was made separately by Dell (1973) and Stampe (1973), and motivates the notion of lexicon optimization in OT. In contrast, though, rules of alternation (or someting like them) are actually necessary to obtain the proper surface forms. An English speaker who does not have a rule of obstruent voice assimilation will simply not produce the right allomorphs of various affixes. In contrast, the same speaker need not encode a process of nasalization—which in English is clearly allophonic (see, e.g., Kager 1999: 31f.)—to obtain the correct outputs. Given that alternations are entrenched in the relevant sense, it is not impossible to imagine that speakers might “project” constraints out of alternation generalizations in the manner described above. Such constraints could be used during online processing, assuming a strong isomorphism between grammatical representations used during production and perception.2 Secondly, since not all alternations are surface-true, it seems reasonable to limit this process of projection to those which are. Were one to project non-surface-true constraints in this fashion, the speaker would find themselves in an awkward position in which actual words are ill-formed.3,4

The APH is interesting contrasted with the following:

(5) Lexicostatistic Phonotactic Hypothesis: Let A, C, and be (possibly-null) string sets. Then, if CAD is statistically underrepresented (in a sense to be determined) in the lexicon L of a grammar G, nonce words containing the subsequence CAD are ill-formed for speakers of G. 

According to the LSPH (as we’ll call it), phonotactic knowledge is projected not from alternations but from statistical analysis of the lexicon. The LSPH is at least implicit in the robust cottage industry which uses statistical and/or computational modeling of the lexicon to infer the existence of phonotactic generalizations. It is notable how virtually none of the “cottage industry” of LSPH work discusses anything like the APH. Finally, one should note that APH and the LSPH do not exhaust the set of possibilities. For instance, Berent et al. (2007) and Daland et al. (2011) test for effects of the Sonority Sequencing Principle, a putative linguistic universal, on wordlikeness judgments. And some have denied the mere existence of phonotactic constraints.

Gorman 2013 reviews some prior results which argue in favor of the APH, which I’ll describe below.

Consider the putative English phonotactic constraint *V̄ʃ#, a constraint against word-final sequences of tense vowels followed by [ʃ] proposed by Iverson & Salmons (2005). Exceptions to this generalization tend to be markedly foreign (e.g., cartouche), to be proper names (e.g., LaRouche), or to convey an “affective, onomatopoeic quality” (e.g., sheeshwoosh). As Gorman (2013:43f.) notes, this constraint is statistically robust, but Hayes & White (2013) report that it has no measurable effect on English speakers’ wordlikeness judgments. In contrast, three English alternation rules  (nasal place assimilation, obstruent voice assimilation, and degemination) have a substantial impact on wordlikeness judgments (Gorman 2013, ch. 4).

A secod, more elaborate example comes from Turkish. Lees (1966a,b) proposes three phonotactic constraints in this language: backness harmony, roundness harmony, and labial attraction. All three of these constraints have exceptions, but Gorman (p. 57-60) shows that they are statistically robust generalizations. Thus, under the LSPH, speakers ought to be sensitive to all three.

Endnotes

  1. I note that the CONTROL module proposed by Orgun & Sprouse (1999) might be a mechanism by which this information could be encoded.
  2. Some evidence that phonotactic knowledge is deployed in production comes from the study of Finnish and Turkish, both of which have robust vowel harmony. Suomi et al. (1997) and Vroomen et al. (1998) find that disharmony seemingly acts as a cue for word boundaries in Finnish, and Kabak et al. (2010) find something similar for Turkish, but not in French, which lacks harmony.
  3. Durvasula & Kahng (2019) find that speakers do not necessarily judge a nonce word to be ill-formed just because it fails to follow certain subtle allophonic generalizations, which suggests that the distinction between allophony and alternation may be important here.
  4. I note that it has sometimes been proposed that actual words of G may in fact be gradiently marked or otherwise degraded w.r.t. to grammar G if they violate phonotactic constraints projected from G (e.g., Coetzee 2008). However, the null hypothesis, it seems to me, is that all actual words are also possible words and so it does not make sense to speak of actual words as marked or ill-formed, gradiently or otherwise.

References

Berent, I., Steriade, D., Lennertz, T., and Vaknin, V. 2007. What we know about what we have never heard: evidence from perceptual illusions. Cognition 104: 591-630.
Coetzee, A. W. 2008. Grammaticality and ungrammaticality in phonology. Language 64(2): 218-257. [I critique this briefly in Gorman 2013, p. 4f.]
Daland, R., Hayes, B., White, J., Garellek, M., Davis, A., and Norrmann, I. 2011. Explaining sonority projection effects. Phonology 28: 197-234.
Dell, F. 1973. Les règles et les sons. Hermann.
Durvasula, K. and Kahng, J. 2019. Phonological acceptability is not isomorphic with phonological grammaticality of stimulus. Talk presented at the Annual Meeting on Phonology.
Gorman, K. 2013. Generative phonotactics. Doctoral dissertation, University of Pennsylvania.
Halle, M. 1959. Sound Pattern of Russian. Mouton.
Hayes, B. and White, J. 2013. Phonological naturalness and phonotactic learning. Linguistic Inquiry 44: 45-75.
Iverson, G. K. and Salmons, J. C. 2005. Filling the gap: English tense vowel plus final
/š/. Journal of English Linguistics 33: 1-15.
Kager, R. 1999. Optimality Theory. Cambridge University Press.
Orgun, C. O. and Sprouse, R. 1999. From MPARSE to CONTROL: deriving ungrammaticality. Phonology 16: 191-224.
Kabak, B., Maniwa, K., and Kazanina, N. 2010. Listeners use vowel harmony and word-final stress to spot nonsense words: a study of Turkish and French. Journal of Laboratory Phonology 1: 207-224.
Lees, R. B. 1966a. On the interpretation of a Turkish vowel alternation. Anthropological Linguistics 8: 32-39.
Lees, R. B. 1966b. Turkish harmony and the description of assimilation. Türk Dili
Araştırmaları Yıllığı Belletene 1966: 279-297
Stampe, D. 1973. A Dissertation on Natural Phonology. Garland. [I don’t have this in front of me but if I remember correctly, Stampe argues non-surface true phonological rules are essentially second-class citizens.]
Suomi, K. McQueen, J. M., and Cutler, A. 1997. Vowel harmony and speech segmentation in Finnish. Journal of Memory and Language 36: 422-444.
Vroomen, J., Tuomainen, J. and de Gelder, B. 1998. The roles of word stress and vowel harmony in speech segmentation. Journal of Memory and Language 38: 133-149.

Logistic regression as the bare minimum. Or, Against naïve Bayes

When I teach introductory machine learning, I begin with (categorical) naïve Bayes classifiers. These are arguably the simplest possible supervised machine learning model, and can be explained quickly to anyone who understands probability and the method of maximum likelihood estimation. I then pivot and introduce logistic regression and its various forms. Ng et al. (2002) provide a nice discussion of how the two relate, and I encourage students to read their study.

Logistic regression is a more powerful technique than naïve Bayes. First, it is “easier” in some sense (Breiman 2001) to estimate the conditional distribution, as one does in logistic regression, than to model the joint distribution, as one does in naïve Bayes. Secondly, logistic regression can be learned using standard (online) stochastic gradient descent methods. Finally, it naturally supports conventional regularization strategies needed to avoid overfitting. For this reason, in 2022, I consider regularized logistic regression the bare minimum supervised learning method, the least sophisticated method that is possibly good enough. The pedagogical-instructional problem I then face is trying to convince students not to use naïve Bayes, given that it is obsolete—it is virtually always inferior to regularized logistic regression—given that tools like scikit-learn (Pedregosa et al. 2011) make it almost trivial to swap one machine learning method for the other.

References

Breiman, Leo. 2001. Statistical modeling: the two cultures. Statistical Science 16:199-231.
Ng, Andrew Y., and Michael I. Jordan. 2002. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In Proceedings of NeurIPS, pages 841-848.
Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, …, and Édouard Duchesnay. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12:2825-2830.

On “alternative” grammar formalisms

A common suggestion to graduate students in linguistics (computational or otherwise) is to study “alternative” grammar formalisms [not my term-KBG]. The implication is that the student is only familiar with formal grammars inspired by the supposedly-hegemonic generativist tradition—though it is not clear if we’re talking about the GB-lite of Penn Treebank, the minimalist grammars (MGs) of Ed Stabler, or perhaps something else—and that the set of “alternatives” includes lexical-functional grammars (LFGs), tree-adjoining grammars (TAGs), combinatory categorial grammars (CCGs), head-driven phrase structure grammar (HPSG), or one of the various forms of construction grammar. I would never say that students should study less rather than more, but I am not convinced this diversity of formalism is key to training well-rounded students. TAGs and CCGs are known to be strongly equivalent (Schiffer & Maletti 2021), and the major unification-based grammar systems (which includes CCGs and HPSGs, and formal forms of construction grammars too) are equivalent to MGs. I speculate that maybe we should be emphasizing similarities rather than differences insofar as those differences are not represented in relative generative capacity.

Another useful way to determine the relative utility of alternative formalisms is to look at their actual use in wide-coverage computational grammars, since as Chomsky (1981: 6) says, it is possible to put systems to the test “only to the extent that we have grammatical descriptions that are reasonably compelling in some domain…”. Or put another way, grammar frameworks both hegemonic and alternative can be assessed for coverage (which can be extensive, in some languages and domains) or general utility rather than for the often-spicy rhetoric of their proponents.

Finally, it is at least possible that some alternative frameworks are simply losers of a multi-agent coordination game and at least some consolidation is desirable.

References

Chomsky, N. 1981. Lectures in Government and Binding. Foris.
Schiffer, L. K. and Maletti, A. 2021. Strong equivalence of TAG and CCG. Transactions of the Association for Computational Linguistics 9: 707-720.

Academic reviewing in NLP

It is obvious to me that NLP researchers are, on average, submitting manuscripts far earlier and more often than they ought to. The average manuscript I review is typo-laden, full of figures and tables far too small to actually read or intruding on the margins, with an unusable bibliography that the authors have clearly never inspected. Sometimes I receive manuscripts whose actual titles are transparently ungrammatical.

There are several reasons this is bad, but most of all it is a waste of reviewer time, since the reviewers have to point out (in triplicate or worse) minor issues that would have been flagged by proof-readers, advisors, or colleagues, were they involved before submission. Then, once these issues are corrected, the reviewers are again asked to read the paper and confirm they have been addressed. This is work the authors could have done, but which instead is pushed onto committees of unpaid volunteers.

The second issue is that the reviewer pool lacks relevant experience. I am regularly tasked with “meta-reviewing”, or critically summarizing the reviews. This is necessary in part because many, perhaps a majority, of the reviewers simply do not know how to review an academic paper, having not received instruction on this topic from their advisors or mentors, and their comments need to be recast in language that can be quickly understood by conference program committees.

[Moving from general to specific.]

I have recently been asked to review an uncommonly large collection of papers on the topic of prompt engineering. Several years ago, it became apparent that neural network language models, trained on enormous amounts of text data, could often provide locally coherent (though rarely globally coherent) responses to prompts or queries. The parade example of this type of model is GPT-2. For instance, if the prompt was:

Malfoy hadn’t noticed anything.

the model might continue:

“In that case,” said Harry, after thinking it over, “I suggest you return to the library.”

I assume this is because there’s fan fiction in the corpus, but I don’t really know. Now it goes without saying that at no point will, Facebook, say, launch a product in which a gigantic neural network is allowed to regurgitate Harry Potter fan fiction (!) at their users. However, researchers persist for some reason (perhaps novelty) to try to “engineer” clever prompts that produce subjectively “good” responses, rather than attempting to understand how any of this works. (It is not an overstatement to say that we have little idea why neural networks, and the methods we use to train them in particular, work at all.) What am I to do when asked to meta-review papers like this? I try to remain collegial, but I’m not sure this kind of work ought to exist at all. I consider GPT-2 a billionaire plaything, a rather wasteful one at that, and it is hard for me to see how this line of work might make the world a better place.

Is linguistics “unusually vituperative”?

The picture of linguistics one can get from books like The Linguistics Wars (Harris 1993) and press coverage of l’affaire du Pirahã suggests it is a quite nasty sort of field, full of hate and invective. Is linguistics really, as an engineer colleague would have it, “unusually vituperative”?

In my opinion it is not, for I object to the modifier unusually. Indeed, while such stories rarely make the nightly news, the sciences have never been without a heft dose of vituperation. For instance, anthropologist Napoleon Chagnon was accused, slanderously and at book length, of causing a measles epidemic among indigenous peoples of the Amazon. And entomologist E.O. Wilson had a pitcher of water poured on his head at a lecture because, according to a lone audience member, his research on ants implied support for eugenics. And even gentleman Darwin was not above keeping an ill-tempered bulldog.

References

Harris, R. A. 1993. The Linguistics Wars: Chomsky, Lakoff, and the Battle over Deep Structure. Oxford University Press. [I don’t recommend this book: Harris, instead of explaining the issues at stake, focuses on “horse race” coverage, quoting extensively from interviews with America’s grumpiest octogenarians.]

The 24th century Universal Translator is unsupervised and requires minimal resources

The Star Trek: Deep Space Nine episode “Sanctuary” pretty clearly establishes that by the 24th century, the Star Trek universe’s Universal Translator works in an unsupervised fashion and requires only a (what we in the real 21st century would consider) minimal monolingual corpus and a few hours of processing to translate Skrreean, a language new to Starfleet and friends. Free paper idea: how does the Universal Translator’s capabilities (in the 22nd through the 24th century, from Enterprise to the original series to the 24th century shows) map onto known terms of art in machine translation in our universe?