Major projects at the Computational Linguistics lab

[The following is geared towards our incoming students. I’m just using the blog as a easy publishing mechanism.]

The following are some major projects ongoing in the GC Computational Linguistics Lab.

Many phonologists believe that phonotactic knowledge is independent of knowledge of phonological alternations. In my dissertation I evaluated computational models of autonomous phonotactic knowledge as predictions of speakers’ judgments of wordlikeness, and I found that these fail to consistently outperform simple baselines. In part, these models fail because they predict gradience that is poorly correlated with human judgments. However, these conclusions were tentative because of the poor quality of the available data, collected little attention paid to experimental design or choice of stimuli. With funding from the National Science Foundation, and in collaboration with professors Karthik Durvasula at Michigan State University and Jimin Kahng at the University of Mississippi, we are building a open-source “megastudy” of human wordlikeness judgments and performing computational modeling of the resulting data.

Speech recognizers and synthesizers are, essentially, engines for synthesizing or recognizing sequences of phonemes. Therefore, it is necessary to transform text into phoneme sequences. Such transformations are challenging insofar as they require linguistic expertise—and language-specific knowledge—and are not always amenable to generic machine learning techniques. We are engaged in several projects involving these mappings. The lab maintains WikiPron (Lee et al. 2020), software and databases for building multilingual pronunciation dictionaries, and has organized two SIGMORPHON shared tasks on multilingual grapheme-to-phoneme conversion (Gorman et al. 2020, Ashby et al. 2021). And with funding from the CUNY Professional Staff Congress, PhD student Amal Aissaoui is engaged building diacritization engines for Arabic and Latin, engines which supply missing pronunciation information for these scripts.

Morphological generation systems use machine learning to predict the inflected forms of words. In 2019 I led a team of researchers in an error analysis of the top two systems in the CoNLL-SIGMORPHON 2017 shared task on morphological generation (Gorman et al. 2019). We found that the top models struggled with inflectional patterns which are sensitive to lexeme-inherent morphosyntactic features like gender, animacy, and aspect, which are not provided in the task data. For instance, the top models often inflect Russian perfective verbs as if they were imperfective, or Polish inanimate nouns as if they were animate. Finally, we find that models struggle with abstract morphophonological patterns which cannot be inferred from the citation form alone. For instance, the top models struggle to predict whether or not a Spanish verb will undergo diphthongization under stress (e.g., negarniego ‘to deny-I deny’ vs. pegarpego ‘to stick-I stick’). In collaboration with professor Katharina Kann and PhD student Adam Weimerslage at the University of Colorado, Boulder, we are developing an open-source “challenge set” for morphological generation, a set that targets complex inflectional patterns in a diverse sample of 10-20 languages. This challenge set will act as benchmarks for neural network models of inflection, and will allow us to further study inherent features and abstract morphophonological patterns. In designing these challenge sets we have targeted a wide variety of morphological processes, including reduplication and templatic formation in addition to affixation and stem change. MA students Kristysha Chan, Mariana Graterol, and M. Elizabeth Garza, and PhD student Selin Alkan have all contributed to the development of this challenge set thus far.

Inflectional defectivity is the poorly-understood dark twin of productivity. With funding from the CUNY Professional Staff Congress, Emily Charde (MA 2020) is engaged in a computational study of defectivity in Greek nouns and Russian verbs.

Quiet quitting is work-to-rule but worse

This week’s hot media trend is quiet quitting, and if you’re even remotely familiar with the US labor movement, you’ll recognize this as a version of organized labor’s work to rule actions, in which workers do the absolute minimum amount of work required by the contract. The difference is that a quiet quitter slacks off alone, whereas work to rule actions are applied across organized groups of employees under similar work conditions. The Wall St. Journal is willing to tell you about the former behavior, which is youth-coded and unlikely to result in improved conditions, but is not in a hurry to tell you about traditional forms of collective labor action.

On who is allowed to graduate

There is a convention I’ve seen at several institutions whereby a PhD (usually) student who already has a job or post-doc lined up is permitted to defend a dissertation that is less complete than would otherwise be accepted were they not up against a deadline. One suspects this sort of thing is applied in a rather biased fashion, but let’s suppose it was not. I cannot see any justification for it. It produces poor science, it is bad for departmental morale and espirit de corps, and it doesn’t prepare the student for future success in an environment where their advisor can no longer put a finger on the scale.

Now it is true that advisors or committee members, for whatever reason, occasionally try to squeeze a student for more one more experiment that is more of a nice-to-have than essential to make the argument being made in the thesis, but it is not clear why accepting a sub-par dissertation should be a remedy for it, and why such a remedy should only be available if you have a new job starting in two weeks.

Defectivity in Kinande

[This is part of a series of defectivity case studies.]

I have already written a bit about reduplication in Kinande; it too is an example of inflectional defectivity, and here I’ll focus on that fact.

In this language, most verbs participate in a form of reduplication with the semantics of roughly ‘to hurriedly V’ or ‘to repetitively V’. Mutaka & Hyman (1990; henceforth MH), argue that the reduplicant is a bisyllabic prefix. For instance, the reduplicated form of e-ri-gend-a ‘to leave’ is e-ri-gend-a-gend-a ‘to leave hurriedly’, with the reduplicant underlined. (In MH’s terms, e- is the “augment”, -ri the “prefix”, and -a is the “final vowel” morpheme.)

Certain verbal suffixes, known to Bantuists as extensions, may also be found in the reduplicant when the reduplicant would otherwise be less than bisyllabic. For instance, the passive suffix, underlyingly /-u-/, surfaces as [w] and is copied by reduplication. Thus for the verb root hum ‘beat’ the passive e-ri-hum-w-a reduplicates as e-ri-hum-w-a-hum-w-a. More interesting is there are “unproductive” (MH’s term) extensions.1 Verbs bearing these extensions rarely have a compositional semantic relationship with their unextended form (if an unextended verb stem exists at all). For instance, whereas luh-uk-a ‘take a rest’ may be semantically related to luh-a ‘be tired’, but there is no unextended *bát-a to go with bát-uk-a ‘move’.

Interesting things happen when we try to reduplicate unproductivity extended monosyllabic verb roots. For some such verbs, the extension is not reduplicated; e.g., e-rí-bang-uk-a ‘to jump about’ has a reduplicated form e-rí-bang-a-bang-uk-a. This is the same behavior found for “productive” extensions. For others, the extension is reduplicated, producing a trisyllabic—instead of the normal bisyllabic—reduplicant; e.g., e-ri-hurut-a ‘to snore’ has a reduplicated form e-ri-hur-ut-a-hur-ut-a. Finally, there are some stems—all monosyllabic verb roots with unproductive extensions—which do not undergo reduplication; e.g., e-rí-bug-ul-a ‘to find’ does not reduplicate and neither *e-rí-bug-a-bug-ul-a or *e-rí-bug-ul-a-bug-ul-a exist.

While one could imagine there are certain semantic restrictions on reduplication, like in Chaha, MH make no mention of such restrictions in Kinande. If possible, we should rule out this as a possible explanation for the aforementioned defectivity.

Endnotes

  1. I will segment these with hyphens though it may make sense to regard some unproductive extensions as part of morphologically simplex stems.

References

Mutaka, N. and Hyman, L. M. 1990. Syllables and morpheme integrity in Kinande reduplication. Phonology 7: 73-119.

re.compile is otiose

Unlike its cousins Perl and Ruby, Python has no literal syntax for regular expressions. Whereas one can express the sheep language /baa+/ with a simple forward-slashed literal in Perl and Ruby, in Python one has to compile them using the function re.compile, which produces objects of type re.Pattern. Such objects have various methods for string matching.

sheep = re.compile(r"baa+")
assert sheep.match("baaaaaaaa")

Except, one doesn’t actually have to compile regular expressions at all, as the documentation explains:

Note: The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

What this means is that in the vast majority of cases, re.compile is otiose (i.e., unnecessary). One can just define expression strings, and pass them to the equivalent module-level functions rather than using the methods of re.Pattern objects.

sheep = r"baa+"
assert re.match(sheep, "baaaaaaaa")

This, I would argue, is slightly easier to read, and certainly no slower. It also makes typing a bit more convenient since str is easier to type than re.Pattern.

Now, I am sure there is some usage pattern which would favor explicit re.compile, but I have not encountered one in code worth profiling.

Defectivity in Polish

[This is part of a series of defectivity case studies.]

Gorman & Yang (2019), following up on a tip from Margaret Borowczyk (p.c.) discuss inflectional gaps in Polish declension. In this language, masculine genitive singular (gen.sg.) are marked either with -a or -u. The two gen.sg. suffixes have a similar type frequency, and neither appears to be more default-like than the other. For instance, both allomorphs are used with loanwords. Because of this, it is generally agreed that the gen.sg. allomorphy is purely arbitrary and must be learned by rote, a process that continues into adulthood (e.g., Dąbrowska 2001, 2005).

Kottum (1981: 182) reports his informants have no gen.sg. for masculine-gender toponyms like Dublin ‘id.’ (e.g., *Dublina/*Dublinu), Göteborg ‘Gothenburg’ and Tarnobrzeg ‘id.’, and Gorman & Yang (2019: 184) report their informants do not have a gen.sg. for words like drut ‘wire’ (e.g., *druta/*drutu, though the latter is prescribed), rower ‘bicycle’, balon ‘baloon’, karabin ‘rifle’, autobus ‘bus’, and lotos ‘lotus flower’.

References

Dąbrowska, E. 2001. Learning a morphological system without a default: The Polish genitive. Journal of Child Language 28: 545-574.
Dąbrowska, E. 2005. Productivity and beyond: mastering the Polish genitive inflection. Journal of Child Language 32:191-205.
Gorman,. K. and Yang, C. 2019. When nobody wins. In F. Rainer, F. Gardani, H. C. Luschützky and W. U. Dressler (ed.), Competition in Inflection and Word Formation, pages 169-193. Springer.
Kottum, S. S. 1981. The genitive singular form of masculine nouns in Polish. Scando-Slavica 27: 179-186.

Defectivity in Chaha

[This is part of a series of defectivity case studies.]

Rose (2000) describes a circumscribed form of defectivity in Chaha, a Semitic language spoken in Ethiopia. Throughout Ethio-Semitic, many verbs have a frequentative formed using a quadriliteral verbal template. Since few verb roots are quadriconsonantal—most are triconsonantal, some are biconsonantal—a sort of reduplication and/or spreading is used to fill in the template. In Tigryina, for instance (p. 318), the frequentative template is of the form CɘCaCɘC. Then, frequentative of the triconsonantal verb root √/grf/ ‘collect’ is [gɘrarɘf], with the root /r/ repeated, and for a biconsonantal verb root like √/ħt/ ‘ask’, the frequentative is [ħatatɘt], with three root /t/s.

Rose contrasts this state of affairs with Chaha. In this language, the frequentative template CɨCɘCɘC cannot be satisfied by a biconsonantal root like √/tʼm/ ‘bend’ or √/Rd/ ‘burn’, and all such verbs lack a frequentative.1 The expected *[tʼɨmɘmɘm] and *[nɨdɘdɘd] are ill-formed, as are all other alternatives. Furthermore, no frequentatives of any sort can be formed with quadriconsonantal roots.

Rose notes that there are often semantic reasons for a verb to lack a frequentative (e.g., stative and resultative verbs are generally not compatible with it), this does not seem applicable here.

Endnotes

  1. As Rose explains: “R represents a coronal sonorant which may be realized as [n] or [r] depending on context…” (p. 317).

References

Rose, S. 2000. Multiple correspondence in reduplication. In Proceedings of the 23rd Annual Meeting of the Berkeley Linguistic Society, pages 315-326.

Defectivity in English

[This is part of a small but growing series of defectivity case studies.]

English lexical verbs can have up to 5 distinct forms, and I am aware of just a few English verbs which are defective. (The following are all my personal judgments.)

  1. I can use begone as an imperative, though it has the form of a past participle (cf. gone and forgone). Is BEGO even a verb lexeme anymore?
  2. Fodor (1972), following Lakoff (1970 [1965]), notes that BEWARE has a limited distribution and never bears explicit inflection. For me, it can occur only as a positive imperative (e.g., beware the dog!), with or without emphatic do. I agree with Fodor that it is also bad under negation, but perhaps for unrelated reasons: e.g., *don’t beware… 
  3. FORGO lacks a simple past: forgo, forgoes, and forgoing are fine, as is the past participle forgone, but *forwent is bad as the preterite/simple past, and *forgoed is perhaps a bit worse.
  4. METHINK can only be used in the 3sg. present active indicative form methinks, and doesn’t allow for an explicit subject.
  5. STRIDE lacks a past participle (e.g., Hill 1976:668, Pinker 1999:136f., Pullum and Wilson 1977:770): *stridden is bad.  The simple past strode cannot be reused here, and I cannot use the regular *strided (under the relevant sense).

References

Fodor, J. D. 1972. Beware. Linguistic Inquiry 3: 528-534.
Hill, A. A. 1976. [Obituary:] Albert Henry Marckwardt. Language 52: 667-681.
Lakoff, G. 1970. Irregularity in Syntax. Holt, Rinehart and Winston.
Pinker, S. 1999. Words and Rules: The Ingredients of Language. Basic Books.
Pullum, G. K. and Wilson, D. 1977. Autonomous syntax and the analysis of auxiliaries. Language 53:741-788.