Libfix report for December 2019

A while ago I acquired a dictionary of English blends (Thurner 1993), and today I went through it looking for candidate libfixes I hadn’t yet recorded. Here are a few I found. From burlesque, we have lesque, used to form both boylesque and girlesque. The kumquat gives rise to quat. This is used in two (literal) hybrid fruits: citrangequat and limequat. From melancholy comes choly, used to build solemncholy ‘a solemn or serious mood’ and the unglossable lemoncholy. From safari there is fari, used to build seafarisurfari, and even snowfariDocumentary has given rise to mentary, as in mockumentary and rockumentary.

An interesting case is that of stache. While stache is a common clipping of mustache, it is commonly used as an affix as well, as in liquid-based beerstache and milkstache and the pejorative fuckstache and fuzzstache.

I also found a number of libfix-like elements that can plausibly be analyzed as affixes rather than cases of “liberation”. Some examples are eteer (blacketeer, stocketeer), legger (booklegger, meatlegger), and logue (duologue, pianologue, travelogue). I do not think these are properly defined as libfixes (they are a bit like -giving) but I could be wrong.

References

D. Thurner (1993). The Portmanteau Dictionary: Blend Words in the English Language, Including Trademarks and Brand Names. MacFarland & Co.

A theory of error analysis

Manual error analyses can help to identify the strengths and weaknesses of computational systems, ultimately suggesting future improvements and guiding development. However, they are often treated as an afterthought or neglected altogether. In three of my recent papers, we have been slowly developing what might be called a theory of error analysis. The systems evaluated include:

  • number normalization (Gorman & Sproat 2016); e.g., mapping 97000 onto quatre vingt dix sept mille,
  • inflection generation (Gorman et al. 2019); e.g., mapping pairs citation form and inflectional specification like (aufbauen, V;IND;PRS;2) onto inflected forms like baust auf, and
  • grapheme-to-phoneme conversion (Lee et al. 2020); e.g., mapping orthographic forms like almohadilla onto phonemic or phonetic forms like /almoaˈdiʎa/ and [almoaˈðiʎa].

While these are rather different types of problems, the systems all have one thing in common: they generate linguistic representations. I discern three major classes of error such systems might make.

  • Target errors are only apparent errors; they arise when the gold data, the data to be predicted, is linguistically incorrect. This is particularly likely to arise with crowd-sourced data though such errors are also present in professionally annotated resources.
  • Linguistic errors are caused by misapplication of independently attested linguistic behaviors to the wrong input representations.
    • In the case of number normalization, these include using the wrong agreement affixes in Russian numbers; e.g., nom.sg. *семьдесят миллион for gen.sg. семьдесят миллионов ‘nine hundred million’ (Gorman & Sproat 2016:516)
    • In inflection generation, these are what Gorman et al. 2019 call allomorphy errors; e.g., for instance, overapplying ablaut to the Dutch weak verb printen ‘to print’ to produce a preterite *pront instead of printte (Gorman et al. 2019:144).
    • In grapheme-to-phoneme conversion, these include failures to apply allophonic rules; e,g, in Korean, 익명 ‘anonymity’ is incorrectly transcribed as [ikmjʌ̹ŋ] instead of [iŋmjʌ̹ŋ], reflecting a failure to apply a rule of obstruent nasalization not indicated in the highly abstract hangul orthography (Lee et al. under review).
  • Silly errors are those errors which cannot be analyzed as either target errors or linguistic errors. These have long been noted as a feature of neural network models (e.g., Pinker & Prince 1988, Sproat 1992:216f. for discussion of *membled) and occur even with modern neural network models.

I propose that this tripartite distinction is a natural starting point when building an error taxonomy for many other language technology tasks, namely those that can be understood as generating linguistic sequences.

References

K. Gorman, A. D. McCarthy, R. Cotterell, E. Vylomova, M. Silfverberg, and M. Markowska (2019). Weird inflects but OK: making sense of morphological generation errors. In CoNLL, 140-151.
K. Gorman and R. Sproat (2016). Minimally supervised number normalization. Transactions of the Association for Computational Linguistics 4: 507-519.
J. L. Lee, L. F.E. Ashby, M. E. Garza, Y. Lee-Sikka, S. Miller, A. Wong, A. D. McCarthy, and K. Gorman (under review). Massively multilingual pronunciation mining with WikiPron.
S. Pinker and A. Prince (1988). On language and connectionism: analysis of a parallel distributed processing model of language acquisition. Cognition 28(1–2):73–193.
R. Sproat (1992). Morphology and computation. Cambridge: MIT Press.