Why language resources should be dynamic

Virtually all the digital linguistic resources used in speech and language technology are static in the sense that

  1. One-time: they are generated once and never updated.
  2. Read-only: they provide no mechanisms for corrections, feature requests, etc.
  3. Closed-source: code and raw data used to generate the data are not released.

However, there are some benefits to designing linguistic resources dynamically, allowing them to be repeatedly regenerated and iteratively improved with the help of the research community. I’ll illustrate this with WikiPron (Lee et al. 2020), our database-cum-library for multilingual pronunciation data.

The data

Pronunctionary dictionaries are an important resource for speech technologies like automatic speech recognition and text-to-speech synthesis. Several teams have considered the possibility of mining pronunciation data from the internet, particularly from the free online dictionary Wiktionary, which by now contains millions of crowd-sourced pronunciations transcribed using the International Phonetic Alphabet. However, none of these prior efforts released any code, nor were their scrapes run repeatedly, so at best they represent of a single (2016, or 2011) slice of the data.

The tool

WikiPron is, first and foremost, a Python command-line tool for scraping pronunciation data from Wiktionary. Stable versions can be installed from PyPI using tools like pip. Once the tool is installed, users specify a language, optionally, a dialect, and various optional flags, and pronunciation data is printed to STDIN as a two-column TSV file. Since this requires an internet connection and may take a while, the system is even able to retry where it left off in case of connection hiccups. The code is carefully documented, tested, type-checked, reflowed, and linted using the CircleCI continuous integration system. 

The infrastructure

We also release, at least annually, a multilingual pronunciation dictionary created using WikiPron. This increases replicability, permits users to see the format and scale of the data WikiPron makes available, and finally allows casual users to bypass the command-line tool altogether. To do this, we provide the data/ directory, which contains data and code which automates “the big scrape”, the process by which we regenerate the multilingual pronunciation dictionary. It includes

  • the data for 335 (at time of writing) languages, dialects, scripts, etc.,
  • code for discovering languages supported by Wiktionary,
  • code for (re)scraping all languages,
  • code for (re)generating data summaries (both computer-readable TSV files and human-readable READMEs rendered by GitHub), and
  • integration tests that confirm the data summaries match the checked-in data,

as well as code and data used for various quality assurance processes. 

Dynamic language resources

In what sense is WikiPron a dynamic language resource? 

  1. It is many-time: it can be run as many times as one wants. Even “the big scrape” static data sets are updated more-than-annually.
  2. It is read-write: one can improve WikiPron data by correcting Wiktionary, and we provide instructions for contributors wishing to send pull requests to the tool.
  3. It is open-source: all code is licensed under the Apache 2.0 license; the data bears a Creative Commons Attribution-ShareAlike 3.0 Unported License inherited from Wiktionary.

Acknowledgements

Most of the “dynamic” features in WikiPron were implemented by CUNY Graduate Center PhD student Lucas Ashby and my colleague Jackson Lee; I have at best served as an advisor and reviewer.

References

Lee, J. L, Ashby, L. F.E., Garza, M. E., Lee-Sikka, Y., Miller, S., Wong, A.,
McCarthy, A. D., and Gorman, K. 2020. Massively multilingual pronunciation
mining with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4223-4228.

Thought experiment #1

A non-trivial portion of what we know about the languages we speak includes information about lexically-arbitrary behaviors, behaviors that are specific to certain roots and/or segments and absent in other superficially-similar roots and/or segments. One of the earliest examples is the failure of English words like obesity to undergo Chomsky & Halle’s (1968: 181) rule of trisyllabic shortening: compare sereneserenity to obese-obesity (Halle 1973: 4f.). Such phenomena are very common in the world’s languages. Some of the well-known examples include Romance mid-vowel metaphony and the Slavic fleeting vowels, which delete in certain phonological contexts.1

Linguists have long claimed (e.g., Harris 1969) one cannot predict whether a Spanish e or o in the final syllable of a verb stem will or will not undergo diphthongization (to ie or ue, respectively) when stress falls on the stem rather than the desinence. For instance negar ‘to deny’ diphthongizes (niego ‘I deny’, *nego) whereas the superficially similar pegar ‘to stick to s.t.’ does not (pego ‘I stick to s.t.’, *piego). There is no reason to suspect that the preceding segment (n vs. p) has anything to do with it; the Spanish speaker simply needs to memorize which mid vowels diphthongize.2 The same is arguably true of the Polish fleeting vowels known as yers, which delete in, among other contexts, the genitive singular (gen.sg.) of masculine nouns. Thus sen ‘dream’ has a gen.sg. snu, with deletion of the internal e, whereas the superficially similar basen ‘pool’ has a gen.sg. basenu, retaining the internal (Rubach 2016: 421). Once again, the Polish speaker needs to memorize whether or not each deletes.

So as to not presuppose a particular analysis, I will refer to segments with these unpredictable alternations—diphthongization in Spanish, deletion in Polish—as magical. Exactly how this magic ought to be encoded is unclear.3 One early approach was to exploit the feature system so that they were underlyingly distinct from non-magical segments. These “exploits” might include mapping magical segments onto gaps in the surface segmental inventory, underspecification, or simply introducing new features. Nowadays, phonologists are more likely to use prosodic prespecification. For instance, Rubach (1986) proposes that the Polish yers are prosodically defective compared to non-alternating e.4 Others have claimed that magic resides in the morph, not the segment.

Regardless of how the magic is encoded, it is a deductive necessity that it be encoded somehow. Clearly something is representationally different in negar and pegar, and sen and basen. Any account which discounts this will be descriptively inadequate. To make this a bit clearer, consider the following thought experiment:

We are contacted by a benign, intelligent alien race, carbon-based lifeforms from the Rigel system with feliform physical morphology and a fondness for catnip. Our scientists observe that they exhibit a strange behavior: when they imbibe fountain soda, their normally-green eyes turn yellow, and when they imbibe soda from a can, their eyes turn red. Scientists have not yet been able to determine the mechanisms underlying these behaviors.

What might we reason about the alien’s seemingly magical soda sense? If we adopt a sort of vulgar uniformitarianism—one which rejects outlandish explanation like time travel or mind-reading—then the only possible explanation remaining to us is that there really is something chemically distinct between the two classes of soda, and the Rigelian sensory system is sensitive to this difference.

Really, this deduction isn’t so different from the one made by linguists like Harris and Rubach: both observe different behaviors and posit distinct entities to explain them. Of course, there is something ontologically different between the two types of soda and the two types of Polish e. The former is a purely chemical difference; the latter arises  because the human language faculty turns primary linguistic data, through the epistemic process we call first language acquisition, into one type of meat (brain tissue), and that type of meat makes another type of meat (the articulatory apparatus) behave in a way that, all else held equal, will recapitulate the primary linguistic data. But both of these deductions are equally valid.

Endnotes

  1. Broadly-similar phenomena previously studied include fleeting vowels in Finnish, Hungarian, Turkish, and Yine, ternary voice contrasts in Turkish, possessive formation in Huichol, and passive formation in Māori.
  2. For simplicity I put aside the arguments by Pater (2009) and Gouskova (2012) that morphs, not segments, are magical. While I am not yet convinced by their arguments, everything I have to say here is broadly consistent with their proposal.
  3. This is yet another feature of language that is difficult to falsify. But as Ollie Sayeed once quipped, the language faculty did not evolve to satisfy a vulgar Popperian falsificationism.
  4. Specfically, Rubach assumes that the non-alternating e‘s have a prespecified mora, whereas the alternating e‘s do not.

References

Chomsky, N. and Halle, M. 1968. The Sound Pattern of English. Harper & Row.
Gouskova, M. 2012. Unexceptional segments. Natural Language & Linguistic Theory 30: 79-133.
Halle, M. 1973. Prolegomena to a theory of word formation. Linguistic Inquiry 4: 3-16.
Harris, J. 1969. Spanish Phonology. MIT Press.
Pater, J. 2009. Morpheme-specific phonology: constraint indexation and inconsistency resolution. In S. Parker (ed.), Phonological Argumentation: Essays on Evidence and Motivation, pages 123-154. Equinox.
Rubach, J. 1986. Abstract vowels in three-dimensional phonology: the yers. The Linguistic Review 5: 247-280.
Rubach, J. 2016. Polish yers: Representation and analysis. Journal of Linguistics 52: 421-466.