Character-based speech technology

Right now everyone seems to be moving to character-based speech recognizers and synthesizers. A character-based speech recognizer is an ASR system in which there is no explicit representation of phones, just Unicode codepoints on the output side. Similarly, a character-based synthesizer is a TTS engine without an explicit mapping onto pronunciations, just orthographic inputs. It is generally assumed that the model ought to learn this sort of thing implicitly (and only as needed).

I genuinely don’t understand why this is supposed to be better. Phonemic transcription really does carry more information than orthography, in the vast majority of languages, and making it an explicit target is going to do a better job of guiding the model than hoping the system automatically self-organizes. Neural nets trained for language tasks often have a implicit representation of some linguistically well-defined feature, but they often do better when that feature is made explicit.

My understanding is that end-to-end systems have potential advances over feed-forward systems when information and uncertainty from previous steps can be carried through to help later steps in the pipeline. But that doesn’t seem applicable here. Building these explicit mappings from words to pronunciations and vice versa is not all that hard, and the information used to resolve ambiguity is not particularly local. Cherry-picked examples aside, it is not at all clear that these models can handle locally conditioned pronunciation variants (the article a pronounced uh or aye), homographs (the two pronunciations of bass in English), or highly deficient writing systems (think Perso-Arabic) better than the ordinary pipeline approach. One has to suspect the long tail of these character-based systems are littered with nonsense.

Major projects at the Computational Linguistics lab

[The following is geared towards our incoming students. I’m just using the blog as a easy publishing mechanism.]

The following are some major projects ongoing in the GC Computational Linguistics Lab.

Many phonologists believe that phonotactic knowledge is independent of knowledge of phonological alternations. In my dissertation I evaluated computational models of autonomous phonotactic knowledge as predictions of speakers’ judgments of wordlikeness, and I found that these fail to consistently outperform simple baselines. In part, these models fail because they predict gradience that is poorly correlated with human judgments. However, these conclusions were tentative because of the poor quality of the available data, collected little attention paid to experimental design or choice of stimuli. With funding from the National Science Foundation, and in collaboration with professors Karthik Durvasula at Michigan State University and Jimin Kahng at the University of Mississippi, we are building a open-source “megastudy” of human wordlikeness judgments and performing computational modeling of the resulting data.

Speech recognizers and synthesizers are, essentially, engines for synthesizing or recognizing sequences of phonemes. Therefore, it is necessary to transform text into phoneme sequences. Such transformations are challenging insofar as they require linguistic expertise—and language-specific knowledge—and are not always amenable to generic machine learning techniques. We are engaged in several projects involving these mappings. The lab maintains WikiPron (Lee et al. 2020), software and databases for building multilingual pronunciation dictionaries, and has organized two SIGMORPHON shared tasks on multilingual grapheme-to-phoneme conversion (Gorman et al. 2020, Ashby et al. 2021). And with funding from the CUNY Professional Staff Congress, PhD student Amal Aissaoui is engaged building diacritization engines for Arabic and Latin, engines which supply missing pronunciation information for these scripts.

Morphological generation systems use machine learning to predict the inflected forms of words. In 2019 I led a team of researchers in an error analysis of the top two systems in the CoNLL-SIGMORPHON 2017 shared task on morphological generation (Gorman et al. 2019). We found that the top models struggled with inflectional patterns which are sensitive to lexeme-inherent morphosyntactic features like gender, animacy, and aspect, which are not provided in the task data. For instance, the top models often inflect Russian perfective verbs as if they were imperfective, or Polish inanimate nouns as if they were animate. Finally, we find that models struggle with abstract morphophonological patterns which cannot be inferred from the citation form alone. For instance, the top models struggle to predict whether or not a Spanish verb will undergo diphthongization under stress (e.g., negar–niego ‘to deny-I deny’ vs. pegar–pego ‘to stick-I stick’). In collaboration with professor Katharina Kann and PhD student Adam Weimerslage at the University of Colorado, Boulder, we are developing an open-source “challenge set” for morphological generation, a set that targets complex inflectional patterns in a diverse sample of 10-20 languages. This challenge set will act as benchmarks for neural network models of inflection, and will allow us to further study inherent features and abstract morphophonological patterns. In designing these challenge sets we have targeted a wide variety of morphological processes, including reduplication and templatic formation in addition to affixation and stem change. MA students Kristysha Chan, Mariana Graterol, and M. Elizabeth Garza, and PhD student Selin Alkan have all contributed to the development of this challenge set thus far.

Inflectional defectivity is the poorly-understood dark twin of productivity. With funding from the CUNY Professional Staff Congress, Emily Charde (MA 2020) is engaged in a computational study of defectivity in Greek nouns and Russian verbs.

“…phonology is logically (and causally) prior to phonetics.”

Two important consequences follow from this. First, that phonology is logically (and causally) prior to phonetics as here defined. Second, phonology is also epistemologically prior to phonetics. Judgments about phonetic events are invariably made in terms of perceptual phonology. (Hammarberg 1976:356)

In this post I’d like to briefly review a view of the relationship between phonetics and phonology as related by Hammarberg (1976) and Appelbaum (1996), the former being primarily concerned with production and the latter with perception.

Phonetics, being concerned with the material and physical, has tended to align itself with the physical sciences (and physics in particular), and with the empiricist tradition in science.^1,2 In contrast, much of what has been called the cognitive revolution in the cognitive sciences, and in linguistics in particular, is explicitly anti-empiricist. As Hammarberg and Appelbaum argue, the empiricist biases of phonetics make it ill-suited to explain fundamental facts about speech.

It is generally understood that spoken language is not produced as a discrete sequence but rather a series of overlapping gestures and acoustic signatures. Anyone who has looked closely at the acoustics of speech will already recognize that it is impossible to say exactly where, in a word like cat, the [æ]-ness ends and the [t]-ness begins. In a worrd like soon, the fricative portion shows signs of rounding not found in words like scene. From an acoustic record alone, one cannot determine empirically how many segments are present. And, one cannot produce natural-sounding synthesized speech via simple concatenation of segments. It is not just that the [æ, t, s] and other segments are coarticulated with nearby segments, however: it is also the case that there are simply no invariant acoustic-phonetic properties that uniquely characterize [t]. A [t] spoken by a child, by a man with a mouth full of chili, by a woman missing her front teeth, and so on may have radically different acoustic properties, yet we as scientists understand them to be in some sense identical phenomena.

This is a basic principle of scientific discovery: one must assume that “the vast multitude of phenomena he encounters may be accounted for in terms of the interactions of a fairly small number of basic entities, standard elementary individuals. His task thus becomes one of identifying the basic entities and describing the interactions in virtue of which the encountered phenomena are generated. From this emerge our…notions of the identity and nonidentity of phenomena.” (Hammarberg, p. 354) The linguistic notion of segment is perhaps the most important of these basic entities. It is an entity recognized both by those early lay-linguists, the Iron Age scribes who gave us the alphabet, as well as one of the most venerable notions in the history of modern linguistics. Yet, segments do not have a physical reality of their own; they do not exist in the physical world, but only in the human mind. They are “internally generated, the creature of some kind of perceptual-cognitive process.”

It is generally uncontroversial to speak of the output of the phonological component as the input to the phonetic component. From this it follows that phonology is cognitively and epistemically prior to phonetics. Coarticulation, for instance, results because of the process which maps segments—which, remember, exist only in the mind of speakers—onto articulatory and acoustic events. But one cannot talk about coarticulation without segments, since it is the spreading of articulatory-acoustic properties between segments that defines coarticulation. One must know that /s/ exists, and has an inherent properties not normally associated with—or compatible with—lip rounding to even observe the anticipatory lip rounding in words like soon.

The existence of coarticulation is often understood teleologically, in the sense that is taken to be in part mechanical, automatic, inertial. This too is a mistake, according to Hammarberg: apparent teleological explanations of human behavior should be recast, as is the tradition in Western philosophy, as the result of intentional, causal behavior. The existence of anticipatory articulation shows us that the influence of the /u/ in soon has on the realization of the preceding /s/ occurred some time before instructions to the articulators were generated, and the level at which this influence occurs should therefore be identified with the mental rather than the physiological. Hammarberg continues to argue that coarticulatory processes are akin to ordinary allophony and should reside in the scope of phonological theory. This argument is strengthened insofar as coarticulation has a language-specific character, as is sometimes claimed.

Appelbaum, while not citing Hammarberg’s original paper, extends this critique to the theory of speech perception. It is an assumption of the so-called motor theory that there are invariant properties which identify “phonetic gestures”. Since the motor theorists do not present any evidence that such invariants soc much as exist, we instead must be abstract out into mental entities which have all the properties of—and which Appelbaum identifies with—what we are calling segments, or perhaps lower-level entities like phonological features. Under this approach, then, there is no content to the motor theory of speech perception beyond the obvious point that phonetic experience, somehow, turns into purely mental representations. Again, the empiricist biases of phonetics have lead us astray.

The above discussion may influence the way we think about the role of phonetics in linguistics education. Phonetics is generally viewed as its own autonomous subdiscipline, and modern acoustic and articulatory analysis is certainly complex enough to justify serious graduate instruction, but it would seem to suggest that phonetic tools exist primarily as a way of gathering phonological information rather than as an autonomous discipline. I am not sure I am ready to conclude that, but it certainly is provocative!

Endnotes

Empiricism refers to a theory of epistemology and should not be confused with the empirical method in science (the use of sense-based observation). Many prominent thinkers reject empiricism in favor of rationalism, but support the use of empirical methods. No one is seriously arguing against the use of the senses.
This will be shown to be yet another example of physics envy as the source of sloppy thinking in linguistics.

References

Appelbaum, I. 1996. The lack of invariance problem and the goal of speech perception. In Proceeding of Fourth International Conference on Spoken Language Processing, 1541-1544.
Hammarberg, R. 1976. The metaphysics of coarticulation. Journal of Phonetics 4: 353-363.

Grapholinguistics talk

Here are slides from a talk, coauthored with Richard Sproat, given at the Grapholinguistics in the 21st Century conference, on how we talk about writing systems in speech and language processing. We will try to get this into archival form soon.

On “from scratch”

For a variety of historical and sociocultural reasons, nearly all natural language processing (NLP) research involves processing of text, i.e., written documents (Gorman & Sproat 2022). Furthermore, most speech processing research uses written text either as input or output.

A great deal of speech or language processing treads words (however they are understood) as atomic, indivisible units rather than the “intricately structured objects linguists have long recognized them to be” (Gorman in press). But there has been a recent trend to instead work with individual Unicode codepoints, or even the individual bytes of a Unicode string encoded in UTF-8. When such systems are part of an “end-to-end” neural network, these systems are sometimes said to be “from scratch”; see, e.g., Gillick et al. 2016 and Li et al. 2019, who both use this exact phrase to describe their contributions. There is an implication that such systems, by bypassing the fraught notion of word, have somehow eliminated the need for linguistic insight altogether.

The expression “from scratch” makes an analogy to baking: it is as if we are making angel food cake by sifting flour, superfine sugar, and cream of tartar, rather than using the “just add water and egg whites” mixes from Betty Crocker. But this analogy understates just how much linguistic knowledge can be baked in (or perhaps “sifted in”) to writing systems. Writing systems are essentially a type of linguistic analysis (Sproat 2010), and like any language technology, they necessarily reify the analysis that underlies them.¹ The linguistic analysis underlying a writing system may be quite naïve but may also encode sophisticated phonemic and/or morphemic insights. Thus written text, whether expressed as Unicode codepoints or UTF-8 bytes, may have quite a bit of linguistic knowledge sifted and folded in.

A familiar and well-known example of this kind of knowledge comes from English (Gorman in press). In this language, changes in vowel quality triggered by the addition of “level 1” suffixes like -ity are generally not indicated in written form. Thus sane [seɪn] and sanity [sæ.nɪ.ti], for instance, are spelled more similarly than they are pronounced (Chomsky and Halle 1968: 44f.), meaning that this vowel change need not be modeled when working with written text.

Endnotes

The Sumerian and Egyptian scribes were thus history’s first linguists, and history’s first language technologists.

References

Chomsky, N., and Halle, M. 1968. Sound Pattern of English. Harper & Row.
Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. 2016. Multilingual language processing from bytes. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1296-1306.
Gorman, K.. In press. Computational morphology. In Aronoff, M. and Fudeman, K., What is Morphology? 3rd edition. Blackwell.
Gorman, K., and Sproat, R. 2022. The persistent conflation of writing and language. Paper presented at Grapholinguistics in the 21st Century.
Li, B., Zhang, Y., Sainath, T., Wu, Y., and Chan, W. 2019. Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5621-5625.
Sproat, R. 2010. Language, Technology, and Society. Oxford University Press.

A* shortest string decoding for non-idempotent semirings

I recently completed some work, in collaboration with Google’s Cyril Allauzen, on a new algorithm for computing the shortest string through weighted finite-state automaton. For so-called path semirings, the shortest string is given by the shortest path, but up until now, there was no general-purpose algorithm for computing the shortest string over non-idempotent semirings (like the log or probability semiring). Such an algorithm would make it much easier to decode with interpolated language models or elaborate channel models in a noisy-channel formalism. In this preprint, we propose such an algorithm using A* search and lazy (“on-the-fly”) determinization, and prove that it is correct. The algorithm in question is implemented in my OpenGrm-BaumWelch library by the baumwelchdecode command-line tool.

WFST talk

I have posted a lightly-revised slide deck from a talk I gave at Johns Hopkins University here. In it, I give my most detailed-yet description of the weighted finite-state transducer formalism and describe two reasonably interesting algorithms, the optimization algorithm underlying Pynini’s optimize method and Thrax’s Optimize function, and a new A*-based single shortest string algorithm for non-idempotent semirings underlying BaumWelch’s baumwelchdecode CLI tool.

“I understood the assignment”

We do a lot of things downstream with the machine learning tool we build, but not always can a model reasonably say it “understood the assignment” in the sense that the classifier is trained to do exactly what it we are making it do.

Take for example, Yuan and Liberman (2011), who study the realization of word-final ing in American English. This varies between a dorsal variant [ɪŋ] and a coronal variant [ɪn].¹ They refer to this phenomenon using the layman’s term g-dropping; I will use the notation (ing) to refer to all variants. They train Gaussian mixture models on this distinction, then enrich their pronunciation dictionary so that each word can be pronounced with or without g-dropping; it is as if the two variants are homographs. Then, they perform a conventional forced alignment; as a side effect, it determines which of the “homographs” was most likely used. This does seem to work, and is certainly very clever, but strikes me as a mild abuse of the forced alignment technique, since the model was not so much trained to distinguish between the two variants as produce a global joint model over audio and phoneme sequences.

What would an approach to the g-dropping problem that better understood the assignment look like? One possibility would be to run ordinary forced alignment, with an ordinary dictionary, and then extract all instances of (ing). The alignment would, naturally, give us reasonably precise time boundaries for the relevant segments. These could then be submitted to a discriminative classifier (perhaps an LSTM) trained to distinguish the various forms of (ing). In this design, one can accurately say that the two components, aligner and classifier, understand the assignment. I expect that this would work quite a bit better than what Yuan and Liberman did, though that’s just conjecture at present.

Some recent work by my student Angie Waller (published as Waller and Gorman 2020), involved an ensemble of two classifiers, one which more clearly understood the assignment than the other. The task here was to detect reviews of professors which are objectifying, in the sense that they make off-topic, usually-positive, comments about the professors’ appearance. One classifier makes document-level classifications, and cannot be said to really understand the assignment. The other classifier attempts to detect “chunks” of objectifying text; if any such chunks are found, one can label the entire document as objectifying. While neither technique is particularly accurate (at the document level), the errors they make are largely uncorrelated so an ensemble of the two obtains reasonably high precision, allowing us to track trends in hundreds of thousands of professor reviews over the last decade.

Endnotes

This doesn’t exhaust the logical possibilities of variation; for instance, for some speakers (including yours truly), there is a variant with a tense vowel followed by the coronal nasal.

References

Waller, A. and Gorman, K. 2020. Detecting objectifying language in online professor reviews. In Proceedings of the Sixth Workshop on Noisy User-Generated Text, pages 171-180.
Yuan, J. and Liberman, M. 2011. Automatic detection of “g-dropping” in American English using forced alignment. In IEEE Workshop on Automatic Speech Recognition & Understanding, pages 490-493.

Results of the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion

The results of the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion are now in, and are summarized in our task paper. A couple bullet points:

Unsurprisingly, the best systems all used some form of ensembling.
Many of the best teams performed self-training and/or data augmentation experiments, but most of these experiments were performance-negative except in simulated low-resource conditions. Maybe we’ll do a low-resource challenge in a future year.
LSTMs and transformers are roughly neck-and-neck; one strong submission used a variant of hard monotonic attention.
Many of the best teams used some kind of pre-processing romanization strategy for Korean, the language with the worst baseline accuracy. We speculate why this helps in the task paper.
There were some concerns about data quality for three languages (Bulgarian, Georgian, and Lithuanian). We know how to fix them and will do so this summer, if time allows. We may also “re-issue” the challenge data with these fixes.

What to do about the academic brain drain

The academy-to-industry brain drain is very real. What can we do about it?

Before I begin, let me confess my biases. I work in the research division of a large tech company (and I do not represent their views). Before that, I worked on grant-funded research in the academy. I work on speech and language technologies, and I’ll largely confine my comments to that area.

[Content warnings: organized labor, name-calling.]

Salary

Fact of the matter is, industry salaries are determined by a relatively-efficient labor market. Academy salaries are compressed, with a relatively firm ceiling for all but a handful of “rock star” faculty. The vast majority of technical faculty are paid substantially less than they’d make if they just took the very next industry offer that came around. It’s even worse for research professors who depend on grant-based “salary support” in a time of unprecedented “austerity”—they can find themselves functionally unemployed any time a pack of incurious morons seem to end up in the White House (as seems to happen every eight years or so).

The solution here is political. Fund the damn NIH and NSF. Double—no, triple—their funding. Pay for it by taxing corporations and the rich, or, better yet, divert some money from the Giant Death Machines fund. Make grant support contractual, so PIs with a five-year grant are guaranteed five years of salary support and a chance to realize their vision. Insist on transparency and consistency in “indirect costs” (i.e., overhead) for grants to drain the bureaucratic swamp (more on that below). Resist the casualization of labor at universities, and do so at every level. Unionize every employee at every American university. Aggressively lobby Democrat presidential candidates to agree to appoint the National Labor Relations Board who will continue to recognize graduate students’ right to unionize.

Administration & bureaucracy

Industry has bureaucratic hurdles, of course, but they’re in no way comparable to the profound dysfunction taken for granted in the academic bureaucracy. If you or anyone you love has ever written a scientific grant, you know what I mean; if not, find a colleague who has and politely ask them to tell you their story. At the same time American universities are cutting their labor costs through casualization, they are massively increasing their administrative costs. You will not be surprised to find that this does not produce better scientific outcomes, or make it easier to submit a grant. This is a case of what Noam Chomsky has described as the “neoliberal confidence trick”. It goes a little something like this:

Appoint/anoint all-powerful administrators/bureaucrats, selecting for maximal incompetence.
Permit them to fail.
Either GOTO #1, or use this to justify cutting investment in whatever was being administered in the first place.

I do not see any way out of this situation except class consciousness and labor organizing. Academic researchers must start seeing the administration as potentially hostile to their interests, and refuse to identify with, or (or quelle horreur, to join) the managerial classes.

Computing power & data

The big companies have more computers than universities. But in my area, speech and language technology, nearly everything worth doing can still be done with a commodity cluster (like you’d find in the average American CS departments) or a powerful desktop with a big GPU. And of those, the majority can still be done on a cheap laptop. (Unless, of course, you’re one of those deep learning eliminationist true believers, in which case, reconsider.) Quite a bit of great speech & language research—in particular, work on machine translation—has come from collaborations between the Giant Death Machines funding agencies (like DARPA) and academics, with the former usually footing the bill for computing and data (usually bought from the Linguistic Data Consortium (LDC), itself essentially a collaboration between the military-industrial complex and the Ivy League). In speech recognition, there are hundreds of hours of transcribed speech in the public domain, and hundreds more can be obtained with a LDC contract paid for by your funders. In natural language processing, it is by now almost gauche for published research to make use of proprietary data, possibly excepting the venerable Penn Treebank.

I feel the data-and-computing issue is largely a myth. I do not know where it got started, though maybe it’s this bizarre press-release-masquerading-as-an-article (and note that’s actually about leaving one megacorp for another).

Talent & culture

Movements between academy & industry have historically been cyclic. World War II and the military-industrial-consumer boom that followed siphoned off a lot of academic talent. In speech & language technologies, the Bell breakup and the resulting fragmentation of Bell Labs pushed talent back to the academy in the 1980s and 1990s; the balance began to shift back to Silicon Valley about a decade ago.

There’s something to be said for “game knows game”—i.e., the talented want to work with the talented. And there’s a more general factor—large industrial organizations engage in careful “cultural design” to keep talent happy in ways that go beyond compensation and fringe benefits. (For instance, see Fergus Henderson’s description of engineering practices at Google.) But I think it’s important to understand this as a symptom of the problem, a lagging indicator, and as part of an unpredictable cycle, not as something to optimize for.

Closing thoughts

I’m a firm believer in “you do you”. But I do have one bit of specific advice for scientists in academia: don’t pay so much damn attention to Silicon Valley. Now, if you’re training students—and you’re doing it with the full knowledge that few of them will ever be able to work in the academy, as you should—you should educate yourself and your students to prepare for this reality. Set up a little industrial advisory board, coordinate interview training, talk with hiring managers, adopt industrial engineering practices. But, do not let Silicon Valley dictate your research program. Do not let Silicon Valley tell you how many GPUs you need, or that you need GPUs at all. Do not believe the hype. Remember always that what works for a few-dozen crypto-feudo-fascisto-libertario-utopio-futurist billionaires from California may not work for you. Please, let the academy once again be a refuge from neoliberalism, capitalism, imperialism, and war. America has never needed you more than we do right now.

If you enjoyed this, you might enjoy my paper, with Richard Sproat, on an important NLP task that neural nets are really bad at.