An unlikely series confusions and mismaps in my early career resulted in my brief involvement with the forced alignment-industrial complex. I’m grateful to people like the excellent Michael Wagner who supported my work on this topic, but I’m glad to see other people who have a deeper interest in acoustic phonetics methodology (like the also-excellent Michael McAuliffe) take over the enterprise. Phonetics has not been a major research interest for me for some time now.

I published a single two-page paper (Gorman et al. 2011) on forced alignment in 2011, and somehow, it’s my most cited work. Perhaps for that reason, I receive a lot of requests to review work involving forced aligners. Two frustrating “tricks” are extremely common in this literature.

The first involves manipulating the phonetic dictionary as a way to auto-code (socio)phonetic variables; I’ll refer to this method as dictionary hacking. For instance, Yuan & Liberman (2011) studied American English ‘g-dropping’ using this method. For each word in ending in ing [ɪŋ] they add a competing pronunciation variant [ɪn]. As a result, the final phonemic alignment contains information about whether the overall model took each ing rendition to be [ɪŋ] or [ɪn]. This sort of works (neat!), but I don’t think it’s a particularly good way to auto-code. First, good HMM accoustic models represent phones (or diphones, or triphones) using mixtures of multi-variant Gaussians (GMMs), and such models are capable of representing phonetically disparate renditions as instances of the same mixture; they don’t really reflect linguists’ intuitions about allophony. Secondly, and specifically to the Yuan & Liberman approach, they ignore a third possibility for this variable: [in], a tense high front vowel with an apical nasal. In my “style a” (i.e., low attention paid to my speech), this variant alternates with lax [ɪn]; I rarely produce [ɪŋ]. Dealing with differnet variants is hard using the dictionary-hacking method. There is of course a simple solution here. You use the forced aligner as is to find reasonably good timestamps of the relevant intervals, you extract acoustic features from those intervals, and you feed them into a discriminative supervised machine learning system trained on a small amount of labeled data. (In some cases, relevant corpora already have sufficiently detailed phonetic transcriptions so no additional labeling is necessary.) Done right, this will produce strictly better (more human-annotator-like) results than dictionary hacking: discriminative models optimized specifically for the coding task at hand, provided with appropriate acoustic features, will be more accurate than the forced aligner’s generative HMM-GMM system optimized for an objective only distantly related to the question at hand.

The second trick involves using (mono-, di-, or tri-)phone GMMs from different dialects or languages to auto-code. I’ll refer to this as phone hacking. For example, if one has a Montreal French acoustic model and an American English acoustic model, one can use the forced aligner to determine whether a rendition of Scottish English r is are more like the Montreal French or American English r. Milne (2011, 2014), for exapmle, describes some early work of this type. Once again, this sort of works (jeepers!) but it has all the same sorts of problems, problems which could be fixed by once again using the forced aligner for approximate timing information (it’s reasonably good at that), extracting phonetic features from the relevant intervals, and then feeding them into discriminative models optimized to code whatever variants of r you’re interested in. There’s no excuse, really for using Montreal French acoustic models on your Scottish data.

In my opinion, dictionary hacking and phone hacking are unnecessarily lazy, sloppy solutions to coding problems that aren’t really all that hard in the first place, and I tell the editors as much when asked to review papers using these techniques. The discriminative approach is not only relatively easy for a computationally sophisticated phonetician, but was almost as easy a full two decades ago. Since I don’t really work in this area anymore, I don’t know if there’s a library for discriminative auto-coding as well-designed or well-documented as the Montreal Forced Aligner, but if not, something like this is greatly needed.

References

Gorman, K., Howell, J. and Wagner, M. 2011. Prosodylab-Aligner: A tool for forced alignment of laboratory speech. Journal of the Canadian Acoustical Association 39(3): 192-193.
Milne, P. 2011. The effects of syllable position on allophonic variation in Québec French /ʀ/: A corpus analysis using a modified version of the Penn Phonetics Lab Forced Aligner. Paper presented at NWAV 40.
Milne, P. 2014. The variable pronunciations of word-final consonant clusters in a force aligned corpus of spoken French. Doctoral dissertation, University of Ottawa.
Yuan, J., and Liberman, M. 2011. Automatic detection of “g-dropping” in American English using forced alignment. In 2011 IEEE workshop on automatic speech recognition & understanding, pages 490-493.

On auto-coding

References

Leave a Reply Cancel reply