re.compile is otiose

Unlike its cousins Perl and Ruby, Python has no literal syntax for regular expressions. Whereas one can express the sheep language /baa+/ with a simple forward-slashed literal in Perl and Ruby, in Python one has to compile them using the function re.compile, which produces objects of type re.Pattern. Such objects have various methods for string matching.

sheep = re.compile(r"baa+")
assert sheep.match("baaaaaaaa")

Except, one doesn’t actually have to compile regular expressions at all, as the documentation explains:

Note: The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

What this means is that in the vast majority of cases, re.compile is otiose (i.e., unnecessary). One can just define expression strings, and pass them to the equivalent module-level functions rather than using the methods of re.Pattern objects.

sheep = r"baa+"
assert re.match(sheep, "baaaaaaaa")

This, I would argue, is slightly easier to read, and certainly no slower. It also makes typing a bit more convenient since str is easier to type than re.Pattern.

Now, I am sure there is some usage pattern which would favor explicit re.compile, but I have not encountered one in code worth profiling.

Linguistics’ contribution to speech & language processing

How does linguistics contribute to speech & language processing? While there exist some “linguist eliminationists”, who wish to process speech audio or text “from scratch” without intermediate linguistic representations, it is generally recognized that linguistic representations are the end goal of many processing “tasks”. Of course some tasks involve poorly-defined, or ill-posed, end-state representations—the detection of hate speech and named entities, neither of which are particularly well-defined, linguistically or otherwise, come to mind—but are driven by apparent business value to be extracted rather than serious goals to understand speech or text.

The standard example for this kind of argument is syntax. It might be the case that syntactic representations are not as useful for textual understanding as was anticipated, and useful features for downstream machine learning can apparently be induced using far simpler approaches, like the masked language modeling task used for pre-training in many neural models. But it’s not as if a terrorist cell of rogue linguists locked NLP researchers in their office until they developed the field of natural language parsing. NLP researchers decided, of their own volition, to spend the last thirty years building models which could recover natural language syntax, and ultimately got pretty good at it, probably getting up to the point where, I suspect, unresolved ambiguities mostly hinge on world knowledge that is rarely if ever made explicit.

Let us consider another example, less widely discussed: the phoneme. The phoneme was discovered in the late 19th century by Baudouin de Courtenay and Kruszewski. It has been around a very long time. In the century and a half since it emerged from the Polish academy, Poland itself has been a congress, a kingdom, a military dictatorship, and a republic (three times), and annexed by the Russian empire, the German Reich, and the Soviet Union. The phoneme is probably here to stay. The phoneme is, by any reasonable account, one of the most successful scientific abstractions in the history of science.

It is no surprise then, that the phoneme plays a major role in speech technologies. Not only did the first speech recognizers and synthesizers make explicit use of phonemic representations (as well as notions like allophones), so did the next five decades worth of recognizers and synthesizers. Conventional recognizers and synthesizers require large pronunciation lexicons mapping between orthographic and phonemic form, and as they get closer to speech, convert these “context-independent” representations of phonemic sequences onto “context-dependent” representations which can account for allophony and local coarticulation, exactly as any linguist would expect. It is only in the last few years that it has even become possible to build a reasonably effective recognizer or synthesizer which doesn’t have an explicit phonemic level of representation. Such models instead use clever tricks and enormous amounts of data to induce implicit phonemic representations instead. We have every reason to suspect these implicit representations are quite similar to the explicit ones linguists would posit. For one, these implicit representations are keyed to orthographic characters, and as I wrote a month ago, “the linguistic analysis underlying a writing system may be quite naïve but may also encode sophisticated phonemic and/or morphemic insights.” If anything, that’s too weak: in most writing systems I’m aware of, the writing system is either a precise phonemic analysis (possibly omitting a few details of low functional load, or using digraphs to get around limitations of the alphabet of choice) or a precise morphophonemic analysis (ditto). For Sapir (1925, et. seq.) this was key evidence for the existence of phonemes! So whether or not implicit “phonemes” are better than explicit ones, speech technologists have converged on the same rational, mentalistic notions discovered by Polish linguists a century and a half ago.

So it is surprising to me that even those schooled in the art of speech processing view the contribution of linguistics to the field in a somewhat negative light. For instance, Paul Taylor, the founder of the TTS firm Phonetic Arts, published a Cambridge University Press textbook on TTS methods in 2009, and while it’s by now quite out of date, there’s no more-recent work of comparable breadth. Taylor spends the first five hundred (!) pages or so talking about linguistic phenomena like phonemes, allophones, prosodic phrases, and pitch accents—at the time, the state of the art in synthesis made use of explicit phonological representations—so it is genuinely a shock to me that Taylor chose to close the book with a chapter (Taylor 2009: ch. 18) about the irrelevance of linguistics. Here are a few choice quotes, with my commentary.

It is widely acknowledged that researchers in the field of speech technology and linguistics do not in general work together. (p. 533)

It may be “acknowledged”, but I don’t think it has ever been true. The number of linguists and linguistically-trained engineers working on FAANG speech products every day is huge. (Modern corporate “AI” is to a great degree just other people, mostly contractors in the Global South.) Taylor continues:

The first stated reason for this gap is the “aeroplanes don’t flap their wings” argument. The implication of this statement is that, even if we had a complete knowledge of how human language worked, it would not help us greatly because we are trying to develop these processes in machines, which have a fundamentally different architecture. (p. 533)

I do not expect that linguistics will provide deep insights about how to build TTS systems, but it clearly identified the relevant representational units for building such systems many decades ahead of time, just as mechanics provided the basis for mechanical engineering. This was true of Kempelen’s speaking machine (which predates phonemic theory, and so had to discover something like it) and Dudley’s voder as well as speech synthesizers in the digital age. So I guess I kind of think that speech synthesizers do flap their wings: parametric, unit selection, hybrid, and neural synthesizers are all big fat phoneme-realization machines. As is standard practice in physical sciences, the simple elementary particles of phonological theory—phonemes, and perhaps features—were discovered quite early on, but it the study of their onotology has taken up the intervening decades. And unlike the physical sciences, us cognitive scientists some day must also understand their epistemology (what Chomsky calls “Plato’s problem”) and ultimately, their evolutionary history (“Darwin’s problem”) too. Taylor, as an engineer, need not worry himself about these further studies, but I think he is being widely uncharitable about the nature of what he’s studying, or the business value of having a well-defined hypothesis space of representations for his team to engineer around in.

Taylor’s argument wouldn’t be complete without a caricature of the generative enterprise:

The most-famous camp of all is the Chomskian [sic] camp, started of course by Noam Chomsky, which advocates a very particular approach. Here data are not used in any explicit sense, quantitative experiments are not performed and little stress is put on explicit description of the theories advocated. (p. 534)

This is nonsense. Linguistic examples are data, in some cases better data than results from corpora or behavioral studies, as the work of Sprouse and colleagues has shown. No era of generativism was actively hostile to behavioral results; as early as the ’60s, generativist-aligned psycholinguists were experimentally testing the derivational theory of complexity and studying morphological decomposition in the lab. And I simply have never found that generativist theorizing lacks for formal explicitness; in phonology, for instance, the major alternative to generativist thinking is exemplar theory—which isn’t even explicit enough to be wrong—and a sort of neo-connectionism—which ought not to work at all given extensive proof-theoretic studies of formal learnability and the formal properties of stochastic gradient descent and backpropagation. Taylor continues to suggest that the “curse of dimensionality” and issues of generalizability prevent application of linguistic theory. Once again, though, the things we’re trying to represent are linguistic notions: machine learning using “features” or “phonemes”, explicit or implicit, is still linguistics.

Taylor concludes with some future predictions about how he hopes TTS research will evolve. His first is that textual analysis techniques from NLP will become increasingly important. Here the future has been kind to him: they are, but as the work of Sproat and colleagues has shown, we remain quite dependent on linguistic expertise—of a rather different and less abstract sort than the notion of the phoneme—to develop these systems.

References

Sapir, E. 1925. Sound patterns in language. Language 1:37-51.
Taylor, P. 2009. Text-to-Speech Synthesis. Cambridge University Press.

“Python” is a proper name

In just the last few days I’ve seen a half dozen instances of the phrase python package or python script in published academic work. It’s disappointing to me that this got by the reviewers, action editors, and copy editors, since Python is obviously a proper name and should be in titlecase. (The fact that the interpreter command is python is irrelevant.)

Markdown isn’t good enough to replace LaTeX

I am generally sympathetic with calls to replace LaTeX with something else. LaTeX has terrible defaults, Unicode and font support is a constant problem, the syntax is deliberately obfuscatory, and actual generation is painfully slow (probably because the whole thing is a big pasta factory of interpreted code instead of a single static library).

But at the same time, I don’t think Markdown is really good enough for LaTeX. Of course one can use Pandoc to generate LaTeX from Markdown notes, and its output is often a decent thing to copy and paste into your LaTeX document. But Markdown just doesn’t solve any of the issues I mention, except making the syntax a tad more WYSIWYG than it would be otherwise. And Markdown is quite a bit worse at one thing: the extended syntax for tables is very hard to key in and still much less expressive than LaTeX’s actually pretty rational tabular environment.

Python hasn’t changed much

Since successfully sticking the landing for the migration from Python 2 (circa 3.6 or so), Python has been on a tear with a large number of small releases. These releases have cleaned up some warts in the “batteries included” modules and made huge improvements to the performance of the parser and run-time. There are also a few minor language features added; for instance, f-strings (which I like a lot) and the so-called walrus operator, mostly used for regular expression matching.

When Python improvements (and they are improvements, IMO) are discussed on sites like Hacker News, there is a lot of fear and trepidation. I am not sure why. These are rather minor changes, and they will take years to diffuse through the Python community. Overall, very little has changed.

Noam on neural networks

I just crashed a Zoom conference in which Noam Chomsky was the discussant. (What I have to say will be heavily paraphrased: I wasn’t taking notes.) One back-and-forth stuck with me. Someone asked Noam what people interested in language and cognition ought to study, other than linguistics itself. He mentioned various biological systems, and said however, that they probably shouldn’t bother to study neural networks, since they have very little in common with intelligent biological systems (despite their branding as “neural” and “brain-inspired”). He stated that he is grateful for Zoom closed captions  (he has some hearing loss), but that one should not conflate that with language understanding. He said, similarly, that he’s grateful for snow plows, but one shouldn’t confuse such a useful technology with theories of the physical world.

For myself, I think they’re not uninteresting devices, and that linguists are uniquely situated to evaluate them—adversarily, I hope—as models of language. I also think they can be viewed as powerful black boxes for studying the limits of domain-general pattern learning. Sometimes we actually want to ask whether certain linguistic information is actually present in the input, and some of my work (e.g., Gorman et al. 2019) looks at that in some detail. But I do share some intuition that they are not likely to greatly expand our understanding of human language overall.

References

Gorman, K., McCarthy, A. D., Cotterell, R., Vylomova, E., Silfverberg, M., and Markowska, M. Weird inflects but OK: making sense of morphological generation errors. In Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 140-151.

Lambda lifting in Python

Python really should have a way to lambda-lift a value e to a no-argument callable function which returns e. Let us suppose that our e is denoted by the variable alpha. One can approximate such a lifting by declaring alpha_fnc = lambda: alpha. Python lambdas are slow compared to true currying functionality, like provided by functools.partial and the functions of the operator library, but it basically works. The problem, however, is that lambda declarations in Python, unlike in, say, C++ 11, have no closure mechanism to capture the local scope, so lambda which refer to outer variables are context-dependent. The following interactive session illustrates the problem.

In [1]: alpha_fnc = lambda: alpha

In [2]: alpha_fnc()
------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [2], in ()
----> 1 alpha_fnc()

Input In [1], in ()
----> 1 alpha_fnc = lambda: alpha

NameError: name 'alpha' is not defined

In [3]: alpha = .5

In [4]: alpha_fnc()
Out[4]: 0.5

In [5]: alpha = .4

In [6]: alpha_fnc()
Out[6]: 0.4

A* shortest string decoding for non-idempotent semirings

I recently completed some work, in collaboration with Google’s Cyril Allauzen, on a new algorithm for computing the shortest string through weighted finite-state automaton. For so-called path semirings, the shortest string is given by the shortest path, but up until now, there was no general-purpose algorithm for computing the shortest string over non-idempotent semirings (like the log or probability semiring). Such an algorithm would make it much easier to decode with interpolated language models or elaborate channel models in a noisy-channel formalism. In this preprint, we propose such an algorithm using A* search and lazy (“on-the-fly”) determinization, and prove that it is correct. The algorithm in question is implemented in my OpenGrm-BaumWelch library by the baumwelchdecode command-line tool.

Please don’t send .docx or .xlsx files

.docx and .xlsx can only be read on a small subset of devices and only after purchasing a license. It is frankly a bit rude to expect everyone to have such licenses in 2022 given the proliferation of superior, and free, alternatives. If the document is static, read-only content, convert it to a PDF. If it’s something you want me to edit or comment on, or which will be changing with time, send me the document via Microsoft 365 or the equivalent Google offerings. Or a Git repo. Sorry to be grumpy but everyone should know this by now. If you’re still emailing these around, please stop.

WFST talk

I have posted a lightly-revised slide deck from a talk I gave at Johns Hopkins University here. In it, I give my most detailed-yet description of the weighted finite-state transducer formalism and describe two reasonably interesting algorithms, the optimization algorithm underlying Pynini’s optimize method and Thrax’s Optimize function, and a new A*-based single shortest string algorithm for non-idempotent semirings underlying BaumWelch’s baumwelchdecode CLI tool.