Pynini 2020: State of the Sandwich

I have been meaning to describe some of the work I have been doing on Pynini, our weighted finite-state grammar development platform. For one, while I have been the primary contributor through the history of the project (Richard Sproat wrote the excellent path iteration library), we are now also getting many contributions from Lawrence Wolf-Sonkin (rewrite of the symbol table wrapper, type hints) and lots of usability and bug reports from the Google linguists.

We are currently on Pynini release 2.1.1. Here are some new features/improvements from the last few releases:

  • 2.0.9: Adds an efficient multi-argument union.
  • 2.0.9: Pynini (and the rest of OpenGrm) are available on Conda via Conda-Forge. This means that for most users, there is no longer any need to compile Pynini by hand; instead Pynini is compiled (for a variety of platforms) in the cloud, using a continuous integration framework.
  • 2.1.0: Rewrites the string compiler so that symbol tables are no longer attached to compiled FSTs, eliminating the need for expensive symbol table merging and relabeling options.
  • 2.1.0: Rewrites the FST and symbol table class hierarchies to better reflect the organization of lower-level APIs.
  • 2.1.1: Adds PEP 484/PEP 561-compatible type stubs.

We also have removed or renamed quite a few features:

  • stringify is renamed string.
  • text is renamed print (cf. the command-line tool fstprint).
  • The defaults struct is removed, though it may be reintroduced as a context manager at some point.
  • The * infix operator, previously used for composition is removed; use @ instead.
  • transducer‘s arguments input_token_type and output_token_type are merged as token_type.

Finally, we have broken Python 2.7 compatibility as of 2.1.0; pywrapfst, the lower-level API, still has some degree of Python 2.7 compatibility, but this is probably the last release to maintain that property.

Idealizations gone wild

Generative grammar and information theory are products of the US post-war defense science funding boom, and it is no surprise that the former attempted to incorporate insights from the latter. Many early ideas in generative phonology—segment structure and morpheme structure rules and constraints (Stanley 1967), the notion of the evaluation metric (Aspects, §6), early debates on opacities, conspiracies, and the alternation condition—are clearly influenced by information theory. It is interesting to note that as early as 1975, Morris Halle regarded his substantial efforts in this area to have been a failure.

In the 1950’s I spent considerable time and energy on attempts to apply concepts of information theory to phonology. In retrospect, these efforts appear to me to have come to naught. For instance, my elaborate computations of the information content in bits of the different phonemes of Russian (Cherry, Halle & Jakobson 1953) have been, as far as I know, of absolutely no use to anyone working on problems in linguistics. And today the same negative conclusion appears to be to be warranted about all my other efforts to make use of information theory in linguistics. (Halle 1975: 532)

Thus, the mania for information theory in early generative grammar—was exactly the sort of bandwagon effect of the sort Claude Shannon, the inventor of information theory, warned about decades earlier.

In the first place, workers in other fields should realize that the basic results of the subject are aimed at a very specific direction, a direction that is not necessarily relevant to such fields as psychology, economics, and other social sciences. (Shannon 1956)

Today, however, information theory is not exactly in disrepute in linguistics. First off, perplexity, a metric derived from information theory, is used as an intrinsic metric in certain natural language processing tasks, particularly language modeling.1 Secondly, there have been attempts to revive information theory notions as an explanatory factor in the study of phonology (e.g., Goldsmith & Riggle 2012) and human morphological processing (e.g., Moscoso del Prado Martı́n et al. 2004). And recently, Mollica & Piantadosi (2019; henceforth M&P) dare to use information theory to measure the size of the grammar of English.

M&P’s program is fundamentally one of idealization. Now, I don’t have any problem per se with idealization. Idealization is an important part of the epistemic process in science, one without which there can be no scientific observation at all. Critics of idealizations (and of idealization itself) are usually concerned with the things an idealization abstracts away from; for instance, critics of Chomsky’s famous “ideal speaker-listener” (Aspects, p. 3f) note correctly that it ignores bilingual interference, working memory limitations, and random errors. But idealizations are not merely the infinitude of variables they choose to ignore (and when the object of study is an enormously complex polysensory, multifactorial system like the human capacity for language, one is simply not going to be able to study the entire system all at once); they are just as much defined by the factors they foreground and the affordances they create, and the constraints they impose on scientific inquiry.

In this case, an information theoretic characterization of grammars constrains us to conceive of our knowledge of language in terms of probability distributions. This is a step I am often uncomfortable with. It is, for example, certainly possible to conceive of speakers’s lexical knowledge as a sort of probability distribution over lexical items, but I am not sure that P(word) has much grammatical work to do except act as a model of the readily apparent observation that more frequent words can be recalled and recognized more rapidly than rare words. To be sure, studies like the aforementioned one by Moscoso del Prado Martı́n et al. attempt to connect information theoretic characterizations of the lexicon to behavioral results, but these studies are correlational and provide little in the way of mechanistic-causal explanation.

However, for sake of argument, let us assume that the probabilistic characterization of grammatical knowledge is coherent. Why then should it be undertaken? M&P claim that the measurements they will allow—grammar sizes, measured in bits—weigh on an familiar debate. As they frame it:

…is the amount of information about language that is learned substantial (empiricism) or minimal (nativism)?

I don’t accept the terms of this debate. While I consider myself a nativist, I have formed no opinions about how many bits it takes to represent the grammar of English, which is by all accounts a rather complex object. The tradeoff between what is to be learned and what is innate is something that has been given extensive consideration in the nativist literature. Nativists recognize that the less there is to be learned, the more that has to have evolved in the rather short amount of time (in evolutionary terms) since we humans split off from our language-lacking primate cousins. But this tradeoff is strictly qualitative; were it possible to satisfactorily measure both evolutionary plausibility and grammar size, they would still be incommensurate quantities.

M&P proceed by computing the number of bits for various linguistic subsystems. They compute the information associated with phonemes (really, the acoustic cues to various features), the phonemic representation of wordforms, lexical semantics (mappings from words to meanings, here represented as a vector space as is the fashion), word frequency, and finally syntax. For each of these they provide lower bounds and upper bounds, though the upper bounds are in some cases constructed by adding an ad-hoc factor-of-two error to the lower bound. Finally, they sum these quantities, giving an estimate of roughly 1.5 megabytes. This M&P consider to be substantial. It is not at all clear why they feel this is the case, or how small a grammar would have to be to be “minimal”.

There is a lot to complain about in the details of M&P’s operationalizations. First, I am not certain that the systems they have identified are well-defined modules that would be recognizable to working linguists; for instance their phonemes module has next to nothing to do with my conception of phonological grammar. Secondly, it seems to me that by summing the bits needed to characterize each module, they are assuming a sort of “feed-forward”, non-interactive relationship between these components, and it is not clear that this is correct; for example, there are well-understood lexico-semantic constraints on verbs’ argument structure.

While I do not wish to go too far afield, it may be useful to consider in more detail their operationalization of syntax. For this module, they use a corpus of textbook example sentences, then compute the number of possible unlabeled binary branching trees that would cover each example. (This quantity is the same as the nth Catalan number.) To turn this into a probability, they assume that one correct parse has been sampled from a uniform distribution over all possible binary trees for the given sentence. First, this assumption of uniformitivity is completely unmotivated. Secondly, since they assume there’s exactly one possible bracketing, and do not provide labels to non-terminals, they have no way of representing the ambiguity of sentences like Call John an ambulance. (Thanks to Brooke Larson for suggesting this example.) Anyone familiar with syntax will have no problem finding gaping faults with this operationalization.2

M&P justify all this hastiness by comparing their work to the informal estimation approach known as a Fermi problem (they call them “Fermi calculations”). In the original framing, the quantity being estimated is the product of many terms, so assuming errors in estimation of each term are independent, the final estimate’s error is expected to grow logarithmically as the number of terms increases (roughly, this is because the logarithm of a product is equal to the sum of the logarithms of its terms). But in M&P’s case, the quantity being estimated is a sum, so the error will grow much faster, i.e., linearly as a function of the number of terms. Perhaps, as one reviewer writes, “you have to start somewhere”. But do we? If something is not worth doing well—and I would submit that measuring grammars, in all their richness, by comparing them to the storage capacity of obsolete magnetic storage media is one such thing—it seems to me to be not worth doing at all.

Footnotes

  1. Though not without criticism; in speech recognition, probably the most important application of language modeling, it is well-known that decreases in perplexity don’t necessarily give rise to decreases in word error rate.
  2. Why do M&P choose such a degenerate version of syntax? Because syntactic theory is “experimentally under-determined”, so they want to be “independent as possible from the specific syntactic formalism.”

References

Cherry, E. C., Halle, M., and Jakobson, R. 1953. Towards the logical description of languages in their phonemic aspect. Language 29(1): 34-46.
Chomsky, N. 1965. Aspects in the theory of syntax. Cambridge: MIT Press.
Goldsmith, J. and Riggle, J. 2012. Information theoretic approaches to phonology: the case of Finnish vowel harmony. Natural Language & Linguistic Theory 30(3): 859-896.
Halle, M. 1975. Confessio grammatici. Language 51(3): 525-535.
Mollica, F. and Piantadosi, S. P. 2019. Humans store about 1.5 megabytes of information during language acquisition. Royal Society Open Science 6: 181393.
Moscoso del Prado Martı́n, F., Kostić, A., and Baayen, R. H. 2004. Putting the bits together: an information theoretical perspective on morphological processing. Cognition 94(1): 1-18.
Shannon, C. E. 1956. The bandwagon. IRE Transactions on Information Theory 2(1): 3.
Stanley, R. 1967. Redundancy rules in phonology. Language 43(2): 393-436.

Elizabeth Warren and the morality of the professional class

I am surprised by the outpouring of grief engendered by Senator Elizabeth Warren’s exit from the presidential primary among my professional friends and colleagues. I dare not tell them how they ought to feel, but the spectacle of grief makes me wonder whether my friends are selling themselves short: virtually all of them have lived, in my opinion, far more virtuous lives than the senator from Massachusetts.

First off, none of them have spent most of their professional lives as right-wing activists, as did Warren, a proud Republican until the late ’90s. As recently as 1991, Warren gave a keynote at a meeting of the Federalist Society, the shadowy anti-choice legal organization that gave us Justice Brett Kavanaugh and so many other young ultra-conservative judicial appointees.

Secondly, Warren spent decades lying about her Cherokee heritage, presumably for nothing more than professional gain. This is a stunningly racist personal behavior, one that greatly reinforces white supremacy by equating the almost-unimaginable struggles of indigenous peoples with plagiarized recipes and “high cheekbones”. Were any of my friends or colleagues caught lying so blatantly on a job application, they would likely be subject to immediate termination. It is shocking that Warren has not faced greater  professional repercussions for this lapse in judgment.

Warren’s more recent history of regulatory tinkering around the most predatory elements of US capitalism, while important, are hardly an appropriate penance for these two monumental personal-professional sins.