When rule directionality does and does not matter

At the Graduate Center we recently hosted an excellent lecture by Jane Chandlee of Haverford College. Those familiar with her work may know that she’s been studying, for some time now, two classes of string-to-string functions called the input strictly local (ISL) and output strictly local (OSL) functions. These are generalizations of the familiar notion of the strictly local (SL) languages proposed by McNaughton and Papert (1971) many years ago. For definitions of ISL and OSL functions, see Chandlee et al. 2014 and Chandlee 2014. Chandlee and colleagues have been arguing, for some time now, that virtually all phonological processes are ISL, OSL, or both (note that their intersection is non-null).

In her talk, Chandlee attempted to formalize the notions of iterativity and non-iterativity in phonology with reference to ISL and OSL functions. One interesting side effect of this work is that one can, quite easily, determine what makes a phonological process direction-invariant or direction-specific. In FSTP (Gorman & Sproat 2021:§5.1.1) we describe three notions of rule directionality (ones which are quite a bit less general than Chandlee’s notions) from the literature, but conclude: “Note, however, that directionality of application has no discernable effect for perhaps the majority of rules, and can often be ignored.” (op. cit., 53) We didn’t bother to determine when this is the case, but Chandlee shows that the set of rules which are invariant to direction of application (in our sense) are exactly those which are ISL ∩ OSL; that is, they describe processes which are both ISL and OSL, in the sense that they are string-to-string functions (or maps, to use her term) which can be encoded either as ISL or OSL.

As Richard Sproat (p.c.) points out to me, there are weaker notions of direction-invariance we may care about in the context of grammar engineering. For instance, it might be the case that some rule is, strictly speaking, direction-specific, but the language of input strings is not expected to contain any relevant examples. I suspect this is quite common also.

References

Chandlee, J. 2014. Strictly local phonological processes. Doctoral dissertation, University of Delaware.
Chandlee, J., Eyraud, R., and Heinz, J. 2014. Learning strictly local subsequential functions. Transactions of the Association for Computational Linguistics 2: 491-503.
Gorman, K., and Sproat, R. 2021. Finite-State Text Processing. Morgan & Claypool.
McNaughton, R., and Papert, S. A. 1971. Counter-Free Automata. MIT Press.

A* shortest string decoding for non-idempotent semirings

I recently completed some work, in collaboration with Google’s Cyril Allauzen, on a new algorithm for computing the shortest string through weighted finite-state automaton. For so-called path semirings, the shortest string is given by the shortest path, but up until now, there was no general-purpose algorithm for computing the shortest string over non-idempotent semirings (like the log or probability semiring). Such an algorithm would make it much easier to decode with interpolated language models or elaborate channel models in a noisy-channel formalism. In this preprint, we propose such an algorithm using A* search and lazy (“on-the-fly”) determinization, and prove that it is correct. The algorithm in question is implemented in my OpenGrm-BaumWelch library by the baumwelchdecode command-line tool.

Please don’t send .docx or .xlsx files

.docx and .xlsx can only be read on a small subset of devices and only after purchasing a license. It is frankly a bit rude to expect everyone to have such licenses in 2022 given the proliferation of superior, and free, alternatives. If the document is static, read-only content, convert it to a PDF. If it’s something you want me to edit or comment on, or which will be changing with time, send me the document via Microsoft 365 or the equivalent Google offerings. Or a Git repo. Sorry to be grumpy but everyone should know this by now. If you’re still emailing these around, please stop.

WFST talk

I have posted a lightly-revised slide deck from a talk I gave at Johns Hopkins University here. In it, I give my most detailed-yet description of the weighted finite-state transducer formalism and describe two reasonably interesting algorithms, the optimization algorithm underlying Pynini’s optimize method and Thrax’s Optimize function, and a new A*-based single shortest string algorithm for non-idempotent semirings underlying BaumWelch’s baumwelchdecode CLI tool.

Dutch names in LaTeX

One thing I recently figured out is a sensible way to handle Dutch names (i.e., those that begin with denvan or similar particles. Traditionally, these particles are part of the cited name in author-date citations (e.g., den Dikken 2003, van Oostendorp 2009) but are ignored when alphabetizing (thus, van Oostendorp is alphabetized between Orgun & Sprouse and Otheguy, not between Vago and Vaux)This is not something handled automatically by tools like LaTeX and BibTeX, but it is relatively easy to annotate name particles like this so that they do the right thing.

First, place, at the top of your BibTeX file, the following:

@preamble{{\providecommand{\noopsort}[1]{}}}

Then, in the individual BibTeX entries, wrap the author field with this command like so:

 author = {{\noopsort{Dikken}{den Dikken}}, Marcel},

This preserves the correct in-text author-date citations, but also gives the intended alphabetization in the bibliography.

Note of course that not all people with van (etc.) names in the Anglosphere treat the van as if it were a particle to be ignored; a few deliberately alphabetize their last name as if it begins with v.