“Segmented languages”

In a recent paper (Gorman & Sproat 2023), we complain about conflation of writing systems with the languages they are used to write, highlighting the nonsense underlying common expressions like “right-to-left language”, “syllabic language” or “ideographic” language found in the literature. Thus we were surprised to find the following:

Four segmented languages (Mandarin, Japanese, Korean and Thai) report character error rate (CER), instead of WER… (Gemini Team 2024:18)

Since the most salient feature of the writing systems used to write Mandarin, Japanese, Korean, and Thai is the absence of segmentation information (e.g., whitespace used to indicate word boundaries), presumably the authors mean to say that the data they are using has already been pre-segmented (by some unspecified means). But this is not a property of these languages, but rather of the available data.

[h/t: Richard Sproat]

References

Gemini Team. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint 2312.11805. URL: https://arxiv.org/abs/2312.11805.

Gorman, K. and Sproat, R.. 2023. Myths about writing systems in speech & language technology. In Proceedings of the Workshop on Computation and Written Language, pages 1-5.

Self-taught C++

I have recently fielded a few requests from students about self-directed learning of C++. I thought I’d combine my notes here. So, compared to Python for instance, C++ is a very large language both in terms of syntactic richness and the large standard library. Secondly, it has been popular for at least two decades longer than Python, so there is a lot of really dated material out there that doesn’t incorporate the huge positive changes to the language made in C++11.

I recommend two books. First and most importantly is the 4th edition of (C++ creator) Bjarne Stroustrup’s The C++ Programming Language. This is a gigantic hardback textbook that basically covers everything you need to know through C++11. It does not cover C++14, C++17, C++20, or C++23, but those are all pretty minor changes by comparison, and you’ll catch on. Stroustrup is actually a pretty good technical writer, too. (If a 5th edition ever comes out, get that one instead.) The other one I recommend is the Scott Myers’ Effective Modern C++, a smaller book which focuses on the newer C++11 and C++14 features. Myers’ book is structured like a series of essays about when and how to incorporate these new features.

There are two other things I recommend that aspiring C++ users use. The first is a good style guide. C++ just isn’t very opinionated, but good code is. I definitely recommend the widely-used Google C++ style guide, but I’m sure there are other good ones out there. The second is Godbolt, an incredible website that combines the functionality of a pastebin with an in-browser compiler.

Another quote from Ludlow

Indeed, when we look at other sciences, in nearly every case, the best theory is arguably not the one that reduces the number of components from four to three, but rather the theory that allows for the simplest calculations and greatest ease of use. This flies in the face of the standard stories we are told about the history of science. […] This way of viewing simplicity requires a shift in our thinking. It requires that we see simplicity criteria as having not so much to do with the natural properties of the world, as they have to do with the limits of us as investigators, and with the kinds of theories that simplify the arduous task of scientific theorizing for us. This is not to say that we cannot be scientific realists; we may very well suppose that our scientific theories approximate the actual structure of reality. It is to say, however, that barring some argument that “reality” is simple, or eschews machinery, etc., we cannot suppose that there is a genuine notion of simplicity apart from the notion of “simple for us to use.” […] Even if, for metaphysical reasons, we supposed that reality must be fundamentally simple, every science (with the possible exception of physics) is so far from closing the book on its domain it would be silly to think that simplicity (in the absolute sense) must govern our theories on the way to completion. Whitehead (1955, 163) underlined just such a point.

Nature appears as a complex system whose factors are dimly discerned by us. But, as I ask you, Is not this the very truth? Should we not distrust the jaunty assurance with which every age prides itself that it at last has hit upon the ultimate concepts in which all that happens can be formulated. The aim of science is to seek the simplest explanations of complex facts. We are apt to fall into the error of thinking that the facts are simple because simplicity is the goal of our quest. The guiding motto in the life of every natural philosopher should be, Seek simplicity and distrust it.

(Ludlow 2011:158-160)

References

Ludlow, P. 2011. The Philosophy of Generative Grammar. Oxford University Press.
Whitehead, W. N. 1955. The Concept of Nature. Cambridge University Press.

The Unicoder

I have long encouraged students to turn software demos (which work on their laptop, in their terminal, and maybe nowhere else) into simple web apps. Years ago I built a demo of what this might look like, using Python’s Flask library. The entire app is under 200 lines of Python (and jinja2 template), plus a bit of static HTML and CSS.

It turns out this little demonstration is actually quite useful for my research. For any given string, it gives you the full decomposition of it into Unicode codepoints, with optional Unicode normalization, whitespace stripping, and case-folding. This is very useful for debugging.

The Unicoder, as it is called, is hosted on the free tier of Glitch. [Edit: it is now on Render.] (It used to also be on Heroku, but Salesforce is actively pushing people off that very useful platform.) Because of that, it takes about 10 seconds to “start up” (i.e., I assume the workers are put into some kind of hibernation mode) if it hasn’t been used in the last half hour or so. But, it’s very, very useful.

Citation practices

In a previous post I talked about an exception to the general rule that you should expand acronyms: sometimes what the acronym expands to is a clear joke made up after the fact. This is an instance of a more general principle: you should provide, via citations, information the reader needs to know or stands to benefit from. To that point, nobody has ever really cared about the mere fact that you “used R (R Core Team 2021)”. It’s usually not relevant. R is one of hundreds of Turing-complete programming environments, and most of the things it can do can be done in any other language. Your work almost surely can be replicated in other environments. It might be interesting to mention this if a major point of your paper is that wrote, say, a new open-source software package for R: there the reader needs to know what platform this library targets. But otherwise it’s just cruft.

Debugging CUDA indexing errors

Perhaps you’ve seen pages of the following scary error:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [99,0,0], thread: [115,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

It turns out there is a relatively simple way to figure out what the indexing issue is. The internet suggests prepending

CUDA_LAUNCH_BLOCKING=1

to your command, but this doesn’t seem to help much either. There is a simpler solution: run whatever you’re doing on CPU. It’ll give you much nicer errors.

It’s time to retire “agglutinative”

A common trope in computational linguistics papers is the use of the technical term agglutinative as a synonym for rich inflectional morphology. This is not really what that term means. Properly, a language has agglutinative morphology in the case that it has affixes, each of which has a single syntacto-semantic function. (To really measure this properly, you probably need a richer, and more syntactically-oriented, theory of morphology than is au courant among the kind of linguistic typologist who would think it interesting to measure this over a wide variety of languages in the first place, but that’s another issue.) Thus Russian, for instance, has rich inflectional morphology, but it is not at all agglutinative, because it is quite happy for the suffix -ov to mark both the genitive and the plural, whereas the genitive plural in Hungarian is marked by two affixes.

I propose that we take agglutinative away from NLP researchers until they learn even a little bit about morphology. If you want to use the term, you need to state why agglutination, rather than the mere matter of lexemes having a large number of inflectional variants, is the thing you want to highlight. While I don’t think WALS is very good —certainly it’s over-used in NLP—it nicely distinguishes between isolation (#20), exponence (#21), and synthesis (#22). This ought to allow one to distinguish between agglutination and synthesis with a carefully-drawn sample, should one wish to.

Is NLP stuck?

I can’t help but feel that NLP is once again stuck.

From about 2011 to 2019, I can identify a huge step forward just about every year. But the last thing that truly excited me is BERT, which came out in 2018 and was published in 2019. For those not in the know, the idea of BERT is to pre-train a gigantic language model, with either monolingual or multilingual data. The major pre-training task is masked language model prediction: we pretend some small percentage (usualyl 15%) of the words in a sentence are obscured by noise and try to predict what they were. Ancillary tasks like predicting whether two sentences are adjacent or not (or if they were, what was their order) are also used, but appear to be non-essential. Pre-training (done a single time, at some expense, at BigCo HQ), produces a contextual encoder, a model which can embed words and sentences in ways that are useful for many downstream tasks. But then one can also take this encoder and fine-tune it to some other downstream task (an instance of transfer learning). It turns out that the combination of task-general pre-training using free-to-cheap ordinary text data and a small amount of task-specific fine-tuning using labeled data results in substantial performance gains over what came before. The BERT creators gave away both software and the pre-trained parameters (which would be expensive for an individual or a small academic lab to reproduce on their own), and an entire ecosystem of sharing pre-trained model parameters has emerged. I see this toolkit-development ecosysytem as a sign of successful science.

From my limited perspective, very little has happened since then that is not just more BERTology—that is, exploiting BERT and similar models. The only alternative on the horizon, in the last 4 years now, are pre-trained large language models without the encoder component, of which the best known are the GPT family (now up to GPT-3). These models do one thing well: they take a text prompt and produce more text that seeminly responds to the prompt. However, whereas BERT and family are free to reuse, GPT-3’s parameters and software are both closed source and can only be accessed at scale by paying a licensing fee to Microsoft. That itself is a substantial regression compared to BERT. More importantly, though, the GPT family are far less expressive tools than BERT, since they don’t really support fine-tuning. (More precisely, I don’t see any difficult technical barriers to fine-tuning GPT-style models; it’s just not supported.) Thus they can be only really used for one thing: zero-shot text generation tasks, in which the task is “explained” to the model in the input prompt, and the output is also textual. Were it possible to simply write out, in plain English, what you want, and then get the output in a sensible text format, this of course would be revolutionary, but that’s not the case. Rather, GPT has spawned a cottage industry of prompt engineering. A prompt engineer, roughly, is someone who specializes in crafting prompts. It is of course impressive that this can be done at all, but just because an orangutan can be taught to make an adequate omelette doesn’t mean I am going to pay one to make breakfast. I simply don’t see how any of this represents an improvement over the BERT ecosystem, which at least has an easy-to-use free and open-source ecosystem. And as you might expect, GPT’s zero-shot approach is quite often much worse than what one would obtain using the light supervision of the BERT-style fine-tuning approach.

The next toolkit 2: electric boogaloo

I just got back from the excellent Workshop on Model Theoretic Representations in Phonology. While I am not exactly a member of the “Delaware school”, i.e., the model-theoretic phonology (MTP) crowd, I am a big fan. In my talk, I contrasted the model-theoretic approach to an approach I called the black box approach, using neural networks and program synthesis solvers as examples of the latter. I likened the two styles to neat vs. scruffy, better is better vs. worse is better, rationalists vs. empiricists, and cowboys vs. aliens.

One lesson I drew from this comparison is the need for MTPists to develop high-quality software—the next toolkit 2. I didn’t say much during my talk about what I imagine this to be like, so I thought I’d leave my thoughts here. Several people—Alëna Aksënova, Hossep Dolatian, and Dakotah Lambert, for example—have developed interesting MTP-oriented libraries. While I do not want to give short schrift to their work, I think there are two useful models for the the next next toolkit: (my own) Pynini and PyTorch. Here is what I see as the key features:

They are ordinary Python on the front-end. Of course, both have a C++ back-end, and PyTorch has a rarely used C++ API, but that’s purely a matter of performance; both have been slowly moving Python code into the C++ layer over the course of their development.The fact of the matter is that in 2022, just about anyone who can code at all can do so in Python.
While both are devilishly complex, their design follows the principle of least suprise; there is only a bit of what Pythonistas call exhuberant syntax (Pynini’s use of the @ operator, PyTorch’s use of _ to denote in-place methods).
They have extensive documentation (both in-module and up to book length).
They have extensive test suites.
They are properly packaged and can be installed via PyPi (i.e., via pip) or Conda-Forge (via conda).
They have corporate backing.

I understand that many in the MTP community are naturally—constitutionally, even—drawn to functional languages and literate programming. I think this should not be the initial focus. It should be ease of use, and for that it is hard to beat ordinary Python in 2022. Jupyter/Colab support is a great idea, though, and might satisfy the literate programming itch too.

Wall clock time

Computers and humans have radically different ways to reckon time. While a computer can tell you how long sometime took it, the computer is constantly switching between tasks, so this number has to be converted to wall clock time, or rather how much time elapsed in the real world while it was working on the job.

I guess I’m a dualist, because I think there’s something special about sentience. I think of humans (and possibly others creatures) are essentially divine but finite beings, whereas to me computers are mere objects. We divine beings can spend some of our finite time on earth to make a program run faster, but at some point it makes more sense to simply wait, and do something else while the program is running. It is hard to draw an equivalence between the opportunity cost for a divine being vs. an object. Learning when to just wait is one of the most important skills a developer can acquire.