Kill yr darlings…

…or at least make them more rigorous.

In the field of computational phonology, there were three mid-pandemic articles that presented elaborate computational “theories of everything” in phonology: Ellis et al. (2022), Rasin et al. (2021), and Yang & Piantadosi (2022).¹ I am quite critical of all three offerings. All three provide computational models evaluated for their ability to acquire phonological patterns—with varying amounts overheated rhetoric about what this means for generative grammar—and in each case, there is a utter lack of rigor. None of the papers prove, or even conjecture, anything hopeful or promising about the computational complexity of the proposed models, how long they take to converge (or if they do), or whether there is any bound on the kinds of mistakes the models might make once they converge. What they do instead is demonstrate that the models produce satisfactory results on toy problem sets. One might speculate that these three papers are the result of lockdown-era hyperfocus on thorny passion projects. But I think it’s unfortunate that the authors (and doubly so the reviewers and editors) considered these projects complete before providing formal characterization of the proposed models’ substantive properties.² By stating this critique here, I hopefully commit myself to align actions with my values in my future work, and I challenge the aforementioned authors to study these properties.

Endnotes

To be fair, Yang and Piantadosi claims to be a theory of not just phonology…
I am permitted to state that I reviewed one of these papers—my review was “signed” and made public, along with the paper—and my review was politely negative. However, it was clear to me that the editor and other reviewers had a very high opinion of this work and there was no reason for me to fight the inevitable.

References

Ellis, K., Albright, A., Solar-Lezama, A., Tenenbaum, J. B., and O’Donnell, T. J. 2022. Synthesizing theories of human language with bayesian program induction. Nature Communications 2022:1–13.
Rasin, E., Berger, I., Lan, N., Shefi, I. and Katzir, R. 2021. Approaching explanatory adequacy in phonology using Minimum Description Length. Journal of Language Modelling 9:17–66.
Yang, Y. and Piantadosi, S. T. 2022. One model for the learning of language. Proceedings of the National Academy of Sciences 119:e2021865119.

Yet more on the Pirahã debate

I just read a draft of Geoff Pullum’s paper on the Pirahã controversy, presented at a workshop of the recent LSA meeting.

It’s not a particularly interesting paper to me, since it has nothing to say about the conflicting data claims at the center of the controversy. No one has ever given an explanation of how one might integrate the evidence for clausal embedding in Everett 1986 (etc.) with the writings of Everett from 2005 onward. These two Everetts are in mortal conflict. Everett (1986), for example gives examples of embedded clauses, Everett (2005) denies that the language has clausal embedding, and Everett (2009), faced with the contradiction, has decided to gloss this same example (Nevins et al. 2009, ex. 13, reproduced from Everett 1986, ex. 232) as two sentences, with no argument provided for why earlier Everett was wrong. While one ought not to reason from one’s own limited imagination, it’s hard for me to fathom anything other than incompetence in 1987 or dishonesty 2005-present. Either way, it suggests that additional attention is probably needed on other specific claims about this language, such as the presence of rare phonetic elements (Everett 1988a) and the presence of ternary metrical feet (Everett 1988b); and on these topics there is far less room for creative hermeneutics.

If people have been nasty to Everett—and this seems to be the real complaint from Pullum—it’s because the whole thing stinks to high heaven; it’s a shame Pullum can’t smell the bullshit.

References

Everett, D. L. 1986. Pirahã. In Handbook of Amazonian Languages, vol. 1, D. C. Derbyshire and G. K. Pullum (ed.), pages 200-326. Mouton de Gruyter.
Everett, D. L. 1988a. Phonetic rarities in Pirahã. Journal of the International Phonetic Association 12: 94-96.
Everett, D. L. 1988b. On metrical constituent structure in Pirahã. Natural Language & Linguistic Theory 6: 207-246.
Everett, D. L. 2005. Cultural constraints on grammar and cognition in Pirahã: another look at the design features of human language. Current Anthropology 46: 621-646.
Everett, D. L. 2009. Pirahã culture and grammar: a response to some criticisms. Language 85: 405-442.
Nevins, A., Pesetsky, D., and Rodrigues, C. 2009. Pirahã exceptionality: a reassessment. Language 85: 355-404.

Streaming decompression for the Reddit dumps

I was recently working with the Reddit comments and submission dumps from PushShift (RIP).¹ These are compressed in Zstandard .zstformat. Unfortunately, Python’s extensive standard library doesn’t have native support for this format, and the some of the files are quite large,² so a streaming API is necessary.

After trying various third-party libraries, I finally found one that worked with a minimum of fuss: pyzstd, available from PyPI or Conda. This appears to be using ~~Facebook~~Meta’s reference C implementation as the backend, but more importantly, it provides a stream API like the familiar gzip.open, bz2.open, and lzma.open for .gz, .bz2 and .xz files, respectively. There’s one nit: PushShift’s Reddit dumps were compressed with an uncommonly large window size (2 << 31), and one has to inform the decompression backend. Without this, I was getting the following error:

_zstd.ZstdError: Unable to decompress zstd data: Frame requires too much memory for decoding.

All I have to do to fix this is to pass the relevant parameter:

PARAMS = {pyzstd.DParameter.windowLogMax: 31}

with pystd.open(yourpath, "rt", level_or_options=PARAMS) as source:
    for line in source:
        ...

Then, each line is a JSON message with the post (either a comment or submission) and all the metadata.

Endnotes

Psst, don’t tell anybody, but… while these are no longer being updated they are available through December 2023 here. We have found them useful!
Unfortunately, they’re grouped first by comments vs. submissions, and then by month. I would have preferred the files to be grouped by subreddit instead.

Alt-lingfluencers

It’s really none of my business whether or not a linguist decides to leave the field. Several people I consider friends have, and while I miss seeing them at conferences, none of them were close collaborators. Reasonable people can disagree about just how noble it is to be a professor (I think it is, or can be, but it’s not a major part of my self-worth), and I certainly understand why one might prefer a job in the private sector. At the same time, I think linguists wildly overestimate how easy it is to get rewarding, lucrative work in the private sector, and also overestimate how difficult that work can be on a day-to-day basis. (Private sector work, like virtually everything else in the West, has gotten substantially worse—more socially alienating, more morally compromising—in the last ten years.)

In this context, I am particularly troubled by the rise of a small class of “alt-ac” ex-linguist influencers. I realize there is a market for advice on how to transition careers, and there are certainly honest people working in this space. (For instance, my department periodically invites graduates from our program to talk about their private sector jobs.) But what the worst of the alt-lingfluencers do in actuality is farm for engagement and prosecute grievances from their time in the field. If they were truly happy with their career transitions, they simply wouldn’t care enough—let alone have the time—to post about their obsessions for hours every day. These alt-lingfluencers were bathed in privilege when they were working linguists, so to see them harangue against the field is a bit like listening to a lottery winner telling you not to play. These are deeply unhappy people, and unless you know them well enough to check in on their well-being from time to time, you should pay them no mind. You’d be doing them a favor, in the end. Narcissism is a disease: get well soon.