Another quote from Ludlow

Indeed, when we look at other sciences, in nearly every case, the best theory is arguably not the one that reduces the number of components from four to three, but rather the theory that allows for the simplest calculations and greatest ease of use. This flies in the face of the standard stories we are told about the history of science. […] This way of viewing simplicity requires a shift in our thinking. It requires that we see simplicity criteria as having not so much to do with the natural properties of the world, as they have to do with the limits of us as investigators, and with the kinds of theories that simplify the arduous task of scientific theorizing for us. This is not to say that we cannot be scientific realists; we may very well suppose that our scientific theories approximate the actual structure of reality. It is to say, however, that barring some argument that “reality” is simple, or eschews machinery, etc., we cannot suppose that there is a genuine notion of simplicity apart from the notion of “simple for us to use.” […] Even if, for metaphysical reasons, we supposed that reality must be fundamentally simple, every science (with the possible exception of physics) is so far from closing the book on its domain it would be silly to think that simplicity (in the absolute sense) must govern our theories on the way to completion. Whitehead (1955, 163) underlined just such a point.

Nature appears as a complex system whose factors are dimly discerned by us. But, as I ask you, Is not this the very truth? Should we not distrust the jaunty assurance with which every age prides itself that it at last has hit upon the ultimate concepts in which all that happens can be formulated. The aim of science is to seek the simplest explanations of complex facts. We are apt to fall into the error of thinking that the facts are simple because simplicity is the goal of our quest. The guiding motto in the life of every natural philosopher should be, Seek simplicity and distrust it.

(Ludlow 2011:158-160)

References

Ludlow, P. 2011. The Philosophy of Generative Grammar. Oxford University Press.
Whitehead, W. N. 1955. The Concept of Nature. Cambridge University Press.

A page from Ludlow (2011)

Much writing in linguistic theory appears to be driven by a certain common wisdom, which is that the simplest theory either is the most aesthetically elegant or has the fewest components, or that it is the theory that eschews extra conceptual resources. This common wisdom is reflected in a 1972 paper by Paul Postal entitled “The Best Theory,” which appeals to simplicity criteria for support of a particular linguistic proposal. A lot of linguists would wholeheartedly endorse Postal’s remark (pp. 137–138) that, “[w]ith everything held constant, one must always pick as the preferable theory that proposal which is most restricted conceptually and most constrained in the theoretical machinery it offers.

This claim may seem pretty intuitive, but it stands in need of clarification, and once clarified, the claim is much less intuitive, if not obviously false. As an alternative, I will propose that genuine simplicity criteria should not involve appeal to theoretical machinery, but rather a notion of simplicity in the sense of “simplicity of use”. That is, simplicity is not a genuine property of the object of investigation (whether construed as the human language faculty or something else), but is rather a property that is entirely relative to the investigator, and turns on the kinds of elements that the investigator finds perspicuous and “user friendly.”

Let’s begin by considering Postal’s thesis that the simplest (and other things being equal the best) theory is the one that utilizes less theoretical machinery. It may seem natural to talk about “theoretical machinery,” but what exactly is theoretical machinery? Consider the following questions that arise in cross-theoretical evaluation of linguistic theories of the sort discussed in Chapter 1. Is a level of linguistic representation part of the machinery? How about a transformation? A constraint on movement? A principle of binding theory? A feature? How about an algorithm that maps from level to level, or that allows us to dispense with levels of representation altogether? These questions are not trivial, nor are they easy to answer. Worse, there may be no theory neutral way of answering them.

The problem is that ‘machinery’ can be defined any way we choose. The machinery might include levels of representation, but then again it might not (one might hold that the machinery delivers the level of representation, but that the level of representation itself is not part of the machinery). Alternatively, one might argue that levels of representation are part of the machinery (as they are supported by data structures of some sort), but that the mapping algorithms which generate the levels of representation are not (as they never have concrete realization). Likewise one might argue that constraints on movement are part of the machinery (since they constrain other portions of the machinery), or one might argue that they are not (since they never have concrete realizations).

Even if we could agree on what counts as part of the machinery, we immediately encounter the question of how one measures whether one element or another represents more machinery. Within a particular well-defined theory it makes perfect sense to offer objective criteria for measuring the simplicity of the theoretical machinery, but measurement across theories is quite another matter. (Ludlow 2011: 153)

References

Ludlow, P. 2011. The Philosophy of Generative Grammar. Oxford University Press.
Postal, P. 1972. The best theory. In S. Peters (ed.), Goals of Linguistic Theory., pages 131-179. Prentice-Hall.

Lottery winners

It is commonplace to compare the act of securing a permanent faculty position in linguistics to winning the lottery. I think this is mostly unfair. There are fewer jobs than interested applicants, but the demand is higher— and the supply lower—than students these days suppose. And my junior faculty colleagues mostly got to where they are by years of dedicated, focused work. Because there are a lot of pitfalls on the path to the tenure track, their egos are often a lot smaller than one might suppose.

I wonder if the lottery ticket metaphor might be better applied to graduate trainees in linguistics finding work in the tech sector. I have held both types of positions, and I think I had to work harder to get into tech than to get back into the academy. Some of the “alt-ac influencers” in our field—the ones who ended up in tech, at least—had all the privileges in the world, including some reasonably prestigious teaching positions, before they made the jump. Being able to stay and work in the US—where the vast majority of this kind of work is—requires a sort of luck too, particularly when you reject the idea that “being American” is some kind of default. And finally demand for linguist labor in the tech sector varies enormously from quarter to quarter, meaning that some people are going to get lucky and others won’t.

Entrenched facts

Berko’s (1958) wug-test is a standard part of the phonologist’s  toolkit. If you’re not sure if a pattern is productive, why not ask whether speakers extend it to nonce words? It makes sense; it has good face validity. However, I increasingly see linguists who think that the results of wug-tests actually trumps contradictory evidence coming from traditional phonological analysis applied to real words. I respectfully disagree. 

Consider for example a proposal by Sanders (2003, 2006). He demonstrates that an alternation in Polish (somewhat imprecisely called o-raising) is not applied to nonce words. From this he takes o-raising to be handled via stem suppletion. He asks, and answers, the very question you may have on your mind. (Note that his here is the OT constraint hierarchy; you may want to read it as grammar.)

Is phonology obsolete?! No! We still need a phonological H to explain how nonce forms conform to phonotactics. We still need a phonological H to explain sound change. And we may still need H to do more with morphology than simply allow extant (memorized) morphemes to trump nonce forms. (Sanders 2006:10)1

I read a sort of nihilism into this quotation. However, I submit that the fact that 50 million people just speak Polish—and “raise” and “lower” their ó‘s with a high degree of consistency across contexts, lexemes, and so on—is a more entrenched fact than the results of a small nonce word elicitation task. I am not saying that Sander’s results are wrong, or even misleading, just that his theory has escalated the importance of these results to the point where it has almost nothing to say about the very interesting fact that the genitive singular of lód [lut] ‘ice’ is lodu [lɔdu] and not *[ludu], and that tens of millions of people agree.

Endnotes

  1. Sanders’ 2006 manuscript is a handout but apparently it’s a summary of his 2003 dissertation (Sanders 2003), stripped of some phonetic-interface details not germane to the question at hand. I just mention so that it doesn’t look like I’m picking on a rando. Those familiar with my work will probably guess that I disagree with just about everything in this quotation, but kudos to Sanders for saying something interesting enought to disagree with.

References

Berko, J. 1958. The child’s learning of English morphology. Word 14: 150-177.
Sanders, N. 2003. Opacity and sound change in the Polish lexicon. Doctoral dissertation, University of California, Santa Cruz.
Sanders, N. 2006. Strong lexicon optimization. Ms., Williams College and University of Massachusetts, Amherst.

The Unicoder

I have long encouraged students to turn software demos (which work on their laptop, in their terminal, and maybe nowhere else) into simple web apps. Years ago I built a demo of what this might look like, using Python’s Flask library. The entire app is under 200 lines of Python (and jinja2 template), plus a bit of static HTML and CSS.

It turns out this little demonstration is actually quite useful for my research. For any given string, it gives you the full decomposition of it into Unicode codepoints, with optional Unicode normalization, whitespace stripping, and case-folding. This is very useful for debugging.

The Unicoder, as it is called, is hosted on the free tier of Glitch. [Edit: it is now on Render.] (It used to also be on Heroku, but Salesforce is actively pushing people off that very useful platform.) Because of that, it takes about 10 seconds to “start up” (i.e., I assume the workers are put into some kind of hibernation mode) if it hasn’t been used in the last half hour or so. But, it’s very, very useful.

Citation practices

In a previous post I talked about an exception to the general rule that you should expand acronyms: sometimes what the acronym expands to is a clear joke made up after the fact. This is an instance of a more general principle: you should provide, via citations, information the reader needs to know or stands to benefit from. To that point, nobody has ever really cared about the mere fact that you “used R (R Core Team 2021)”. It’s usually not relevant. R is one of hundreds of Turing-complete programming environments, and most of the things it can do can be done in any other language. Your work almost surely can be replicated in other environments. It might be interesting to mention this if a major point of your paper is that wrote, say, a new open-source software package for R: there the reader needs to know what platform this library targets. But otherwise it’s just cruft.

Robot autopsies

I don’t really understand the exhuberance for studying whether neural networks know syntax. I have a lot to say about this issue—I’ll return to it later—but for today I’d like to briefly discuss this passage from a recent(ish) paper by Baroni (2022). The author expresses great surprise that few formal linguists have cited a particular paper (Linzen et al. 2016) about the ability of neural networks to learn long-distance agreement phenomena. (To be fair, Baroni is not a coauthor of said paper.) He then continues:

While it is possible that deep nets are relying on a completely different approach to language processing than the one encoded in human linguistic competence, theoretical linguists should investigate what are the building blocks making these systems so effective: if not for other reasons, at least in order to explain why a model that is supposedly encoding completely different priors than those programmed into the human brain should be so good at handling tasks, such as translating from a language into another, that should presuppose sophisticated linguistic knowledge. (Baroni 2022: 11).

I think this passage is a useful stepping-off point for what I think. I want to be clear: I am not “picking on” Baroni, who is probably far more senior to and certainly better known than me anyways; this is just a particularly clearly written claim, and I just happen to disagree.

Baroni says it is “possible that deep nets are relying on a completely different approach to language processing…” than humans; I’d say it’s basically certain that they are. We simply have no reason to think they might be using similar mechanisms since humans and neural networks don’t contain any of the same ingredients. Any similarities will naturally be analogies, not homologies.

Without a strong reason to think neural models and humans share some kind of cognitive homologies, there is no reason for theoretical linguists to investigate them; as artifacts of human culture they are no more in the domain of study for theoretical linguists than zebra finches, carburetors, or the perihelion of Mercury. 

It is not even clear how one ought to poke into the neural black box. Complex networks are mostly resistent to the kind of proof-theoretic techniques that mathematical linguists (witness the Delaware school or even just work by, say, Tesar) actually rely on, and most of the results are both negative and of minimal applicability: for instance, we know that there always exists a single-layer network large enough to encode, with arbitrary precision, any function a multi-layer network encodes, but we have no way to figure out how big is big enough for a given function.

Probing and other interpretative approaches exist, but have not yet proved themselves, and it is not clear that theoretical linguists have the relevant skills to push things forward anyways. Quality assurance, and adversarial data generation, is not exactly a high-status job; how can Baroni demand Cinque or Rizzi (to choose two of Baroni’s well-known countrymen) to put down their chalk and start doing free or poorly-paid QA for Microsoft?

Why should theoretical linguists of all people be charged with doing robot autopsies when the creators of the very same robots are alive and well? Either it’s easy and they’re refusing to do the work, or—and I suspect this is the case—it’s actually far beyond our current capabilities and that’s why little progress is being made.

I for one am glad that, for the time being, most linguists still have a little more self-respect. 

References

Baroni, M. 2022. On the proper role of linguistically oriented deep net analysis in linguistic theorising. In S. Lappin (ed). Algebraic Structures in Natural Language, pages 1-16. Taylor & Francis.
Linzen, T., Dupoux, E., and Goldberg, Y. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4: 521-535.

Defectivity in English: more observations

[This is part of a series of defectivity case studies.]

In an earlier post I listed some defective verbs in my idiolect. After talking with our PhD student Aidan Malanoski, I have a couple additional generalizations to note.

  1. Aidan is fine with infinitival BEWARE (e.g., Caesar was told to beware the Ides of March). I am not sure about this myself. 
  2. Aidan points out that SCRAM, SHOO, and GO AWAY (we might call them, along with BEWARE, “imperative-dominant verbs”) have a similarly restricted distribution. Roughly, our judgments are:
  • imperatives ok: ScramShooGo away! 
  • infinitives ok: Roaches started to scram when I turned the lights on. She shouted for the pigeons to shoo. The waiters couldn’t wait for them to go away
  • Gerunds marginal: Scramming would be a good idea right about now. Just going away might be the best thing.
  • Other -ing participles degraded: (past continuous) Roaches started scramming when I turned the lights on. (small clause) I saw the police scramming
  • Simple pasts degraded: Roaches scrammed when I turned the lights on. (compositional reading only) He went away.

They point out that same -ing surface forms may differ in acceptability. I also note that for me shooed [s.o.] away is fine as a transitive.

Why binarity is probably right

Consider the following passage, about phonological features:

I have not seen any convicing justification for the doctrine that all features must be underlyingly binary rather than ternary, quaternary, etc. The proponents of the doctrine often realize it needs defending, but the calibre of the defense is not unfairly represented by the subordinary clause devoted to the subject in SPE (297): ‘for the natural way of indicating whether or not an item belongs to a particular category is by means of binary features.’ The restriction to two underlying specifications creates problems and solves none. (Sommerstein 1977: 109)

Similarly, I had a recent conversation by someone who insisted certain English multi-object constructions in syntax are better handled by assuming the possibility of ternary branching.

I disagree with Sommerstein, though: a logical defense of the assumption of binarity—both for the specification of phonological feature polarity and for the arity of syntactic trees—is so obvious that it fits on a single page. Roughly: 1) less than two is not enough, and; 2) two is enough.

Less than two is not enough. This much should be obvious: theories in which features only have one value, or syntactic constituents cannot dominate more than one element, have no expressive power whatsover.1,2

Two is enough. Every time we might desire to use a ternary feature polarity, or a ternary branching non-terminal, there exists a weakly equivalent specification which uses binary polarity or binary branching, respectively, and more features or non-terminals. It is then up to the analyst to determine whether or not they are happy with the natural classes and/or constituents obtained, but this possibility is always available to the analyst. One opposed to the this strategy has a duty to say why the hypothesized features or non-terminals are wrong.

Endnotes

  1. It is important to note in this regard that privative approaches to feature theory (as developed by Trubetzkoy and disciples) are themselves special cases of the binary hypothesis which happen to treat absence as a non-referable. For instance, if we treat the set of nasals as a natural class (specified [Nasal]) but deny the existence of the (admittedly rather diverse) natural class [−Nasal]—and if we further insist rules be defined in terms of natural classes, and deny the possibility of disjunctive specification—we are still working in a binary setting, we just have added an additional stipulation that negated features cannot be referred to by rules.
  2. I put aside the issue of cumulativity of stress—a common critique in the early days—since nobody believes this is done by feature in 2023.

References

Sommerstein, A. 1977. Modern Phonology. Edward Arnold.

The different functions of probabilty in probabilistic grammar

I have long been critical of naïve interpretations of probabilistic grammar.  To me, it seems like the major motivation for this approach derives from a naïve—I’d say overly naïve—linking hypothesis mapping between acceptability judgments and grammaticality, as seen in Likert scale-style acceptability tasks. (See chapter 2 of my dissertation for a concrete argument against this.) But in this approach, the probabilities are measures of wellformedness.

It occurs to me that there are a number of ontologically distinct interpretations of grammatical probabilities of the sort produced by “maxent”, i.e., logistic regression models.

For instance, at M100 this weekend, I heard Bruce Hayes talk about another use of maximum entropy models: scansion. In poetic meters, there is variation in, say, whether the caesura is masculine (after a stressed syllable) or feminine (after an unstressed syllable), and the probabilities reflect that.1 However, I don’t think it makes sense to equate this with grammaticality, since we are talking about variation in highly self-conscious linguistic artifacts here and there is no reason to think one style of caesura is more grammatical than the other.2

And of course there is a third interpretation, in which the probabilities are production probabilities, representing actual variation in production, within a speaker or across multiple speakers.

It is not obvious to me that these facts all ought to be modeled the same way, yet the maxent community seems comfortable assuming a single cognitive model to cover all three scenarios. To state the obvious, it makes no sense for a cognitive model to acconut for interspeaker variation because there is no such thing as “interspeaker cognition”, there are just individual mental grammars.

Endnotes

  1. This is a fabricated example because Hayes and colleagues mostly study English meter—something I know nothing about—whereas I’m interested in Latin poetry. I imagine English poetry has caesurae too but I’ve given it no thought yet.
  2. I am not trying to say that we can’t study grammar with poetry. Separately, I note, as did, I think, Paul Kiparsky at the talk, that this model also assumes that the input text the poet is trying to fit to the meter has no role to play in constraining what happens.