Optimizing three-way composition for decipherment problems

Knight et al. (2006) introduce a class of problems they call decipherment. In this scenario, we observe a “ciphertext” C , which we wish to decode. We imagine that there exists a corpus of “plaintext” P, and which to recover the encipherment model G that transduces from P to C. All three components can be represented as (weighted) finite-state transducers: P is a language model over plaintexts, C is a list of strings, and G is an initially-uniform transducer from P to C. We can then estimate the parameters (i.e.. arc weights) of G by holding P and C constant and applying the expectation maximization algorithm (Dempster et al. 1977).

Both training and prediction require us to repeatedly compute the “cascade”, the three-way composition P ○ G ○ C. First off, two-way composition is associative, so for all ab, c : (ab) ○ c = a ○ (b ○ c). However, given any n-way composition, some associations may be radically more efficient than others. Even were the time complexity of each possible composition known, it is still not trivial to compute the optimal association. Fortunately, in this case we are dealing with three-way composition, for which there are only two possible associations; we simply need to compare the two.1

Composition performance depends on the sorting properties of the relevant machines. In the simplest case, the inner loop of (two-way) composition consists of a complex game of “go fish” between a state in the left-hand side automaton and a state in the right-hand side automaton. One state enumerates over its input (respectively, output) labels and queries the other state’s output (respectively input) labels. In the case that the state in the automaton being queried has its arcs sorted according to the label values, a sublinear binary search is used; otherwise, linear-time search is required. Optimal performance obtains when the left-hand side of composition is sorted by output labels and the right-hand side is sorted by input labels.2 Naturally, we also want to perform arc-sorting offline if possible.

Finally, OpenFst, the finite-state library we use, implements composition as an on-the-fly operation: states in the composed FST are lazily computed and stored in an LRU cache.3 Assiduous use of the cache can make it feasible to compute very large compositions when it is not necessary to visit all state of the composed machine. Today I focus on associativity and assume optimal label sorting; caching will have to wait for another day.

Our cascade consists of three weighted finite-state machines:

  • P is a language model expressed as a weighted label-sorted finite-state acceptor. The model is order 6, with Witten-Bell smoothing (Bell et al. 1990) backoffs encoded using φ (i.e., “failure”) transitions, and has been shrunk to 1 million n-grams using relative entropy pruning (Stolcke 1998).
  • G is a uniform channel model encoded as a finite-state transducer. Because it is a non-deterministic transducer, it can be input-label-sorted or output-label sorted, but not both.
  • C is an unweighted label-sorted string finite-state acceptor encoding a long plaintext.

There are two possible associativities, which we illustrate using the OpenFst Python bindings.4 In the first, we use a left-associative composition. Offline, before composition, we input label-sort G:

In [5]: G.arcsort("ilabel")

Then, we perform both compositions, sorting the intermediate object by output label:

In [6]: %timeit -n 10 
...          partial = compose(P, G, connect=False).arcsort("olabel"); 
...          cascade = compose(partial, C, connect=False)
10 loops, best of 3: 41.6 s per loop

In our second design, we use the parallel right-associative construction. Offline, we output label-sort G:

In [7]: G.arcsort("olabel")

Then, we perform both compositions, sorting the intermediate object by input label:

In [8]: %timeit -n 10 
...          partial = compose(G, C, connect=False).arcsort("ilabel"); 
...          cascade = compose(P, partial, connect=False)
3 loops, best of 3: 38.5 s per loop

So we see a small advantage for the right-associative composition, which we take advantage of in OpenGrm-BaumWelch, freely available from the OpenGrm website.

Endnotes

  1. There exist FST algorithms for n-ary composition (Allauzen & Mohri 2009), but in practice one can achieve similar effects using composition filters (Allauzen et al. 2010) instead.
  2. Note that acceptors which are input label-sorted are implicitly output label-sorted and vice versa, and string FSTs are input and output label-sorted by definition.
  3. In the case where one needs the entire composition at once, we can simply disable caching; in OpenFst, the result is also connected (i.e., trimmed) by default, but we disable that since we need to track the original state IDs.
  4. The timeit module is used to estimate execution times irrespective of caching.

References

Allauzen, C., and Mohri, M.. 2009. N-way composition of weighted finite-state transducers. International Journal of Foundations of Computer Science 20(4): 613-627.
Allauzen, C., Riley, M. and Schalkwyk, J. 2010. Filters for efficient composition of weighted finite-state transducers. In Implementation and Application of Automata: 15th International Conference, CIAA 2010, pages 28-38. Manitoba.
Bell, T.C., Clearly, J. G., and Witten, I.H. 1990. Text Compression. Englewood Cliffs, NJ: Prentice Hall.
Dempster, A. P., Laird, N., M, and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39(1): 1-38.
Knight, K., Nair, A., Rathod, N, Yamada, K. 2006. Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 499-506. Sydney.
Stolcke, A. 1998. Entropy-based pruning of backoff language models. In Proceedings of the DARPA Broadcast News And Understanding Workshop, pages 270–274. Lansdowne, Virginia.

Pynini 2020: State of the Sandwich

I have been meaning to describe some of the work I have been doing on Pynini, our weighted finite-state grammar development platform. For one, while I have been the primary contributor through the history of the project (Richard Sproat wrote the excellent path iteration library), we are now also getting many contributions from Lawrence Wolf-Sonkin (rewrite of the symbol table wrapper, type hints) and lots of usability and bug reports from the Google linguists.

We are currently on Pynini release 2.1.1. Here are some new features/improvements from the last few releases:

  • 2.0.9: Adds an efficient multi-argument union.
  • 2.0.9: Pynini (and the rest of OpenGrm) are available on Conda via Conda-Forge. This means that for most users, there is no longer any need to compile Pynini by hand; instead Pynini is compiled (for a variety of platforms) in the cloud, using a continuous integration framework.
  • 2.1.0: Rewrites the string compiler so that symbol tables are no longer attached to compiled FSTs, eliminating the need for expensive symbol table merging and relabeling options.
  • 2.1.0: Rewrites the FST and symbol table class hierarchies to better reflect the organization of lower-level APIs.
  • 2.1.1: Adds PEP 484/PEP 561-compatible type stubs.

We also have removed or renamed quite a few features:

  • stringify is renamed string.
  • text is renamed print (cf. the command-line tool fstprint).
  • The defaults struct is removed, though it may be reintroduced as a context manager at some point.
  • The * infix operator, previously used for composition is removed; use @ instead.
  • transducer‘s arguments input_token_type and output_token_type are merged as token_type.

Finally, we have broken Python 2.7 compatibility as of 2.1.0; pywrapfst, the lower-level API, still has some degree of Python 2.7 compatibility, but this is probably the last release to maintain that property.

Using a fixed training-development-test split in sklearn

The scikit-learn machine learning library has good support for various forms of model selection and hyperparameter tuning. For setting regularization hyperparameters, there are model-specific cross-validation tools, and there are also tools for both grid (e.g., exhaustive) hyperparameter tuning with the sklearn.model_selection.GridSearchCV and random hyperparameter tuning (in the sense of Bergstra & Bengio 2012) with sklearn.model_selection.RandomizedSearchCV, respectively. While you could probably could implement these yourself, the sklearn developers have enabled just about every feature you could want, including multiprocessing support.

One apparent limitation of these classes is that, as their names suggest, they are designed for use in a cross-validation setting. In the speech & language technology, however, standard practice is to use a fixed partition of the data into training, development (i.e., validation), and test (i.e., evaluation) sets, and to select hyperparameters which maximize performance on the development set. This is in part an artifact of limited computing resources of the Penn Treebank era and I’ve long suspected it has serious repercussions for model evaluation. But tuning and evaluating with a standard split is faster than cross-validation and can make exact replication much easier. And, there are also some concerns about whether cross-validation is the best way to set hyperparameters anyways. So what can we do?

The GridSearchCV and RandomSearchCV classes take an optional cv keyword argument, which can be, among other things, an object implementing the cross-validation iterator interface. At first I thought I would create an object which allowed me to use a fixed development set for hyperparameter tuning, but then I realized that I could do this with one of the existing iterator classes, namely one called sklearn.model_selection.PredefinedSplit. The constructor for this class takes a single argument test_fold, an array of integers of the same size as the data passed to the fitting method.  As the documentation explains “…when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.” That we can do. Suppose that we have training data x_train and y_train and development data x_dev and y_dev laid out as NumPy arrays. We then create a training-and-development set like so:

x = numpy.concatenate([x_train, x_dev])
y = numpy.concatenate([y_train, y_dev])

Then, we create the iterator object:

test_fold = numpy.concatenate([
    # The training data.
    numpy.full(-1, x_train.shape[1], dtype=numpy.int8),
    # The development data.
    numpy.zeros(x_dev.shape[1], dtype=numpy.int8)
])
cv = sklearn.model_selection.PredefinedSplit(test_fold)

Finally, we provide cv as a keyword argument to the grid or random search constructor, and then train. For instance, similar to this example we might do something like:

base = sklearn.ensemble.RandomForestClassifier()
grid = {"bootstrap": [True, False], 
        "max_features": [1, 3, 5, 7, 9, 10]}
model = sklearn.model_select.GridSearchCV(base, grid, cv=cv)
model.fit(x, y)

Now just add n_jobs=-1 to the constructor for model and to spread the work across all your logical cores.

References

Bergstra, J., and Bengio, Y. 2012. Random search for hyperparameter optimization. Journal of Machine Learning Research 13: 281-305.

Text encoding issues in Universal Dependencies

Do you know why the following comparison (in Python 3.7) fails?

>>> s1 = "ड़"
>>> s2 = "ड़"
>>> s1 == s2
False

I’ll give you a hint:

>>> len(s1)
1
>>> len(s2)
2

Despite the two strings rendering identically, they are encoded differently. The string s1 is a single-codepoint sequence, whereas s2 contains two codepoints. Thus string comparison fails, whether it’s done at the level of bytes or of Unicode codepoints.

Some NLP researchers are aware of issues arising from faulty string encoding. Eckhart de Castilho (2016), for example, describes a tool which automatically identifies misencoded pre-trained data, whereas Wu & Yarowsky (2018) report issues using an existing tool for transliteration on certain languages because of encoding issues. However, I suspect that far fewer NLP researchers are familiar with the aforementioned problem, which is specific to Unicode normalization. To put it simply, Unicode defines four normalization forms (and associated conversion algorithms) for strings, and the key distinction is between “composed” and “decomposed” forms of characters (using that term in a pretheoretic sense). The string s1 is composed into a single Unicode codepoint; s2 is decomposed into two.

Unfortunately, three columns of the Hindi Dependency Treebank (hi_hdtb, commit 54c4c0f; Bhat et al. 2017, Palmer et al. 2009) have a chaotic mix of composed and decomposed representations. It seems most if not all of these have to do with the encoding of the six nuqta (‘dot’) consonants, which are usually found in borrowings from Arabic or Persian (via Urdu, presumably). In Devangari these consonants are written by adding a dot to a phonetically similar native consonant; for instance ड [ɖə] plus the nuqta produces ड़ [ɽə]. As is usually the case in Unicode, there is more than one way to do it: you can either encode ड़ with a composed character (U+095C DEVANAGARI LETTER DDDHA) or with the native Devangari character (U+O921 DEVANAGARI LETTER DDA) plus a combining character (U+093C DEVANAGARI SIGN NUKTA). In practical terms, this means that strings containing diferent encodings of <ṛa> (as it is sometimes transliterated) will be treated as totally separate during training and evaluation, except on the off chance that all associated tools perform Unicode normalization ahead of time.

This does have negative consequences for NLP. Consider the UDPipe system (Straka & Straková 2017) at the CoNLL 2017 shared task on dependency parsing (Zeman et al. 2017), for which the primary metric is labeled attachment score (LAS). I first attempted to replicate the UDPipe results for the Hindi Dependency Treebank. Using UDPipe 1.2.0, word2vec (commit 20c129a), the hyperparameters given in the authors’ supplementary materials, and the official evaluation script, I obtain LAS = 87.09 on the “gold tokenization” subtask. However I can improve this simply by converting the training, development, and test data to a consistent normalization like so:

for FILE in *.conllu; do
    TMPFILE="$(mktemp)"
    uconv -x nfkc "${FILE}" > "${TMPFILE}"
    mv "${TMPFILE}" "${FILE}"
done

and then retraining. Here I have chosen to apply the NFKC (“compatibility composed”) normalization form. While Zeman et al. do not discuss the encoding of the labeled Universal Dependencies data, they do mention that they apply NFKC normalization to the addditional raw data. But it doesn’t really matter in this case which you choose so long as you are consistent. After retraining, I obtain LAS = 87.38, or .29 points for free. I also ran an “mismatch” experiment, where the training and testing data have different normalization forms; naturally, this causes a slight degradation to LAS = 86.98.

Straka & Straková (2017) report a separate set of experiments in which they have attempted to rebalance the training-development-test splits. Just to be sure, I repeated the above experiments using their original rebalancing script. With the baseline—mixed normalization—data, I can replicate their result exactly: LAS = 87.30. With a consistent NFKC normalization of training, development and test data, I get LAS = 87.50. And with a normalization mismatch between training and test data, I get LAS = 87.07, a slight degradation. And the improvements are more or less for free.

While I have not yet done a consistent audit, I found three other UD treebanks that have encoding issues. The ar_padt treebank has a non-canonical ordering of combining characters in the lemma column (the shaddah, which indicates geminates, should come before the fathah and not the other way around), but this is unlikely to have any major effect on model performance because it uses this non-canonical ordering consistently. The ko_kaist and ur_udtb treebanks also have minor inconsistencies.

Unfortunately my corporate overlord doesn’t permit me to file a pull request here because of the Hindi data is released under a CC BY-NC-SA license. But if you’re not so constrained, feel free to do so, and ping this thread once you have! And pay attention in the future.

References

Bhat, R. A., Bhatt, R., Farudi, A., Klassen, P., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D. M., Vaidya, A., Vishnu, S. R., and Xia, F. 2017. The Hindi/Urdu Treebank Project. In Ide., N., and Pustejovsky, J. (ed.), The Handbook of Linguistic Annotation, pages 659-698. Springer.
Eckhart de Castilho, R. 2016. Automatic analysis of flaws in pre-trained NLP models. In 3rd International Workshop on Worldwide Language Service Infrastructure and 2nd Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies, pages 19-27.
Palmer, M., Bhatt, R., Narasimhan, B., Rambow, O., Sharma, D. M., and Xia, F. 2009. Hindi syntax: Annotation dependency, lexical predicate-argument structure, and phrase structure. In ICON, pages 14-17.
Straka, M., and Straková, J. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88-99.
Wu, W. and Yarowsky, D. 2018. A comparative study of extremely low-resource transliteration of the world’s languages. In LREC, pages 938-943.
Zeman, D., Popel, M., Straka, M., Hajič, J., Nivre, J., Ginter, F., … and Li, J. 2017. CoNLL Shared Task: Multilingual parsing from raw text to Universal Dependencies. In CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1-19.

Lessons learned from my time at Google

  • C++ 11 is a powerful, elegant language and the right choice for performant general-purpose code. Bash is an excellent lingua franca for chaining a long series of commands. Python is best for everything else.
  • Data should be passed around in schematic form, with a compact serializations over the wire and a human-readable format at rest. Protocol buffers (and the lesser-known text format) are an ideal cross-language solution.
  • Grammar development is more important than model building.
  • Model building is easier than deployment.
  • Whiteboards are useful.
  • I can only do certain sorts of work without an office (yes, that thing with a door).

How, why, and when to flatten your conditionals

You may be tempted to write code that looks a little like this:

for item in items:
     if not condition_1(item):
         if not condition_2(item, False):
             if not condition_3(item, 3, 3):
                 if not condition_4(item):
                     do_work(item)

But please, don’t. Flatten your conditionals instead.

How to flatten your conditionals

There is a relatively straightforward alternative to the above. Instead, we use continue expressions to short-circuit the cascade. This looks a little bit like this:

for item in items:
     if condition_1(item):
         continue
     if condition_2(item, False):
         continue
     if condition_3(item, 3, 3):
         continue
     if condition_4(item):
         continue
     do_work(item)

That’s pretty much all there’s to it.

Perhaps you’re not inside of a loop, but rather inside of a “nullable” function or method (i.e., one which may reasonably return None);  that’s okay, replace continue with return. Perhaps there’s more than one item you’re possibly shipping off to do_work on; that’s okay, wrap the conditionals in a function and use return to short-circuit evaluation. Perhaps you want to terminal the entire loop, not just this iteration thereof; that’s okay, replace continue with break.

Why to flatten your conditionals

The flattened loop is much easier to read. There is no indentation (or bracketing) to track. The fact that each of the conditional expressions is at the same indentation (bracketing) level makes it clear that we’re just dealing with a cascade of conditionals, all of which are handled the same. Realistically, one can only visually parse 3-4 levels of indentation; the Linux kernel, for example, uses an 8-character indent and forbids more than 3 levels of indentation. Flattening your conditionals means you don’t have to deal with that very often.

When to flatten your conditionals

It’s perhaps preferable to write your early code with nested conditional statements. It may turn out that you need to do some work in one of the medial else: clauses (which we’ve elided here), which can make flattening the conditionals hard. But once you’re writing comments about the conditionals in the cascade, and preparing to share your code with others, it’s time to do away with more than a few layers of indentation.

Understanding text encoding in Python 2 and Python 3

Computers were rather late to the word processing game. The founding mothers and fathers of computing were primarily interested in numbers. This is fortunate: after all, computers only know about numbers. But as Brian Kunde explains in his brief history of word processing, word processing existed long before digital computing, and the text processing has always been something of an afterthought.

Humans think of text as consisting of ordered sequence of “characters” (an ill-defined Justice-Stewart-type concept which I won’t attempt to clarify here). To manipulate text in digital computers, we have to have a mapping between the character set (a finite list of the characters the system recognizes) and numbers. Encoding is the process of converting characters to numbers, and decoding is (naturally) the process of converting numbers to characters. Before we get to Python, a bit of history.

ASCII and Unicode

There are only a few character sets that have any relevance to life in 2014. The first is ASCII (American Standard Code for Information Interchange), which was first published in 1963. This character set consists of 128 characters intended for use by an English audience. Of these 95 are printable, meaning that they correspond to lay-human notions about characters. On a US keyboard, these are (approximately) the alphanumeric and punctuation characters that can be typed with a single keystroke, or with a single keystroke while holding down the Shift key, space, tab, the two newline characters (which you get when you type return), and a few apocrypha. The remaining 33 are non-printable “control characters”. For instance, the first character in the ASCII table is the “null byte”. This is indicated by a '' in C and other languages, but there’s no standard way to render it. Many control characters were designed for earlier, more innocent times; for instance, character #7 'a' tells the receiving device to ring a cute little bell (which were apparently attached to teletype terminals); today your computer might make a beep, or the terminal window might flicker once, but either way, nothing is printed.

Of course, this is completely inadequate for anything but English (not to mention those users of superfluous diaresis…e.g., the editors of the New Yorker, Motörhead). However, each ASCII character takes up only 7 bits, leaving room for another 128 characters (since a byte has an integer value between 0-255, inclusive), and so engineers could exploited the remaining 128 characters to write the characters from different alphabets, alphasyllabaries, or syllabaries. Of these ASCII-based character sets, the best-known are ISO/IEC 8859-1, also known as Latin-1, and Windows-1252, also known as CP-1252. Unfortunately, this created more problems than it solved. That last bit just didn’t leave enough space for the many languages which need a larger character set (Japanese kanji being an obvious example). And even when there are technically enough code points left over, engineers working in different languages didn’t see eye-to-eye about what to do with them. As a result, the state of affairs made it impossible to, for example, write in French (ISO/IEC 8859-1) about Ukrainian (ISO/IEC 8859-5, at least before the 1990 orthography reform).

Clearly, fighting over scraps isn’t going to cut it in the global village. Enter the Unicode standard and its Universal Character Set (UCS), first published in 1991. Unicode is the platonic ideal of an character encoding, abstracting away from the need to efficiently convert all characters to numbers. Each character is represented by a single code with various metadata (e.g., A is an “Uppercase Letter” from the “Latin” script). ASCII and its extensions map onto a small subset of this code.

Fortunately, not all encodings are merely shadows on the walls of a cave. The One True Encoding is UTF-8, which implements the entire UCS using an 8-bit code. There are other encodings, of course, but this one is ours, and I am not alone in feeling strongly that UTF-8 is the chosen encoding. At the risk of getting too far afield, here are two arguments for why you and everyone you know should just use UTF-8. First off, it is hardly matters much which UCS-compatible encoding we all use (the differences between them are largely arbitrary), but what does matter is that we all choose the same one. There is no general procedure for “sniffing” out the encoding of a file, and  there’s nothing preventing you from coming up with a file that’s a French cookbook in one encoding, and a top-secret message in another. This is good for steganographers, but bad for the rest of us, since so many text files lack encoding metadata. When it comes to encodings, there’s no question that UTF-8 is the most popular Unicode encoding scheme worldwide, and is on its way to becoming the de-facto standard. Secondly, ASCII is valid UTF-8, because UTF-8 and ASCII encode the ASCII characters in exactly the same way. What this means, practically speaking, is you can achieve nearly complete coverage of the world’s languages simply by assuming that all the inputs to your software are UTF-8. This is a big, big win for us all.

Decode early, encode late

A general rule of thumb for developers is “decode early” (convert inputs to their Unicode representation), “encode late” (convert back to bytestrings). The reason for this is that in nearly any programming language, Unicode strings behave the way our monkey brains expect them to, but bytestrings do not. To see why, try iterating over non-ASCII bytestring in Python (more on the syntax later).

>>> for byte in b"año":
...     print(byte)
...
a
?
?
o

There are two surprising things here: iterating over the bytestring returned more bytes then there are “characters” (goodbye, indexing), and furthermore the 2nd “character” failed to render properly. This is what happens when you let computers dictate the semantics to our monkey brains, rather than the other way around. Here’s what happens when we try the same with a Unicode string:

>>> for byte in u"año":
...     print(byte)
...
a
ñ
o

The Python 2 & 3 string models

Before you put this all into practice, it is important to note that Python 2 and Python 3 use very different string models. The familiar Python 2 str class is a bytestring. To convert it to a Unicode string, use the str.decode instance method, which returns a copy of the string as an instance of the unicode class. Similarly, you can make a str copy of a unicode instance with unicode.encode. Both of these functions take a single argument: a string (either kind!) representing the encoding.

Python 2 provides specific syntax for Unicode string literals (which you saw above): the a lower-case u prefix before the initial quotation mark (as in u"año").

When it comes to Unicode-awareness, Python 3 has totally flipped the script; in my opinion, it’s for the best. Instances of str are now Unicode strings (the u"" syntax still works, but is vacuous). The (reduced) functionality of the old-style strings is now just available for instances of the class bytes. As you might expect, you can create a bytes instance by using the encode method of a new-style str. Python 3 decodes bytestrings as soon as they are created, and (re)encodes Unicode strings only at the interfaces; in other words, it gets the “early/late” stuff right by default. Your APIs probably won’t need to change much, because Python 3 treats UTF-8 (and thus ASCII) as the default encoding, and this assumption is valid more often than not.

If for some reason, you want a bytestring literal, Python has syntax for that, too: prefix the quotation marks delimiting the string with a lower-case b (as in b"año"; see above also).

tl;dr

Strings are ordered sequences of characters. But computers only know about numbers, so they are encoded as byte arrays; there are many ways to do this, but UTF-8 is the One True Encoding. To get the strings to have the semantics you expect as a human, decode a string to Unicode as early as possible, and encode it as bytes as late as possible. You have to do this explicitly in Python 2; it happens automatically in Python 3.

Further reading

For more of the historical angle, see Joel Spolsky’s The absolute minimum every software developer absolutely, positively must know About Unicode and character sets (no excuses!).

Simpler sentence boundary detection

Consider the following sentence, from the Wall St. Journal portion of the Penn Treebank:

Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.

This sentence contains 4 periods, but only the last denotes a sentence boundary. It’s obvious that the first one in U.S. is unambiguously part of an acronym, not a sentence boundary, and the same is true of expressions like $12.53. But the periods at the end of Inc. and U.S. could easily have been on the left edge of a sentence boundary; it just turns out they’re not. Humans can use local context to determine that neither of these are likely to be sentence boundaries; for example, the verb expect selects two arguments (an object its U.S. sales and the infinitival clause to remain steady…), neither of which would be satisfied if U.S. was sentence-final. Similarly, not all question marks or exclamation points are sentence-final (strictu sensu):

He says the big questions–“Do you really need this much money to put up these investments? Have you told investors what is happening in your sector? What about your track record?–“aren’t asked of companies coming to market.

Much of the available data for natural language processing experiment—including the enormous Gigaword corpus—does not include annotations for sentence boundaries providence annotations for sentence boundaries. In Gigaword, for example, paragraphs and articles are annotated, but paragraphs may contain internal sentence boundaries, which are not indicated in any way. In natural language processing (NLP), this task is known as sentence boundary detection (SBD). [1] SBD is one of the earliest steps in many natural language processing (NLP) pipelines, and since errors at this step are very likely to propagate, it is particularly important to just Get It Right.

An important component of this problem is the detection of abbreviations and acronyms, since a period ending an abbreviation is generally not a sentence boundary. But some abbreviations and acronyms do sometimes occur in sentence-final position (for instance, in the Wall St. Journal portion of the Penn Treebank, there are 99 sentence-final instances of U.S.). In this context, English writers generally omit one period, a sort of orthographic haplology.

NLTK provides an implementation of Punkt (Kiss & Strunk 2006), an unsupervised sentence boundary detection system; perhaps because it is easily available, it has been widely used. Unfortunately, Punkt is simply not very accurate compared to other systems currently available. Promising early work by Riley (1989) suggested a different way: a supervised classifier (in Riley’s case, a decision tree). Gillick (2009) achieved the best published numbers on the “standard split” for this task using another classifier, namely a support vector machine (SVM) with a linear kernel; Gillick’s features are derived from the words to the left and right of a period. Gillick’s code has make available under the name Splitta.

I recently attempted to construct my own SBD system, loosely inspired by Splitta, but expanding the system to handle ellipses (), question marks, exclamation points, or sentence-final punctuation marks. Since Gillick found no benefits from tweaking the hyperparameters of the SVM, I used a hyperparameter-free classifier, the averaged perceptron (Freund & Schapire 1999). After performing a stepwise feature ablation, I settled on a relatively small set of features, extracted as follows. Candidate boundaries are identified using a nasty regular expression. If the left or right contextual tokens match a regular expression for American English numbers (including prices, decimals, negatives, etc.), they are merged into a special token *NUMBER* (per Kiss & Strunk 2006); a similar approach is used to convert various types of quotation marks into *QUOTE*. The following features were then extracted:

  • the identity of the punctuation mark
  • identity of L and R (Reynar & Ratnaparkhi 1997, etc.)
  • the joint identity of both L and R (Gillick 2009)
  • does L contain a vowel? (Mikheev 2002)
  • does L contain a period? (Grefenstette 1999)
  • length of L (Riley 1989)
  • case of L and R (Riley 1989)

This 8-feature system performed exceptionally well on the “standard split”, with an accuracy of .9955, an F-score of .9971, and just 46 errors in all. This is very comparable with the results I obtained with a fork of Splitta extended to handle ellipses, question marks, etc.; this forked system produced 55 errors.

I have made my system freely available as a Python 3 module (and command-line tool) under the name DetectorMorse. Both code and dependencies are pure Python, so it can be run using pypy3, if you’re in a hurry.

Endnotes

[1] Or, sometimes, sentence boundary disambiguationsentence segmentationsentence splitting, sentence tokenization, etc.

References

Y. Freund & R.E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37(3): 277-296.
D. Gillick. 2009. Sentence boundary detection and the problem with the U.S. In Proc. NAACL-HLT, pages 241-244.
G. Grefenstette. 1999. Tokenization. In H. van Halteren (ed.), Syntactic wordclass tagging, pages 117-133. Dordrecht: Kluwer.
T. Kiss & J. Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4): 485-525.
A. Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics 28(3): 289-318.
J.C. Reynar & A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proc. 5th Conference on Applied Natural Language Processing, pages 16-19.
M.D. Riley. 1989. Some applications of tree-based modelling to speech and language indexing. In Proc. DARPA Speech and Natural Language Workshop, pages 339-352.