Simpler sentence boundary detection

Consider the following sentence, from the Wall St. Journal portion of the Penn Treebank:

Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.

This sentence contains 4 periods, but only the last denotes a sentence boundary. It’s obvious that the first one in U.S. is unambiguously part of an acronym, not a sentence boundary, and the same is true of expressions like $12.53. But the periods at the end of Inc. and U.S. could easily have been on the left edge of a sentence boundary; it just turns out they’re not. Humans can use local context to determine that neither of these are likely to be sentence boundaries; for example, the verb expect selects two arguments (an object its U.S. sales and the infinitival clause to remain steady…), neither of which would be satisfied if U.S. was sentence-final. Similarly, not all question marks or exclamation points are sentence-final (strictu sensu):

He says the big questions–“Do you really need this much money to put up these investments? Have you told investors what is happening in your sector? What about your track record?–“aren’t asked of companies coming to market.

Much of the available data for natural language processing experiment—including the enormous Gigaword corpus—does not include annotations for sentence boundaries providence annotations for sentence boundaries. In Gigaword, for example, paragraphs and articles are annotated, but paragraphs may contain internal sentence boundaries, which are not indicated in any way. In natural language processing (NLP), this task is known as sentence boundary detection (SBD). [1] SBD is one of the earliest steps in many natural language processing (NLP) pipelines, and since errors at this step are very likely to propagate, it is particularly important to just Get It Right.

An important component of this problem is the detection of abbreviations and acronyms, since a period ending an abbreviation is generally not a sentence boundary. But some abbreviations and acronyms do sometimes occur in sentence-final position (for instance, in the Wall St. Journal portion of the Penn Treebank, there are 99 sentence-final instances of U.S.). In this context, English writers generally omit one period, a sort of orthographic haplology.

NLTK provides an implementation of Punkt (Kiss & Strunk 2006), an unsupervised sentence boundary detection system; perhaps because it is easily available, it has been widely used. Unfortunately, Punkt is simply not very accurate compared to other systems currently available. Promising early work by Riley (1989) suggested a different way: a supervised classifier (in Riley’s case, a decision tree). Gillick (2009) achieved the best published numbers on the “standard split” for this task using another classifier, namely a support vector machine (SVM) with a linear kernel; Gillick’s features are derived from the words to the left and right of a period. Gillick’s code has make available under the name Splitta.

I recently attempted to construct my own SBD system, loosely inspired by Splitta, but expanding the system to handle ellipses (), question marks, exclamation points, or sentence-final punctuation marks. Since Gillick found no benefits from tweaking the hyperparameters of the SVM, I used a hyperparameter-free classifier, the averaged perceptron (Freund & Schapire 1999). After performing a stepwise feature ablation, I settled on a relatively small set of features, extracted as follows. Candidate boundaries are identified using the following nasty regular expression:

/(S+)s*((.+)|([!?]))(['`")}]]*)(s+)s*(S+)/

The first group matches the left token L, and the last group matches the right token R. If the L or R tokens match a regular expression for American English numbers (including prices, decimals, negatives, etc.), they are merged into a special token *NUMBER* (per Kiss & Strunk 2006); a similar approach is used to convert various types of quotation marks into *QUOTE*. The following features were then extracted:

  • the identity of the punctuation mark
  • identity of L and R (Reynar & Ratnaparkhi 1997, etc.)
  • the joint identity of both L and R (Gillick 2009)
  • does L contain a vowel? (Mikheev 2002)
  • does L contain a period? (Grefenstette 1999)
  • length of L (Riley 1989)
  • case of L and R (Riley 1989)

This 8-feature system performed exceptionally well on the “standard split”, with an accuracy of .9955, an F-score of .9971, and just 46 errors in all. This is very comparable with the results I obtained with a fork of Splitta extended to handle ellipses, question marks, etc.; this forked system produced 55 errors.

I have made my system freely available as a Python 3 module (and command-line tool) under the name DetectorMorse. Both code and dependencies are pure Python, so it can be run using pypy3, if you’re in a hurry.

Endnotes

[1] Or, sometimes, sentence boundary disambiguationsentence segmentationsentence splitting, sentence tokenization, etc.

References

Y. Freund & R.E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37(3): 277-296.
D. Gillick. 2009. Sentence boundary detection and the problem with the U.S. In Proc. NAACL-HLT, pages 241-244.
G. Grefenstette. 1999. Tokenization. In H. van Halteren (ed.), Syntactic wordclass tagging, pages 117-133. Dordrecht: Kluwer.
T. Kiss & J. Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4): 485-525.
A. Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics 28(3): 289-318.
J.C. Reynar & A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proc. 5th Conference on Applied Natural Language Processing, pages 16-19.
M.D. Riley. 1989. Some applications of tree-based modelling to speech and language indexing. In Proc. DARPA Speech and Natural Language Workshop, pages 339-352.

Leave a Reply

Your email address will not be published. Required fields are marked *