Simpler sentence boundary detection

Consider the following sentence, from the Wall St. Journal portion of the Penn Treebank:

Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.

This sentence contains 4 periods, but only the last denotes a sentence boundary. It’s obvious that the first one in U.S. is unambiguously part of an acronym, not a sentence boundary, and the same is true of expressions like $12.53. But the periods at the end of Inc. and U.S. could easily have been on the left edge of a sentence boundary; it just turns out they’re not. Humans can use local context to determine that neither of these are likely to be sentence boundaries; for example, the verb expect selects two arguments (an object its U.S. sales and the infinitival clause to remain steady…), neither of which would be satisfied if U.S. was sentence-final. Similarly, not all question marks or exclamation points are sentence-final (strictu sensu):

He says the big questions–“Do you really need this much money to put up these investments? Have you told investors what is happening in your sector? What about your track record?–“aren’t asked of companies coming to market.

Much of the available data for natural language processing experiment—including the enormous Gigaword corpus—does not include annotations for sentence boundaries providence annotations for sentence boundaries. In Gigaword, for example, paragraphs and articles are annotated, but paragraphs may contain internal sentence boundaries, which are not indicated in any way. In natural language processing (NLP), this task is known as sentence boundary detection (SBD). [1] SBD is one of the earliest steps in many natural language processing (NLP) pipelines, and since errors at this step are very likely to propagate, it is particularly important to just Get It Right.

An important component of this problem is the detection of abbreviations and acronyms, since a period ending an abbreviation is generally not a sentence boundary. But some abbreviations and acronyms do sometimes occur in sentence-final position (for instance, in the Wall St. Journal portion of the Penn Treebank, there are 99 sentence-final instances of U.S.). In this context, English writers generally omit one period, a sort of orthographic haplology.

NLTK provides an implementation of Punkt (Kiss & Strunk 2006), an unsupervised sentence boundary detection system; perhaps because it is easily available, it has been widely used. Unfortunately, Punkt is simply not very accurate compared to other systems currently available. Promising early work by Riley (1989) suggested a different way: a supervised classifier (in Riley’s case, a decision tree). Gillick (2009) achieved the best published numbers on the “standard split” for this task using another classifier, namely a support vector machine (SVM) with a linear kernel; Gillick’s features are derived from the words to the left and right of a period. Gillick’s code has make available under the name Splitta.

I recently attempted to construct my own SBD system, loosely inspired by Splitta, but expanding the system to handle ellipses (), question marks, exclamation points, or sentence-final punctuation marks. Since Gillick found no benefits from tweaking the hyperparameters of the SVM, I used a hyperparameter-free classifier, the averaged perceptron (Freund & Schapire 1999). After performing a stepwise feature ablation, I settled on a relatively small set of features, extracted as follows. Candidate boundaries are identified using a nasty regular expression. If the left or right contextual tokens match a regular expression for American English numbers (including prices, decimals, negatives, etc.), they are merged into a special token *NUMBER* (per Kiss & Strunk 2006); a similar approach is used to convert various types of quotation marks into *QUOTE*. The following features were then extracted:

  • the identity of the punctuation mark
  • identity of L and R (Reynar & Ratnaparkhi 1997, etc.)
  • the joint identity of both L and R (Gillick 2009)
  • does L contain a vowel? (Mikheev 2002)
  • does L contain a period? (Grefenstette 1999)
  • length of L (Riley 1989)
  • case of L and R (Riley 1989)

This 8-feature system performed exceptionally well on the “standard split”, with an accuracy of .9955, an F-score of .9971, and just 46 errors in all. This is very comparable with the results I obtained with a fork of Splitta extended to handle ellipses, question marks, etc.; this forked system produced 55 errors.

I have made my system freely available as a Python 3 module (and command-line tool) under the name DetectorMorse. Both code and dependencies are pure Python, so it can be run using pypy3, if you’re in a hurry.

Endnotes

[1] Or, sometimes, sentence boundary disambiguationsentence segmentationsentence splitting, sentence tokenization, etc.

References

Y. Freund & R.E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37(3): 277-296.
D. Gillick. 2009. Sentence boundary detection and the problem with the U.S. In Proc. NAACL-HLT, pages 241-244.
G. Grefenstette. 1999. Tokenization. In H. van Halteren (ed.), Syntactic wordclass tagging, pages 117-133. Dordrecht: Kluwer.
T. Kiss & J. Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4): 485-525.
A. Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics 28(3): 289-318.
J.C. Reynar & A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proc. 5th Conference on Applied Natural Language Processing, pages 16-19.
M.D. Riley. 1989. Some applications of tree-based modelling to speech and language indexing. In Proc. DARPA Speech and Natural Language Workshop, pages 339-352.

11 thoughts on “Simpler sentence boundary detection”

  1. 1. What dataset / corpus was used for training and testing ?
    2. Given this is imbalanced class problem, wouldn’t weighted F1 be a better metric ?

  2. Re: #1: the same as in the Gillick paper; it’s drawn from the WSJ portion of Penn Treebank.

    Re: #2: F1 would make sense. I’m not sure why you’d want to use a weighted variant of F-score unless you have some strong prior that precision is more important than recall or vice versa, and I don’t. That said, I’m almost positive you’ll get the same system ranking either way.

    1. Thanks Kyle.

      I want to clarify that by weighted F1, I do not imply that beta value needs to be modified. It is rather a weighted average of F1 of each of the classes. [In scikit-learn this can be obtained by sklearn.metrics.f1_score(average=’weighted’)]

      I have a query regarding the nasty regex. Why do have (s+)s*(S+) in the right context ? why not just (s*)(S+) ?

  3. Oh, I understand now. While that’s one reasonable evaluation setting, in absence of further evidence I generally prefer to just optimize for accuracy.

    I think if we want to think in information-retrieval-type terms we ought to think about F-scores in terms of “sentences correctly retrieved”, where a sentence would be incorrectly retrieved in any of the following cases:

    * the left boundary is not correctly predicted
    * the right boundary is not correctly predicted
    * a boundary is incorrectly predicted in the middle of the sentence

    This would give a more construct-valid notion for evaluation.

    As for the regex there, it got corrupted in the conversion to a new blog (some of the escaped characters were deleted) several years ago (as you can see, this post is 5 years old now). Refer to the library itself for the regexes (in the actual library they’re broken up into a few pieces).

    1. Install the package (pip install DetectorMorse) then run python -m detectormorse --help. If you don’t want to use it from the command line, it’s documented in-module.

  4. Can this tool deal with the condition that there is no space after period in the text?
    For example: “Today is a sunny day.I want to go swimming in the afternoon.”
    In this sentence, there is no space betwwen”day” and “I”.

    1. It could be made to, though by default that doesn’t match the regular expression it uses to find candidate boundaries. You’d just have to change the regex constants in detector.py.

      1. Thank you for your instructions. I tried to modify the regex constants in the detector.py but failed. could you teach me how to alter them in order to meet my unreasonable requirements? ( If it is too much trouble , forget it)
        Thank you for your help and patient reply again.

Leave a Reply

Your email address will not be published. Required fields are marked *