Simpler sentence boundary detection

Consider the following sentence, from the Wall St. Journal portion of the Penn Treebank:

Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.

This sentence contains 4 periods, but only the last denotes a sentence boundary. It’s obvious that the first one in U.S. is unambiguously part of an acronym, not a sentence boundary, and the same is true of expressions like $12.53. But the periods at the end of Inc. and U.S. could easily have been on the left edge of a sentence boundary; it just turns out they’re not. Humans can use local context to determine that neither of these are likely to be sentence boundaries; for example, the verb expect selects two arguments (an object its U.S. sales and the infinitival clause to remain steady…), neither of which would be satisfied if U.S. was sentence-final. Similarly, not all question marks or exclamation points are sentence-final (strictu sensu):

He says the big questions–“Do you really need this much money to put up these investments? Have you told investors what is happening in your sector? What about your track record?–“aren’t asked of companies coming to market.

Much of the available data for natural language processing experiment—including the enormous Gigaword corpus—does not include annotations for sentence boundaries providence annotations for sentence boundaries. In Gigaword, for example, paragraphs and articles are annotated, but paragraphs may contain internal sentence boundaries, which are not indicated in any way. In natural language processing (NLP), this task is known as sentence boundary detection (SBD). [1] SBD is one of the earliest steps in many natural language processing (NLP) pipelines, and since errors at this step are very likely to propagate, it is particularly important to just Get It Right.

An important component of this problem is the detection of abbreviations and acronyms, since a period ending an abbreviation is generally not a sentence boundary. But some abbreviations and acronyms do sometimes occur in sentence-final position (for instance, in the Wall St. Journal portion of the Penn Treebank, there are 99 sentence-final instances of U.S.). In this context, English writers generally omit one period, a sort of orthographic haplology.

NLTK provides an implementation of Punkt (Kiss & Strunk 2006), an unsupervised sentence boundary detection system; perhaps because it is easily available, it has been widely used. Unfortunately, Punkt is simply not very accurate compared to other systems currently available. Promising early work by Riley (1989) suggested a different way: a supervised classifier (in Riley’s case, a decision tree). Gillick (2009) achieved the best published numbers on the “standard split” for this task using another classifier, namely a support vector machine (SVM) with a linear kernel; Gillick’s features are derived from the words to the left and right of a period. Gillick’s code has make available under the name Splitta.

I recently attempted to construct my own SBD system, loosely inspired by Splitta, but expanding the system to handle ellipses (), question marks, exclamation points, or sentence-final punctuation marks. Since Gillick found no benefits from tweaking the hyperparameters of the SVM, I used a hyperparameter-free classifier, the averaged perceptron (Freund & Schapire 1999). After performing a stepwise feature ablation, I settled on a relatively small set of features, extracted as follows. Candidate boundaries are identified using a nasty regular expression. If the left or right contextual tokens match a regular expression for American English numbers (including prices, decimals, negatives, etc.), they are merged into a special token *NUMBER* (per Kiss & Strunk 2006); a similar approach is used to convert various types of quotation marks into *QUOTE*. The following features were then extracted:

  • the identity of the punctuation mark
  • identity of L and R (Reynar & Ratnaparkhi 1997, etc.)
  • the joint identity of both L and R (Gillick 2009)
  • does L contain a vowel? (Mikheev 2002)
  • does L contain a period? (Grefenstette 1999)
  • length of L (Riley 1989)
  • case of L and R (Riley 1989)

This 8-feature system performed exceptionally well on the “standard split”, with an accuracy of .9955, an F-score of .9971, and just 46 errors in all. This is very comparable with the results I obtained with a fork of Splitta extended to handle ellipses, question marks, etc.; this forked system produced 55 errors.

I have made my system freely available as a Python 3 module (and command-line tool) under the name DetectorMorse. Both code and dependencies are pure Python, so it can be run using pypy3, if you’re in a hurry.

Endnotes

[1] Or, sometimes, sentence boundary disambiguationsentence segmentationsentence splitting, sentence tokenization, etc.

References

Y. Freund & R.E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37(3): 277-296.
D. Gillick. 2009. Sentence boundary detection and the problem with the U.S. In Proc. NAACL-HLT, pages 241-244.
G. Grefenstette. 1999. Tokenization. In H. van Halteren (ed.), Syntactic wordclass tagging, pages 117-133. Dordrecht: Kluwer.
T. Kiss & J. Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4): 485-525.
A. Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics 28(3): 289-318.
J.C. Reynar & A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proc. 5th Conference on Applied Natural Language Processing, pages 16-19.
M.D. Riley. 1989. Some applications of tree-based modelling to speech and language indexing. In Proc. DARPA Speech and Natural Language Workshop, pages 339-352.

UNIX AV club

[This post was written as a supplement to CS506/606: Research Programming at OHSU.]

SoX and FFmpeg are fast, powerful command-line tools for manipulating audio and video data, respectively. In this short tutorial, I’ll show how to use these tools for two very common tasks: 1) resampling and 2) (de)multiplexing. Both tools are available from your favorite package manager  (like Homebrew or apt-get).

SoX and friends

SoX is a suite of programs for manipulating audio files. Commands are of the form:

sox [flag ...] infile1 [...] outfile [effect effect-options] ...

That is, the command sox, zero or more global flags, one or more input files, one output file, and then a list of “effects” to apply. Unlike most UNIX command-line programs, though, SoX actually cares about file extensions. If the input file is in.wav it better be a WAV file; if the output file is out.flac it will be encoded in the FLAC (“free lossless audio codec”) format.

The simplest invocation of sox converts audio files to new formats. For instance, the following would use audio.wav to create a new FLAC file audio.flac with the same same bit depth and sample rate.

sox audio.wav audio.flac

Concatenating audio files is only slightly more complicated. The following would concatenate 01_Intro.wav and 02_Untitled.wav together into a new file concatenated.wav.

sox 01_Intro.wav 02_Untitled.wav concatenated.wav

Resampling with SoX

But SoX really shines for resampling audio. For this, use the rate effect. The following would downsample the CD-quality (44.1 kHz) audio in CD.wav to the standard sample rate used on telephones (8 kHz) and store the result in telephone.wav

sox CD.wav telephone.wav rate 8k

There are two additional effects you may want to invoke when resampling. First, you may want to “dither” the audio. As man sox explains:

Dithering is a technique used to maximize the dynamic range of audio stored at a particular bit-depth. Any distortion introduced by quantization is decorrelated by adding a small amount of white noise to the signal. In most cases, SoX can determine whether the selected processing requires dither and will add it during output formatting if appropriate.

The following would resample to the telephone rate with dithering (if necessary).

sox CD.wav telephone.wav rate 8k dither -s

Finally, when resampling audio, you may want to invoke the gain effect to avoid clipping. This can be done using the -G (“Gain”) global option.

sox -G CD.wav telephone.wav rate 8k dither -s

(De)multiplexing with SoX

The SoX remix effect is useful for manipulating multichannel audio. The following would remix a multi-channel audio file stereo.wav down to mono.

sox stereo.wav mono.wav remix -

We also can split a stereo file into two mono files.

sox stereo.wav left.wav remix 1
sox stereo.wav right.wav remix 2

Finally, we can merge two mono files together to create one stereo file using the -M (“merge”) global option; this file should be identical to stereo.wav.

sox -M left.wav right.wav stereo2.wav

Other SoX goodies

There are three other useful utilities in SoX: soxi prints information extracted from audio file headers, play uses the SoX libraries to play audio files, and rec records new audio files using a microphone.

FFmpeg

The FFmpeg suite is to video files what SoX is to audio. Commands are of the form:

ffmpeg [flag ...] [-i infile1 ...] [-effect ...] [outfile]

The -acodec and -vcodec effects can be used to extract the audio and video streams from a video file, respectively; this process sometimes known as demuxing (short for “de-multiplexing”).

ffmpeg -i both.mp4 -acodec copy -vn audio.ac3
...
ffmpeg -i both.mp4 -vcodec copy -an video.h264
...

We can also mux (“multiplex”) them back together.

ffmpeg -i video.h264 -i audio.ac3 -vcodec copy -acodec copy both2.mp4

Hopefully that’ll get you started. Both programs have excellent manual pages; read them!

Gigaword English preprocessing

I recently took a little time out to coerce a recent version of the LDC’s Gigaword English corpus into a format that could be used for training conventional n-gram models. This turned out to be harder than I expected.

Decompression

Gigaword English (v. 5) ships with 7 directories of gzipped SGML data, one directory for each of the news sources. The first step is, obviously enough, to decompress these files, which can be done with gunzip.

SGML to XML

The resulting files are, alas, not XML files, which for all their verbosity can be parsed in numerous elegant ways. In particular, the decompressed Gigaword files do not contain a root node: each story is inside of <DOC> tags at the top level of the hierarchy. While this might be addressed by simply adding in a top-level tag, the files also contain a few  “entities” (e.g., &amp;) which ideally should be replaced by their actual referent. Simply inserting the Gigaword Document Type Definition, or DTD, at the start of each SGML file was sufficient to convert the Gigaword files to valid SGML.

I also struggled to find software for SGML-to-XML conversion; is this not something other people regularly want to do? I ultimately used an ancient library called OpenSP (open-sp in Homebrew), in particular the command osx. This conversion throws a small number of errors due to unexpected presence of UTF-8 characters, but these can be ignored (with the flag -E 0).

XML to text

Each Gigaword file contains a series of <DOC> tags, each representing a single news story. These tags have four possible type attributes; the most common one, story, is the only one which consistently contains coherent full sentences and paragraphs. Immediately underneath <DOC> in this hierarchy are two tags: <HEADLINE> and <TEXT>. While it would be fun to use the former for a study of Headlinese, <TEXT>—the tag surrounding the document body—is generally more useful. Finally, good old-fashioned <p> (paragraph) tags are the only children of <TEXT>. I serialized “story” paragraphs using the lxml library in Python. This library supports the elegant XPath query language. To select paragraphs of “story” documents, I used the XPath query /GWENG/DOC[@type="story"]/TEXT, stripped whitespace, and then encoded the text as UTF-8.

Text to sentences

The resulting units are paragraphs (with occasional uninformative line breaks), not sentences. Python’s NLTK module provides an interface to the Punkt sentence tokenizer. However, thanks to this Stack Overflow post, I became aware of its limitations. Here’s a difficult example from Moby Dick, with sentence boundaries (my judgements) indicated by the pipe character (|):

A clam for supper? | a cold clam; is THAT what you mean, Mrs. Hussey?” | says I, “but that’s a rather cold and clammy reception in the winter time, ain’t it, Mrs. Hussey?”

But, the default sentence tokenizer insists on sentence breaks immediately after both occurrences of “Mrs.”. To remedy this, I replaced the space after titles like “Mrs.”
(the full list of such abbreviations was adapted from GPoSTTL) with an underscore so as to “bleed” the sentence tokenizer, then replaced the underscore with a space after tokenization was complete. That is, the sentence tokenizer sees word tokens like “Mrs._Hussey”; since sentence boundaries must line up with word token boundaries, there is no chance a space will be inserted here. With this hack, the sentence tokenizer does that snippet of Moby Dick just right.

Sentences to tokens

For the last step, I used NLTK’s word tokenizer (nltk.tokenize.word_tokenize), which is similar to the (in)famous Treebank tokenizer, and then case-folded the resulting tokens.

Summary

In all, 170 million sentences, 5 billion word tokens, and 22 billion characters, all of which fits into 7.5 GB (compressed). Best of luck to anyone looking to do the same!