{"id":87,"date":"2013-12-21T05:46:49","date_gmt":"2013-12-21T05:46:49","guid":{"rendered":"http:\/\/sonny..ogi.edu\/~kgorman\/blog\/?p=87"},"modified":"2013-12-21T05:46:49","modified_gmt":"2013-12-21T05:46:49","slug":"gigaword-english-preprocessing","status":"publish","type":"post","link":"https:\/\/www.wellformedness.com\/blog\/gigaword-english-preprocessing\/","title":{"rendered":"Gigaword English preprocessing"},"content":{"rendered":"<p>I recently took a little time out to coerce a recent version of the <a title=\"Gigaword English\" href=\"http:\/\/catalog.ldc.upenn.edu\/LDC2011T07\">LDC&#8217;s Gigaword English corpus<\/a>\u00a0into a format that could be used for training conventional\u00a0<em>n<\/em>-gram models. This turned out to be harder than I expected.<\/p>\n<h2>Decompression<\/h2>\n<p>Gigaword English (v. 5) ships with 7 directories of gzipped SGML data, one directory for each of the news sources. The first step is, obviously enough, to decompress these files, which can be done with <code>gunzip<\/code>.<\/p>\n<h2>SGML to XML<\/h2>\n<p>The resulting files are, alas, not XML files, which for all their verbosity can be parsed in numerous elegant ways. In particular, the decompressed Gigaword files do not contain a root node: each story is inside of\u00a0<code>&lt;DOC&gt;<\/code>\u00a0tags at the top level of the hierarchy. While this might be addressed by simply adding in a top-level tag, the files also contain a few \u00a0&#8220;entities&#8221; (e.g., <code>&amp;amp;<\/code>) which ideally should be replaced by their actual referent.\u00a0Simply inserting the Gigaword <a title=\"Document Type Definition\" href=\"http:\/\/en.wikipedia.org\/wiki\/Document_type_definition\">Document Type Definition<\/a>, or DTD, at the start of each SGML file was sufficient to convert the Gigaword files to valid SGML.<\/p>\n<p>I also struggled to find software for SGML-to-XML conversion; is this not something other people regularly want to do? I\u00a0ultimately used an ancient library called OpenSP (<code>open-sp<\/code> in Homebrew), in particular the command <code>osx<\/code>. This conversion throws a small number of errors due to unexpected presence of UTF-8 characters, but these can be ignored (with the flag <code>-E 0<\/code>).<\/p>\n<h2>XML to text<\/h2>\n<p>Each Gigaword file contains a series of <code>&lt;DOC&gt;<\/code> tags, each representing a single news story. These tags have four possible <code>type<\/code> attributes; the most common one, <code>story<\/code>, is the only one which consistently contains coherent full sentences and paragraphs. Immediately underneath <code>&lt;DOC&gt;<\/code> in this hierarchy are two tags:\u00a0<code>&lt;HEADLINE&gt;<\/code> and <code>&lt;TEXT&gt;<\/code>. While it would be fun to use the former for a study of <a title=\"Headlinese\" href=\"http:\/\/en.wikipedia.org\/wiki\/Headlinese\">Headlinese<\/a>, <code>&lt;TEXT&gt;<\/code>\u2014the tag surrounding the document body\u2014is generally more useful. Finally, good old-fashioned <code>&lt;p&gt;<\/code> (paragraph) tags are the only children of <code>&lt;TEXT&gt;<\/code>. I serialized &#8220;story&#8221; paragraphs using the <code>lxml<\/code> library in Python. This library supports the\u00a0elegant <a title=\"XPath\" href=\"http:\/\/en.wikipedia.org\/wiki\/Xpath\">XPath<\/a> query language. To select paragraphs of &#8220;story&#8221; documents, I used the XPath query\u00a0<code>\/GWENG\/DOC[@type=\"story\"]\/TEXT<\/code>, stripped whitespace, and then encoded the text as UTF-8.<\/p>\n<h2>Text to sentences<\/h2>\n<p>The resulting units are paragraphs (with occasional uninformative line breaks), not sentences. Python&#8217;s <a title=\"NLTK\" href=\"http:\/\/nltk.org\">NLTK<\/a>\u00a0module provides an interface to the Punkt sentence tokenizer. However, thanks to\u00a0<a title=\"Stack Overflow post on sentence tokenization in Python\" href=\"http:\/\/stackoverflow.com\/questions\/14095971\">this Stack Overflow post<\/a>, I became aware of its limitations.\u00a0Here&#8217;s a difficult example from <i>Moby Dick, <\/i>with sentence boundaries (my judgements) indicated by the pipe character (<code>|<\/code>):<\/p>\n<blockquote><p>A clam for supper? | a cold clam; is THAT what you mean, Mrs. Hussey?&#8221; | says I, &#8220;but that&#8217;s a rather cold and clammy reception in the winter time,\u00a0ain&#8217;t it, Mrs. Hussey?&#8221;<\/p><\/blockquote>\n<p>But, the default sentence tokenizer insists on sentence breaks immediately after both occurrences of &#8220;Mrs.&#8221;. To remedy this, I replaced the space after titles like &#8220;Mrs.&#8221;<br \/>\n(the full list of such abbreviations was adapted from <a title=\"GPoSTTL\" href=\"http:\/\/gposttl.sourceforge.net\">GPoSTTL<\/a>) with an underscore so as to &#8220;bleed&#8221; the sentence tokenizer, then replaced the underscore with a space after tokenization was complete. That is, the sentence tokenizer sees word tokens like\u00a0&#8220;Mrs._Hussey&#8221;; since sentence boundaries must line up with word token boundaries, there is no chance a space will be inserted here.\u00a0With this hack, the sentence tokenizer does that snippet of\u00a0<em>Moby Dick<\/em> just right.<\/p>\n<h2>Sentences to tokens<\/h2>\n<p>For the last step, I used NLTK&#8217;s word tokenizer (<code>nltk.tokenize.word_tokenize<\/code>), which is similar to the (in)famous <a title=\"Treebank tokenizer\" href=\"http:\/\/www.cis.upenn.edu\/~treebank\/tokenization.html\">Treebank tokenizer<\/a>, and then<a title=\"Case folding\" href=\"http:\/\/en.wikipedia.org\/wiki\/Letter_case#Case_folding\"> case-folded<\/a>\u00a0the resulting tokens.<\/p>\n<h2>Summary<\/h2>\n<p>In all, 170 million sentences, 5 billion word tokens, and 22 billion characters, all of which fits into 7.5 GB (compressed). Best of luck to anyone looking to do the same!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently took a little time out to coerce a recent version of the LDC&#8217;s Gigaword English corpus\u00a0into a format that could be used for training conventional\u00a0n-gram models. This turned out to be harder than I expected. Decompression Gigaword English (v. 5) ships with 7 directories of gzipped SGML data, one directory for each of &hellip; <a href=\"https:\/\/www.wellformedness.com\/blog\/gigaword-english-preprocessing\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Gigaword English preprocessing&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[3,4],"tags":[],"class_list":["post-87","post","type-post","status-publish","format-standard","hentry","category-dev","category-language"],"_links":{"self":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/87","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/comments?post=87"}],"version-history":[{"count":0,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/87\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/media?parent=87"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/categories?post=87"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/tags?post=87"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}