{"id":689,"date":"2019-03-13T18:56:24","date_gmt":"2019-03-13T18:56:24","guid":{"rendered":"http:\/\/www.wellformedness.com\/blog\/?p=689"},"modified":"2022-07-07T22:32:26","modified_gmt":"2022-07-07T22:32:26","slug":"text-encoding-issues-in-universal-dependencies","status":"publish","type":"post","link":"https:\/\/www.wellformedness.com\/blog\/text-encoding-issues-in-universal-dependencies\/","title":{"rendered":"Text encoding issues in Universal Dependencies"},"content":{"rendered":"<p>Do you know why the following comparison (in Python 3.7) fails?<\/p>\n<pre>&gt;&gt;&gt; s1 = \"\u095c\"\r\n&gt;&gt;&gt; s2 = \"\u0921\u093c\"\r\n&gt;&gt;&gt; s1 == s2\r\nFalse<\/pre>\n<p>I&#8217;ll give you a hint:<\/p>\n<pre>&gt;&gt;&gt; len(s1)\r\n1\r\n&gt;&gt;&gt; len(s2)\r\n2<\/pre>\n<p>Despite the two strings rendering identically, they are encoded differently. The string <tt>s1<\/tt> is a single-codepoint sequence, whereas <tt>s2<\/tt> contains two codepoints. Thus string comparison fails, whether it&#8217;s done at the level of bytes or of Unicode codepoints.<\/p>\n<p>Some NLP researchers are aware of issues arising from faulty string encoding. Eckhart de Castilho (2016), for example, describes a tool which automatically identifies misencoded pre-trained data, whereas Wu &amp; Yarowsky (2018) report issues using an existing tool for transliteration on certain languages because of encoding issues. However, I suspect that far fewer NLP researchers are familiar with the aforementioned problem, which is specific to <em>Unicode normalization<\/em>. To put it simply, Unicode defines four normalization forms (and associated conversion algorithms) for strings, and the key distinction is between &#8220;composed&#8221; and &#8220;decomposed&#8221; forms of characters (using that term in a pretheoretic sense). The string <tt>s1<\/tt> is composed into a single Unicode codepoint; <tt>s2<\/tt> is decomposed into two.<\/p>\n<p>Unfortunately, three columns of the <a href=\"https:\/\/github.com\/UniversalDependencies\/UD_Hindi-HDTB\/tree\/master\">Hindi Dependency Treebank<\/a> (<tt>hi_hdtb<\/tt>, commit <tt>54c4c0f<\/tt>; Bhat et al. 2017, Palmer et al. 2009) have a chaotic mix of composed and decomposed representations. It seems most if not all of these have to do with the encoding of the six <a href=\"https:\/\/en.wikipedia.org\/wiki\/Nuqta\">nuqta<\/a> (&#8216;dot&#8217;) consonants, which are usually found in borrowings from Arabic or Persian (via Urdu, presumably). In Devangari these consonants are written by adding a dot to a phonetically similar native consonant; for instance \u0921 [\u0256\u0259] plus the nuqta produces \u095c [\u027d\u0259]. As is usually the case in Unicode, there is more than one way to do it: you can either encode \u095c with a composed character (<tt>U+095C DEVANAGARI LETTER DDDHA<\/tt>) or with the native Devangari character (<tt>U+O921 DEVANAGARI LETTER DDA<\/tt>) plus a combining character (<tt>U+093C DEVANAGARI SIGN NUKTA<\/tt>). In practical terms, this means that strings containing diferent encodings of &lt;\u1e5ba&gt; (as it is sometimes transliterated) will be treated as totally separate during training and evaluation, except on the off chance that all associated tools perform Unicode normalization ahead of time.<\/p>\n<p>This does have negative consequences for NLP. Consider the <a href=\"http:\/\/ufal.mff.cuni.cz\/udpipe\">UDPipe<\/a> system (Straka &amp; Strakov\u00e1 2017) at the CoNLL 2017 shared task on dependency parsing (Zeman et al. 2017), for which the primary metric is <a href=\"https:\/\/linguistics.stackexchange.com\/questions\/6863\/how-is-the-f1-score-computed-when-assessing-dependency-parsing\">labeled attachment score<\/a> (LAS). I first attempted to replicate the UDPipe results for the Hindi Dependency Treebank. Using <a href=\"https:\/\/github.com\/ufal\/udpipe\/releases\/tag\/v1.2.0\">UDPipe 1.2.0,<\/a> <a href=\"https:\/\/github.com\/tmikolov\/word2vec\/commit\/20c129af10659f7c50e86e3be406df663beff438\">word2vec<\/a> (commit <tt>20c129a<\/tt>), the hyperparameters given in the authors&#8217; <a href=\"http:\/\/hdl.handle.net\/11234\/1-2859\">supplementary materials<\/a>,\u00a0and the <a href=\"https:\/\/github.com\/ufal\/conll2017\/blob\/master\/evaluation_script\/conll17_ud_eval.py\">official evaluation script<\/a>, I obtain LAS = 87.09 on the &#8220;gold tokenization&#8221; subtask. However I can improve this simply by converting the training, development, and test data to a consistent normalization like so:<\/p>\n<pre>for FILE in *.conllu; do\r\n    TMPFILE=\"$(mktemp)\"\r\n    uconv -x nfkc \"${FILE}\" &gt; \"${TMPFILE}\"\r\n    mv \"${TMPFILE}\" \"${FILE}\"\r\ndone<\/pre>\n<p>and then retraining. Here I have chosen to apply the NFKC (&#8220;compatibility composed&#8221;) normalization form. While Zeman et al. do not discuss the encoding of the labeled Universal Dependencies data, they do mention that they apply NFKC normalization to the <a href=\"http:\/\/hdl.handle.net\/11234\/1-1989\">addditional raw data<\/a>. But it doesn&#8217;t really matter in this case which you choose so long as you are consistent. After retraining, I obtain LAS = 87.38, or <strong>.29 points for free<\/strong>. I also ran an &#8220;mismatch&#8221; experiment, where the training and testing data have different normalization forms; naturally, this causes a slight degradation to LAS = 86.98.<\/p>\n<p>Straka &amp; Strakov\u00e1 (2017) report a separate set of experiments in which they have attempted to rebalance the training-development-test splits. Just to be sure, I repeated the above experiments using their <a href=\"https:\/\/lindat.mff.cuni.cz\/repository\/xmlui\/handle\/11234\/1-2364\">original rebalancing script<\/a>. With the baseline\u2014mixed normalization\u2014data, I can replicate <a href=\"http:\/\/ufal.mff.cuni.cz\/udpipe\/models#universal_dependencies_23_models_performance\">their result exactly<\/a>: LAS = 87.30. With a consistent NFKC normalization of training, development and test data, I get LAS = 87.50. And with a normalization mismatch between training and test data, I get LAS = 87.07, a slight degradation. <strong>And the improvements are more or less for free.<\/strong><\/p>\n<p>While I have not yet done a consistent audit, I found three other UD treebanks that have encoding issues. The <tt>ar_padt<\/tt> treebank has a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Unicode_equivalence#Canonical_ordering\">non-canonical ordering of combining characters<\/a> in the lemma column (the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Shadda\">shaddah<\/a>, which indicates geminates, should come before the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Arabic_diacritics#Fat%E1%B8%A5ah\">fathah<\/a> and not the other way around), but this is unlikely to have any major effect on model performance because it uses this non-canonical ordering consistently. The <tt>ko_kaist<\/tt> and <tt>ur_udtb<\/tt> treebanks also have minor inconsistencies.<\/p>\n<p>Unfortunately my corporate overlord doesn&#8217;t permit me to file a pull request <a href=\"https:\/\/github.com\/UniversalDependencies\/UD_Hindi-HDTB\/tree\/master\">here<\/a> because of the Hindi data is released under a CC BY-NC-SA license. But if you&#8217;re not so constrained, feel free to do so, and ping this thread once you have! And pay attention in the future.<\/p>\n<h1>References<\/h1>\n<p>Bhat, R. A., Bhatt, R., Farudi, A., Klassen, P., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D. M., Vaidya, A., Vishnu, S. R., and Xia, F. 2017. The Hindi\/Urdu Treebank Project.\u00a0In Ide., N., and Pustejovsky, J. (ed.),\u00a0<em>The Handbook of Linguistic Annotation<\/em>, pages 659-698. Springer.<br \/>\nEckhart de Castilho, R. 2016. Automatic analysis of flaws in pre-trained NLP models.\u00a0In <em>3rd International Workshop on Worldwide Language Service Infrastructure and 2nd Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies<\/em>, pages 19-27.<br \/>\nPalmer, M., Bhatt, R., Narasimhan, B., Rambow, O., Sharma, D. M., and Xia, F. 2009. Hindi syntax: Annotation dependency, lexical predicate-argument structure, and phrase structure. In <em>ICON<\/em>, pages 14-17<em>.<br \/>\n<\/em>Straka, M., and Strakov\u00e1, J. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In <em>CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies<\/em>, pages 88-99.<br \/>\nWu, W. and Yarowsky, D. 2018. A comparative study of extremely low-resource transliteration of the world&#8217;s languages. In <em>LREC<\/em>, pages 938-943.<br \/>\nZeman, D., Popel, M., Straka, M.,\u00a0Haji\u010d, J., Nivre, J., Ginter, F., &#8230; and Li, J. 2017. CoNLL Shared Task: Multilingual parsing from raw text to Universal Dependencies.\u00a0In <em>CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies<\/em>, pages 1-19.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Do you know why the following comparison (in Python 3.7) fails? &gt;&gt;&gt; s1 = &#8220;\u095c&#8221; &gt;&gt;&gt; s2 = &#8220;\u0921\u093c&#8221; &gt;&gt;&gt; s1 == s2 False I&#8217;ll give you a hint: &gt;&gt;&gt; len(s1) 1 &gt;&gt;&gt; len(s2) 2 Despite the two strings rendering identically, they are encoded differently. The string s1 is a single-codepoint sequence, whereas s2 contains &hellip; <a href=\"https:\/\/www.wellformedness.com\/blog\/text-encoding-issues-in-universal-dependencies\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Text encoding issues in Universal Dependencies&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[3,4,5,8],"tags":[],"class_list":["post-689","post","type-post","status-publish","format-standard","hentry","category-dev","category-language","category-nlp","category-python"],"_links":{"self":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/689","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/comments?post=689"}],"version-history":[{"count":18,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/689\/revisions"}],"predecessor-version":[{"id":1398,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/689\/revisions\/1398"}],"wp:attachment":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/media?parent=689"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/categories?post=689"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/tags?post=689"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}