{"id":298,"date":"2014-11-06T23:15:05","date_gmt":"2014-11-06T23:15:05","guid":{"rendered":"http:\/\/sonny.cslu.ohsu.edu\/~gormanky\/blog\/?p=298"},"modified":"2018-09-06T21:54:11","modified_gmt":"2018-09-06T21:54:11","slug":"understanding-text-encoding-in-python-2-and-python-3","status":"publish","type":"post","link":"https:\/\/www.wellformedness.com\/blog\/understanding-text-encoding-in-python-2-and-python-3\/","title":{"rendered":"Understanding text encoding in Python 2 and Python 3"},"content":{"rendered":"<p>Computers were rather late to the word processing game. The founding mothers and fathers of computing were primarily interested in numbers. This is\u00a0fortunate: after all,\u00a0<em>computers only know about numbers<\/em>. But as Brian Kunde explains in <a title=\"Brief history of word processing\" href=\"http:\/\/web.stanford.edu\/~bkunde\/fb-press\/articles\/wdprhist.html\">his brief history of word processing<\/a>, word processing existed long before digital computing, and the text processing has\u00a0always been something of an afterthought.<\/p>\n<p>Humans think of text as consisting of ordered sequence of &#8220;characters&#8221; (an\u00a0ill-defined Justice-Stewart-type concept which I won&#8217;t attempt to clarify\u00a0here). To manipulate text in digital computers, we have to have a mapping between the character set (a finite list of the characters the system recognizes) and numbers. <em>Encoding <\/em>is the process of converting characters to numbers, and\u00a0<em>decoding\u00a0<\/em>is (naturally) the process of converting numbers to characters. Before we get to Python, a bit of history.<\/p>\n<h1>ASCII and Unicode<\/h1>\n<p>There are only a few\u00a0character sets that have any relevance to life in 2014. The first is <a title=\"ASCII\" href=\"http:\/\/en.wikipedia.org\/wiki\/ASCII\">ASCII<\/a> (American Standard Code for Information Interchange), which was first published\u00a0in 1963. This character set consists of 128 characters intended for use by an English audience. Of these 95 are\u00a0<em>printable<\/em>, meaning that they correspond to lay-human notions about\u00a0characters. On a US keyboard, these are (approximately) the alphanumeric and punctuation characters that can be typed with a single keystroke, or with a single keystroke while holding down the Shift key, space, tab, the two newline characters (which you get when you type return), and a few apocrypha. The remaining 33 are<em>\u00a0<\/em>non-printable &#8220;control characters&#8221;. For instance, the first character in the ASCII table is the &#8220;null byte&#8221;. This is indicated by a\u00a0<code>''<\/code> in C and other languages, but there&#8217;s no\u00a0standard way to render it. Many control characters were\u00a0designed for earlier, more innocent times; for instance, character #7 <code>'a'<\/code> tells the receiving device to ring a cute little bell (which were apparently attached to teletype terminals); today your computer might make a beep, or the terminal window might flicker once, but either way, nothing is printed.<\/p>\n<p>Of course, this is completely inadequate for anything but English (not to mention those users of\u00a0superfluous\u00a0<a title=\"diaresis\" href=\"http:\/\/en.wikipedia.org\/wiki\/Diaeresis_(diacritic)\">diaresis<\/a>&#8230;e.g., the editors of the New Yorker, Mot\u00f6rhead). However, each ASCII\u00a0character takes up only 7 bits, leaving room for another 128 characters (since\u00a0a byte has an integer value between\u00a00-255, inclusive), and so engineers could\u00a0exploited the remaining 128 characters to write the characters from different\u00a0alphabets, alphasyllabaries, or\u00a0syllabaries. Of these ASCII-based character sets, the best-known are <a title=\"ISO\/IEC 8859-1\" href=\"http:\/\/en.wikipedia.org\/wiki\/ISO\/IEC_8859-1\">ISO\/IEC 8859-1<\/a>, also known as Latin-1, and\u00a0<a title=\"Windows-1252\" href=\"http:\/\/en.wikipedia.org\/wiki\/Windows-1252\">Windows-1252<\/a>, also known as CP-1252. Unfortunately, this created more problems than it solved. That\u00a0last bit just didn&#8217;t leave enough space for the many languages which need a larger character set (Japanese <i>kanji<\/i> being an obvious example). And\u00a0even when there are technically enough code points left over, engineers working in different languages didn&#8217;t see eye-to-eye about what to do with them. As a result, the state of affairs made it impossible to, for example,\u00a0write\u00a0<em>in<\/em> French\u00a0(ISO\/IEC 8859-1) <em>about<\/em>\u00a0Ukrainian (<a title=\"ISO\/IEC 8859-5\" href=\"http:\/\/en.wikipedia.org\/wiki\/ISO\/IEC_8859-5\">ISO\/IEC 8859-5<\/a>, at least before the <a title=\"1990 Ukrainian spelling reform\" href=\"http:\/\/en.wikipedia.org\/wiki\/Ukrainian_alphabet#Unified_orthography\">1990 orthography\u00a0reform<\/a>).<\/p>\n<p>Clearly, fighting over scraps\u00a0isn&#8217;t going to cut it in the global village. Enter the\u00a0<a title=\"Unicode\" href=\"http:\/\/en.wikipedia.org\/wiki\/Unicode\">Unicode<\/a>\u00a0standard and its\u00a0<a title=\"UCS\" href=\"http:\/\/en.wikipedia.org\/wiki\/Universal_Character_Set\">Universal Character Set<\/a>\u00a0(UCS), first published in 1991. Unicode is<em>\u00a0<\/em>the\u00a0platonic ideal of an character encoding,\u00a0abstracting away from the need to <em>efficiently<\/em> convert all characters to numbers. Each character is represented by a single code with various metadata (e.g., <code>A<\/code> is an &#8220;Uppercase Letter&#8221; from the &#8220;Latin&#8221; script). ASCII and its extensions map onto a small subset of this code.<\/p>\n<p>Fortunately,\u00a0<em>not\u00a0all\u00a0encodings<\/em>\u00a0are merely <a title=\"Allegory of the Cave\" href=\"http:\/\/en.wikipedia.org\/wiki\/Allegory_of_the_Cave\">shadows on the walls of a cave<\/a>. The One True Encoding is\u00a0<a title=\"UTF-8\" href=\"http:\/\/en.wikipedia.org\/wiki\/UTF-8\">UTF-8<\/a>, which implements the entire UCS\u00a0using an 8-bit code. There are other encodings, of course, but this one is ours, and I am not alone in feeling strongly that UTF-8 is the chosen encoding. At the risk of getting too far afield, here are two arguments for why\u00a0you and everyone you know should just use UTF-8. First off, it is hardly matters much which UCS-compatible encoding we all use (the differences between them are largely arbitrary), but what\u00a0<em>does<\/em> matter is that <a title=\"14 standards\" href=\"http:\/\/xkcd.com\/927\/\">we all choose the same one<\/a>. There is no\u00a0general procedure for &#8220;sniffing&#8221; out the encoding of a file, and \u00a0there&#8217;s nothing preventing you from coming up with a file that&#8217;s a French cookbook in one encoding, and a top-secret message in another. This is good for\u00a0<a title=\"steganography\" href=\"http:\/\/en.wikipedia.org\/wiki\/Steganography\">steganographers<\/a>, but bad for the rest of us, since so many\u00a0text files\u00a0lack encoding\u00a0metadata.\u00a0When it comes to encodings, there&#8217;s no question that UTF-8 is the most popular\u00a0Unicode encoding scheme worldwide, and is on its way to becoming the de-facto standard. Secondly,\u00a0<em>ASCII is valid UTF-8<\/em>, because UTF-8 and ASCII encode the ASCII characters in exactly the same way. What this means, practically speaking, is you can achieve nearly complete\u00a0coverage of the world&#8217;s languages simply\u00a0by assuming\u00a0that all the inputs to your software are UTF-8. This is a big, big win for us all.<\/p>\n<h1>Decode\u00a0early, encode late<\/h1>\n<p>A general rule of\u00a0thumb for developers is &#8220;decode early&#8221; (convert inputs to their Unicode representation), &#8220;encode late&#8221; (convert back to bytestrings). The reason for this is that in nearly any programming language, Unicode strings behave the way our monkey brains expect them to, but bytestrings do not. To see why, try iterating over non-ASCII bytestring in Python (more on the syntax later).<\/p>\n<pre><code>&gt;&gt;&gt; for byte in b\"a\u00f1o\":\r\n...     print(byte)\r\n...\r\na\r\n?\r\n?\r\no<\/code><\/pre>\n<p>There are two surprising things here: iterating over the bytestring returned more bytes then there are &#8220;characters&#8221; (goodbye, indexing), and furthermore the 2nd\u00a0&#8220;character&#8221; failed to render properly. This is what happens when you let computers dictate the semantics to our monkey brains, rather than the other way around. Here&#8217;s\u00a0what happens when we try the same with a Unicode string:<\/p>\n<pre><code>&gt;&gt;&gt; for byte in u\"a\u00f1o\":\r\n...     print(byte)\r\n...\r\na\r\n\u00f1\r\no<\/code><\/pre>\n<h1>The Python 2 &amp; 3 string models<\/h1>\n<p>Before you put this all into practice,\u00a0it is important to note that Python 2 and Python 3 use very different string models. The familiar Python 2 <code>str<\/code> class is a bytestring. To convert it to a Unicode string, use the\u00a0<code>str.decode<\/code> instance method, which returns a copy of the string as an instance of the <code>unicode<\/code>\u00a0class.\u00a0Similarly, you can make a\u00a0<code>str<\/code> copy of a <code>unicode<\/code> instance with <code>unicode.encode<\/code>. Both of these functions take a single argument: a string (either kind!) representing the encoding.<\/p>\n<p>Python 2 provides specific syntax for Unicode string literals (which you saw above): the a lower-case <code>u<\/code>\u00a0prefix before the initial quotation mark (as in <code>u\"a\u00f1o\"<\/code>).<\/p>\n<p>When it comes to Unicode-awareness, Python 3 has totally flipped the script; in my opinion, it&#8217;s for the best. Instances of <code>str<\/code> are now Unicode strings (the <code>u\"\"<\/code> syntax still works, but is vacuous). The (reduced) functionality of the old-style strings is now just available for instances of the class <code>bytes<\/code>. As you might expect, you can create a <code>bytes<\/code> instance by using the <code>encode<\/code> method of a new-style <code>str<\/code>. Python 3 decodes bytestrings as soon as they are created, and (re)encodes Unicode strings only at the interfaces; in other words, it gets the &#8220;early\/late&#8221; stuff right by default. Your APIs probably won&#8217;t need to change much, because Python 3\u00a0treats\u00a0UTF-8 (and thus ASCII) as the default encoding, and this assumption is valid\u00a0more often than not.<\/p>\n<p>If for some reason, you want a bytestring literal, Python has syntax for that, too: prefix the quotation marks delimiting the string with a lower-case <code>b<\/code>\u00a0(as in <code>b\"a\u00f1o\"<\/code>; see above also).<\/p>\n<h1>tl;dr<\/h1>\n<p>Strings are ordered sequences of characters. But computers only know about numbers, so they are encoded as\u00a0byte arrays; there are many ways to do this, but UTF-8 is the One True Encoding. To get the strings to have the semantics you expect as a human, decode a string to\u00a0Unicode as early as possible, and encode it as bytes as late as possible. You have to do this explicitly in Python 2; it happens automatically in Python 3.<\/p>\n<h1>Further reading<\/h1>\n<p>For more of the historical angle, see Joel Spolsky&#8217;s\u00a0<a title=\"joel on unicode\" href=\"https:\/\/www.joelonsoftware.com\/2003\/10\/08\/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses\/\">The absolute minimum every software developer absolutely, positively must know About Unicode and character sets (no excuses!)<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Computers were rather late to the word processing game. The founding mothers and fathers of computing were primarily interested in numbers. This is\u00a0fortunate: after all,\u00a0computers only know about numbers. But as Brian Kunde explains in his brief history of word processing, word processing existed long before digital computing, and the text processing has\u00a0always been something &hellip; <a href=\"https:\/\/www.wellformedness.com\/blog\/understanding-text-encoding-in-python-2-and-python-3\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Understanding text encoding in Python 2 and Python 3&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[3,4,5,8],"tags":[],"class_list":["post-298","post","type-post","status-publish","format-standard","hentry","category-dev","category-language","category-nlp","category-python"],"_links":{"self":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/298","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/comments?post=298"}],"version-history":[{"count":1,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/298\/revisions"}],"predecessor-version":[{"id":591,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/298\/revisions\/591"}],"wp:attachment":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/media?parent=298"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/categories?post=298"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/tags?post=298"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}