{"id":962,"date":"2020-10-28T03:04:58","date_gmt":"2020-10-28T03:04:58","guid":{"rendered":"http:\/\/www.wellformedness.com\/blog\/?p=962"},"modified":"2020-10-28T14:15:03","modified_gmt":"2020-10-28T14:15:03","slug":"translating-lost-languages-machine-learning","status":"publish","type":"post","link":"https:\/\/www.wellformedness.com\/blog\/translating-lost-languages-machine-learning\/","title":{"rendered":"Translating lost languages using machine learning?"},"content":{"rendered":"<p><strong>[The following is a guest post from my colleague <a href=\"https:\/\/rws.xoba.com\/\">Richard Sproat<\/a>. This should go without saying, but: this post does not represent the opinions of\u00a0<em>anyone&#8217;s<\/em> employer.]<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">In 2009 a <\/span><a href=\"https:\/\/science.sciencemag.org\/content\/324\/5931\/1165\/tab-figures-data\"><span style=\"font-weight: 400;\">paper<\/span><\/a><span style=\"font-weight: 400;\"> appeared in <\/span><i><span style=\"font-weight: 400;\">Science <\/span><\/i><span style=\"font-weight: 400;\">by Rajesh Rao and colleagues that claimed to show using \u201centropic evidence\u201d that the thus far undeciphered Indus Valley symbol system was true writing not, as colleagues and I had <\/span><a href=\"https:\/\/crossasia-journals.ub.uni-heidelberg.de\/index.php\/ejvs\/article\/view\/620\"><span style=\"font-weight: 400;\">argued<\/span><\/a><span style=\"font-weight: 400;\">, a non-linguistic symbol system. Some other papers from Rao and colleagues followed, and there was also a <\/span><a href=\"https:\/\/royalsocietypublishing.org\/doi\/full\/10.1098\/rspa.2010.0041\"><span style=\"font-weight: 400;\">paper<\/span><\/a><span style=\"font-weight: 400;\"> in the <\/span><i><span style=\"font-weight: 400;\">Proceedings of the Royal Society <\/span><\/i><span style=\"font-weight: 400;\">by Rob Lee and colleagues that used a different \u201centropic\u201d method to argue that symbols carved on stones by the Picts of Iron Age Scotland also represented language.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">I, and others, were deeply skeptical (see e.g. <\/span><a href=\"https:\/\/languagelog.ldc.upenn.edu\/nll\/?p=2227\"><span style=\"font-weight: 400;\">here<\/span><\/a><span style=\"font-weight: 400;\">) that such methods could distinguish between true writing and symbol systems that, while having structure, encoded some sort of non-linguistic information. This skepticism was fed in part by our observation that completely random meaningless \u201csymbol systems\u201d could be shown to fall into the \u201clinguistic\u201d bin according to those measures. What if anything were such methods telling us about the difference between natural language and other systems that convey meaning? My skepticism led to a sequence of presentations and papers, culminating in this <\/span><a href=\"https:\/\/www.linguisticsociety.org\/sites\/default\/files\/archived-documents\/Sproat_Lg_90_2.pdf\"><span style=\"font-weight: 400;\">paper<\/span><\/a><span style=\"font-weight: 400;\"> in <\/span><i><span style=\"font-weight: 400;\">Language<\/span><\/i><span style=\"font-weight: 400;\">, where I tried a variety of statistical methods, including those of the Rao and Lee teams, in an attempt to distinguish between samples of systems that were known to be true writing, and systems known to be non-linguistic. None of these methods really worked and I concluded that simple extrinsic measures based on the distribution of symbols without knowing <\/span><i><span style=\"font-weight: 400;\">what <\/span><\/i><span style=\"font-weight: 400;\">the symbols denote, were unlikely to be of much use.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The upshot of this attempt at debunking Rao\u2019s and Lee\u2019s widely publicized work was that I convinced people who were already convinced and failed to convince those who were not. As icing on the cake, I was accused by Rao and Lee and colleagues of totally misrepresenting their work, which I most certainly had not done: indeed I was careful to consider all possible interpretations of their arguments, the problem being that their own interpretations of what they had done seemed to be rather fluid, changing as the criticisms changed; on the latter point see <\/span><a href=\"https:\/\/www.linguisticsociety.org\/sites\/default\/files\/14e_91.4Sproat.pdf\"><span style=\"font-weight: 400;\">my reply<\/span><\/a><span style=\"font-weight: 400;\">, also in <\/span><i><span style=\"font-weight: 400;\">Language<\/span><\/i><span style=\"font-weight: 400;\">. This experience led me to pretty much give up the debunking business entirely, since people usually end up believing what they want to believe, and it is rare for people to admit they were wrong.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Still, there are times when one feels inclined to try to set the record straight, and one such instance is this <\/span><a href=\"https:\/\/news.mit.edu\/2020\/translating-lost-languages-using-machine-learning-1021\"><span style=\"font-weight: 400;\">recent announcement<\/span><\/a><span style=\"font-weight: 400;\"> from MIT about work from Regina Barzilay and colleagues that purports to provide a machine-learning based system that \u201caims to help linguists decipher languages that have been lost to history.\u201d The <\/span><a href=\"https:\/\/arxiv.org\/abs\/2010.11054\"><span style=\"font-weight: 400;\">paper<\/span><\/a><span style=\"font-weight: 400;\"> this press release is based on (to appear in the <\/span><em><a href=\"https:\/\/transacl.org\/index.php\/tacl\"><span style=\"font-weight: 400;\">Transactions of the Association for Computational Linguistics<\/span><\/a><\/em><span style=\"font-weight: 400;\">) is of course more reserved than what the MIT public relations people produced, but is still misleading in a number of ways.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Before I get into that though, let me state at the outset that as with the work by Rao et al. and Lee et al. that I had critiqued previously, the issue here is not that Barzilay and colleagues do not have results, but rather what one concludes from their results. And to be fair, this new work is a couple of orders of magnitude more sophisticated than what Rao and his colleagues did.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In brief summary, Barzilay et al\u2019s approach is to take a text in an unknown ancient script, which may be <\/span><i><span style=\"font-weight: 400;\">unsegmented<\/span><\/i><span style=\"font-weight: 400;\"> into words, along with phonetic transcriptions of a known language. In general the phonetic values of the unknown script are, well, not known, so candidate mappings are generated. (The authors also consider cases where some of the values are known, or can be guessed at, e.g. because the glyphs look like glyphs in known scripts.) The weights on the various mappings are learnable parameters, and the learning is also guided by phonological constraints such as assumed regularity of sound changes and rough preservation of the size of the phonemic inventory as languages change. (Of course, phoneme inventories can change a lot in size and details over a long history: Modern English has quite a different inventory from Proto-Indo-European. Still, since one\u2019s best hope of a decipherment is to find languages that are reasonably closely related to the target, the authors\u2019 assumption here may not be unreasonable.) The objective function for the learning aims to cover as much of the unknown text as possible while optimizing the quality of the extracted cognates. Their training is summarized in the following pseudocode from page 6 of their paper:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-970 size-full\" src=\"https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0.png\" alt=\"\" width=\"2296\" height=\"632\" srcset=\"https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0.png 2296w, https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-300x83.png 300w, https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-1024x282.png 1024w, https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-768x211.png 768w, https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-1536x423.png 1536w, https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-2048x564.png 2048w, https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-500x138.png 500w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">One can then compare the results of the algorithm when run with the unknown text, and a <\/span><i><span style=\"font-weight: 400;\">set<\/span><\/i><span style=\"font-weight: 400;\"> of known languages, to see which of the known languages is the best model. The work is thus in many ways similar to earlier <\/span><a href=\"https:\/\/www.isi.edu\/natural-language\/mt\/decipher06.pdf\"><span style=\"font-weight: 400;\">work<\/span><\/a><span style=\"font-weight: 400;\"> by Kevin Knight and colleagues, which the present paper also cites.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In the experiments the authors used three ancient scripts: <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Ugaritic\"><span style=\"font-weight: 400;\">Ugaritic<\/span><\/a><span style=\"font-weight: 400;\"> (12th century BCE), a close relative of Hebrew; Gothic, a 4th century CE East Germanic language that is also the earliest preserved Germanic tongue; and Iberian, a heretofore undeciphered script \u2014 or more accurately a collection of scripts \u2014 of the late pre-Common Era from the Iberian peninsula. (It is worth noting that Iberian was very likely to have been a mixed <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Iberian_scripts\"><span style=\"font-weight: 400;\">alphabetic-syllabic<\/span><\/a><span style=\"font-weight: 400;\"> script, not a purely alphabetic one, which means that one is giving oneself a bit of a leg up if one bases one\u2019s work on a transliteration of those texts into a purely alphabetic form.) The comparison known languages were Proto-Germanic, Old Norse, Old English, Latin, Spanish, Hungarian, Turkish, Basque, Arabic and Hebrew. (I note in passing that Latin and Spanish seem to be assigned by the authors to different language families!)<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For Ugaritic, Hebrew came out as dramatically closer than other languages, and for Gothic, Proto-Germanic. For Iberian, no language was a dramatically better match, though Basque did seem to be somewhat closer. As they argue (p. 9):<\/span><\/p>\n<blockquote><p><span style=\"font-weight: 400;\">The picture is quite different for Iberian. No language seems to have a pronounced advantage over others. This seems to accord with the current scholarly understanding that Iberian is a language isolate, with no established kinship with others.<\/span><\/p><\/blockquote>\n<p><span style=\"font-weight: 400;\">\u201cScholarly understanding\u201d may be an overstatement since the most one can say at this point is that there is <\/span><i><span style=\"font-weight: 400;\">scholarly disagreement <\/span><\/i><span style=\"font-weight: 400;\">on the relationships between the Iberian language(s) and known languages.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">But, in any case, one problem is that since they only perform this experiment for three ancient scripts, two of which they are able to find clear relationships for, and the third not so clearly, it is not obvious what if anything one can conclude from this. The statistical sample is not such as to be overwhelming in its significance. Furthermore, in at least one case there is a serious danger of circularity: the closest match they find for Gothic is with Proto-Germanic, which shows a much better match than the other Germanic languages, Old Norse or Old English. But that is hardly surprising: Proto Germanic reconstructions are heavily informed by Gothic, the earliest recorded example of a Germanic language. Indeed, if Gothic were truly an unknown language, and assuming that we had no access to a reconstructed protolanguage that depends in part on Gothic for its reconstruction, then we would be left with the two known Germanic languages in their set, Old English and Old Norse. This of course would be a more reasonable model in any case for the situation a real decipherer would encounter. But then the situation for Gothic becomes much less clear. Below is their Figure 4, which plots various settings of their coverage threshold hyperparameter <\/span><i><span style=\"font-weight: 400;\">r<\/span><\/i><i><span style=\"font-weight: 400;\">cov<\/span><\/i> <span style=\"font-weight: 400;\">against the obtained coverage. The more separated the curve for the language is above the rest, the better the method is able to distinguish the closest matched language from everything else. With this in mind, Hebrew is clearly a lot closer to Ugaritic than anything else. Iberian, as we noted, does not have a language that is obviously closest, though Basque is a contender. For Gothic, Proto-Germanic (PG) is a clear winner, but if one removed that the closest two are now Old English (OE) and Old Norse (ON). Not bad, of course, but just eyeballing the plots, the situation is no longer as dramatic, and not clearly more dramatic than the situation for Iberian.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-971 size-full\" src=\"https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-1.png\" alt=\"\" width=\"2166\" height=\"1782\" srcset=\"https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-1.png 2166w, https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-1-300x247.png 300w, https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-1-1024x842.png 1024w, https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-1-768x632.png 768w, https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-1-1536x1264.png 1536w, https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-1-2048x1685.png 2048w, https:\/\/www.wellformedness.com\/blog\/wp-content\/uploads\/2020\/10\/pasted-image-0-1-365x300.png 365w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">And as for Iberian, again, they note (p. 9) that \u201cBasque somewhat stands out from the rest, which might be attributed to its similar phonological system with Iberian\u201d. But what are they comparing against? Modern Basque is certainly different from its form 2000+ years ago, and indeed if one buys into <\/span><a href=\"https:\/\/julietteblevins.ws.gc.cuny.edu\/proto-basque\/\"><span style=\"font-weight: 400;\">recent work by Juliette Blevins<\/span><\/a><span style=\"font-weight: 400;\">, then Ancient Basque was phonologically quite a bit different from the modern language. Which in turn leaves one wondering what these results are telling us.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The abstract of the paper opens with the statement that:<\/span><\/p>\n<blockquote><p><span style=\"font-weight: 400;\">Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined.<\/span><\/p><\/blockquote>\n<p><span style=\"font-weight: 400;\">Of course this is all perfectly true, but it rather understates the case when it comes to the real challenges faced in most cases of decipherment.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To wit:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Not only is the \u201cclosest \u2026 language\u201d not usually known, but there may not even<\/span><i><span style=\"font-weight: 400;\"> be<\/span><\/i><span style=\"font-weight: 400;\"> a closest language. This appears to be the situation for Linear A where, even though there is a substantial amount of Linear A text, and the syllabary is very similar in appearance and was almost certainly the precursor to the deciphered Linear B, decipherment has remained elusive for 100 years in large measure because we simply do not know anything about the Eteocretan Language. It is also the situation for Etruscan. The authors of course claim their results support this conclusion for Iberian, and thereby imply that their method can help one decide whether there really is a closest language, and thus presumably whether it is worth wasting one\u2019s time pursuing a given relationship. But as we have suggested above, the results seem equivocal on this point.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Even when it turns out that the text is in a language related to a known language, the way in which the script encodes that language may make the correspondences far less transparent than the known systems chosen for this paper. Gothic and Ugaritic are both segmental writing systems which presumably had a fairly straightforward grapheme-to-phoneme relation. And while Ugaritic is a \u201cdefective\u201d writing system in that it fails to represent, e.g., most vowels, it is no different from Hebrew or Arabic in that regard. This makes it a great deal easier to find correspondences than, say, Linear B. Linear B was a syllabary, and it was a<\/span><i><span style=\"font-weight: 400;\"> lousy <\/span><\/i><span style=\"font-weight: 400;\">way to write Greek. It failed to make important phonemic distinctions that Greek had, so that whereas Greek had a three-way voiced-voiceless-voiceless aspirate distinction in stops, Linear B for the most part could only represent place, not manner of articulation. It could not for the most part directly represent consonant clusters so that either these had to be broken up into CV units (e.g. <\/span><i><span style=\"font-weight: 400;\">knossos <\/span><\/i><span style=\"font-weight: 400;\">as <\/span><i><span style=\"font-weight: 400;\">ko-no-so<\/span><\/i><span style=\"font-weight: 400;\">) or some of the consonants ended up being unrepresented (e.g. <\/span><i><span style=\"font-weight: 400;\">sperma <\/span><\/i><span style=\"font-weight: 400;\">as <\/span><i><span style=\"font-weight: 400;\">pe-ma<\/span><\/i><span style=\"font-weight: 400;\">).\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">And all of this assumes the script was purely phonographic. Many ancient scripts, and <\/span><i><span style=\"font-weight: 400;\">all<\/span><\/i><span style=\"font-weight: 400;\"> of the original independently invented scripts, included at least some amount of purely <\/span><i><span style=\"font-weight: 400;\">logographic<\/span><\/i><span style=\"font-weight: 400;\"> (or, if you prefer,<\/span><i><span style=\"font-weight: 400;\"> morphographic<\/span><\/i><span style=\"font-weight: 400;\">) and even <\/span><i><span style=\"font-weight: 400;\">semasiographic <\/span><\/i><span style=\"font-weight: 400;\">symbology, so that an ancient text was a mix of glyphs, some of which would relate to the sound, and others of which would relate to a particular morpheme or its meaning. And when sound was encoded, it was often quite unsystematic in the way in which it was encoded, certainly much less systematic than Gothic or Ugaritic were.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Then there is the issue of the amount of text available, which may be merely in the hundreds, or fewer, of tokens. And of course there are issues familiar in decipherment such as knowing when two glyphs in a pair of inscriptions that look similar to each other are indeed the same glyph, or not. Or as in the case of Mayan, where very different looking glyphs are actually calligraphic variants of the same glyph (see e.g. <\/span><a href=\"http:\/\/www.famsi.org\/research\/pitts\/MayaGlyphsBook2.pdf\"><span style=\"font-weight: 400;\">here<\/span><\/a><span style=\"font-weight: 400;\"> in the section on \u201chead glyphs\u201d). The point here is that one often cannot be sure whether two glyphs in a corpus are instances of the same glyph, or not, until one has a better understanding of the whole system.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Of course, all of these might be addressed using computational methods as we gradually whittle away at the bigger problem. But it is important to stress that methods such as the one presented in this paper are really a very small piece in the overall task of decipherment.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We do need to say one more thing here about Linear B, since the authors of this paper claim that one of their previously reported systems (<\/span><a href=\"https:\/\/www.aclweb.org\/anthology\/P19-1303\/\"><span style=\"font-weight: 400;\">Luo, Cao and Barzilay, 2019<\/span><\/a><span style=\"font-weight: 400;\">) \u201ccan successfully decipher lost languages like \u2026 Linear B\u201d. But if you look at what was done in that paper, they took a lexicon of Linear B words, and aligned them successfully to a nicely cleaned up lexicon of known Greek names noting, somewhat obliquely, that location names were important in the successful decipherment of Linear B. That is true, of course, but then again it wasn\u2019t particularly the largely non-Greek Cretan place names that led to the realization that Linear B was Greek. One must remember that Michael Ventris, no doubt under the influence of Arthur Evans, was initially of the opinion that Linear B could not be Greek. It was only when the language that he was uncovering started to look more and more familiar, and clearly Greek words like <\/span><i><span style=\"font-weight: 400;\">ko-wo (korwos) &#8216;<\/span><\/i><span style=\"font-weight: 400;\">boy&#8217; and <\/span><i><span style=\"font-weight: 400;\">i-qo <\/span><\/i><span style=\"font-weight: 400;\">(<\/span><i><span style=\"font-weight: 400;\">iqqos) &#8216;<\/span><\/i><span style=\"font-weight: 400;\">horse&#8217; started to appear that the conclusion became inescapable. To simulate some of the steps that Ventris went through, one could imagine using something like the Luo et al. approach as follows. First guess that there might be proper names mentioned in the corpus, then use their algorithm to derive a set of possible phonetic values for the Linear B symbols, some of which would probably be close to being correct. Then use those along with something along the lines of what is presented in the newest paper to attempt to find the closest language from a set of candidates including Greek, and thereby hope one can extend the coverage. That would be an interesting program to pursue, but there is much that would need to be done to make it actually work, especially if we intend an <\/span><i><span style=\"font-weight: 400;\">honest <\/span><\/i><span style=\"font-weight: 400;\">experiment where we make as few assumptions as possible about what we know about the language encoded by the system. And, of course more generally this approach would fail entirely if the language were not related to any known language. In that case one would end up with a set of things that one could probably read, such as place names, and not much else \u2014 a situation not too dissimilar from that of Linear A. All of which is to say that what Luo et al. presented is interesting, but hardly counts as a \u201cdecipherment\u201d of Linear B.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Of course Champollion is often credited with being the decipherer of Egyptian, whereas a more accurate characterization would be to say that he provided the crucial key to a process that unfolded over the ensuing century. (In contrast, Linear B was to a large extent deciphered within Ventris\u2019 rather short lifetime \u2014 but then again Linear B is a much less complicated writing system than Egyptian.) If one were being charitable, then, one might compare Luo et al.\u2019s results to those of Champollion, but then it is worth remembering that from that initial stage to a full decipherment of the system can still be a daunting task.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In summary, I think there are contributions in this work, and there would be no problem if it were presented as a method that provides a piece of what one would need in one\u2019s toolkit if one wanted to (semi-) automate the process of decipherment. (In fact, computational methods have played thus far only a very <\/span><i><span style=\"font-weight: 400;\">minor<\/span><\/i><span style=\"font-weight: 400;\"> role in real decipherment work, but one can hold out hope that they could be used more.) But everything apparently has to be hyped these days well beyond what the work actually does.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Needless to say, the press loves this sort of stuff, but are scientists mainly in the business of feeding exciting tidbits to the press? Apparently they often are: my paper that I referenced in the introduction that appeared in <\/span><i><span style=\"font-weight: 400;\">Language <\/span><\/i><span style=\"font-weight: 400;\">was initially submitted to <\/span><i><span style=\"font-weight: 400;\">Science <\/span><\/i><span style=\"font-weight: 400;\">as a reply to the paper by Rao and colleagues. This reply was rejected before it even made it out of the editorial office. The reason was pretty transparent: Rao and colleagues\u2019 original paper purported to be a sexy \u201cAI\u201d-based approach that supposedly told us something interesting about an ancient civilization. My paper was a more mundane contribution showing that none of the proposed methods worked. Which one sells more copies?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In any event, with respect to the paper currently under discussion, hopefully my attempt here will have served at least to put things a bit more in perspective.<\/span><\/p>\n<p><i><span style=\"font-weight: 400;\">Acknowledgements:\u00a0<\/span><\/i><span style=\"font-weight: 400;\">I thank Kyle Gorman and Alexander Gutkin for comments on earlier versions.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[The following is a guest post from my colleague Richard Sproat. This should go without saying, but: this post does not represent the opinions of\u00a0anyone&#8217;s employer.] In 2009 a paper appeared in Science by Rajesh Rao and colleagues that claimed to show using \u201centropic evidence\u201d that the thus far undeciphered Indus Valley symbol system was &hellip; <a href=\"https:\/\/www.wellformedness.com\/blog\/translating-lost-languages-machine-learning\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Translating lost languages using machine learning?&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[4,5],"tags":[],"class_list":["post-962","post","type-post","status-publish","format-standard","hentry","category-language","category-nlp"],"_links":{"self":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/962","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/comments?post=962"}],"version-history":[{"count":5,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/962\/revisions"}],"predecessor-version":[{"id":972,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/962\/revisions\/972"}],"wp:attachment":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/media?parent=962"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/categories?post=962"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/tags?post=962"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}