{"id":1951,"date":"2024-04-12T13:08:14","date_gmt":"2024-04-12T17:08:14","guid":{"rendered":"https:\/\/www.wellformedness.com\/blog\/?p=1951"},"modified":"2024-04-12T13:08:14","modified_gmt":"2024-04-12T17:08:14","slug":"segmented-languages","status":"publish","type":"post","link":"https:\/\/www.wellformedness.com\/blog\/segmented-languages\/","title":{"rendered":"&#8220;Segmented languages&#8221;"},"content":{"rendered":"<p>In a recent paper (Gorman &amp; Sproat 2023), we complain about conflation of writing systems with the languages they are used to write, highlighting the nonsense underlying common expressions like &#8220;right-to-left language&#8221;, &#8220;syllabic language&#8221; or &#8220;ideographic&#8221; language found in the literature. Thus we were surprised to find the following:<\/p>\n<blockquote><p>Four segmented languages (Mandarin, Japanese, Korean and Thai) report character error rate\u00a0(CER), instead of WER&#8230;\u00a0<em>(Gemini Team 2024:18)<\/em><\/p><\/blockquote>\n<p>Since the most salient feature of the writing systems used to write Mandarin, Japanese, Korean, and Thai is the\u00a0<em>absence\u00a0<\/em>of segmentation information (e.g., whitespace used to indicate word boundaries), presumably the authors mean to say that the data they are using has already been pre-segmented (by some unspecified means). But this is not a property of these languages, but rather of the available data.<\/p>\n<p>[h\/t: Richard Sproat]<\/p>\n<div class=\"gs\">\n<div class=\"\">\n<div id=\":cn\" class=\"ii gt\">\n<div id=\":cm\" class=\"a3s aiL \">\n<div dir=\"ltr\">\n<div>\n<h1>References<\/h1>\n<p>Gemini Team. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint 2312.11805. URL: <a href=\"https:\/\/arxiv.org\/abs\/2312.11805\">https:\/\/arxiv.org\/abs\/2312.11805<\/a>.<\/p>\n<\/div>\n<div>\n<p>Gorman, K. and Sproat, R.. 2023.\u00a0<a href=\"https:\/\/aclanthology.org\/2023.cawl-1.1\/\">Myths about writing systems in speech &amp; language technology<\/a>. In\u00a0<i>Proceedings of the Workshop on Computation and Written Language<\/i>, pages 1-5.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>In a recent paper (Gorman &amp; Sproat 2023), we complain about conflation of writing systems with the languages they are used to write, highlighting the nonsense underlying common expressions like &#8220;right-to-left language&#8221;, &#8220;syllabic language&#8221; or &#8220;ideographic&#8221; language found in the literature. Thus we were surprised to find the following: Four segmented languages (Mandarin, Japanese, Korean &hellip; <a href=\"https:\/\/www.wellformedness.com\/blog\/segmented-languages\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;&#8220;Segmented languages&#8221;&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[3,4,5,7],"tags":[],"class_list":["post-1951","post","type-post","status-publish","format-standard","hentry","category-dev","category-language","category-nlp","category-presentation-of-self"],"_links":{"self":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/1951","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/comments?post=1951"}],"version-history":[{"count":2,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/1951\/revisions"}],"predecessor-version":[{"id":1956,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/1951\/revisions\/1956"}],"wp:attachment":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/media?parent=1951"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/categories?post=1951"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/tags?post=1951"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}