{"id":2157,"date":"2024-09-26T11:59:33","date_gmt":"2024-09-26T15:59:33","guid":{"rendered":"https:\/\/www.wellformedness.com\/blog\/?p=2157"},"modified":"2024-09-26T11:59:33","modified_gmt":"2024-09-26T15:59:33","slug":"learned-tokenization","status":"publish","type":"post","link":"https:\/\/www.wellformedness.com\/blog\/learned-tokenization\/","title":{"rendered":"Learned tokenization"},"content":{"rendered":"<p>Conventional (i.e., non-neural, pre-BERT) NLP stacks tend to use rule-based systems for tokenizing sentences into words. One good example is Spacy, which provides rule-based tokenizers for the languages it supports. I am sort of baffled this is considered a good idea for languages other than English, since it seems to me that <em>most<\/em> languages need machine learning for even this task to properly handle phenomena like clitics. If you like the Spacy interface\u2014I admit it&#8217;s very convenient\u2014and work in Python, you may want to try the<code><a href=\"https:\/\/github.com\/TakeLab\/spacy-udpipe\">spacy-udpipe<\/a><\/code> library, which exposes the <a href=\"https:\/\/lindat.mff.cuni.cz\/repository\/xmlui\/handle\/11234\/1-3131\">UDPipe 1.5 models for Universal Dependencies 2.5<\/a>; these in turn use learned tokenizers (and taggers, morphological analyzers, and dependency parsers, if you care) trained on high-quality Universal Dependencies data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Conventional (i.e., non-neural, pre-BERT) NLP stacks tend to use rule-based systems for tokenizing sentences into words. One good example is Spacy, which provides rule-based tokenizers for the languages it supports. I am sort of baffled this is considered a good idea for languages other than English, since it seems to me that most languages need &hellip; <a href=\"https:\/\/www.wellformedness.com\/blog\/learned-tokenization\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Learned tokenization&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"aside","meta":{"_crdt_document":"","footnotes":""},"categories":[3,8],"tags":[],"class_list":["post-2157","post","type-post","status-publish","format-aside","hentry","category-dev","category-python","post_format-post-format-aside"],"_links":{"self":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/2157","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/comments?post=2157"}],"version-history":[{"count":1,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/2157\/revisions"}],"predecessor-version":[{"id":2158,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/2157\/revisions\/2158"}],"wp:attachment":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/media?parent=2157"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/categories?post=2157"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/tags?post=2157"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}