{"id":2224,"date":"2024-11-28T11:30:46","date_gmt":"2024-11-28T16:30:46","guid":{"rendered":"https:\/\/www.wellformedness.com\/blog\/?p=2224"},"modified":"2024-12-12T14:07:42","modified_gmt":"2024-12-12T19:07:42","slug":"hugging-face-needs-better-curation","status":"publish","type":"post","link":"https:\/\/www.wellformedness.com\/blog\/hugging-face-needs-better-curation\/","title":{"rendered":"Hugging Face needs better curation"},"content":{"rendered":"<p><a href=\"https:\/\/huggingface.co\/\">Hugging Face<\/a> is, among other things, a platform for obtaining pre-trained neural network models. We use their <a href=\"https:\/\/pypi.org\/project\/tokenizers\/\"><code>tokenizers<\/code><\/a> and\u00a0 <a href=\"https:\/\/pypi.org\/project\/transformers\/\"><code>transformers<\/code><\/a> Python libraries in a number of projects. While these have a bit more abstraction than I like, and are arguably over-featured, they are fundamentally quite good and make it really easy to, e.g., add a pre-trained encoder. I also appreciate that the tokenizers are mostly compiled code (they&#8217;re Rust extensions, apparently), which in practice means that tokenization is IO-bound rather than CPU bound.<\/p>\n<p>My use case mostly involves loading Hugging Face transformers and their tokenizers and using their encoding layers for fine-tuning.\u00a0To load a model in <code>transformers<\/code>, one uses the function <code>transformers.AutoModel.from_pretrained<\/code> and provides the name of the model on Hugging Face as a string argument. If the model exists, but you don&#8217;t already have a local copy, Hugging Face will automatically download it for you (and stashes the assets in some hidden directory). One can do something similar with the <code>tokenizers.AutoTokenizer<\/code>, or one can request the tokenizer from the model instance.<\/p>\n<p>Now you might think that this would make it easy to, say, write a command-line tool where the user can specify any Hugging Face model, but unfortunately, you&#8217;d be wrong. First off, a lot of models, including so-called\u00a0<em>token<\/em>&#8211;<em>free<\/em> ones lack a tokenizer. Why doesn&#8217;t <a href=\"https:\/\/huggingface.co\/docs\/transformers\/v4.28.1\/model_doc\/byt5\">ByT5<\/a>, for instance, provide as its tokenizer a trivial Rust (or Python, even) function that returns bytes? In practice, one cannot support arbitrary Hugging Face models because one cannot count on them having a tokenizer. In this case, I see no alternative but to keep a list of supported models that lack their own tokenizer. Such a list is necessarily incomplete because the model hub continues to grow.<\/p>\n<p>A similar problem comes with how parameters of the models are named. Most models are trained with dropout and support a dropout parameter, but the name of this parameter is inconsistent from model to model. In <a href=\"https:\/\/github.com\/CUNY-CL\/udtube\">UDTube<\/a>, for instance, dropout is a global parameter and it is applied to each hidden layer of the encoder (which requires us to access the guts of the Hugging Face model), and <a href=\"https:\/\/github.com\/CUNY-CL\/udtube\/blob\/46b14ddff18081f9d4ff29f5df8a09c4ed1ae484\/udtube\/modules.py#L149\">then again to the pooled contextual subword embeddings <\/a>just before they&#8217;re pooled into word embeddings. Most of the models we&#8217;ve looked at call the dropout probability of the encoder\u00a0<code>hidden_dropout_prob<\/code>, but others call it <code>dropout<\/code> or <code>dropout_rate<\/code>.\u00a0 Because of this, we have to maintain <a href=\"https:\/\/github.com\/CUNY-CL\/udtube\/blob\/master\/udtube\/encoders.py\">a module which keeps track of what the hidden layer dropout probability parameter is called<\/a>.<\/p>\n<p>I think this is basically a failure of curation. Hugging Face community managers should be out there fixing these gaps and inconsistencies, or perhaps should also publish standards for such things. They&#8217;re valued at $4.5 billion. I would argue this is at least as important as their efforts with model cards and the like.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hugging Face is, among other things, a platform for obtaining pre-trained neural network models. We use their tokenizers and\u00a0 transformers Python libraries in a number of projects. While these have a bit more abstraction than I like, and are arguably over-featured, they are fundamentally quite good and make it really easy to, e.g., add a &hellip; <a href=\"https:\/\/www.wellformedness.com\/blog\/hugging-face-needs-better-curation\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Hugging Face needs better curation&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[3,8],"tags":[],"class_list":["post-2224","post","type-post","status-publish","format-standard","hentry","category-dev","category-python"],"_links":{"self":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/2224","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/comments?post=2224"}],"version-history":[{"count":3,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/2224\/revisions"}],"predecessor-version":[{"id":2239,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/2224\/revisions\/2239"}],"wp:attachment":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/media?parent=2224"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/categories?post=2224"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/tags?post=2224"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}