{"id":675,"date":"2019-03-08T23:15:53","date_gmt":"2019-03-08T23:15:53","guid":{"rendered":"http:\/\/www.wellformedness.com\/blog\/?p=675"},"modified":"2020-10-16T15:52:03","modified_gmt":"2020-10-16T15:52:03","slug":"a-minimalist-project-design-for-nlp","status":"publish","type":"post","link":"https:\/\/www.wellformedness.com\/blog\/a-minimalist-project-design-for-nlp\/","title":{"rendered":"A minimalist project design for NLP"},"content":{"rendered":"<p>Let&#8217;s say you want to build a new tagger, a new named entity recognizer, a new dependency parser, or whatever. Or perhaps you just want to see how your coreference resolution engine performs on your new database of anime reviews. So how should you structure your project? Here&#8217;s my minimalist solution.<\/p>\n<p>There are two principles that guide my design. The first one is <em>modularity. <\/em>Some of these components will get run many times, some won&#8217;t. If you&#8217;re doing model comparison\u2014and you should be doing model comparison\u2014some components will get swapped out with someone else&#8217;s code. This sort of thing is a major lift unless you opt for modularity. The second principle is\u00a0<em>filesystem state<\/em>. The filesystem is your friend. If your embedding table eats up all your RAM and you have to restart, the filesystem will be in roughly the same state as when you left. The filesystem allows you to organize things into directories and subdirectories, and give the pieces informative names; I like to record information about datasets and hyperparameter values in my file and directory names. So without further ado, here are the recommended scripts or applications to create when you&#8217;re starting off on a new project.<\/p>\n<ol>\n<li><code>split<\/code> takes the <span style=\"text-decoration: underline;\">full dataset<\/span> and a <span style=\"text-decoration: underline;\">random seed<\/span> (which you should store for later) as input. The script reads the data in, randomly shuffles the data, and then splits it into an 80% training set, <span style=\"text-decoration: underline;\">10% development set<\/span>, and a 10<span style=\"text-decoration: underline;\">% test<\/span> (i.e., evaluation set) which it then outptus. If you&#8217;re comparing to prior work that used a &#8220;standard split&#8221; you may want to have a separate script that generates that too, but I strongly recommend using randomly generated splits.<\/li>\n<li><code>train<\/code> takes the <span style=\"text-decoration: underline;\">training set<\/span> as input and outputs <span style=\"text-decoration: underline;\">a model file or directory<\/span>. If you&#8217;re automating hyperparameter tuning you will also want to provide the <span style=\"text-decoration: underline;\">development set<\/span> as input; if not you will probably want to either <span style=\"text-decoration: underline;\">add a bunch of flags to control the hyperparameters<\/span> or <span style=\"text-decoration: underline;\">allow the user to pass some kind of model configuration file<\/span> (I like <a href=\"https:\/\/en.wikipedia.org\/wiki\/YAML\">YAML<\/a> for this).<\/li>\n<li><code>apply<\/code> takes as input <span style=\"text-decoration: underline;\">the model file(s)<\/span> produced in (2) and the <span style=\"text-decoration: underline;\">test set<\/span>, and applies the model to the data, outputting a new <span style=\"text-decoration: underline;\">hypothesized test data set<\/span> (i.e., the model&#8217;s predictions). One open question is whether this ought to take only unlabeled data or should overwrite the existing labels: it depends.<\/li>\n<li><code>evaluate<\/code> takes as input <span style=\"text-decoration: underline;\">the gold test set<\/span> and the <span style=\"text-decoration: underline;\">hypothesized test data set<\/span> generated in (3) and outputs <span style=\"text-decoration: underline;\">the evaluation results<\/span> (as text or in some structured data format\u2014sometimes YAML is a good choice, other times TSV files will do). I recommend you test this with a small amount of data first.<\/li>\n<\/ol>\n<p>That&#8217;s all there&#8217;s to it. When you begin doing model comparison you may find yourself swapping out (2-3) for somebody else&#8217;s code, but make sure to still stick to the same evaluation script.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let&#8217;s say you want to build a new tagger, a new named entity recognizer, a new dependency parser, or whatever. Or perhaps you just want to see how your coreference resolution engine performs on your new database of anime reviews. So how should you structure your project? Here&#8217;s my minimalist solution. There are two principles &hellip; <a href=\"https:\/\/www.wellformedness.com\/blog\/a-minimalist-project-design-for-nlp\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;A minimalist project design for NLP&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,4,5,7,10],"tags":[],"class_list":["post-675","post","type-post","status-publish","format-standard","hentry","category-dev","category-language","category-nlp","category-presentation-of-self","category-stats"],"_links":{"self":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/675","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/comments?post=675"}],"version-history":[{"count":3,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/675\/revisions"}],"predecessor-version":[{"id":960,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/675\/revisions\/960"}],"wp:attachment":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/media?parent=675"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/categories?post=675"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/tags?post=675"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}