{"id":715,"date":"2019-03-31T22:37:18","date_gmt":"2019-03-31T22:37:18","guid":{"rendered":"http:\/\/www.wellformedness.com\/blog\/?p=715"},"modified":"2019-04-01T23:48:18","modified_gmt":"2019-04-01T23:48:18","slug":"using-a-fixed-training-development-test-split-in-sklearn","status":"publish","type":"post","link":"https:\/\/www.wellformedness.com\/blog\/using-a-fixed-training-development-test-split-in-sklearn\/","title":{"rendered":"Using a fixed training-development-test split in sklearn"},"content":{"rendered":"<p>The\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/\">scikit-learn <\/a>machine learning library has good support for various forms of model selection and hyperparameter tuning. For setting regularization hyperparameters, there are model-specific cross-validation tools, and there are also tools for both grid (e.g., exhaustive) hyperparameter tuning with the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.GridSearchCV.html\"><code>sklearn.model_selection.GridSearchCV<\/code><\/a> and random hyperparameter tuning (in the sense of Bergstra &amp; Bengio 2012) with <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.RandomizedSearchCV.html\"><code>sklearn.model_selection.RandomizedSearchCV<\/code><\/a>, respectively. While you could probably could implement these yourself, the sklearn developers have enabled just about every feature you could want, including multiprocessing support.<\/p>\n<p>One apparent limitation of these classes is that, as their names suggest, they are designed for use in a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cross-validation_(statistics)\">cross-validation<\/a> setting. In the speech &amp; language technology, however, standard practice is to use a fixed partition of the data into training, development (i.e., validation), and test (i.e., evaluation) sets, and to select hyperparameters which maximize performance on the development set. This is in part an artifact of limited computing resources of the Penn Treebank era and I&#8217;ve long suspected it has serious repercussions for model evaluation. But tuning and evaluating with a standard split is faster than cross-validation and can make exact replication much easier. And, there are also some concerns about whether cross-validation is the best way to set hyperparameters anyways. So what can we do?<\/p>\n<p>The <code>GridSearchCV<\/code>\u00a0and\u00a0<code>RandomSearchCV<\/code> classes take an optional <code>cv<\/code> keyword argument, which can be, among other things, an object implementing the <a href=\"https:\/\/scikit-learn.org\/stable\/glossary.html#term-cross-validation-splitter\"><em>cross-validation iterator<\/em>\u00a0interface<\/a>. At first I thought I would create an object which allowed me to use a fixed development set for hyperparameter tuning, but then I realized that I could do this with one of the existing iterator classes, namely one called <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.PredefinedSplit.html#sklearn.model_selection.PredefinedSplit\"><code>sklearn.model_selection.PredefinedSplit<\/code><\/a>. The constructor for this class takes a single argument <code>test_fold<\/code>, an array of integers of the same size as the data passed to the fitting method.\u00a0 As the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/cross_validation.html#predefined-fold-splits-validation-sets\">documentation<\/a> explains &#8220;&#8230;when using a validation set, set the <code>test_fold<\/code> to 0 for all samples that are part of the validation set, and to -1 for all other samples.&#8221; That we can do. Suppose that we have training data <code>x_train<\/code> and <code>y_train<\/code> and development data <code>x_dev<\/code> and <code>y_dev<\/code> laid out as <a href=\"http:\/\/www.numpy.org\/\">NumPy<\/a> arrays. We then create a training-and-development set like so:<\/p>\n<div id=\"predefined-fold-splits-validation-sets\" class=\"section\">\n<pre>x = numpy.concatenate([x_train, x_dev])\r\ny = numpy.concatenate([y_train, y_dev])\r\n<\/pre>\n<p>Then, we create the iterator object:<\/p>\n<pre>test_fold = numpy.concatenate([\r\n    # The training data.\r\n    numpy.full(-1, x_train.shape[1], dtype=numpy.int8),\r\n    # The development data.\r\n    numpy.zeros(x_dev.shape[1], dtype=numpy.int8)\r\n])\r\ncv = sklearn.model_selection.PredefinedSplit(test_fold)<\/pre>\n<\/div>\n<p>Finally, we provide <code>cv<\/code> as a keyword argument to the grid or random search constructor, and then train. For instance, similar to <a href=\"https:\/\/scikit-learn.org\/stable\/auto_examples\/model_selection\/plot_randomized_search.html#sphx-glr-auto-examples-model-selection-plot-randomized-search-py\">this example<\/a>\u00a0we might do something like:<\/p>\n<pre>base = sklearn.ensemble.RandomForestClassifier()\r\ngrid = {\"bootstrap\": [True, False], \r\n        \"max_features\": [1, 3, 5, 7, 9, 10]}\r\nmodel = sklearn.model_select.GridSearchCV(base, grid, cv=cv)\r\nmodel.fit(x, y)\r\n<\/pre>\n<p>Now just add <code>n_jobs=-1<\/code> to the constructor for <code>model<\/code> and to spread the work across all your logical cores.<\/p>\n<h1>References<\/h1>\n<p>Bergstra, J., and Bengio, Y. 2012. Random search for hyperparameter optimization.\u00a0<em>Journal of Machine Learning Research\u00a0<\/em>13: 281-305.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The\u00a0scikit-learn machine learning library has good support for various forms of model selection and hyperparameter tuning. For setting regularization hyperparameters, there are model-specific cross-validation tools, and there are also tools for both grid (e.g., exhaustive) hyperparameter tuning with the sklearn.model_selection.GridSearchCV and random hyperparameter tuning (in the sense of Bergstra &amp; Bengio 2012) with sklearn.model_selection.RandomizedSearchCV, respectively. &hellip; <a href=\"https:\/\/www.wellformedness.com\/blog\/using-a-fixed-training-development-test-split-in-sklearn\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Using a fixed training-development-test split in sklearn&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,4,5,8],"tags":[],"class_list":["post-715","post","type-post","status-publish","format-standard","hentry","category-dev","category-language","category-nlp","category-python"],"_links":{"self":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/715","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/comments?post=715"}],"version-history":[{"count":13,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/715\/revisions"}],"predecessor-version":[{"id":729,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/715\/revisions\/729"}],"wp:attachment":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/media?parent=715"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/categories?post=715"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/tags?post=715"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}