Surprises for the new NLP developer

There are a couple things that surprise students when they first begin to develop natural language processing applications.

  • Some things just take a while. A script that, say, preprocesses millions of sentences isn’t necessarily wrong because it takes a half hour.
  • You really do have to avoid wasting memory. If you’re processing a big file line-by-line,
    • you really can’t afford to read it all in at once, and
    • you should write out data as soon as you can.
  • The OS and program already know how to buffer IO; don’t fight it.
  • Whereas so much software works with data in human non-readable (e.g., wire formats, binary data) or human-hostile (XML) formats, if you’re processing text files, you can just open the files up and read them to see if they’re roughly what you expected.

Should you assign it to a variable?

Let us suppose that you’re going to compute some value, and then send it to another function. Which snippet is better (using lhs as shorthand for a variable identifier and rhs() as shorthand for the right-hand side expression)?

lhs = rhs()
do_work(lhs)
do_work(rhs())

My short answer isĀ it depends. Here is my informal set of heuristics:

    • Is a type-cast involved (e.g., in a statically typed language like C)? If so, assign it to a variable.
    • Would the variable name be a meaningful one that would provide non-obvious information about the nature of the computation? If so, assign it to a variable. For instance, if a long right-hand side expression computes perplexity, ppl = ... or perplex = ... is about as useful as an inline comment.
    • Is the computation used again in the same scope? If so, assign it to a variable.
    • Is the right-hand side just a very complicated expression? If so, consider assigning it to a variable, and try to give it an informative name.
    • Otherwise, do not assign it to a variable.