Surprises for the new NLP developer

There are a couple things that surprise students when they first begin to develop natural language processing applications.

  • Some things just take a while. A script that, say, preprocesses millions of sentences isn’t necessarily wrong because it takes a half hour.
  • You really do have to avoid wasting memory. If you’re processing a big file line-by-line,
    • you really can’t afford to read it all in at once, and
    • you should write out data as soon as you can.
  • The OS and program already know how to buffer IO; don’t fight it.
  • Whereas so much software works with data in human non-readable (e.g., wire formats, binary data) or human-hostile (XML) formats, if you’re processing text files, you can just open the files up and read them to see if they’re roughly what you expected.

Leave a Reply

Your email address will not be published. Required fields are marked *