Software installation

Install OpenFst and OpenGrm-NGram:

$ conda install -c conda-forge openfst ngram

Data

Download text data from the Wall St. Journal portion of the Penn Treebank corpus at the following URL:

https://www.wellformedness.com/courses/LING83800/Data/wsj.tar.gz
$ curl -O https://www.wellformedness.com/courses/LING83800/data/wsj.tar.gz

Then decompress it like so.

$ tar -xzf wsj.tar.gz

This creates a directory with three files; we'll use wsj_train.txt, which has one sentence per line.

Character model

$ head -1 wsj_train.txt  # For demonstration purposes only.
Efforts by the Hong Kong Futures Exchange to introduce a new interest-rate futures contract continue to hit snags, despite the support the proposed instrument enjoys in the colony's financial community.
$ farcompilestrings \
      --fst_type=compact \
      --token_type=byte \
      wsj_train.txt \
      wsj_train.far
$ farinfo wsj_train.far  # For demonstration.
far type                                          sttable
arc type                                          standard
fst type                                          compact_string
# of FSTs                                         34827
total # of states                                 4420561
total # of arcs                                   4385734
total # of final states                           34827
$ ngramcount --require_symbols=false --order=6 wsj_train.far wsj_train.cnt
$ ngrammake --method=witten_bell wsj_train.cnt wsj_train.lm
$ fstinfo wsj_train.lm  # For demonstration: it's an ordinary FST.
fst type                                          vector
arc type                                          standard
input symbol table                                none
output symbol table                               none
# of states                                       402801
# of arcs                                         1361523
initial state                                     1
# of final states                                 8653
# of input/output epsilons                        402800
# of input epsilons                               402800
# of output epsilons                              402800
input label multiplicity                          1
output label multiplicity                         1
# of accessible states                            402801
# of coaccessible states                          402801
# of connected states                             402801
# of connected components                         1
# of strongly conn components                     6551
input matcher                                     y
output matcher                                    y
input lookahead                                   n
output lookahead                                  n
expanded                                          y
mutable                                           y
error                                             n
acceptor                                          y
input deterministic                               y
output deterministic                              y
input/output epsilons                             y
input epsilons                                    y
output epsilons                                   y
input label sorted                                y
output label sorted                               y
weighted                                          y
cyclic                                            y
cyclic at initial state                           n
top sorted                                        n
accessible                                        y
coaccessible                                      y
string                                            n
weighted cycles                                   y
$ ngramshrink \
      --method=relative_entropy \
      --target_number_of_ngrams=100000 \
      wsj_train.lm \
      wsj_train.shrunk.lm
$ ngraminfo wsj_train.shrunk.lm   # For demonstration.
# of states                                       33971
# of ngram arcs                                   99815
# of backoff arcs                                 33970
initial state                                     1
unigram state                                     0
# of final states                                 185
ngram order                                       6
# of 1-grams                                      87
# of 2-grams                                      2356
# of 3-grams                                      13808
# of 4-grams                                      32411
# of 5-grams                                      35891
# of 6-grams                                      15447
well-formed                                       y
normalized                                        y

Token model

In this case one needs to tokenize the data, which can be done with this simple script. One also may wish to case-fold the data, though this is not done here. Whereas the character model uses each byte's ASCII representation as its arc label, it is necessary to automatically build a symbol table mapping from tokens to integer labels. (Standard output is shown in bold.)

$ ./word_tokenize.py wsj_train.txt > wsj_train.tok
$ head -1 wsj_train.tok  # For demonstration purposes only.
Efforts by the Hong Kong Futures Exchange to introduce a new interest-rate futures contract continue to hit snags , despite the support the proposed instrument enjoys in the colony 's financial community .
$ ngramsymbols wsj_train.tok wsj_train.sym
$ farcompilestrings \
      --fst_type=compact \
      --symbols=wsj_train.sym \
      --keep_symbols \
      wsj_train.tok \
      wsj_train.far
$ farinfo wsj_train.far  # For demonstration.
far type                                          sttable
arc type                                          standard
fst type                                          compact_string
# of FSTs                                         34827
total # of states                                 873020
total # of arcs                                   838193
total # of final states                           34827
$ ngramcount \
      --order=3 \
      wsj_train.far \
      wsj_train.cnt
$ ngrammake --method=kneser_ney wsj_train.cnt wsj_train.lm
$ fstinfo wsj_train.lm  # For demonstration: it's an ordinary FST.
fst type                                          vector
arc type                                          standard
input symbol table                                wsj_train.sym
output symbol table                               wsj_train.sym
# of states                                       351486
# of arcs                                         1292422
initial state                                     1
# of final states                                 8292
# of input/output epsilons                        351485
# of input epsilons                               351485
# of output epsilons                              351485
input label multiplicity                          1
output label multiplicity                         1
# of accessible states                            351486
# of coaccessible states                          351486
# of connected states                             351486
# of connected components                         1
# of strongly conn components                     4100
input matcher                                     y
output matcher                                    y
input lookahead                                   n
output lookahead                                  n
expanded                                          y
mutable                                           y
error                                             n
acceptor                                          y
input deterministic                               y
output deterministic                              y
input/output epsilons                             y
input epsilons                                    y
output epsilons                                   y
input label sorted                                y
output label sorted                               y
weighted                                          y
cyclic                                            y
cyclic at initial state                           n
top sorted                                        n
accessible                                        y
coaccessible                                      y
string                                            n
weighted cycles                                   y
$ ngramshrink \
      --method=relative_entropy \
      --target_number_of_ngrams=100000 \
      wsj_train.lm \
      wsj_train.shrunk.lm
$ ngraminfo wsj_train.shrunk.lm  # For demonstration. 
# of states                                       21175
# of ngram arcs                                   99934
# of backoff arcs                                 21174
initial state                                     1
unigram state                                     0
# of final states                                 66
ngram order                                       3
# of 1-grams                                      40925
# of 2-grams                                      49061
# of 3-grams                                      10014
well-formed                                       y
normalized                                        y