Streaming decompression for the Reddit dumps

I was recently working with the Reddit comments and submission dumps from PushShift (RIP).1 These are compressed in Zstandard .zstformat. Unfortunately, Python’s extensive standard library doesn’t have native support for this format, and the some of the files are quite large,2 so a streaming API is necessary.

After trying various third-party libraries, I finally found one that worked with a minimum of fuss: pyzstd, available from PyPI or Conda. This appears to be using FacebookMeta’s reference C implementation as the backend, but more importantly, it provides a stream API like the familiar gzip.open, bz2.open, and lzma.open for .gz, .bz2 and .xz files, respectively. There’s one nit: PushShift’s Reddit dumps were compressed with an uncommonly large window size (2 << 31), and one has to inform the decompression backend. Without this, I was getting the following error:

_zstd.ZstdError: Unable to decompress zstd data: Frame requires too much memory for decoding.

All I have to do to fix this is to pass the relevant parameter:

PARAMS = {pyzstd.DParameter.windowLogMax: 31}

with pystd.open(yourpath, "rt", level_or_options=PARAMS) as source:
    for line in source:
        ...

Then, each line is a JSON message with the post (either a comment or submission) and all the metadata.

Endnotes

  1. Psst, don’t tell anybody, but… while these are no longer being updated they are available through December 2023 here. We have found them useful!
  2. Unfortunately, they’re grouped first by comments vs. submissions, and then by month. I would have preferred the files to be grouped by subreddit instead.

Leave a Reply

Your email address will not be published. Required fields are marked *