{"id":1910,"date":"2024-02-12T10:12:54","date_gmt":"2024-02-12T15:12:54","guid":{"rendered":"https:\/\/www.wellformedness.com\/blog\/?p=1910"},"modified":"2024-02-12T10:12:54","modified_gmt":"2024-02-12T15:12:54","slug":"streaming-decompression-reddit-dumps","status":"publish","type":"post","link":"https:\/\/www.wellformedness.com\/blog\/streaming-decompression-reddit-dumps\/","title":{"rendered":"Streaming decompression for the Reddit dumps"},"content":{"rendered":"<p>I was recently working with the Reddit comments and submission dumps from PushShift (RIP).<sup>1<\/sup> These are compressed in <a href=\"https:\/\/facebook.github.io\/zstd\/\">Zstandard<\/a> <code>.zst<\/code>format. Unfortunately, Python&#8217;s extensive standard library doesn&#8217;t have native support for this format, and the some of the files are quite large,<sup>2<\/sup> so a streaming API is necessary.<\/p>\n<p>After trying various third-party libraries, I finally found one that worked with a minimum of fuss: <tt><a href=\"https:\/\/github.com\/Superskyyy\/pyzstd\">pyzstd<\/a><\/tt>, available from PyPI or Conda. This appears to be using <del>Facebook<\/del>Meta&#8217;s reference C implementation as the backend, but more importantly, it provides a stream API like the familiar <code>gzip.open<\/code>, <code>bz2.open<\/code>, and <code>lzma.open<\/code> for <code>.gz<\/code>, <code>.bz2<\/code> and <code>.xz<\/code> files, respectively. There&#8217;s one nit: PushShift&#8217;s Reddit dumps were compressed with an uncommonly large window size (2 &lt;&lt; 31), and one has to inform the decompression backend. Without this, I was getting the following error:<\/p>\n<pre>_zstd.ZstdError: Unable to decompress zstd data: Frame requires too much memory for decoding.<\/pre>\n<p>All I have to do to fix this is to pass the relevant parameter:<\/p>\n<pre>PARAMS = {pyzstd.DParameter.windowLogMax: 31}\r\n\r\nwith pystd.open(yourpath, \"rt\", level_or_options=PARAMS) as source:\r\n    for line in source:\r\n        ...<\/pre>\n<p>Then, each <code>line<\/code> is a JSON message with the post (either a comment or submission) and all the metadata.<\/p>\n<h1>Endnotes<\/h1>\n<ol>\n<li>Psst, don&#8217;t tell anybody, but&#8230; while these are no longer being updated\u00a0they are available through December 2023 <a href=\"https:\/\/academictorrents.com\/details\/9c263fc85366c1ef8f5bb9da0203f4c8c8db75f4\/tech\">here<\/a>. We have found them <a href=\"https:\/\/academicworks.cuny.edu\/gc_etds\/3398\/\">useful<\/a>!<\/li>\n<li>Unfortunately, they&#8217;re grouped first by comments vs. submissions, and then by month. I would have preferred the files to be grouped by subreddit instead.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>I was recently working with the Reddit comments and submission dumps from PushShift (RIP).1 These are compressed in Zstandard .zstformat. Unfortunately, Python&#8217;s extensive standard library doesn&#8217;t have native support for this format, and the some of the files are quite large,2 so a streaming API is necessary. After trying various third-party libraries, I finally found &hellip; <a href=\"https:\/\/www.wellformedness.com\/blog\/streaming-decompression-reddit-dumps\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Streaming decompression for the Reddit dumps&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[4,5],"tags":[],"class_list":["post-1910","post","type-post","status-publish","format-standard","hentry","category-language","category-nlp"],"_links":{"self":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/1910","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/comments?post=1910"}],"version-history":[{"count":1,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/1910\/revisions"}],"predecessor-version":[{"id":1911,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/posts\/1910\/revisions\/1911"}],"wp:attachment":[{"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/media?parent=1910"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/categories?post=1910"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wellformedness.com\/blog\/wp-json\/wp\/v2\/tags?post=1910"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}