reddit_mining

Digital Humanities sentiment analysis and effective prosody research data samples and data transformation code

I made a big mistake by lowercasing the URLs in the reddit_links dataset. I recommend you use https://the-eye.eu/redarcs/ instead.

List of Subreddits

There are over two million subreddits but I've curated a list of the top ~60,000 or so.

Downloads

The most interesting files are likely going to be top_link_subreddits.csv and top_text_subreddits.csv.

The files starting with long_* and nsfw_* contain the same data -- they are just sorted differently. Check insights.md for more details.

I thought I knew most subreddits but there were a few popular ones that I discovered while writing this:

/r/lastimages
/r/invasivespecies
/r/MomForAMinute
/r/CrazyDictatorIdeas
/r/drydockporn
/r/ancientpics
/r/coaxedintoasnafu/
/r/actualconspiracies
/r/3FrameMovies
/r/thisisntwhoweare
/r/CorporateMisconduct
/r/NuclearRevenge
/r/redditserials
/r/HobbyDrama

How was this made?

The data aggregates loaded here were created by converting pushshift RS*.zst data into SQLITE format using the pushshift subcommand of the xklb python package:

wget -e robots=off -r -k -A zst https://files.pushshift.io/reddit/submissions/

pip install xklb

for f in psaw/files.pushshift.io/reddit/submissions/*
    echo "unzstd --memory=2048MB --stdout $f | library pushshift (basename $f).db"
end | parallel -j4

library merge submissions.db psaw/RS*.db

This takes several days per step (and several terabytes of free space) but the end result is a 600 GB SQLITE file. You can save some disk space by downloading the parquet files below.

I split up submissions.db into two parquet files via sqlite2parquet.

Query the Parquet files using octosql. Depending on the query, octosql is usually faster than SQLITE and parquet compresses very well. You may download those parquet files here:

reddit_links.parquet [87.7G]
reddit_posts.parquet [~134G]

Additionally, for simple analysis you can get by with downloading the sub-100MB pre-aggregated files in this repo. For the sake of speed, the ideal of having clearly defined experimental variables, I have bifurcated the aggregations based on the type of post into two types of files:

'link' for traditional reddit posts.
'text' posts (aka selftext; which were introduced in 2008).

Misc

user_stats_link.csv

user_stats_link.csv.zstd was 150MB so I split it up into three files like this:

split -d -C 250MB user_stats_link.csv user_stats_link_
zstd -19 user_stats_link_*

You can combine it back to one file like this:

zstdcat user_stats_link_*.zstd > user_stats_link.csv

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
insights_files/figure-gfm		insights_files/figure-gfm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
insights.Rmd		insights.Rmd
insights.md		insights.md
long_link_subreddits.csv		long_link_subreddits.csv
long_text_subreddits.csv		long_text_subreddits.csv
nsfw_link_subreddits.csv		nsfw_link_subreddits.csv
nsfw_text_subreddits.csv		nsfw_text_subreddits.csv
null_subreddit_link.csv.zst		null_subreddit_link.csv.zst
null_subreddit_text.csv.zst		null_subreddit_text.csv.zst
subreddit_stats.sql		subreddit_stats.sql
subreddit_stats_link.csv.zst		subreddit_stats_link.csv.zst
subreddit_stats_text.csv.zst		subreddit_stats_text.csv.zst
subreddits_00		subreddits_00
subreddits_01		subreddits_01
subreddits_02		subreddits_02
top_link_subreddits.csv		top_link_subreddits.csv
top_text_subreddits.csv		top_text_subreddits.csv
user_stats_link_00.zst		user_stats_link_00.zst
user_stats_link_01.zst		user_stats_link_01.zst
user_stats_link_02.zst		user_stats_link_02.zst
user_stats_text.csv.zst		user_stats_text.csv.zst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

reddit_mining

List of Subreddits

Downloads

How was this made?

Misc

user_stats_link.csv

About

Releases

Packages

License

chapmanjacobd/reddit_mining

Folders and files

Latest commit

History

Repository files navigation

reddit_mining

List of Subreddits

Downloads

How was this made?

Misc

user_stats_link.csv

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages