Wednesday, May 10, 2017

Looking At Reddit Comment Data, part 1

Last month I came across this article, Dissecting Trump’s Most Rabid Online Following. Despite obvious political leaning of the author, it's still an interesting article. The discussion on calculating similarity between subreddits all sounded vaguely familiar, and reminds me of calculating movie and user similarities for the Netflix prize. So I decided to take a look at the raw data of the article (and try out Goggle bigtable + bigquery at the same time). Previously I didn't know there's a large, publicly accessible dataset of reddit comments.

The original article used data from Jan 2015 to Dec 2016. At the time I looked at the dataset, it had data up to Feb 2017, so I used data from Jan 2015 to Feb 2017, which has almost 1.7 billion comments. After a quick look, I decided to ignore the comments from two special "users" accounts:
  • deleted users, with 7% of all comments
  • AutoModerator, a "moderation bot", with 0.7% of all comments
There're many other bots, but fortunately they seemed very small relative to AutoModerator. I tallied all users whose name starts with "auto" or ends with "bot", and who have at least 10000 comments, and together they have only 0.34% of all comments. So due to time constraints, I didn't go on a bot-hunting spree. After removing the two special "users" from analysis, the dataset had:
  • 313080 subreddits
  • 14718265 users (who posted at least 1 comment)
  • 1566721864 comments
Notice the "users" here are those who posted at least 1 comment from Jan 2015 to Feb 2017, and users who never posted a comment in that time frame are not in the dataset.

Like most social web sites, a small numbers of users are responsible for most content:
  • top 0.01% of users posted 3% of all comments
  • top 0.1% of users posted 12% of all comments
  • top 1% of users posted 41% of all comments
  • top 10% of users posted 85% of all comments
Since the users who never posted a comment are not in the dataset, the percentages in terms of all reddit users are even more skewed.



Similarly, the number of users/commenters in all subreddits is also very skewed:
  • 8 subreddits have 1000000 or more users
  • 149 subreddits have 100000 or more users
  • 1723 subreddits have 10000 or more users
  • 8588 subreddits have 1000 or more users
  • 304492 subreddits have less than 1000 users
Again, remember this is a user comment dataset, and "users" includes only those who posted at least 1 comment from Jan 2015 to Feb 2017.