Friday, July 21, 2017

Looking at Reddit Comment Data, Part 2 (Trump, Bernie, Hillary, & Harry Potter)

(Note: This post was written using reddit comment data from Jan 2015 up to June 2017.)

To see the most popular subreddits in terms of comments, I calculated the following values for all subreddits:
  • comments: number of comments in the subreddit
  • c_rank: subreddit rank by number of comments
  • users: number of commenters, or users who posted at least 1 comment
  • u_rank: subreddit rank by number of commenters
  • avg cmts: average number of comments per commenter
Again, AutoModerator and '[deleted]' users were not counted.

The following table shows the subreddits that are in top 20 by either the number of comments, or the number of users (who posted at least 1 comment). You can click on column headers to sort the table by column.

subreddit comments c_rank users u_rank avg cmts

No real surprise here. Ask reddit, politics, gaming, sports, news, and Donald Trump.

The Donald, Bernie Sanders, Hillary Clinton, and Harry Potter

From the data, we can see the popularity, or at least the celebrity, of Donald Trump. /r/The_Donald is the number one subreddit focused on an individual, by either the number of comments or the number of users. The next individual-focused subreddit is for Bernie Sanders, and Hillary Clinton is way behind. I also noticed the Harry Potter subreddit while looking at the data.

subreddit comments c_rank users u_rank avg cmts

The fact that /r/The_Donald is the top subreddit focused on an individual is also confirmed by reddistlist, which ranks subreddits by recent activity and subscribers.

Now, would Harry Potter win a presidential race against Donald Trump, Bernie Sanders, or Hillary Clinton ? It's tough to predict. Potter is a great wizard, but from all indications, Trump is also a great wizard in his own right.

We know Potter has great name recognition and is popular with young voters, but he has yet to be attacked politically. All we have are some pro-Potter propaganda books by JK Rowling.

Wednesday, May 10, 2017

Looking At Reddit Comment Data, part 1

Last month I came across this article, Dissecting Trump’s Most Rabid Online Following. Despite obvious political leaning of the author, it's still an interesting article. The discussion on calculating similarity between subreddits all sounded vaguely familiar, and reminds me of calculating movie and user similarities for the Netflix prize. So I decided to take a look at the raw data of the article (and try out Goggle bigtable + bigquery at the same time). Previously I didn't know there's a large, publicly accessible dataset of reddit comments.

The original article used data from Jan 2015 to Dec 2016. At the time I looked at the dataset, it had data up to Feb 2017, so I used data from Jan 2015 to Feb 2017, which has almost 1.7 billion comments. After a quick look, I decided to ignore the comments from two special "users" accounts:
  • deleted users, with 7% of all comments
  • AutoModerator, a "moderation bot", with 0.7% of all comments
There're many other bots, but fortunately they seemed very small relative to AutoModerator. I tallied all users whose name starts with "auto" or ends with "bot", and who have at least 10000 comments, and together they have only 0.34% of all comments. So due to time constraints, I didn't go on a bot-hunting spree. After removing the two special "users" from analysis, the dataset had:
  • 313080 subreddits
  • 14718265 users (who posted at least 1 comment)
Notice the "users" here are those who posted at least 1 comment from Jan 2015 to Feb 2017, and users who never posted a comment in that time frame are not in the dataset.

Like most social web sites, a small numbers of users are responsible for most content:
  • top 0.01% of users posted 3% of all comments
  • top 0.1% of users posted 12% of all comments
  • top 1% of users posted 41% of all comments
  • top 10% of users posted 85% of all comments
Since the users who never posted a comment are not in the dataset, the percentages in terms of all reddit users are even more skewed.

Similarly, the number of users/commenters in all subreddits is also very skewed:
  • 8 subreddits have 1000000 or more users
  • 149 subreddits have 100000 or more users
  • 1723 subreddits have 10000 or more users
  • 8588 subreddits have 1000 or more users
  • 304492 subreddits have less than 1000 users
Again, remember this is a user comment dataset, and "users" includes only those who posted at least 1 comment from Jan 2015 to Feb 2017.