The original article used data from Jan 2015 to Dec 2016. At the time I looked at the dataset, it had data up to Feb 2017, so I used data from Jan 2015 to Feb 2017, which has almost 1.7 billion comments. After a quick look, I decided to ignore the comments from two special "users" accounts:
- deleted users, with 7% of all comments
- AutoModerator, a "moderation bot", with 0.7% of all comments
There're many other bots, but fortunately they seemed very small relative to AutoModerator. I tallied all users whose name starts with "auto" or ends with "bot", and who have at least 10000 comments, and together they have only 0.34% of all comments. So due to time constraints, I didn't go on a bot-hunting spree. After removing the two special "users" from analysis, the dataset had:
- 313080 subreddits
- 14718265 users (who posted at least 1 comment)
- 1566721864 comments
Like most social web sites, a small numbers of users are responsible for most content:
- top 0.01% of users posted 3% of all comments
- top 0.1% of users posted 12% of all comments
- top 1% of users posted 41% of all comments
- top 10% of users posted 85% of all comments
Similarly, the number of users/commenters in all subreddits is also very skewed:
- 8 subreddits have 1000000 or more users
- 149 subreddits have 100000 or more users
- 1723 subreddits have 10000 or more users
- 8588 subreddits have 1000 or more users
- 304492 subreddits have less than 1000 users
Again, remember this is a user comment dataset, and "users" includes only those who posted at least 1 comment from Jan 2015 to Feb 2017.