Friday, July 21, 2017

Looking at Reddit Comment Data, Part 2 (Trump, Bernie, Hillary, & Harry Potter)

(Note: This post was written using reddit comment data from Jan 2015 up to June 2017.)

To see the most popular subreddits in terms of comments, I calculated the following values for all subreddits:
  • comments: number of comments in the subreddit
  • c_rank: subreddit rank by number of comments
  • users: number of commenters, or users who posted at least 1 comment
  • u_rank: subreddit rank by number of commenters
  • avg cmts: average number of comments per commenter
Again, AutoModerator and '[deleted]' users were not counted.

The following table shows the subreddits that are in top 20 by either the number of comments, or the number of users (who posted at least 1 comment). You can click on column headers to sort the table by column.

subreddit comments c_rank users u_rank avg cmts

No real surprise here. Ask reddit, politics, gaming, sports, news, and Donald Trump.

The Donald, Bernie Sanders, Hillary Clinton, and Harry Potter

From the data, we can see the popularity, or at least the celebrity, of Donald Trump. /r/The_Donald is the number one subreddit focused on an individual, by either the number of comments or the number of users. The next individual-focused subreddit is for Bernie Sanders, and Hillary Clinton is way behind. I also noticed the Harry Potter subreddit while looking at the data.

subreddit comments c_rank users u_rank avg cmts

The fact that /r/The_Donald is the top subreddit focused on an individual is also confirmed by reddistlist, which ranks subreddits by recent activity and subscribers.

Now, would Harry Potter win a presidential race against Donald Trump, Bernie Sanders, or Hillary Clinton ? It's tough to predict. Potter is a great wizard, but from all indications, Trump is also a great wizard in his own right.

We know Potter has great name recognition and is popular with young voters, but he has yet to be attacked politically. All we have are some pro-Potter propaganda books by JK Rowling.

Wednesday, May 10, 2017

Looking At Reddit Comment Data, part 1

Last month I came across this article, Dissecting Trump’s Most Rabid Online Following. Despite obvious political leaning of the author, it's still an interesting article. The discussion on calculating similarity between subreddits all sounded vaguely familiar, and reminds me of calculating movie and user similarities for the Netflix prize. So I decided to take a look at the raw data of the article (and try out Goggle bigtable + bigquery at the same time). Previously I didn't know there's a large, publicly accessible dataset of reddit comments.

The original article used data from Jan 2015 to Dec 2016. At the time I looked at the dataset, it had data up to Feb 2017, so I used data from Jan 2015 to Feb 2017, which has almost 1.7 billion comments. After a quick look, I decided to ignore the comments from two special "users" accounts:
  • deleted users, with 7% of all comments
  • AutoModerator, a "moderation bot", with 0.7% of all comments
There're many other bots, but fortunately they seemed very small relative to AutoModerator. I tallied all users whose name starts with "auto" or ends with "bot", and who have at least 10000 comments, and together they have only 0.34% of all comments. So due to time constraints, I didn't go on a bot-hunting spree. After removing the two special "users" from analysis, the dataset had:
  • 313080 subreddits
  • 14718265 users (who posted at least 1 comment)
Notice the "users" here are those who posted at least 1 comment from Jan 2015 to Feb 2017, and users who never posted a comment in that time frame are not in the dataset.

Like most social web sites, a small numbers of users are responsible for most content:
  • top 0.01% of users posted 3% of all comments
  • top 0.1% of users posted 12% of all comments
  • top 1% of users posted 41% of all comments
  • top 10% of users posted 85% of all comments
Since the users who never posted a comment are not in the dataset, the percentages in terms of all reddit users are even more skewed.

Similarly, the number of users/commenters in all subreddits is also very skewed:
  • 8 subreddits have 1000000 or more users
  • 149 subreddits have 100000 or more users
  • 1723 subreddits have 10000 or more users
  • 8588 subreddits have 1000 or more users
  • 304492 subreddits have less than 1000 users
Again, remember this is a user comment dataset, and "users" includes only those who posted at least 1 comment from Jan 2015 to Feb 2017.

Sunday, May 12, 2013

Bitten By A Dog

I was jogging in a local park minding my own business, when this dog just came up to me and bit me in the leg. The dog owner was very apologetic: "I'm so sorry, she usually doesn't do this, but she's in heat now." So I guess I was sexually assaulted by a bitch.

The wound was minor, but I was worried about rabies and went to see my doctor, who gave me a shot for tetanus, and assured me rabies is not a concern in domestic dogs today. No excuse for foaming at the mouth yet.

Wednesday, September 28, 2011

Heritage Health Prize Round 1 Results

The Heritage Health Prize round 1 milestone results were released. I came in 3rd on the final private rankings, even though I was only the 5th on public leaderboard. So I guess I didn't overfit too badly.

I wonder by how much I missed winning the lottery this time.

Thursday, September 24, 2009

The Netflix Prize Announcement

I went to the Netflix Prize announcement in New York on Sept 21. It was great to finally meet some of my team members, as well as our arch-nemesis BellKor's Pragmatic Chaos. I also did some sightseeing in NYC so it was a pretty good trip.

The prize announcement was pretty well-covered by the media. Reporters from New York Times, Forbes, AP, Business Week, Wall Street Journal, etc were there. Just do a search and you'll find plenty of press coverage of the event.

From talking to my team members, it's not surprising that we all work in IT-related fields, but I'm still impressed by the breadth of experience in the Ensemble team, from health care to entertainment industry to unmanned weapon systems. That's right, from saving your life to killing you and making a movie about it, we got you covered.

BellKor's Pragmatic Chaos revealed one surprising chance-of-luck happening that explained some of the behind-the-scene drama we saw in the last 48 hours of the contest.

PS: I accidentally got quoted in this Business Week article.

Saturday, July 4, 2009

We Are The Borg

We are the borg. Your algorithm will be assimilated. Resistance is irrelevant.

We're also known as: xlvector, OfADifferentKind, & Newman(me).

Update: to clean up the leaderboard, we have voluntarily removed "We are the Borg" and other sub-teams of "Vandelay Industries !". The pre-Vandelay Industries team "Newman, George, and Peterman !" is gone too.

Wednesday, July 1, 2009

The End Is Near

The end Netflix Prize is near: BPC has reached 10%.

Now, how do I apply what I've learning during the participation of this contest ? Maybe I should get into trying to predict the stock market or commodity prices, or something. People are already doing that on the stock market, it's called algorithmic trading.

Of course recommender systems and data-mining can be applied to all kinds of products and services, not just movies. For example truck dealers can try to find people who are most likely to buy trucks and send them truck advertisement; online matchmaking services can match people based on user data and history of matches.

Of course, for most products and service there are obvious "indicators" that we all know. For example, males in construction industry are probably far more likely to buy trucks than people in general. But still, there might be some subtle indicators and trends that can only be discovered by a computer algorithm, which might improve the accuracy of recommendations by 10%, as measured by RMSE. And as BellKor's research shows, even a few percentages of RMSE improvement can translate into huge increase in the quality of the recommendations you get. You'll actually like the products recommended to you, or the people.

How Can It Work For Matchmaking ?

Well, now I think about it, it probably won't work for matchmaking services. Why ? Because for matchmaking services, it will take a long time to truly know the quality of the recommendations.

Sure, you can find out about bad ones pretty soon: people go on one date and can't stand each other. But how do you know which recommendations are really good, and which ones only look good now but will end up in messy divorce 10 year later ? I think for matchmaking services, you'll have to wait for a generation to tell which recommendations are truly good. But by that time, society and people in general would have chance a lot, so whatever worked 20 years ago probably won't work so well today.

So for matchmaking services, the value of recommender systems is probably to filter out potential incompatible dates.