Wednesday, May 10, 2017

Looking At Reddit Comment Data, part 1

Last month I came across this article, Dissecting Trump’s Most Rabid Online Following. Despite obvious political leaning of the author, it's still an interesting article. The discussion on calculating similarity between subreddits all sounded vaguely familiar, and reminds me of calculating movie and user similarities for the Netflix prize. So I decided to take a look at the raw data of the article (and try out Goggle bigtable + bigquery at the same time). Previously I didn't know there's a large, publicly accessible dataset of reddit comments.

The original article used data from Jan 2015 to Dec 2016. At the time I looked at the dataset, it had data up to Feb 2017, so I used data from Jan 2015 to Feb 2017, which has almost 1.7 billion comments. After a quick look, I decided to ignore the comments from two special "users" accounts:
  • deleted users, with 7% of all comments
  • AutoModerator, a "moderation bot", with 0.7% of all comments
There're many other bots, but fortunately they seemed very small relative to AutoModerator. I tallied all users whose name starts with "auto" or ends with "bot", and who have at least 10000 comments, and together they have only 0.34% of all comments. So due to time constraints, I didn't go on a bot-hunting spree. After removing the two special "users" from analysis, the dataset had:
  • 313080 subreddits
  • 14718265 users (who posted at least 1 comment)
  • 1566721864 comments
Notice the "users" here are those who posted at least 1 comment from Jan 2015 to Feb 2017, and users who never posted a comment in that time frame are not in the dataset.

Like most social web sites, a small numbers of users are responsible for most content:
  • top 0.01% of users posted 3% of all comments
  • top 0.1% of users posted 12% of all comments
  • top 1% of users posted 41% of all comments
  • top 10% of users posted 85% of all comments
Since the users who never posted a comment are not in the dataset, the percentages in terms of all reddit users are even more skewed.



Similarly, the number of users/commenters in all subreddits is also very skewed:
  • 8 subreddits have 1000000 or more users
  • 149 subreddits have 100000 or more users
  • 1723 subreddits have 10000 or more users
  • 8588 subreddits have 1000 or more users
  • 304492 subreddits have less than 1000 users
Again, remember this is a user comment dataset, and "users" includes only those who posted at least 1 comment from Jan 2015 to Feb 2017.

Sunday, May 12, 2013

Bitten By A Dog

I was jogging in a local park minding my own business, when this dog just came up to me and bit me in the leg. The dog owner was very apologetic: "I'm so sorry, she usually doesn't do this, but she's in heat now." So I guess I was sexually assaulted by a bitch.

The wound was minor, but I was worried about rabies and went to see my doctor, who gave me a shot for tetanus, and assured me rabies is not a concern in domestic dogs today. No excuse for foaming at the mouth yet.

Wednesday, September 28, 2011

Heritage Health Prize Round 1 Results

The Heritage Health Prize round 1 milestone results were released. I came in 3rd on the final private rankings, even though I was only the 5th on public leaderboard. So I guess I didn't overfit too badly.

I wonder by how much I missed winning the lottery this time.

Thursday, September 24, 2009

The Netflix Prize Announcement

I went to the Netflix Prize announcement in New York on Sept 21. It was great to finally meet some of my team members, as well as our arch-nemesis BellKor's Pragmatic Chaos. I also did some sightseeing in NYC so it was a pretty good trip.

The prize announcement was pretty well-covered by the media. Reporters from New York Times, Forbes, AP, Business Week, Wall Street Journal, etc were there. Just do a search and you'll find plenty of press coverage of the event.

From talking to my team members, it's not surprising that we all work in IT-related fields, but I'm still impressed by the breadth of experience in the Ensemble team, from health care to entertainment industry to unmanned weapon systems. That's right, from saving your life to killing you and making a movie about it, we got you covered.

BellKor's Pragmatic Chaos revealed one surprising chance-of-luck happening that explained some of the behind-the-scene drama we saw in the last 48 hours of the contest.

PS: I accidentally got quoted in this Business Week article.

Saturday, July 4, 2009

We Are The Borg

We are the borg. Your algorithm will be assimilated. Resistance is irrelevant.

We're also known as: xlvector, OfADifferentKind, & Newman(me).

Update: to clean up the leaderboard, we have voluntarily removed "We are the Borg" and other sub-teams of "Vandelay Industries !". The pre-Vandelay Industries team "Newman, George, and Peterman !" is gone too.

Wednesday, July 1, 2009

The End Is Near

The end Netflix Prize is near: BPC has reached 10%.

Now, how do I apply what I've learning during the participation of this contest ? Maybe I should get into trying to predict the stock market or commodity prices, or something. People are already doing that on the stock market, it's called algorithmic trading.

Of course recommender systems and data-mining can be applied to all kinds of products and services, not just movies. For example truck dealers can try to find people who are most likely to buy trucks and send them truck advertisement; online matchmaking services can match people based on user data and history of matches.

Of course, for most products and service there are obvious "indicators" that we all know. For example, males in construction industry are probably far more likely to buy trucks than people in general. But still, there might be some subtle indicators and trends that can only be discovered by a computer algorithm, which might improve the accuracy of recommendations by 10%, as measured by RMSE. And as BellKor's research shows, even a few percentages of RMSE improvement can translate into huge increase in the quality of the recommendations you get. You'll actually like the products recommended to you, or the people.

How Can It Work For Matchmaking ?

Well, now I think about it, it probably won't work for matchmaking services. Why ? Because for matchmaking services, it will take a long time to truly know the quality of the recommendations.

Sure, you can find out about bad ones pretty soon: people go on one date and can't stand each other. But how do you know which recommendations are really good, and which ones only look good now but will end up in messy divorce 10 year later ? I think for matchmaking services, you'll have to wait for a generation to tell which recommendations are truly good. But by that time, society and people in general would have chance a lot, so whatever worked 20 years ago probably won't work so well today.

So for matchmaking services, the value of recommender systems is probably to filter out potential incompatible dates.

Thursday, June 11, 2009

Calculating 316 million movie correlations in 2 minutes (down from 2.5 hours)

KNN is one of the most popular CF algorithms. Central to KNN is how to define the neighborhood relationship, or the distance between two objects. For this purpose, Pearson Correlation is a pretty good measure and is used frequently (see my previous post for my basic KNN approach). In the Netflix dataset, there are 17770 movies, so we need to calculate 17770*17770, or about 316 million Pearson correlations. Not a trivial task.

In this post, I'll describe the optimizations I used for the calculation of Pearson Correlation, which cut my running time from 2.5 hours to less than 2 minutes. It won't help your RMSE directly, but it may help indirectly by allowing you to explore the KNN parameter space faster. And although I used Pearson Correlation, the methods described in this post can be applied to many other neighborhood measures too.

Since this is about optimization, system spec is important. The numbers reported here were generated on a PC with Intel Core2 Quad CPU (4 cores) @ 2.4G Hz, 2 GB of RAM, running Window XP. Most code was written in C++, built by Visual Studio 2008. (TODO: get memory spec)




The Matrix: Halved

Before discussing any optimization, I'll discuss an optimization that's NOT taken. Look at Figure 1. It shows the correlation matrix of 5 movies. AB is the correlation of movie A and B, AC is the correlation of movie A and C, and so on.

Now, if you assume movie correlations are symmetrical, then AB=BA, AC=CA, BC=CB, and so on, so only less than half of the matrix needs to be calculated. Now let’s say I calculated the blue-shaded part of the matrix, and saved the data to a big file. To make predictions, I need to read the data back:

For movie A, I need to read AB to AE, and that’s 1 file-seek and 1 file-read operation.

For movie B, I need to read AB, and BC to BE, that’s 2 file-seek and 2 file-read operations.

For movie C, I need to read AC, BC, and CD and CE, that’s 3 file-seek and 3 file-read operations.

And so on…

In the Netflix data set, there are M=17770 movies, so by the time I get to the last movie, I’d need 17769 file seek and read operations per movie. All that file seeking will kill my performance and my hard disk.

The simple solution is to have enough memory to hold half the matrix, so I can load all the data from disk in one shot. But for each matrix entry, I need not only the correlation value, but also two average values, and a common viewer count, making it 16 bytes per entry. So for M=17770 movies, I need about 2.35 GB, but I needed this thing to run on my laptop, which had only 1.25 G when I started coding this. Even after I compacted it down to 5 bytes per entry, memory situation was still tight.

(Actually, I think you only need enough memory to hold one quarter of the matrix to eliminate the huge number of file seeks, but the in-memory indexing scheme gets pretty complicated. There might be clever ways to do this, but I'm not sure yet.)

And all that applies only if you assume movie correlations are symmetrical, and there're ways to calculate correlations where they're not symmetrical. So out of concerns for memory usage and simplicity, I decided to ignore this half-matrix optimization.

Now let's get back to the calculation of Pearson's r.



The Brute-Force Approach

To calculate Pearson's r of two movies X and Y, we need to find viewers who rated both X and Y, and calculate some intermediate values:

  struct PearsonIntermediate {
  //from all viewers who rated both movie X and Y.
    float x; //sum of ratings for movie X
    float y; //sum of ratings for movie Y
    float xy; //sum of product of ratings for movies X and Y
    float xx; //sum of square of ratings for movie X
    float yy; //sum of square of ratings for movie Y
    unsigned int cnt; //number of viewers who rated both movies
  };
and you can easily calculate Pearson's r from these intermediate values.

For Netflix data, float can be replaced by "unsigned int" when processing raw, integer ratings, which can only be 1, 2, 3, 4, or 5.

As mentioned on my previous blog, I have the rating data sorted by movie index followed by viewer index, so finding common viewers of two movies is simply finding common values in two sorted arrays. Very easy, and very slow. The inner loop of this approach looks this:


  while (not at the end of pViewerIndex0 and pViewerIndex1) {
    if ( *pViewerIndex0 < *pViewerIndex1) {         
        pViewerIndex0++;
    } else if ( *pViewerIndex0 > *pViewerIndex1) {
        pViewerIndex1++;
    } else {
        //a viewer who rated both movies
        //update PearsonIntermediate
    }
  }
The inner loop is conditional branch-heavy, and even worse, given that the viewer indices are randomly spaced, they are very unpredictable branches, probably completely defeating the branch prediction logic of the CPU.

I think the Big O Notation of this approach is roughly O(M*M*T/M)=O(M*T), where M is the total number of movies, and T is the total number of ratings. Basically, for each movie, you have to compare against all other movies (M*M), and the average length of comparisions is the average number of ratings per movie (T/M).

My implementation of this approach took 4.5 hours to run on my little laptop, and 2.5 hours on the faster desktop. And I suffered thru this 4.5 hours a few times, due to the inevitable bugs in the initial implementation.

You can definitely optimize the code somewhat, but once I got KNN working correctly, I moved onto other algorithms.



A Break-thru

So my KNN algorithm just sat there, for months. During this time, I worked on other CF algorithms, which is probably why one day, I was working on something totally unrelated to KNN, and I suddenly had one of those "Oh my God I'm an idiot !" moment.

I realized there's a much bette way of calculating correlations. You see, for each movie X, instead of comparing it to every other movie directly, I can go thru all the viewers V who rated movie X; and for each viewer V, go thru all the movies Y rated by that viewer; and for each movie Y, update the PearsonIntermediate of movie pair X and Y. At the end, I'll have PearsonIntermediate of movie X and all the movies in the dataset (including X itself).

In pseudo-code, to calculate Pearson correlations of movie X and all movies in the dataset, you declare:


  PearsonIntermediate array[MOVIE_CNT]; //MOVIE_CNT is 17770

  initialize array[] to zeros
  for each viewer v who rated movie X {
    for each movie Y rated by viewer v {
       update values of array[Y]
    }
  }
  go thru array[] to calculate all correlations for movie X
As a sanity check, you can verify the correlation from array[X] is 1 (or very close to 1, due to floating point inaccuracy), which is the correlation of movie X and itself.

The Big O Notation for this approach is O(M*(T/M)*(T/V))=O(T*T/V). The brute-force approach is O(M*T). Since T/V<<M, I should get a significant speed-up. Furthermore, the inner loop does not have the unpredictable branchings of the brute-force approach.

This is just the motivation I needed to implement KNN on residues of other algorithms. If the original algorithm is slow with raw ratings (1 to 5 integer), it will be even slower on residues (floating point). After I was done with residues, I retrofitted this approach to my raw-rating KNN, and the improvement was huge: calculation time dropped to about about 15 minutes, a 10x speed-up !

That's pretty good, but wait, there is more ...



Parallelization

At this point I implemented multi-threading to take advantage of multiple CPU cores. Movie correlation calculation is trivially parallelizable. The result is about 3.4x speed up on 4 cores, bringing the running time down to 265 seconds.

But wait, there is more ...



Simplifying the Inner Loop

I thought since raw ratings can be only 1, 2, 3, 4, or 5, I should be able to speed things up significantly by replacing the PearsonIntermediate structure with:
    unsigned int32 vCount[5][5];
where vCount[m][n] is the number of viewers who gave movie X a score of m+1 and movie Y a score of n+1. This simplifies the inner-loop to just incrementing an integer !

I quickly implemented this "clever" idea, however, the result was disappointing: the speed-up was a few percent in some quick tests. Hmm, what could be wrong ?

After some head-scratching, I guessed the problem was the size of memory used. Each PearsonIntermediate structure takes only 24 bytes, but "unsigned int32 vCount[5][5]" takes 100 bytes. So as the code does the calculations, more memory had to be swapped in and out of cache and this seriously harmed the performance.

Fortunately we can use the nature of the data itself to work around this problem. Of the 17770 movies in the data set, 33% have less than 256 ratings, 65% have between 256 and 65535 ratings, and only 2% have more than 65536 ratings. Of course, if a movie has less than 256 ratings, there is no way the it has 256 or more common viewers with another movie, so the common viewer counts of this movie can be stored using "unsigned char" instead of "unsigned int", or 1 byte instead of 4 bytes. For movies that have between 256 and 65535 ratings, we can used "unsigned short" (2 bytes).

This means I need 3 versions of the correlation calculation function, one each for unsigned int/short/char (or int32/int16/int8). Here C++'s template function feature comes to the rescue (in spite of the awkward syntax). I wrote one function that uses a type T to store the common viewer counts:

  T vCount[5][MOVIE_CNT][5]; //initialize to zero

  for each viewer V who rated movie X {
    m=the rating viewer V gave to movie X 
    for each movie Y rated by viewer V {
      n=the rating viewer V gave to movie Y
      vCount[m][Y][n]++;
    }
  }
and I had 3 "instantiations" (those jargons are as bad as the awkward syntax) of this function where T is unsigned int/short/char. Notice the vCount array is arranged as [5][MOVIE_CNT][5] instead of the more "straight-forward" form of [MOVIE_CNT][5][5]. My tests showed [5][MOVIE_CNT][5] gave the best performance, no doubt because it minimized cache misses.

With this change, the running time improved to 156 seconds.

But wait, there is more ...



Complicating the Inner Loop

The previous chanage simplified the inner loop, but most gains seem to come from reducing the demand on memory subsytem and cache miss. Can we do better ?

Consider a movie with 3000 ratings, the type of its vCount[][][] array is "unsigned short", or 2 bytes per element. But it's unlikely that the final value of any array element would be 256 or bigger, given 3000/25=120. Even for movies with 6400 or more ratings (6400/25=256), because ratings are not evenly distributed, most vCount[][][] elements are still less than 255.

So for most movies, one byte is enough for most vCount elements. But how do we use the 1-byte (unsigned char) version of vCount, and still be able to handle the occasional elements that go above 255 ? Well, use another array to count the overflows:

  unsigned char vCount[5][MOVIE_CNT][5]; //initialize to zero
  unsigned short overflows[5][MOVIE_CNT][5]; //initialize to zero

  for each viewer V who rated movie X {
    m=the rating viewer V gave to movie X 
    for each movie Y rated by viewer V {
      n=the rating viewer V gave to movie Y
      vCount[m][Y][n]++;
      if (0==vCount[m][Y][n]) overflows[m][Y][n]++;
    }
  }
The code above takes advantage of the fact that in 2's complement system, when an integer value overflows, it wraps around to zero.

At first I was very skeptical of this change, because it adds a branching to the inner-most loop. How could it ever improve performance with that ? However that branching is highly predictable, and everything I read says with modern CPU's branch prediction logic, the cost of predictable branches is almost nothing.

So with great skepticism, I tried the code above. To my relief it did run faster, bringing the running time down to 143 seconds.

But wait, there is more ....



Reordering the Movies

For any viewer V, the IDs of movies V has rated are pretty much randomly distributed. Imaging the array of movies, with red dots indicating the movies rated by viewer V:

and as we go thru the movies rated by viewer V, most red dots incur cache misses.

However, if we sort the movie IDs in increasing order of rating count of movies, then the distribution of red dots changes:

statistically, we have the same or even more cache misses at the beginning, but as we go into higher movie indices, the chances that the viewer rated closely spaced movies increases, thus reducing the cache misses as vCount[][][] is updated.

Notice the "sorting" of the movie IDs by movie rating count is not an O(N*log(N)) quicksort for 100 million ratings, but just an O(N) remapping.

This change reduced the running time to 116 seconds, less than 2 minutes.



Summary

Well, that's it. I've picked all the low-hanging fruits, at least those that I could see.

Here's the list of performance gains from my initial implementation:

*Algorithm change: 10x
*Parallelization: 3.4x (4 cores vs 1 core)
*Low-level tweaks: 2.3x

In total about 78x performance gain, without writing a single line of assembly language.

Of the 3 items above, the biggest performance gain vs ease of implementation is parallization, because KNN calculation can be parallized very easily. I believe more cores will help the performance more, but with diminishing returns.



The Memory Bandwidth Bottleneck

Memory bandwidth seems to be the performance bottleneck of KNN algorithms.

When I was coding the parallization step, I installed some software to monitor MSRs, or model-specific registers. MSRs are very useful for performance monitoring, but unfortunately they're very sparsely documented. The "Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide" has some details, but nowhere as detailed as the individual CPU instructions.

One particular MSR I've looked at is DCU_MISS_OUTSTANDING. This webpage has the following to say about it:

DCU_MISS_OUTSTANDING gives a weighted measure of the cycles spent waiting for a cache line to be brought in. Each cycle spent waiting is weighted by the number of outstanding cache misses ... and has some caveats, but can be considered a rough estimate of the time spent waiting for a cache line to be brought in.

Anyway, when I first implemented multithreading, the code took 265 seconds to run, and I recorded about 400 billion CPU cycles spent on DCU_MISS_OUTSTANDING events. The final version took 116 seconds to run and I recorded about 100 billion CPU cycles. The difference is 149 seconds and 300 billion cycles. 300 billion cycles divided by 2.4G Hz (CPU clock speed) gives 125 seconds. So it looks like a lot of gains came from reducing the time spent waiting for data to be loaded into the cache.

I won't pretend I understand DCU_MISS_OUTSTANDING or many other MSRs completely, I just hope intel provide better documentation.