Thursday, June 11, 2009

Calculating 316 million movie correlations in 2 minutes (down from 2.5 hours)

KNN is one of the most popular CF algorithms. Central to KNN is how to define the neighborhood relationship, or the distance between two objects. For this purpose, Pearson Correlation is a pretty good measure and is used frequently (see my previous post for my basic KNN approach). In the Netflix dataset, there are 17770 movies, so we need to calculate 17770*17770, or about 316 million Pearson correlations. Not a trivial task.

In this post, I'll describe the optimizations I used for the calculation of Pearson Correlation, which cut my running time from 2.5 hours to less than 2 minutes. It won't help your RMSE directly, but it may help indirectly by allowing you to explore the KNN parameter space faster. And although I used Pearson Correlation, the methods described in this post can be applied to many other neighborhood measures too.

Since this is about optimization, system spec is important. The numbers reported here were generated on a PC with Intel Core2 Quad CPU (4 cores) @ 2.4G Hz, 2 GB of RAM, running Window XP. Most code was written in C++, built by Visual Studio 2008. (TODO: get memory spec)




The Matrix: Halved

Before discussing any optimization, I'll discuss an optimization that's NOT taken. Look at Figure 1. It shows the correlation matrix of 5 movies. AB is the correlation of movie A and B, AC is the correlation of movie A and C, and so on.

Now, if you assume movie correlations are symmetrical, then AB=BA, AC=CA, BC=CB, and so on, so only less than half of the matrix needs to be calculated. Now let’s say I calculated the blue-shaded part of the matrix, and saved the data to a big file. To make predictions, I need to read the data back:

For movie A, I need to read AB to AE, and that’s 1 file-seek and 1 file-read operation.

For movie B, I need to read AB, and BC to BE, that’s 2 file-seek and 2 file-read operations.

For movie C, I need to read AC, BC, and CD and CE, that’s 3 file-seek and 3 file-read operations.

And so on…

In the Netflix data set, there are M=17770 movies, so by the time I get to the last movie, I’d need 17769 file seek and read operations per movie. All that file seeking will kill my performance and my hard disk.

The simple solution is to have enough memory to hold half the matrix, so I can load all the data from disk in one shot. But for each matrix entry, I need not only the correlation value, but also two average values, and a common viewer count, making it 16 bytes per entry. So for M=17770 movies, I need about 2.35 GB, but I needed this thing to run on my laptop, which had only 1.25 G when I started coding this. Even after I compacted it down to 5 bytes per entry, memory situation was still tight.

(Actually, I think you only need enough memory to hold one quarter of the matrix to eliminate the huge number of file seeks, but the in-memory indexing scheme gets pretty complicated. There might be clever ways to do this, but I'm not sure yet.)

And all that applies only if you assume movie correlations are symmetrical, and there're ways to calculate correlations where they're not symmetrical. So out of concerns for memory usage and simplicity, I decided to ignore this half-matrix optimization.

Now let's get back to the calculation of Pearson's r.



The Brute-Force Approach

To calculate Pearson's r of two movies X and Y, we need to find viewers who rated both X and Y, and calculate some intermediate values:

  struct PearsonIntermediate {
  //from all viewers who rated both movie X and Y.
    float x; //sum of ratings for movie X
    float y; //sum of ratings for movie Y
    float xy; //sum of product of ratings for movies X and Y
    float xx; //sum of square of ratings for movie X
    float yy; //sum of square of ratings for movie Y
    unsigned int cnt; //number of viewers who rated both movies
  };
and you can easily calculate Pearson's r from these intermediate values.

For Netflix data, float can be replaced by "unsigned int" when processing raw, integer ratings, which can only be 1, 2, 3, 4, or 5.

As mentioned on my previous blog, I have the rating data sorted by movie index followed by viewer index, so finding common viewers of two movies is simply finding common values in two sorted arrays. Very easy, and very slow. The inner loop of this approach looks this:


  while (not at the end of pViewerIndex0 and pViewerIndex1) {
    if ( *pViewerIndex0 < *pViewerIndex1) {         
        pViewerIndex0++;
    } else if ( *pViewerIndex0 > *pViewerIndex1) {
        pViewerIndex1++;
    } else {
        //a viewer who rated both movies
        //update PearsonIntermediate
    }
  }
The inner loop is conditional branch-heavy, and even worse, given that the viewer indices are randomly spaced, they are very unpredictable branches, probably completely defeating the branch prediction logic of the CPU.

I think the Big O Notation of this approach is roughly O(M*M*T/M)=O(M*T), where M is the total number of movies, and T is the total number of ratings. Basically, for each movie, you have to compare against all other movies (M*M), and the average length of comparisions is the average number of ratings per movie (T/M).

My implementation of this approach took 4.5 hours to run on my little laptop, and 2.5 hours on the faster desktop. And I suffered thru this 4.5 hours a few times, due to the inevitable bugs in the initial implementation.

You can definitely optimize the code somewhat, but once I got KNN working correctly, I moved onto other algorithms.



A Break-thru

So my KNN algorithm just sat there, for months. During this time, I worked on other CF algorithms, which is probably why one day, I was working on something totally unrelated to KNN, and I suddenly had one of those "Oh my God I'm an idiot !" moment.

I realized there's a much bette way of calculating correlations. You see, for each movie X, instead of comparing it to every other movie directly, I can go thru all the viewers V who rated movie X; and for each viewer V, go thru all the movies Y rated by that viewer; and for each movie Y, update the PearsonIntermediate of movie pair X and Y. At the end, I'll have PearsonIntermediate of movie X and all the movies in the dataset (including X itself).

In pseudo-code, to calculate Pearson correlations of movie X and all movies in the dataset, you declare:


  PearsonIntermediate array[MOVIE_CNT]; //MOVIE_CNT is 17770

  initialize array[] to zeros
  for each viewer v who rated movie X {
    for each movie Y rated by viewer v {
       update values of array[Y]
    }
  }
  go thru array[] to calculate all correlations for movie X
As a sanity check, you can verify the correlation from array[X] is 1 (or very close to 1, due to floating point inaccuracy), which is the correlation of movie X and itself.

The Big O Notation for this approach is O(M*(T/M)*(T/V))=O(T*T/V). The brute-force approach is O(M*T). Since T/V<<M, I should get a significant speed-up. Furthermore, the inner loop does not have the unpredictable branchings of the brute-force approach.

This is just the motivation I needed to implement KNN on residues of other algorithms. If the original algorithm is slow with raw ratings (1 to 5 integer), it will be even slower on residues (floating point). After I was done with residues, I retrofitted this approach to my raw-rating KNN, and the improvement was huge: calculation time dropped to about about 15 minutes, a 10x speed-up !

That's pretty good, but wait, there is more ...



Parallelization

At this point I implemented multi-threading to take advantage of multiple CPU cores. Movie correlation calculation is trivially parallelizable. The result is about 3.4x speed up on 4 cores, bringing the running time down to 265 seconds.

But wait, there is more ...



Simplifying the Inner Loop

I thought since raw ratings can be only 1, 2, 3, 4, or 5, I should be able to speed things up significantly by replacing the PearsonIntermediate structure with:
    unsigned int32 vCount[5][5];
where vCount[m][n] is the number of viewers who gave movie X a score of m+1 and movie Y a score of n+1. This simplifies the inner-loop to just incrementing an integer !

I quickly implemented this "clever" idea, however, the result was disappointing: the speed-up was a few percent in some quick tests. Hmm, what could be wrong ?

After some head-scratching, I guessed the problem was the size of memory used. Each PearsonIntermediate structure takes only 24 bytes, but "unsigned int32 vCount[5][5]" takes 100 bytes. So as the code does the calculations, more memory had to be swapped in and out of cache and this seriously harmed the performance.

Fortunately we can use the nature of the data itself to work around this problem. Of the 17770 movies in the data set, 33% have less than 256 ratings, 65% have between 256 and 65535 ratings, and only 2% have more than 65536 ratings. Of course, if a movie has less than 256 ratings, there is no way the it has 256 or more common viewers with another movie, so the common viewer counts of this movie can be stored using "unsigned char" instead of "unsigned int", or 1 byte instead of 4 bytes. For movies that have between 256 and 65535 ratings, we can used "unsigned short" (2 bytes).

This means I need 3 versions of the correlation calculation function, one each for unsigned int/short/char (or int32/int16/int8). Here C++'s template function feature comes to the rescue (in spite of the awkward syntax). I wrote one function that uses a type T to store the common viewer counts:

  T vCount[5][MOVIE_CNT][5]; //initialize to zero

  for each viewer V who rated movie X {
    m=the rating viewer V gave to movie X 
    for each movie Y rated by viewer V {
      n=the rating viewer V gave to movie Y
      vCount[m][Y][n]++;
    }
  }
and I had 3 "instantiations" (those jargons are as bad as the awkward syntax) of this function where T is unsigned int/short/char. Notice the vCount array is arranged as [5][MOVIE_CNT][5] instead of the more "straight-forward" form of [MOVIE_CNT][5][5]. My tests showed [5][MOVIE_CNT][5] gave the best performance, no doubt because it minimized cache misses.

With this change, the running time improved to 156 seconds.

But wait, there is more ...



Complicating the Inner Loop

The previous chanage simplified the inner loop, but most gains seem to come from reducing the demand on memory subsytem and cache miss. Can we do better ?

Consider a movie with 3000 ratings, the type of its vCount[][][] array is "unsigned short", or 2 bytes per element. But it's unlikely that the final value of any array element would be 256 or bigger, given 3000/25=120. Even for movies with 6400 or more ratings (6400/25=256), because ratings are not evenly distributed, most vCount[][][] elements are still less than 255.

So for most movies, one byte is enough for most vCount elements. But how do we use the 1-byte (unsigned char) version of vCount, and still be able to handle the occasional elements that go above 255 ? Well, use another array to count the overflows:

  unsigned char vCount[5][MOVIE_CNT][5]; //initialize to zero
  unsigned short overflows[5][MOVIE_CNT][5]; //initialize to zero

  for each viewer V who rated movie X {
    m=the rating viewer V gave to movie X 
    for each movie Y rated by viewer V {
      n=the rating viewer V gave to movie Y
      vCount[m][Y][n]++;
      if (0==vCount[m][Y][n]) overflows[m][Y][n]++;
    }
  }
The code above takes advantage of the fact that in 2's complement system, when an integer value overflows, it wraps around to zero.

At first I was very skeptical of this change, because it adds a branching to the inner-most loop. How could it ever improve performance with that ? However that branching is highly predictable, and everything I read says with modern CPU's branch prediction logic, the cost of predictable branches is almost nothing.

So with great skepticism, I tried the code above. To my relief it did run faster, bringing the running time down to 143 seconds.

But wait, there is more ....



Reordering the Movies

For any viewer V, the IDs of movies V has rated are pretty much randomly distributed. Imaging the array of movies, with red dots indicating the movies rated by viewer V:

and as we go thru the movies rated by viewer V, most red dots incur cache misses.

However, if we sort the movie IDs in increasing order of rating count of movies, then the distribution of red dots changes:

statistically, we have the same or even more cache misses at the beginning, but as we go into higher movie indices, the chances that the viewer rated closely spaced movies increases, thus reducing the cache misses as vCount[][][] is updated.

Notice the "sorting" of the movie IDs by movie rating count is not an O(N*log(N)) quicksort for 100 million ratings, but just an O(N) remapping.

This change reduced the running time to 116 seconds, less than 2 minutes.



Summary

Well, that's it. I've picked all the low-hanging fruits, at least those that I could see.

Here's the list of performance gains from my initial implementation:

*Algorithm change: 10x
*Parallelization: 3.4x (4 cores vs 1 core)
*Low-level tweaks: 2.3x

In total about 78x performance gain, without writing a single line of assembly language.

Of the 3 items above, the biggest performance gain vs ease of implementation is parallization, because KNN calculation can be parallized very easily. I believe more cores will help the performance more, but with diminishing returns.



The Memory Bandwidth Bottleneck

Memory bandwidth seems to be the performance bottleneck of KNN algorithms.

When I was coding the parallization step, I installed some software to monitor MSRs, or model-specific registers. MSRs are very useful for performance monitoring, but unfortunately they're very sparsely documented. The "Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide" has some details, but nowhere as detailed as the individual CPU instructions.

One particular MSR I've looked at is DCU_MISS_OUTSTANDING. This webpage has the following to say about it:

DCU_MISS_OUTSTANDING gives a weighted measure of the cycles spent waiting for a cache line to be brought in. Each cycle spent waiting is weighted by the number of outstanding cache misses ... and has some caveats, but can be considered a rough estimate of the time spent waiting for a cache line to be brought in.

Anyway, when I first implemented multithreading, the code took 265 seconds to run, and I recorded about 400 billion CPU cycles spent on DCU_MISS_OUTSTANDING events. The final version took 116 seconds to run and I recorded about 100 billion CPU cycles. The difference is 149 seconds and 300 billion cycles. 300 billion cycles divided by 2.4G Hz (CPU clock speed) gives 125 seconds. So it looks like a lot of gains came from reducing the time spent waiting for data to be loaded into the cache.

I won't pretend I understand DCU_MISS_OUTSTANDING or many other MSRs completely, I just hope intel provide better documentation.