...: August 2007

Wednesday, August 29, 2007

The Great Collaborative Filtering Groupthink, Thinks Faster

(Note: this post assumes you're familiar with The Netflix Prize.)
(Warning: long post)

As described in the “The Great Collaborative Filtering Groupthink” post, my viewer-clustering algorithm has an assignment step and an update step, repeated for many iterations.

In my initial implementation, each iteration took about 90 seconds, for a set of 64 clusters (see previous post for my hardware spec). I didn’t write down the initial performance numbers, but if I recall correctly, the assignment step took anywhere from 70 to 80 seconds, with the rest going to the update step.

Now if a set of clusters takes 10 iterations to build, and I need to build 8 sets (with different random starting values), that’s 2 hours of running time already. I’d rather not wait for 2 hours every time I tweak the algorithm or fix a bug, plus I need to experiment with different number of clusters, so I looked for ways to make it run faster, eventually reducing the per-iteration running time to 17.6 seconds.

Obviously the assignment step is the place to look first. The following is its pseudo-code:

//loop thru data in viewer-cluster-movie order
for each viewer {
    minSquare=FLT_MAX; //maximum float value
    for each cluster {
        clusterSquare=0;
        for each movie rated by the current viewer {
            clusterSquare+=(ratingOfViewer – ratingOfCluster)^2;
        } // for each movie rated by the current viewer
        if (clusterSqueare &lt minSquare) {
            Assign the current viewer to the current cluster.
            minSquare=clusterSquare;
        }
    } //for each cluster
} //for each viewer

First, I tried an early-out approach for the inner loop:

//loop thru data in viewer-cluster-movie order
for each viewer {
    minSquare=FLT_MAX; //maximum float value
    for each cluster {
        clusterSquare=0;
        for each 5% of the movies rated by the current viewer {
            for each movie in the current 5% {
                clusterSquare+=(ratingOfViewer – ratingOfCluster)^2;
            } // for each movie in the next 5%
            if (clusterSqure >= minSquare) skip current cluster.
        } //repeat 20 times
        Assign the current viewer to the current cluster.
        minSquare=clusterSquare;
    } //for each cluster
} //for each viewer

Basically, I'm tracking the current minimum square of distance value in minSquare, and as soon as the current cluster's clusterSquare value exceeds minSquare, I move onto the next cluster. However this improved performance less than 10%. I think the reason is that for most clusters, you need to go thru most of the data before clusterSquare exceeds minSquare. It’s time to look deeper.

For each set of clusters, I stored all the clusters’ rating data in a 2-D array:

unsigned short ratings[clusterCount][movieCount];

That’s right, I used a 16-bit unsigned integer for each rating. The value in ratings[][] array is the real rating value times 1024. The data is stored in the natural cluster-by-movie order: the first index of the array is cluster index, and the second index of the array is movie index. Here’s the memory layout of the data (from low memory address to high memory address):

c0m0, c0m1, c0m2, …… c0m[movieCount-2], c0m[movieCount-1],
c1m0, c1m1, c1m2, …… c1m[movieCount-2], c1m[movieCount-1],
c2m0, c2m1, c2m2, …… c2m[movieCount-2], c2m[movieCount-1],
...

where c0m0 is the cluster 0’s rating for movie 0, and c0m1 is the cluster 0’s rating for movie 1, and so on.

In the Netflix data set, there’re 17770 movies and about half million viewers, but only about 100 million ratings, so on average each viewer rated about one in 89 movies.

Therefore in my inner loops above (for each movie rated by the viewer), the code is skipping thru the rows of the array from left to right. On average, the memory address of each rating value accessed is about 89 ratings (178 bytes) away from the previous address. So it’s a guaranteed cache miss in most cases, with horrible performance penalties. Most of the time, the CPU is just sitting there waiting for the data to arrive from memory.

OK, given this scenario, the obvious solution is to turn the ratings[][] array sideways, swapping the first index and the second index:

unsigned short ratings[movieCount] [clusterCount];

and the new memory layout:

m0c0, m0c1, m0c2, …… m0c[clusterCount-2], m0c[clusterCount-1],
m1c0, m1c1, m1c2, …… m1c[clusterCount-2], m1c[clusterCount-1],
m2c0, m2c1, m2c2, …… m2c[clusterCount-2], m2c[clusterCount-1],
...

where m0c0 is the cluster 0’s rating for movie 0, and m0c1 is the cluster 1’s rating for movie 0, and so on. Now we go thru the data in a new order:

//loop thru data in viewer-movie-cluster order
float clusterSquares[clusterCount];
for each viewer {
    set all elements of clusterSquares[] to zero.
    for each movie rated by the current viewer {
        for each cluster {
            clusterSquares[clusterIndex]+=(ratingOfViewer – ratingOfCluster)^2;
        }
    } // for each movie rated by the current viewer
    Assign the viewer to the cluster that has the smallest value in clusterSquares[].
} // for each viewer {

Now in the inner loop, the code is scanning thru the array/memory sequentially with no skips, greatly reducing cache miss.

At this time I also changed the data type of ratings[][] array from unsigned short (2 bytes) to unsigned char (1 byte), to reduce the amount of data that need to be accessed. Yes I lose some accuracy this way, but the probe RMSE using 1-byte rating value is only a few 0.0001’s worse than before, so the effect is negligible for the result we’re trying to achieve here.

With both changes above, the assignment step now took about 44.3 seconds, and the update step took about 8.0 seconds (yes I wrote these numbers down), for a total of 52.3 seconds, a good improvement from the previous 90 seconds.

But wait, there’s more. The new inner-loop is scanning thru the array/memory sequentially, so we can easily take advantage of data prefetching.

However, given that cluster count is relatively small, prefetching within the cluster rating data of each movie (each row in the array/memory layout) is probably not useful. So I decided to prefetch by movie, at the second-most inner-loop level:

for each movie rated by the current viewer {
    movieToPrefetch=moviesRatedByViewer[currentIndex +prefetchOffset];
    prefetch memory address &ratings[movieToPrefetch][0]
    ...
} // for each movie rated by the current viewer

Now, there're a few caveats about prefetching:

First of all, for production code, you obviously want to write some CPU detection code to make sure the running CPU supports prefetching instructions.

Secondly, the optimal prefetch distance depends on a number of factors (yes the article talks about Itanium, but the general principle applies to Pentium too). Not a trivial task.

And finally, there’s a potential crash bug in the calculation of the prefetch memory address:

MovieToPrefetch=moviesRatedByViewer[currentIndex +prefetchOffset];

Because of prefetchOffset, we’re reading beyond the memory allocated for moviesRatedByViewer[], a potential crash.

In reality, it probably won’t crash, because moviesRatedByViewer[] sits in a block of memory pages allocated by the C++’s runtime memory manager, and the memory immediately after moviesRatedByViewer[] most likely belongs to the memory manager, so our program can still read it. The only way it could crash is if moviesRatedByViewer[] ends at or near the last byte of the last page of the block of pages it sits in.

But still, it’s better to just fix this. Fortunately, the fix is very simple: just allocate some extra bytes for moviesRatedByViewer[]. If you estimate your maximum prefetchOffset is N, just allocate an extra N*sizeof(*moviesRatedByViewer) bytes.

After I added prefetching, and did a few trial runs to find the best prefetchOffset value, the running time of assignment step went from 44.3 seconds to 32.2 seconds. Another 27% improvement.

But wait, there’s more. The new inner-loop is scanning thru array/memory sequentially, doing the same type of operation on each data element. This practically screams out for SIMD optimization.

It’s always fun to improve performance by writing assembly language code, but we rarely have good reasons to do it nowadays. Now, if processing 100 million pieces of data, repeatedly, in a non-production environment, isn’t a good excuse, then I don’t know what is.

So I translated the inner-loop to inline assembly code using SSE instructions. Even a straight-forward translation, without any tuning at all, reduced the running time of assignment step from 32.2 seconds down to 10.4 seconds! After a few runs to find a new optimal prefetch distance, the running time dropped further to 9.6 seconds. A huge improvement !

So in summary, the per-iteration running times in seconds:

Original implementation: total=90
Rearrangement/reduction of data: assignement=44.3, update=8.0, total=52.5
Prefetch: assignement=32.2, update=8.0, total=40.2
SSE inline assembly code: assignment= 9.6, update=8.0, total=17.6

An 80% reduction in running time, or 5 times boost in speed.

Next up: The Greater Collaborative Filtering Groupthink (KNN)

Friday, August 24, 2007

The Great Collaborative Filtering Groupthink

(Note: this post assumes you're familiar with The Netflix Prize.)

Now, some details on algorithms. I'll start with my viewer-clustering approach first.

The basic idea is to come up with a number of "representative" viewers, with each one representing a group of viewers with similar tastes. In various literatures these "representative" viewers are usually called "centroid" or "cluster" - I'll use the term "cluster" from now on.

In our case, each cluster is really a made-up viewer representing a group of viewers. The data for each cluster are just movie ratings, one rating value for each movie:

unsigned char clusterMovieRatings[MOVIE_CNT];

Note I used a single byte (unsigned char) for each rating. I'll write more on this detail later.

When it’s time to make a decision (prediction), each viewer just asks all the clusters he belongs to for a prediction, and then calculate the weighted average of the clusters' predictions. Of course, the predictions of each cluster is the average prediction of all viewers belonging to that cluster. Now you see why I call this algorithm the great collaborative filtering groupthink.

OK, how do you come up with the clusters ? I used the basic k-means algorithm:

Initialization: Setup some clusters, set their movie rating values to random.
Assignment step: For each viewer, determine which cluster the viewer belongs to, i.e. the cluster that is most similar to the viewer. I used the square of Euclidean distance as the measure of similarity, that is, the sum of squares of difference between the viewer’s rating and the cluster’s rating, over the movies the viewer has rated. Of course, the smaller the number the greater the similarity.
Update step: Go thru each cluster. For each movie, calculate the its average rating from the viewers who both belong to this cluster and have rated this movie. Replace the cluster’s rating for this movie with the average.
Repeat the previous two steps a number of times, or you can make a prediction for the probe data set and decide if to stop based on its RMSE.

To make a prediction for viewer v and movie m:

Find the K clusters that are most similar to viewer v. Again, the square of Euclidean distance is used to measure similarity. Calculate the weight for each cluster from the square of distance: weight=1/(squaredDistance + offset). The offset is a small positive number, it’s there so we don’t get a divide by zero if a viewer happen to match a cluster perfectly.
Calculate the weighted average of the K clusters' ratings for movie m, and that’s the predicted rating for viewer v and movie m.

That’s it. Pretty simple, isn’t it ? Of course there’re enhancement to the basic k-means algorithm such as soft k-means and k-means++ (and I don’t see any reason why they can’t be combined to produce a soft k-means++ algorithm), but I’m trying to get the simple and easy stuff working first. And given what I tried next, they probably don’t matter in the end.

With the algorithms above, I tried different combinations of the following parameters:

C: the total number of clusters.
I: the number of iterations used to build the clusters.
K: the number of most similar clusters used for prediction.

In the end, C=64, I=10, and K=5 seemed the best. Furthermore, since we set the clusters’ ratings to random values at the start of the algorithm, we can run the algorithm multiple times, using a different set of random initial values each time, and then average the predictions to get a better RMSE. Basically we create more groups among the viewers, and averaging nudges the predictions toward a more reliable, common value.

Here a mystery surfaced: K=6 or 7 consistently showed the best RMSE for raw predictions, but K=3 or 4 gave the best result when averaging multiple sets of predictions.

At this stage I started tweaking the algorithm. What I found was all tweaks improved the raw RMSE, but the averaged RMSEs stayed pretty much the same, another mystery. Here’re some probe RMSE from raw predictions with C=64 and I=10:

K=2, RMSE=0.9551
K=3, RMSE=0.9480
K=4, RMSE=0.9450
K=5, RMSE=0.9437
K=6, RMSE=0.9433
K=7, RMSE=0.9433
K=8, RMSE=0.9437
K=10, RMSE=0.9448
K=15, RMSE=0.9489
K=20, RMSE=0.9536
K=25, RMSE=0.9585
K=30, RMSE=0.9635

and here are some probe RMSEs from averaging multiple sets with C=64, I=10:

Average of 2, K=2, RMSE=0.9440
Average of 2, K=3, RMSE=0.9408
Average of 2, K=4, RMSE=0.9398
Average of 2, K=5, RMSE=0.9397
Average of 2, K=6, RMSE=0.9400
Average of 2, K=7, RMSE=0.9407

Average of 4, K=2, RMSE=0.9386
Average of 4, K=3, RMSE=0.9374
Average of 4, K=4, RMSE=0.9374
Average of 4, K=5, RMSE=0.9379
Average of 4, K=6, RMSE=0.9387
Average of 4, K=7, RMSE=0.9396

Average of 8, K=2, RMSE=0.9358
Average of 8, K=3, RMSE=0.9356
Average of 8, K=4, RMSE=0.9360
Average of 8, K=5, RMSE=0.9368
Average of 8, K=6, RMSE=0.9378
Average of 8, K=7, RMSE=0.9388

So in summary, groups are good, individuals are bad. We’re the Borg, you’ll be assimilated.

Next up: The Great Collaborative Filtering Groupthink Thinks Faster

Wednesday, August 22, 2007

The Great Collaborative Filtering Framework

(Note: this post assumes you're familiar with The Netflix Prize.)

Continuing from the last blog, I plodded along and wrote some boring C# code to generate my binary files. Another task is to extract the probe data from the full data set. In the end I had 3 sets of data: the full data set, the probe-less data set, and the probe data set. For each set, I created two binary files: one sorted by movie ID first and then by viewer ID, the other sorted by viewer ID first and then by movie ID.

Next, I generated a prediction file using weighted averages of movie and viewer average ratings. To do that I had to get a least squares fitting/linear regression program going. This program is also used later to blend results from different algorithms. Plus I had to write a program to grade the probe predictions. The RMSE number I got agreed with what other people were reporting, so I was more or less on the right track. Whew, it was a lot of work just to get a basic framework going. But it should be smooth sailing from now on, right ?

Well, before I go any further, there're questions to be answered:

Should I filter the data somehow ? As others have reported, there're oddball viewers in the data set. Like the guy who rated about 17000 movies, or the guy who rated thousands of movies in one day.
Should I do some sort of global offset or normalization ?
What to do about rating dates ? If you look at the global averages vs dates, you'll see the averages steadily going up, and there was an abrupt jump of about 0.2 around April 2004.

Hmm, I don't have any good answers, but a few hokey theories and SWAGs (scientific wild ass guess) came to the rescue:

Filtering: not needed, as the number of oddball viewers is too small and their effect is probably negligible.
Global offset/normalization: other people have reported getting RMSE in the low 0.98s using only movie/viewer averages. That’s within .03 of Netflix’s Cinematch algorithm, which is a non-trivial algorithm, which has to do more than just offset/normalization, if it does it at all. So the effect of offset/normalization is probably around 0.015. So I think as a data mining newbie just getting started on this, I can ignore this issue for now, I’ll worry about it after I exhaust other possibilities.
Rating dates: this might be nothing. It’s possible that as time progresses, Netflix collected more and more data and continuously updated their software, so people get better and better recommendations, and as a result the ratings got better and better. As for the abrupt jump around April 2004, maybe they did a major software upgrade/bug fix around that time.

Actually, by this time I was so tired of reading research papers and so eager to get a nontrivial algorithm going, I’ll make any excuse to take the “damn the torpedoes full speed ahead” approach.

Next up: the great collaborative filtering groupthink, aka clustering.

Thursday, August 16, 2007

Great Expectations and Database

(Note: this post assumes you're familiar with The Netflix Prize.)

When I first came across the Netflix prize, I just couldn't resist urge to participate. It looked like a well-planned contest with an interesting problem. It would test and improve your coding skills and your understanding of interesting algorithms. Plus there's a $1,000,000 prize. So I eagerly signed up, downloaded the (large) data set, read the documentations, and pondered, how to start ?

At first, I thought this was a good project with which to improve my relatively limited knowledge of C#. I also had the idea of putting all the data into a database, so I could easily run queries like "gimme all viewers who gave this movie a score of 3", and I could learn the C#/.NET interface to database at the same time. So the version 0 of my grand plan was:

Improve my C# skillz.
Learn C# and .NET database interface.
Read about and implement interesting algorithms.
Win the one million dollar prize.

Hehehe... good thing I didn't get XML and Haskell into the mess.

So I setup a database, and started writing a C# program to reformat the raw data files from Netflix. That worked out pretty well and soon I had some data in the database. But as soon as I tried accessing the database I noticed it is unbearably slow for this task. It was probably OK even if I wanted to run thousands of queries at a time, but a prediction algorithm needs to go thru all 100 million pieces of data, and more. With everything in the database, even the simplest algorithm would take a few hours to run. This seemed to mirror the experience of other people. Maybe it's workable if I have a cluster of 1000 fast PCs at my disposal, but with at most 2 computers at my disposal, the database approach would never work.

So I abandoned the database-based approach, and switched to sorting and saving data in binary files for fast access. This time I put a little more thought into it and decided to use memory-mapped file as my primary means of getting data off the disk. This simplified my data access and has worked out very well so far.

Friday, August 10, 2007

My Netflix Attempt

I've been working on the Netflix prize contest off and on, basically the it's a data-mining challenge with a $1 million dollar prize. I'll be describing some of the approaches I took, hopefully someone will give me some feedback that will improve my score (0.899 as of July 20, 2007).

So far I've the following algorithms up and running: SVD (based on Timely Development's code, thank you!), KNN, clustering, MLE(or is it EM ? I'm not sure, anyway it didn't work), and a couple of amateur-ish schemes I dreamt up, which also didn't work out (by not working I mean did not beat Netflix's Cinematch's RMSE). Oh by the way, this is the first time I've ever implemented these algorithms.

While I certainly have enough programming experience for this contest, I'm not a mathematician or a statistician, and I have never dabbled in data mining before. So you won't find deep theoretical musings on topics like sampling techniques, SVD, restricted Boltzmann machine guns here. I'll concentrate on my implementation detail and try to keep my understanding and my terminology correct. All malaprops may or may not be intestinal.

More details to follow.