Wednesday, August 29, 2007

The Great Collaborative Filtering Groupthink, Thinks Faster

(Note: this post assumes you're familiar with The Netflix Prize.)
(Warning: long post)

As described in the “The Great Collaborative Filtering Groupthink” post, my viewer-clustering algorithm has an assignment step and an update step, repeated for many iterations.

In my initial implementation, each iteration took about 90 seconds, for a set of 64 clusters (see previous post for my hardware spec). I didn’t write down the initial performance numbers, but if I recall correctly, the assignment step took anywhere from 70 to 80 seconds, with the rest going to the update step.

Now if a set of clusters takes 10 iterations to build, and I need to build 8 sets (with different random starting values), that’s 2 hours of running time already. I’d rather not wait for 2 hours every time I tweak the algorithm or fix a bug, plus I need to experiment with different number of clusters, so I looked for ways to make it run faster, eventually reducing the per-iteration running time to 17.6 seconds.

Obviously the assignment step is the place to look first. The following is its pseudo-code:

//loop thru data in viewer-cluster-movie order
for each viewer {
    minSquare=FLT_MAX; //maximum float value
    for each cluster {
        clusterSquare=0;
        for each movie rated by the current viewer {
            clusterSquare+=(ratingOfViewer – ratingOfCluster)^2;
        } // for each movie rated by the current viewer
        if (clusterSqueare &lt minSquare) {
            Assign the current viewer to the current cluster.
            minSquare=clusterSquare;
        }
    } //for each cluster
} //for each viewer

First, I tried an early-out approach for the inner loop:

//loop thru data in viewer-cluster-movie order
for each viewer {
    minSquare=FLT_MAX; //maximum float value
    for each cluster {
        clusterSquare=0;
        for each 5% of the movies rated by the current viewer {
            for each movie in the current 5% {
                clusterSquare+=(ratingOfViewer – ratingOfCluster)^2;
            } // for each movie in the next 5%
            if (clusterSqure >= minSquare) skip current cluster.
        } //repeat 20 times
        Assign the current viewer to the current cluster.
        minSquare=clusterSquare;
    } //for each cluster
} //for each viewer

Basically, I'm tracking the current minimum square of distance value in minSquare, and as soon as the current cluster's clusterSquare value exceeds minSquare, I move onto the next cluster. However this improved performance less than 10%. I think the reason is that for most clusters, you need to go thru most of the data before clusterSquare exceeds minSquare. It’s time to look deeper.

For each set of clusters, I stored all the clusters’ rating data in a 2-D array:

unsigned short ratings[clusterCount][movieCount];

That’s right, I used a 16-bit unsigned integer for each rating. The value in ratings[][] array is the real rating value times 1024. The data is stored in the natural cluster-by-movie order: the first index of the array is cluster index, and the second index of the array is movie index. Here’s the memory layout of the data (from low memory address to high memory address):

c0m0, c0m1, c0m2, …… c0m[movieCount-2], c0m[movieCount-1],
c1m0, c1m1, c1m2, …… c1m[movieCount-2], c1m[movieCount-1],
c2m0, c2m1, c2m2, …… c2m[movieCount-2], c2m[movieCount-1],
...

where c0m0 is the cluster 0’s rating for movie 0, and c0m1 is the cluster 0’s rating for movie 1, and so on.

In the Netflix data set, there’re 17770 movies and about half million viewers, but only about 100 million ratings, so on average each viewer rated about one in 89 movies.

Therefore in my inner loops above (for each movie rated by the viewer), the code is skipping thru the rows of the array from left to right. On average, the memory address of each rating value accessed is about 89 ratings (178 bytes) away from the previous address. So it’s a guaranteed cache miss in most cases, with horrible performance penalties. Most of the time, the CPU is just sitting there waiting for the data to arrive from memory.

OK, given this scenario, the obvious solution is to turn the ratings[][] array sideways, swapping the first index and the second index:

unsigned short ratings[movieCount] [clusterCount];

and the new memory layout:

m0c0, m0c1, m0c2, …… m0c[clusterCount-2], m0c[clusterCount-1],
m1c0, m1c1, m1c2, …… m1c[clusterCount-2], m1c[clusterCount-1],
m2c0, m2c1, m2c2, …… m2c[clusterCount-2], m2c[clusterCount-1],
...

where m0c0 is the cluster 0’s rating for movie 0, and m0c1 is the cluster 1’s rating for movie 0, and so on. Now we go thru the data in a new order:

//loop thru data in viewer-movie-cluster order
float clusterSquares[clusterCount];
for each viewer {
    set all elements of clusterSquares[] to zero.
    for each movie rated by the current viewer {
        for each cluster {
            clusterSquares[clusterIndex]+=(ratingOfViewer – ratingOfCluster)^2;
        }
    } // for each movie rated by the current viewer
    Assign the viewer to the cluster that has the smallest value in clusterSquares[].
} // for each viewer {

Now in the inner loop, the code is scanning thru the array/memory sequentially with no skips, greatly reducing cache miss.

At this time I also changed the data type of ratings[][] array from unsigned short (2 bytes) to unsigned char (1 byte), to reduce the amount of data that need to be accessed. Yes I lose some accuracy this way, but the probe RMSE using 1-byte rating value is only a few 0.0001’s worse than before, so the effect is negligible for the result we’re trying to achieve here.

With both changes above, the assignment step now took about 44.3 seconds, and the update step took about 8.0 seconds (yes I wrote these numbers down), for a total of 52.3 seconds, a good improvement from the previous 90 seconds.

But wait, there’s more. The new inner-loop is scanning thru the array/memory sequentially, so we can easily take advantage of data prefetching.

However, given that cluster count is relatively small, prefetching within the cluster rating data of each movie (each row in the array/memory layout) is probably not useful. So I decided to prefetch by movie, at the second-most inner-loop level:

for each movie rated by the current viewer {
    movieToPrefetch=moviesRatedByViewer[currentIndex +prefetchOffset];
    prefetch memory address &ratings[movieToPrefetch][0]
    ...
} // for each movie rated by the current viewer

Now, there're a few caveats about prefetching:

First of all, for production code, you obviously want to write some CPU detection code to make sure the running CPU supports prefetching instructions.

Secondly, the optimal prefetch distance depends on a number of factors (yes the article talks about Itanium, but the general principle applies to Pentium too). Not a trivial task.

And finally, there’s a potential crash bug in the calculation of the prefetch memory address:

MovieToPrefetch=moviesRatedByViewer[currentIndex +prefetchOffset];

Because of prefetchOffset, we’re reading beyond the memory allocated for moviesRatedByViewer[], a potential crash.

In reality, it probably won’t crash, because moviesRatedByViewer[] sits in a block of memory pages allocated by the C++’s runtime memory manager, and the memory immediately after moviesRatedByViewer[] most likely belongs to the memory manager, so our program can still read it. The only way it could crash is if moviesRatedByViewer[] ends at or near the last byte of the last page of the block of pages it sits in.

But still, it’s better to just fix this. Fortunately, the fix is very simple: just allocate some extra bytes for moviesRatedByViewer[]. If you estimate your maximum prefetchOffset is N, just allocate an extra N*sizeof(*moviesRatedByViewer) bytes.

After I added prefetching, and did a few trial runs to find the best prefetchOffset value, the running time of assignment step went from 44.3 seconds to 32.2 seconds. Another 27% improvement.

But wait, there’s more. The new inner-loop is scanning thru array/memory sequentially, doing the same type of operation on each data element. This practically screams out for SIMD optimization.

It’s always fun to improve performance by writing assembly language code, but we rarely have good reasons to do it nowadays. Now, if processing 100 million pieces of data, repeatedly, in a non-production environment, isn’t a good excuse, then I don’t know what is.

So I translated the inner-loop to inline assembly code using SSE instructions. Even a straight-forward translation, without any tuning at all, reduced the running time of assignment step from 32.2 seconds down to 10.4 seconds! After a few runs to find a new optimal prefetch distance, the running time dropped further to 9.6 seconds. A huge improvement !

So in summary, the per-iteration running times in seconds:
  • Original implementation: total=90
  • Rearrangement/reduction of data: assignement=44.3, update=8.0, total=52.5
  • Prefetch: assignement=32.2, update=8.0, total=40.2
  • SSE inline assembly code: assignment= 9.6, update=8.0, total=17.6
An 80% reduction in running time, or 5 times boost in speed.

Next up: The Greater Collaborative Filtering Groupthink (KNN)

7 comments:

Anonymous said...

What do you mean by 'I need to build 8 sets'? Why? What is the difference between one set of clusters and another?

by321 said...

Each set is set to different random starting values, for blending later.

Anonymous said...

OK. That makes sense. So blending from random sources seems to have a beneficial effect.

Unknown said...

First, thanks for the interesting posts! I think I'll have to put in a clustering algorithm myself. (Yeah yeah, when I get time...)

I have to admit, I'm amazed pre-fetching made such a difference, and that SSE instructions could be so well used. I wouldn't have thought things were quite parallel enough for that. Very interesting.

Do you think you would have any gains by tracking a current minimum (the distance of the nth cluster, where n is the number you'll use at the end) and bailing out early with that? Or would the complexity negate any efficiency gains?

It also seems an amazingly rapid convergence -- 10 iterations from starting with random values. Wow!

by321 said...

I think your "tracking a current minimum" idea is the early-out approach I described. I modified the post to make it more clear.

Anonymous said...

I am unable to reproduce your results using the approach you described and using C=64, I=10, K=8. It's obviously something I'm doing wrong...

Anonymous said...

Could you clarify what you mean by the assignment and update times... Are the assignment times, the time it takes to assign all ~480k users to clusters, or is it for assigning a single user to a cluster?

Are there any other general performance tricks you are using? It sounds like you are avoiding floating point operations based on your description.

Any chance you'll release your source code for this? Or maybe just the juicy bits, like the assignment and update routines?