Friday, August 24, 2007

The Great Collaborative Filtering Groupthink

(Note: this post assumes you're familiar with The Netflix Prize.)

Now, some details on algorithms. I'll start with my viewer-clustering approach first.

The basic idea is to come up with a number of "representative" viewers, with each one representing a group of viewers with similar tastes. In various literatures these "representative" viewers are usually called "centroid" or "cluster" - I'll use the term "cluster" from now on.

In our case, each cluster is really a made-up viewer representing a group of viewers. The data for each cluster are just movie ratings, one rating value for each movie:
unsigned char clusterMovieRatings[MOVIE_CNT];
Note I used a single byte (unsigned char) for each rating. I'll write more on this detail later.

When it’s time to make a decision (prediction), each viewer just asks all the clusters he belongs to for a prediction, and then calculate the weighted average of the clusters' predictions. Of course, the predictions of each cluster is the average prediction of all viewers belonging to that cluster. Now you see why I call this algorithm the great collaborative filtering groupthink.

OK, how do you come up with the clusters ? I used the basic k-means algorithm:
  • Initialization: Setup some clusters, set their movie rating values to random.
  • Assignment step: For each viewer, determine which cluster the viewer belongs to, i.e. the cluster that is most similar to the viewer. I used the square of Euclidean distance as the measure of similarity, that is, the sum of squares of difference between the viewer’s rating and the cluster’s rating, over the movies the viewer has rated. Of course, the smaller the number the greater the similarity.
  • Update step: Go thru each cluster. For each movie, calculate the its average rating from the viewers who both belong to this cluster and have rated this movie. Replace the cluster’s rating for this movie with the average.
  • Repeat the previous two steps a number of times, or you can make a prediction for the probe data set and decide if to stop based on its RMSE.
To make a prediction for viewer v and movie m:

  • Find the K clusters that are most similar to viewer v. Again, the square of Euclidean distance is used to measure similarity. Calculate the weight for each cluster from the square of distance: weight=1/(squaredDistance + offset). The offset is a small positive number, it’s there so we don’t get a divide by zero if a viewer happen to match a cluster perfectly.
  • Calculate the weighted average of the K clusters' ratings for movie m, and that’s the predicted rating for viewer v and movie m.
That’s it. Pretty simple, isn’t it ? Of course there’re enhancement to the basic k-means algorithm such as soft k-means and k-means++ (and I don’t see any reason why they can’t be combined to produce a soft k-means++ algorithm), but I’m trying to get the simple and easy stuff working first. And given what I tried next, they probably don’t matter in the end.

With the algorithms above, I tried different combinations of the following parameters:

  • C: the total number of clusters.
  • I: the number of iterations used to build the clusters.
  • K: the number of most similar clusters used for prediction.
In the end, C=64, I=10, and K=5 seemed the best. Furthermore, since we set the clusters’ ratings to random values at the start of the algorithm, we can run the algorithm multiple times, using a different set of random initial values each time, and then average the predictions to get a better RMSE. Basically we create more groups among the viewers, and averaging nudges the predictions toward a more reliable, common value.

Here a mystery surfaced: K=6 or 7 consistently showed the best RMSE for raw predictions, but K=3 or 4 gave the best result when averaging multiple sets of predictions.

At this stage I started tweaking the algorithm. What I found was all tweaks improved the raw RMSE, but the averaged RMSEs stayed pretty much the same, another mystery. Here’re some probe RMSE from raw predictions with C=64 and I=10:

K=2, RMSE=0.9551
K=3, RMSE=0.9480
K=4, RMSE=0.9450
K=5, RMSE=0.9437
K=6, RMSE=0.9433
K=7, RMSE=0.9433
K=8, RMSE=0.9437
K=10, RMSE=0.9448
K=15, RMSE=0.9489
K=20, RMSE=0.9536
K=25, RMSE=0.9585
K=30, RMSE=0.9635

and here are some probe RMSEs from averaging multiple sets with C=64, I=10:

Average of 2, K=2, RMSE=0.9440
Average of 2, K=3, RMSE=0.9408
Average of 2, K=4, RMSE=0.9398
Average of 2, K=5, RMSE=0.9397
Average of 2, K=6, RMSE=0.9400
Average of 2, K=7, RMSE=0.9407

Average of 4, K=2, RMSE=0.9386
Average of 4, K=3, RMSE=0.9374
Average of 4, K=4, RMSE=0.9374
Average of 4, K=5, RMSE=0.9379
Average of 4, K=6, RMSE=0.9387
Average of 4, K=7, RMSE=0.9396

Average of 8, K=2, RMSE=0.9358
Average of 8, K=3, RMSE=0.9356
Average of 8, K=4, RMSE=0.9360
Average of 8, K=5, RMSE=0.9368
Average of 8, K=6, RMSE=0.9378
Average of 8, K=7, RMSE=0.9388
So in summary, groups are good, individuals are bad. We’re the Borg, you’ll be assimilated.

Next up: The Great Collaborative Filtering Groupthink Thinks Faster

2 comments:

Anonymous said...

I pretty much follow everything you say except for the beginning: thus what exactly is a cluster? How do you 'generate some clusters'.

Anonymous said...

Slept on it.
Now I understand the concept of 'cluster'.
Like it very much.