Monday, September 10, 2007

The Greater Collaborative Filtering Groupthink: KNN

(Note: this post assumes you're familiar with The Netflix Prize.)

In this post I'll describe my implementation of the KNN algorithm, which stands for K Nearest Newbies (with me being the biggest newbie around). A more descriptive name for my implementation is "movie-to-movie Pearson's r-based KNN with slope-one correction and other tweaks" - that's how I see it anyway. I was able to get a 0.9174 qualifying score using this method.

In a nutshell, the KNN algorithm is this: to predict viewer v's rating for movie m, you first get the list of all the movies previously rated by viewer v. Among those movies, you find the K closest "neighbors" to movie m. Then you calculate the weighted-average of viewer v's ratings for those "neighbor" movies, and that’s your final prediction.

Making KNN Predictions

Conceptually, to predict the rating for movie m by viewer v, I do the following:

First get a list of all movies rated by viewer v, let's call this list L0.

Pick a value for parameter MinCV, which is the minimum number of common viewers required. By common viewers I mean viewers who rated both movies. Remove all movies in L0 that have less than MinCV common viewers with movie m. From trial runs, it seems 16 is a good number for MinCV.

For each movie n left in the list L0, calculate the following data structure:

  struct MovieNeighbor { //movie n as neighbor of movie m
      unsigned int commonViewers;
      float mAvg; //average rating of movie m
      float nAvg; //average rating of movie n

      float nRating; //current viewer's rating for movie n

      float rRaw;    //raw Pearson's r value
      float rLower; 
      float weight;

In the structure above, commonViewers is the number of viewers who rated both movie m and n, mAvg and nAvg are the average ratings of movie m and n by common viewers, respectively. And nRating is viewer v's rating for movie n.

The rRaw is Pearson's product moment coefficient, aka Pearson's r, for the movie pair in question. And rLower is the lower bound of the movie pair's Pearson's r at 95% confidence interval. This is calculated using rRaw, commonViewers, and Fisher Transform/Fisher Inverse. If rRaw has a small absolute value, rLower may end up having a different sign than rRaw, in this case, I chose to set rLower to 0.

Once rLower is calculated, I calculate the weight this way:
    weight = rLower*rLower*log(commonViewers);

This is just one of the many ways to calculate weight. For example, you can change the confidence interval, you can use a power function on commonViewers, or a sigmoid function, or just linear with a max limit. I tried many different ways of calculating the weight and found the best ones gave very similar end results.

Put all MovieNeighbor structures in a new list, let's call this list L1. Each item in list L1 represents a neighbor for movie m. And we don't need list L0 any more.

Create a new MovieNeighbor structure, and set the following fields:
    mAvg=average rating of movie m from all viewers

add this new structure to list L1. This creates a dummy neighbor that predicts movie m's average rating. The effect of this dummy neighbor depends on the movie m's real neighbors. If movie m has lots of weak neighbors (low correlation, low number of common viewers), then the dummy neighbor will pull the prediction toward movie m's average. On the other hand, if movie m has many good-quality neighbors (high correlation, high number of common viewers), then the dummy neighbor will have no effect on the final outcome.

Pick a value for parameter MaxN, which is the maximum number of neighbors used for prediction. Sort list L1 by weight and keep the MaxN items that have the greatest weights (implementation hint: don't call sort or qsort, use nth_element).

For each item remaining in L1, calculate its prediction:
    dif = ratingn - nAvg;
    if (rRaw<0 dif="-dif;<br">    prediction = mAvg + dif;

The final prediction is the weighted-average of predictions of all remaining items in list L1. That's it !

Implementation Notes

A straight-forward implementation of the above will work, but it will be slow and make it almost impractical to explore large number of parameter combinations. As you can see there're a number of parameters you can tune:

  • confidence interval
  • power rLower
  • function of common viewers (log, power, sigmoid, etc)
  • minimum number of common viewers required
  • max number of neighbors
  • etc ...

To try large combinations of parameter values efficiently, you want to pre-calculate and share the computational overhead as much as possible.

On the top of the list is the following for each movie pair in the data set:
    struct MovieCorrelation { //for movie pair m-n
    unsigned int commonViewers;
    float mAvg;    //average rating of movie m
    float nAvg;    //average rating of movie n
    float rRaw;    //raw Pearson's r value

There're 17770 movies in the data set, so there're 17770x17770 (of which half are unique) MovieCorrelation structures we can calculate and save to a large file. This sounds like a major task, but it only needs to be done once (once your code is bug-free).

And thanks to the raw speed of modern computers, it does not take as long as it looks. I managed to get it done in less than 20 minutes on my laptop, and my implementation is by no means optimized to the max. If you have a fast PC, it should take less than 15 minutes; and if you have enough memory to hold 17770x17770/2 (packed) records in memory, it will probably take less than 10 minutes.

Once you have all that MovieCorrelation calculated and saved in a file, arrange your prediction code like this:

  for each movie m to predict {
    array0=read all MovieCorrelation's of movie m from file
    for each value of 1st parameter {
      for each value of 2nd parameter {
        for each value of 3rd parameter {
          do predictions using array3;       

The 1st parameter might be confidence interval, the 2nd parameter might be maximum number of neighbors, etc. After I wrote my prediction code like above, it took about one minute to generate one probe prediction, but to generate 32 probe predictions, it only took 3 minutes !