Wednesday, August 22, 2007

The Great Collaborative Filtering Framework

(Note: this post assumes you're familiar with The Netflix Prize.)

Continuing from the last blog, I plodded along and wrote some boring C# code to generate my binary files. Another task is to extract the probe data from the full data set. In the end I had 3 sets of data: the full data set, the probe-less data set, and the probe data set. For each set, I created two binary files: one sorted by movie ID first and then by viewer ID, the other sorted by viewer ID first and then by movie ID.

Next, I generated a prediction file using weighted averages of movie and viewer average ratings. To do that I had to get a least squares fitting/linear regression program going. This program is also used later to blend results from different algorithms. Plus I had to write a program to grade the probe predictions. The RMSE number I got agreed with what other people were reporting, so I was more or less on the right track. Whew, it was a lot of work just to get a basic framework going. But it should be smooth sailing from now on, right ?

Well, before I go any further, there're questions to be answered:

  • Should I filter the data somehow ? As others have reported, there're oddball viewers in the data set. Like the guy who rated about 17000 movies, or the guy who rated thousands of movies in one day.
  • Should I do some sort of global offset or normalization ?
  • What to do about rating dates ? If you look at the global averages vs dates, you'll see the averages steadily going up, and there was an abrupt jump of about 0.2 around April 2004.

Hmm, I don't have any good answers, but a few hokey theories and SWAGs (scientific wild ass guess) came to the rescue:

  • Filtering: not needed, as the number of oddball viewers is too small and their effect is probably negligible.
  • Global offset/normalization: other people have reported getting RMSE in the low 0.98s using only movie/viewer averages. That’s within .03 of Netflix’s Cinematch algorithm, which is a non-trivial algorithm, which has to do more than just offset/normalization, if it does it at all. So the effect of offset/normalization is probably around 0.015. So I think as a data mining newbie just getting started on this, I can ignore this issue for now, I’ll worry about it after I exhaust other possibilities.
  • Rating dates: this might be nothing. It’s possible that as time progresses, Netflix collected more and more data and continuously updated their software, so people get better and better recommendations, and as a result the ratings got better and better. As for the abrupt jump around April 2004, maybe they did a major software upgrade/bug fix around that time.

Actually, by this time I was so tired of reading research papers and so eager to get a nontrivial algorithm going, I’ll make any excuse to take the “damn the torpedoes full speed ahead” approach.

Next up: the great collaborative filtering groupthink, aka clustering.

No comments: