(Note: this post assumes you're familiar with The Netflix Prize.)
Continuing from the last blog, I plodded along and wrote some boring C# code to generate my binary files. Another task is to extract the probe data from the full data set. In the end I had 3 sets of data: the full data set, the probe-less data set, and the probe data set. For each set, I created two binary files: one sorted by movie ID first and then by viewer ID, the other sorted by viewer ID first and then by movie ID.
Next, I generated a prediction file using weighted averages of movie and viewer average ratings. To do that I had to get a least squares fitting/linear regression program going. This program is also used later to blend results from different algorithms. Plus I had to write a program to grade the probe predictions. The RMSE number I got agreed with what other people were reporting, so I was more or less on the right track. Whew, it was a lot of work just to get a basic framework going. But it should be smooth sailing from now on, right ?
Well, before I go any further, there're questions to be answered:
- Should I filter the data somehow ? As others have reported, there're oddball viewers in the data set. Like the guy who rated about 17000 movies, or the guy who rated thousands of movies in one day.
- Should I do some sort of global offset or normalization ?
- What to do about rating dates ? If you look at the global averages vs dates, you'll see the averages steadily going up, and there was an abrupt jump of about 0.2 around April 2004.
Hmm, I don't have any good answers, but a few hokey theories and SWAGs (scientific wild ass guess) came to the rescue:
- Filtering: not needed, as the number of oddball viewers is too small and their effect is probably negligible.
- Global offset/normalization: other people have reported getting RMSE in the low 0.98s using only movie/viewer averages. That’s within .03 of Netflix’s Cinematch algorithm, which is a non-trivial algorithm, which has to do more than just offset/normalization, if it does it at all. So the effect of offset/normalization is probably around 0.015. So I think as a data mining newbie just getting started on this, I can ignore this issue for now, I’ll worry about it after I exhaust other possibilities.
- Rating dates: this might be nothing. It’s possible that as time progresses, Netflix collected more and more data and continuously updated their software, so people get better and better recommendations, and as a result the ratings got better and better. As for the abrupt jump around April 2004, maybe they did a major software upgrade/bug fix around that time.
Actually, by this time I was so tired of reading research papers and so eager to get a nontrivial algorithm going, I’ll make any excuse to take the “damn the torpedoes full speed ahead” approach.
Next up: the great collaborative filtering groupthink, aka clustering.
No comments:
Post a Comment