Thursday, August 16, 2007

Great Expectations and Database

(Note: this post assumes you're familiar with The Netflix Prize.)

When I first came across the Netflix prize, I just couldn't resist urge to participate. It looked like a well-planned contest with an interesting problem. It would test and improve your coding skills and your understanding of interesting algorithms. Plus there's a $1,000,000 prize. So I eagerly signed up, downloaded the (large) data set, read the documentations, and pondered, how to start ?

At first, I thought this was a good project with which to improve my relatively limited knowledge of C#. I also had the idea of putting all the data into a database, so I could easily run queries like "gimme all viewers who gave this movie a score of 3", and I could learn the C#/.NET interface to database at the same time. So the version 0 of my grand plan was:

  • Improve my C# skillz.
  • Learn C# and .NET database interface.
  • Read about and implement interesting algorithms.
  • Win the one million dollar prize.

Hehehe... good thing I didn't get XML and Haskell into the mess.

So I setup a database, and started writing a C# program to reformat the raw data files from Netflix. That worked out pretty well and soon I had some data in the database. But as soon as I tried accessing the database I noticed it is unbearably slow for this task. It was probably OK even if I wanted to run thousands of queries at a time, but a prediction algorithm needs to go thru all 100 million pieces of data, and more. With everything in the database, even the simplest algorithm would take a few hours to run. This seemed to mirror the experience of other people. Maybe it's workable if I have a cluster of 1000 fast PCs at my disposal, but with at most 2 computers at my disposal, the database approach would never work.

So I abandoned the database-based approach, and switched to sorting and saving data in binary files for fast access. This time I put a little more thought into it and decided to use memory-mapped file as my primary means of getting data off the disk. This simplified my data access and has worked out very well so far.

No comments: