Now, some details on algorithms. I'll start with my viewer-clustering approach first.

The basic idea is to come up with a number of "representative" viewers, with each one representing a group of viewers with similar tastes. In various literatures these "representative" viewers are usually called "centroid" or "cluster" - I'll use the term "cluster" from now on.

In our case, each cluster is really a made-up viewer representing a group of viewers. The data for each cluster are just movie ratings, one rating value for each movie:

unsigned char clusterMovieRatings[MOVIE_CNT];Note I used a single byte (unsigned char) for each rating. I'll write more on this detail later.

When it’s time to make a decision (prediction), each viewer just asks all the clusters he belongs to for a prediction, and then calculate the weighted average of the clusters' predictions. Of course, the predictions of each cluster is the average prediction of all viewers belonging to that cluster. Now you see why I call this algorithm the great collaborative filtering groupthink.

OK, how do you come up with the clusters ? I used the basic k-means algorithm:

- Initialization: Setup some clusters, set their movie rating values to random.
- Assignment step: For each viewer, determine which cluster the viewer belongs to, i.e. the cluster that is most similar to the viewer. I used the square of Euclidean distance as the measure of similarity, that is, the sum of squares of difference between the viewer’s rating and the cluster’s rating, over the movies the viewer has rated. Of course, the smaller the number the greater the similarity.
- Update step: Go thru each cluster. For each movie, calculate the its average rating from the viewers who both belong to this cluster and have rated this movie. Replace the cluster’s rating for this movie with the average.
- Repeat the previous two steps a number of times, or you can make a prediction for the probe data set and decide if to stop based on its RMSE.

- Find the K clusters that are most similar to viewer v. Again, the square of Euclidean distance is used to measure similarity. Calculate the weight for each cluster from the square of distance: weight=1/(squaredDistance + offset). The offset is a small positive number, it’s there so we don’t get a divide by zero if a viewer happen to match a cluster perfectly.
- Calculate the weighted average of the K clusters' ratings for movie m, and that’s the predicted rating for viewer v and movie m.

With the algorithms above, I tried different combinations of the following parameters:

- C: the total number of clusters.
- I: the number of iterations used to build the clusters.
- K: the number of most similar clusters used for prediction.

**Here a mystery surfaced**: K=6 or 7 consistently showed the best RMSE for raw predictions, but K=3 or 4 gave the best result when averaging multiple sets of predictions.

At this stage I started tweaking the algorithm. What I found was all tweaks improved the raw RMSE, but the averaged RMSEs stayed pretty much the same,

**another mystery**. Here’re some probe RMSE from raw predictions with C=64 and I=10:

K=2, RMSE=0.9551

K=3, RMSE=0.9480

K=4, RMSE=0.9450

K=5, RMSE=0.9437

K=6, RMSE=0.9433

K=7, RMSE=0.9433

K=8, RMSE=0.9437

K=10, RMSE=0.9448

K=15, RMSE=0.9489

K=20, RMSE=0.9536

K=25, RMSE=0.9585

K=30, RMSE=0.9635

and here are some probe RMSEs from averaging multiple sets with C=64, I=10:

Average of 2, K=2, RMSE=0.9440So in summary, groups are good, individuals are bad. We’re the Borg, you’ll be assimilated.

Average of 2, K=3, RMSE=0.9408

Average of 2, K=4, RMSE=0.9398

Average of 2, K=5, RMSE=0.9397

Average of 2, K=6, RMSE=0.9400

Average of 2, K=7, RMSE=0.9407

Average of 4, K=2, RMSE=0.9386

Average of 4, K=3, RMSE=0.9374

Average of 4, K=4, RMSE=0.9374

Average of 4, K=5, RMSE=0.9379

Average of 4, K=6, RMSE=0.9387

Average of 4, K=7, RMSE=0.9396

Average of 8, K=2, RMSE=0.9358

Average of 8, K=3, RMSE=0.9356

Average of 8, K=4, RMSE=0.9360

Average of 8, K=5, RMSE=0.9368

Average of 8, K=6, RMSE=0.9378

Average of 8, K=7, RMSE=0.9388

Next up: The Great Collaborative Filtering Groupthink Thinks Faster

## 2 comments:

I pretty much follow everything you say except for the beginning: thus what exactly is a cluster? How do you 'generate some clusters'.

Slept on it.

Now I understand the concept of 'cluster'.

Like it very much.

Post a Comment