...

Looking at Reddit Comment Data, Part 2 (Trump, Bernie, Hillary, & Harry Potter)

2017-07-21T14:32:00.004-07:00

(Note: This post was written using reddit comment data from Jan 2015 up to June 2017.)

To see the most popular subreddits in terms of comments, I calculated the following values for all subreddits:

comments: number of comments in the subreddit
c_rank: subreddit rank by number of comments
users: number of commenters, or users who posted at least 1 comment
u_rank: subreddit rank by number of commenters
avg cmts: average number of comments per commenter

Again, AutoModerator and '[deleted]' users were not counted.

The following table shows the subreddits that are in top 20 by either the number of comments, or the number of users (who posted at least 1 comment). You can click on column headers to sort the table by column.

subreddit	comments	c_rank	users	u_rank	avg cmts
AskReddit	129526018	1	4054395	1	31.95
politics	33459498	2	722696	17	46.3
leagueoflegends	24259812	3	606736	21	39.98
nba	20042862	4	280039	54	71.57
nfl	19536466	5	271890	56	71.85
worldnews	18893913	6	1176059	8	16.07
The_Donald	18835574	7	393769	38	47.83
funny	18417913	8	1844344	3	9.99
pics	17125307	9	1851402	2	9.25
news	16750343	10	1019662	10	16.43
videos	15047577	11	1439067	4	10.46
pcmasterrace	14464493	12	690385	19	20.95
SquaredCircle	14216591	13	126425	133	112.45
todayilearned	14207850	14	1359651	5	10.45
soccer	13818732	15	250960	66	55.06
hockey	12714481	16	153287	114	82.95
gaming	12157336	17	1225064	7	9.92
movies	11684827	18	1007113	12	11.6
DotA2	11243969	19	239691	69	46.91
GlobalOffensive	11106482	20	360058	42	30.85
gifs	8700658	25	1247750	6	6.97
Showerthoughts	7162287	30	1061022	9	6.75
aww	4890833	35	1015063	11	4.82
IAmA	4005195	55	989388	13	4.05
mildlyinteresting	4496801	44	950380	14	4.73
WTF	8188117	26	827009	15	9.9
Music	3157128	78	736537	16	4.29
AdviceAnimals	9202993	23	710905	18	12.95
explainlikeimfive	4745297	40	683992	20	6.94

No real surprise here. Ask reddit, politics, gaming, sports, news, and Donald Trump.

The Donald, Bernie Sanders, Hillary Clinton, and Harry Potter

From the data, we can see the popularity, or at least the celebrity, of Donald Trump. /r/The_Donald is the number one subreddit focused on an individual, by either the number of comments or the number of users. The next individual-focused subreddit is for Bernie Sanders, and Hillary Clinton is way behind. I also noticed the Harry Potter subreddit while looking at the data.

subreddit	comments	c_rank	users	u_rank	avg cmts
The_Donald	18835574	7	393769	38	47.83
SandersForPresident	4176126	51	208781	78	20.00
harrypotter	832561	363	82167	221	10.13
hillaryclinton	1193252	243	41418	493	28.81

The fact that /r/The_Donald is the top subreddit focused on an individual is also confirmed by reddistlist, which ranks subreddits by recent activity and subscribers.

Now, would Harry Potter win a presidential race against Donald Trump, Bernie Sanders, or Hillary Clinton ? It's tough to predict. Potter is a great wizard, but from all indications, Trump is also a great wizard in his own right.

We know Potter has great name recognition and is popular with young voters, but he has yet to be attacked politically. All we have are some pro-Potter propaganda books by JK Rowling.

Looking At Reddit Comment Data, part 1

2017-05-10T03:33:00.002-07:00

Last month I came across this article, Dissecting Trump’s Most Rabid Online Following. Despite obvious political leaning of the author, it's still an interesting article. The discussion on calculating similarity between subreddits all sounded vaguely familiar, and reminds me of calculating movie and user similarities for the Netflix prize. So I decided to take a look at the raw data of the article (and try out Goggle bigtable + bigquery at the same time). Previously I didn't know there's a large, publicly accessible dataset of reddit comments.

The original article used data from Jan 2015 to Dec 2016. At the time I looked at the dataset, it had data up to Feb 2017, so I used data from Jan 2015 to Feb 2017, which has almost 1.7 billion comments. After a quick look, I decided to ignore the comments from two special "users" accounts:

deleted users, with 7% of all comments
AutoModerator, a "moderation bot", with 0.7% of all comments

There're many other bots, but fortunately they seemed very small relative to AutoModerator. I tallied all users whose name starts with "auto" or ends with "bot", and who have at least 10000 comments, and together they have only 0.34% of all comments. So due to time constraints, I didn't go on a bot-hunting spree. After removing the two special "users" from analysis, the dataset had:

313080 subreddits
14718265 users (who posted at least 1 comment)
1566721864 comments

Notice the "users" here are those who posted at least 1 comment from Jan 2015 to Feb 2017, and users who never posted a comment in that time frame are not in the dataset.

Like most social web sites, a small numbers of users are responsible for most content:

top 0.01% of users posted 3% of all comments
top 0.1% of users posted 12% of all comments
top 1% of users posted 41% of all comments
top 10% of users posted 85% of all comments

Since the users who never posted a comment are not in the dataset, the percentages in terms of all reddit users are even more skewed.

Similarly, the number of users/commenters in all subreddits is also very skewed:

8 subreddits have 1000000 or more users
149 subreddits have 100000 or more users
1723 subreddits have 10000 or more users
8588 subreddits have 1000 or more users
304492 subreddits have less than 1000 users

Again, remember this is a user comment dataset, and "users" includes only those who posted at least 1 comment from Jan 2015 to Feb 2017.

Heritage Health Prize Round 1 Results

2011-09-28T15:01:00.000-07:00

The Heritage Health Prize round 1 milestone results were released. I came in 3rd on the final private rankings, even though I was only the 5th on public leaderboard. So I guess I didn't overfit too badly.

I wonder by how much I missed winning the lottery this time.

The Netflix Prize Announcement

2009-09-24T10:04:00.000-07:00

I went to the Netflix Prize announcement in New York on Sept 21. It was great to finally meet some of my team members, as well as our arch-nemesis BellKor's Pragmatic Chaos. I also did some sightseeing in NYC so it was a pretty good trip.

The prize announcement was pretty well-covered by the media. Reporters from New York Times, Forbes, AP, Business Week, Wall Street Journal, etc were there. Just do a search and you'll find plenty of press coverage of the event.

From talking to my team members, it's not surprising that we all work in IT-related fields, but I'm still impressed by the breadth of experience in the Ensemble team, from health care to entertainment industry to unmanned weapon systems. That's right, from saving your life to killing you and making a movie about it, we got you covered.

BellKor's Pragmatic Chaos revealed one surprising chance-of-luck happening that explained some of the behind-the-scene drama we saw in the last 48 hours of the contest.

PS: I accidentally got quoted in this Business Week article.

We Are The Borg

2009-07-04T18:49:00.001-07:00

We are the borg. Your algorithm will be assimilated. Resistance is irrelevant.

We're also known as: xlvector, OfADifferentKind, & Newman(me).

Update: to clean up the leaderboard, we have voluntarily removed "We are the Borg" and other sub-teams of "Vandelay Industries !". The pre-Vandelay Industries team "Newman, George, and Peterman !" is gone too.

The End Is Near

2009-07-01T00:19:00.001-07:00

The end Netflix Prize is near: BPC has reached 10%.

Now, how do I apply what I've learning during the participation of this contest ? Maybe I should get into trying to predict the stock market or commodity prices, or something. People are already doing that on the stock market, it's called algorithmic trading.

Of course recommender systems and data-mining can be applied to all kinds of products and services, not just movies. For example truck dealers can try to find people who are most likely to buy trucks and send them truck advertisement; online matchmaking services can match people based on user data and history of matches.

Of course, for most products and service there are obvious "indicators" that we all know. For example, males in construction industry are probably far more likely to buy trucks than people in general. But still, there might be some subtle indicators and trends that can only be discovered by a computer algorithm, which might improve the accuracy of recommendations by 10%, as measured by RMSE. And as BellKor's research shows, even a few percentages of RMSE improvement can translate into huge increase in the quality of the recommendations you get. You'll actually like the products recommended to you, or the people.

How Can It Work For Matchmaking ?

Well, now I think about it, it probably won't work for matchmaking services. Why ? Because for matchmaking services, it will take a long time to truly know the quality of the recommendations.

Sure, you can find out about bad ones pretty soon: people go on one date and can't stand each other. But how do you know which recommendations are really good, and which ones only look good now but will end up in messy divorce 10 year later ? I think for matchmaking services, you'll have to wait for a generation to tell which recommendations are truly good. But by that time, society and people in general would have chance a lot, so whatever worked 20 years ago probably won't work so well today.

So for matchmaking services, the value of recommender systems is probably to filter out potential incompatible dates.

Calculating 316 million movie correlations in 2 minutes (down from 2.5 hours)

2009-06-11T11:55:00.000-07:00

KNN is one of the most popular CF algorithms. Central to KNN is how to define the neighborhood relationship, or the distance between two objects. For this purpose, Pearson Correlation is a pretty good measure and is used frequently (see my previous post for my basic KNN approach). In the Netflix dataset, there are 17770 movies, so we need to calculate 17770*17770, or about 316 million Pearson correlations. Not a trivial task.

In this post, I'll describe the optimizations I used for the calculation of Pearson Correlation, which cut my running time from 2.5 hours to less than 2 minutes. It won't help your RMSE directly, but it may help indirectly by allowing you to explore the KNN parameter space faster. And although I used Pearson Correlation, the methods described in this post can be applied to many other neighborhood measures too.

Since this is about optimization, system spec is important. The numbers reported here were generated on a PC with Intel Core2 Quad CPU (4 cores) @ 2.4G Hz, 2 GB of RAM, running Window XP. Most code was written in C++, built by Visual Studio 2008. (TODO: get memory spec)

The Matrix: Halved

Before discussing any optimization, I'll discuss an optimization that's NOT taken. Look at Figure 1. It shows the correlation matrix of 5 movies. AB is the correlation of movie A and B, AC is the correlation of movie A and C, and so on.

Now, if you assume movie correlations are symmetrical, then AB=BA, AC=CA, BC=CB, and so on, so only less than half of the matrix needs to be calculated. Now let’s say I calculated the blue-shaded part of the matrix, and saved the data to a big file. To make predictions, I need to read the data back:

For movie A, I need to read AB to AE, and that’s 1 file-seek and 1 file-read operation.

For movie B, I need to read AB, and BC to BE, that’s 2 file-seek and 2 file-read operations.

For movie C, I need to read AC, BC, and CD and CE, that’s 3 file-seek and 3 file-read operations.

And so on…

In the Netflix data set, there are M=17770 movies, so by the time I get to the last movie, I’d need 17769 file seek and read operations per movie. All that file seeking will kill my performance and my hard disk.

The simple solution is to have enough memory to hold half the matrix, so I can load all the data from disk in one shot. But for each matrix entry, I need not only the correlation value, but also two average values, and a common viewer count, making it 16 bytes per entry. So for M=17770 movies, I need about 2.35 GB, but I needed this thing to run on my laptop, which had only 1.25 G when I started coding this. Even after I compacted it down to 5 bytes per entry, memory situation was still tight.

(Actually, I think you only need enough memory to hold one quarter of the matrix to eliminate the huge number of file seeks, but the in-memory indexing scheme gets pretty complicated. There might be clever ways to do this, but I'm not sure yet.)

And all that applies only if you assume movie correlations are symmetrical, and there're ways to calculate correlations where they're not symmetrical. So out of concerns for memory usage and simplicity, I decided to ignore this half-matrix optimization.

Now let's get back to the calculation of Pearson's r.

The Brute-Force Approach

To calculate Pearson's r of two movies X and Y, we need to find viewers who rated both X and Y, and calculate some intermediate values:

  struct PearsonIntermediate {
  //from all viewers who rated both movie X and Y.
    float x; //sum of ratings for movie X
    float y; //sum of ratings for movie Y
    float xy; //sum of product of ratings for movies X and Y
    float xx; //sum of square of ratings for movie X
    float yy; //sum of square of ratings for movie Y
    unsigned int cnt; //number of viewers who rated both movies
  };

and you can easily calculate Pearson's r from these intermediate values.

For Netflix data, float can be replaced by "unsigned int" when processing raw, integer ratings, which can only be 1, 2, 3, 4, or 5.

As mentioned on my previous blog, I have the rating data sorted by movie index followed by viewer index, so finding common viewers of two movies is simply finding common values in two sorted arrays. Very easy, and very slow. The inner loop of this approach looks this:

  while (not at the end of pViewerIndex0 and pViewerIndex1) {
    if ( *pViewerIndex0 < *pViewerIndex1) {         
        pViewerIndex0++;
    } else if ( *pViewerIndex0 > *pViewerIndex1) {
        pViewerIndex1++;
    } else {
        //a viewer who rated both movies
        //update PearsonIntermediate
    }
  }

The inner loop is conditional branch-heavy, and even worse, given that the viewer indices are randomly spaced, they are very unpredictable branches, probably completely defeating the branch prediction logic of the CPU.

I think the Big O Notation of this approach is roughly O(M*M*T/M)=O(M*T), where M is the total number of movies, and T is the total number of ratings. Basically, for each movie, you have to compare against all other movies (M*M), and the average length of comparisions is the average number of ratings per movie (T/M).

My implementation of this approach took 4.5 hours to run on my little laptop, and 2.5 hours on the faster desktop. And I suffered thru this 4.5 hours a few times, due to the inevitable bugs in the initial implementation.

You can definitely optimize the code somewhat, but once I got KNN working correctly, I moved onto other algorithms.

A Break-thru

So my KNN algorithm just sat there, for months. During this time, I worked on other CF algorithms, which is probably why one day, I was working on something totally unrelated to KNN, and I suddenly had one of those "Oh my God I'm an idiot !" moment.

I realized there's a much bette way of calculating correlations. You see, for each movie X, instead of comparing it to every other movie directly, I can go thru all the viewers V who rated movie X; and for each viewer V, go thru all the movies Y rated by that viewer; and for each movie Y, update the PearsonIntermediate of movie pair X and Y. At the end, I'll have PearsonIntermediate of movie X and all the movies in the dataset (including X itself).

In pseudo-code, to calculate Pearson correlations of movie X and all movies in the dataset, you declare:

  PearsonIntermediate array[MOVIE_CNT]; //MOVIE_CNT is 17770

  initialize array[] to zeros
  for each viewer v who rated movie X {
    for each movie Y rated by viewer v {
       update values of array[Y]
    }
  }
  go thru array[] to calculate all correlations for movie X

As a sanity check, you can verify the correlation from array[X] is 1 (or very close to 1, due to floating point inaccuracy), which is the correlation of movie X and itself.

The Big O Notation for this approach is O(M*(T/M)*(T/V))=O(T*T/V). The brute-force approach is O(M*T). Since T/V<<M, I should get a significant speed-up. Furthermore, the inner loop does not have the unpredictable branchings of the brute-force approach.

This is just the motivation I needed to implement KNN on residues of other algorithms. If the original algorithm is slow with raw ratings (1 to 5 integer), it will be even slower on residues (floating point). After I was done with residues, I retrofitted this approach to my raw-rating KNN, and the improvement was huge: calculation time dropped to about about 15 minutes, a 10x speed-up !

That's pretty good, but wait, there is more ...

Parallelization

At this point I implemented multi-threading to take advantage of multiple CPU cores. Movie correlation calculation is trivially parallelizable. The result is about 3.4x speed up on 4 cores, bringing the running time down to 265 seconds.

But wait, there is more ...

Simplifying the Inner Loop

I thought since raw ratings can be only 1, 2, 3, 4, or 5, I should be able to speed things up significantly by replacing the PearsonIntermediate structure with:

    unsigned int32 vCount[5][5];

where vCount[m][n] is the number of viewers who gave movie X a score of m+1 and movie Y a score of n+1. This simplifies the inner-loop to just incrementing an integer !

I quickly implemented this "clever" idea, however, the result was disappointing: the speed-up was a few percent in some quick tests. Hmm, what could be wrong ?

After some head-scratching, I guessed the problem was the size of memory used. Each PearsonIntermediate structure takes only 24 bytes, but "unsigned int32 vCount[5][5]" takes 100 bytes. So as the code does the calculations, more memory had to be swapped in and out of cache and this seriously harmed the performance.

Fortunately we can use the nature of the data itself to work around this problem. Of the 17770 movies in the data set, 33% have less than 256 ratings, 65% have between 256 and 65535 ratings, and only 2% have more than 65536 ratings. Of course, if a movie has less than 256 ratings, there is no way the it has 256 or more common viewers with another movie, so the common viewer counts of this movie can be stored using "unsigned char" instead of "unsigned int", or 1 byte instead of 4 bytes. For movies that have between 256 and 65535 ratings, we can used "unsigned short" (2 bytes).

This means I need 3 versions of the correlation calculation function, one each for unsigned int/short/char (or int32/int16/int8). Here C++'s template function feature comes to the rescue (in spite of the awkward syntax). I wrote one function that uses a type T to store the common viewer counts:

  T vCount[5][MOVIE_CNT][5]; //initialize to zero

  for each viewer V who rated movie X {
    m=the rating viewer V gave to movie X 
    for each movie Y rated by viewer V {
      n=the rating viewer V gave to movie Y
      vCount[m][Y][n]++;
    }
  }

and I had 3 "instantiations" (those jargons are as bad as the awkward syntax) of this function where T is unsigned int/short/char. Notice the vCount array is arranged as [5][MOVIE_CNT][5] instead of the more "straight-forward" form of [MOVIE_CNT][5][5]. My tests showed [5][MOVIE_CNT][5] gave the best performance, no doubt because it minimized cache misses.

With this change, the running time improved to 156 seconds.

But wait, there is more ...

Complicating the Inner Loop

The previous chanage simplified the inner loop, but most gains seem to come from reducing the demand on memory subsytem and cache miss. Can we do better ?

Consider a movie with 3000 ratings, the type of its vCount[][][] array is "unsigned short", or 2 bytes per element. But it's unlikely that the final value of any array element would be 256 or bigger, given 3000/25=120. Even for movies with 6400 or more ratings (6400/25=256), because ratings are not evenly distributed, most vCount[][][] elements are still less than 255.

So for most movies, one byte is enough for most vCount elements. But how do we use the 1-byte (unsigned char) version of vCount, and still be able to handle the occasional elements that go above 255 ? Well, use another array to count the overflows:

  unsigned char vCount[5][MOVIE_CNT][5]; //initialize to zero
  unsigned short overflows[5][MOVIE_CNT][5]; //initialize to zero

  for each viewer V who rated movie X {
    m=the rating viewer V gave to movie X 
    for each movie Y rated by viewer V {
      n=the rating viewer V gave to movie Y
      vCount[m][Y][n]++;
      if (0==vCount[m][Y][n]) overflows[m][Y][n]++;
    }
  }

The code above takes advantage of the fact that in 2's complement system, when an integer value overflows, it wraps around to zero.

At first I was very skeptical of this change, because it adds a branching to the inner-most loop. How could it ever improve performance with that ? However that branching is highly predictable, and everything I read says with modern CPU's branch prediction logic, the cost of predictable branches is almost nothing.

So with great skepticism, I tried the code above. To my relief it did run faster, bringing the running time down to 143 seconds.

But wait, there is more ....

Reordering the Movies

For any viewer V, the IDs of movies V has rated are pretty much randomly distributed. Imaging the array of movies, with red dots indicating the movies rated by viewer V:

and as we go thru the movies rated by viewer V, most red dots incur cache misses.

However, if we sort the movie IDs in increasing order of rating count of movies, then the distribution of red dots changes:

statistically, we have the same or even more cache misses at the beginning, but as we go into higher movie indices, the chances that the viewer rated closely spaced movies increases, thus reducing the cache misses as vCount[][][] is updated.

Notice the "sorting" of the movie IDs by movie rating count is not an O(N*log(N)) quicksort for 100 million ratings, but just an O(N) remapping.

This change reduced the running time to 116 seconds, less than 2 minutes.

Summary

Well, that's it. I've picked all the low-hanging fruits, at least those that I could see.

Here's the list of performance gains from my initial implementation:

*Algorithm change: 10x
*Parallelization: 3.4x (4 cores vs 1 core)
*Low-level tweaks: 2.3x

In total about 78x performance gain, without writing a single line of assembly language.

Of the 3 items above, the biggest performance gain vs ease of implementation is parallization, because KNN calculation can be parallized very easily. I believe more cores will help the performance more, but with diminishing returns.

The Memory Bandwidth Bottleneck

Memory bandwidth seems to be the performance bottleneck of KNN algorithms.

When I was coding the parallization step, I installed some software to monitor MSRs, or model-specific registers. MSRs are very useful for performance monitoring, but unfortunately they're very sparsely documented. The "Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide" has some details, but nowhere as detailed as the individual CPU instructions.

One particular MSR I've looked at is DCU_MISS_OUTSTANDING. This webpage has the following to say about it:

DCU_MISS_OUTSTANDING gives a weighted measure of the cycles spent waiting for a cache line to be brought in. Each cycle spent waiting is weighted by the number of outstanding cache misses ... and has some caveats, but can be considered a rough estimate of the time spent waiting for a cache line to be brought in.

Anyway, when I first implemented multithreading, the code took 265 seconds to run, and I recorded about 400 billion CPU cycles spent on DCU_MISS_OUTSTANDING events. The final version took 116 seconds to run and I recorded about 100 billion CPU cycles. The difference is 149 seconds and 300 billion cycles. 300 billion cycles divided by 2.4G Hz (CPU clock speed) gives 125 seconds. So it looks like a lot of gains came from reducing the time spent waiting for data to be loaded into the cache.

I won't pretend I understand DCU_MISS_OUTSTANDING or many other MSRs completely, I just hope intel provide better documentation.

Almost A Year Since Last Update

2009-02-23T13:26:00.000-08:00

Wow, it's almost a year since last update. What have I done in the last year ?

*SVD++. This is similar to the SVD++ described by BellKor, or the transdusive MF described by Gravity.

*Expanded KNN, combinations of movie/viewer and raw/residue/binary, and much faster too (raw movie correlation calcuation takes less than 2 minutes now). For all my KNNs, I no longer save huge correlation files, instead the program just calculates correlations and predictions in one pass and saves predictions only.

*Recently got basic RBM working. This is the most confusing model for me. Basically I got it working thru trial and error. I still don't really understand it, and I have yet to make it work on residues of other predictors.

*Read a lot of research papers. I actually spent more time reading papers than writing code.

Residues and Threads

2008-02-25T17:11:00.001-08:00

I'm slowly changing all my algorithms to work on either raw data or residue data. Because raw data are integer values in the range of 1 to 5, and thus enabling some performance optimizations and shortcuts, but residue data are essentially floats, all my algorithms took a performance hit.

At the same time, I'm adding multi-threading support to speed things up. The improvement is significant, but I'm already seeing diminishing returns: on a quad-core systems, matrix factorization using 4 cores runs at only 3.27x the speed of using one core.

Back to Basics

2008-01-28T22:53:00.001-08:00

New year, now I'm playing with the most basic stuff: global effects. This time I'll try to be more patient and take a more systematic approach, instead of the "hack anything that looks promising" approach I've been taking.

Right now I've a nice number for my qualifying RMSE: 0.8888. Hey, it takes real skill to hit such a nice number by chance ! I'll try to resist the urge to submit better results until I'm sure I've made a big improvement.

Error distribution of different algorithms

2007-12-14T11:11:00.001-08:00

The Greater Collaborative Filtering Groupthink: KNN

2007-09-10T17:28:00.000-07:00

(Note: this post assumes you're familiar with The Netflix Prize.)

In this post I'll describe my implementation of the KNN algorithm, which stands for K Nearest Newbies (with me being the biggest newbie around). A more descriptive name for my implementation is "movie-to-movie Pearson's r-based KNN with slope-one correction and other tweaks" - that's how I see it anyway. I was able to get a 0.9174 qualifying score using this method.

In a nutshell, the KNN algorithm is this: to predict viewer v's rating for movie m, you first get the list of all the movies previously rated by viewer v. Among those movies, you find the K closest "neighbors" to movie m. Then you calculate the weighted-average of viewer v's ratings for those "neighbor" movies, and that’s your final prediction.

Making KNN Predictions

Conceptually, to predict the rating for movie m by viewer v, I do the following:

First get a list of all movies rated by viewer v, let's call this list L0.

Pick a value for parameter MinCV, which is the minimum number of common viewers required. By common viewers I mean viewers who rated both movies. Remove all movies in L0 that have less than MinCV common viewers with movie m. From trial runs, it seems 16 is a good number for MinCV.

For each movie n left in the list L0, calculate the following data structure:

  struct MovieNeighbor { //movie n as neighbor of movie m
      unsigned int commonViewers;
      float mAvg; //average rating of movie m
      float nAvg; //average rating of movie n

      float nRating; //current viewer's rating for movie n

      float rRaw;    //raw Pearson's r value
      float rLower; 
      float weight;
  };

In the structure above, commonViewers is the number of viewers who rated both movie m and n, mAvg and nAvg are the average ratings of movie m and n by common viewers, respectively. And nRating is viewer v's rating for movie n.

The rRaw is Pearson's product moment coefficient, aka Pearson's r, for the movie pair in question. And rLower is the lower bound of the movie pair's Pearson's r at 95% confidence interval. This is calculated using rRaw, commonViewers, and Fisher Transform/Fisher Inverse. If rRaw has a small absolute value, rLower may end up having a different sign than rRaw, in this case, I chose to set rLower to 0.

Once rLower is calculated, I calculate the weight this way:

    weight = rLower*rLower*log(commonViewers);

This is just one of the many ways to calculate weight. For example, you can change the confidence interval, you can use a power function on commonViewers, or a sigmoid function, or just linear with a max limit. I tried many different ways of calculating the weight and found the best ones gave very similar end results.

Put all MovieNeighbor structures in a new list, let's call this list L1. Each item in list L1 represents a neighbor for movie m. And we don't need list L0 any more.

Create a new MovieNeighbor structure, and set the following fields:

    mAvg=average rating of movie m from all viewers
    nAvg=ratingn=0;  
    weight=log(MinCV);

add this new structure to list L1. This creates a dummy neighbor that predicts movie m's average rating. The effect of this dummy neighbor depends on the movie m's real neighbors. If movie m has lots of weak neighbors (low correlation, low number of common viewers), then the dummy neighbor will pull the prediction toward movie m's average. On the other hand, if movie m has many good-quality neighbors (high correlation, high number of common viewers), then the dummy neighbor will have no effect on the final outcome.

Pick a value for parameter MaxN, which is the maximum number of neighbors used for prediction. Sort list L1 by weight and keep the MaxN items that have the greatest weights (implementation hint: don't call sort or qsort, use nth_element).

For each item remaining in L1, calculate its prediction:

    dif = ratingn - nAvg;
    if (rRaw<0 dif="-dif;<br">    prediction = mAvg + dif;

The final prediction is the weighted-average of predictions of all remaining items in list L1. That's it !

Implementation Notes

A straight-forward implementation of the above will work, but it will be slow and make it almost impractical to explore large number of parameter combinations. As you can see there're a number of parameters you can tune:

confidence interval
power rLower
function of common viewers (log, power, sigmoid, etc)
minimum number of common viewers required
max number of neighbors
etc ...

To try large combinations of parameter values efficiently, you want to pre-calculate and share the computational overhead as much as possible.

On the top of the list is the following for each movie pair in the data set:

    struct MovieCorrelation { //for movie pair m-n
    unsigned int commonViewers;
    float mAvg;    //average rating of movie m
    float nAvg;    //average rating of movie n
    float rRaw;    //raw Pearson's r value
};

There're 17770 movies in the data set, so there're 17770x17770 (of which half are unique) MovieCorrelation structures we can calculate and save to a large file. This sounds like a major task, but it only needs to be done once (once your code is bug-free).

And thanks to the raw speed of modern computers, it does not take as long as it looks. I managed to get it done in less than 20 minutes on my laptop, and my implementation is by no means optimized to the max. If you have a fast PC, it should take less than 15 minutes; and if you have enough memory to hold 17770x17770/2 (packed) records in memory, it will probably take less than 10 minutes.

Once you have all that MovieCorrelation calculated and saved in a file, arrange your prediction code like this:

  for each movie m to predict {
    array0=read all MovieCorrelation's of movie m from file
    for each value of 1st parameter {
      array1=some_operation(array0);
      for each value of 2nd parameter {
        array2=some_operation(array1);
        for each value of 3rd parameter {
          array3=some_operation(array2);
          do predictions using array3;       
        }
      }
    }
  }

The 1st parameter might be confidence interval, the 2nd parameter might be maximum number of neighbors, etc. After I wrote my prediction code like above, it took about one minute to generate one probe prediction, but to generate 32 probe predictions, it only took 3 minutes !

The Great Collaborative Filtering Groupthink, Thinks Faster

2007-08-29T15:20:00.000-07:00

(Note: this post assumes you're familiar with The Netflix Prize.)
(Warning: long post)

As described in the “The Great Collaborative Filtering Groupthink” post, my viewer-clustering algorithm has an assignment step and an update step, repeated for many iterations.

In my initial implementation, each iteration took about 90 seconds, for a set of 64 clusters (see previous post for my hardware spec). I didn’t write down the initial performance numbers, but if I recall correctly, the assignment step took anywhere from 70 to 80 seconds, with the rest going to the update step.

Now if a set of clusters takes 10 iterations to build, and I need to build 8 sets (with different random starting values), that’s 2 hours of running time already. I’d rather not wait for 2 hours every time I tweak the algorithm or fix a bug, plus I need to experiment with different number of clusters, so I looked for ways to make it run faster, eventually reducing the per-iteration running time to 17.6 seconds.

Obviously the assignment step is the place to look first. The following is its pseudo-code:

//loop thru data in viewer-cluster-movie order
for each viewer {
    minSquare=FLT_MAX; //maximum float value
    for each cluster {
        clusterSquare=0;
        for each movie rated by the current viewer {
            clusterSquare+=(ratingOfViewer – ratingOfCluster)^2;
        } // for each movie rated by the current viewer
        if (clusterSqueare &lt minSquare) {
            Assign the current viewer to the current cluster.
            minSquare=clusterSquare;
        }
    } //for each cluster
} //for each viewer

First, I tried an early-out approach for the inner loop:

//loop thru data in viewer-cluster-movie order
for each viewer {
    minSquare=FLT_MAX; //maximum float value
    for each cluster {
        clusterSquare=0;
        for each 5% of the movies rated by the current viewer {
            for each movie in the current 5% {
                clusterSquare+=(ratingOfViewer – ratingOfCluster)^2;
            } // for each movie in the next 5%
            if (clusterSqure >= minSquare) skip current cluster.
        } //repeat 20 times
        Assign the current viewer to the current cluster.
        minSquare=clusterSquare;
    } //for each cluster
} //for each viewer

Basically, I'm tracking the current minimum square of distance value in minSquare, and as soon as the current cluster's clusterSquare value exceeds minSquare, I move onto the next cluster. However this improved performance less than 10%. I think the reason is that for most clusters, you need to go thru most of the data before clusterSquare exceeds minSquare. It’s time to look deeper.

For each set of clusters, I stored all the clusters’ rating data in a 2-D array:

unsigned short ratings[clusterCount][movieCount];

That’s right, I used a 16-bit unsigned integer for each rating. The value in ratings[][] array is the real rating value times 1024. The data is stored in the natural cluster-by-movie order: the first index of the array is cluster index, and the second index of the array is movie index. Here’s the memory layout of the data (from low memory address to high memory address):

c0m0, c0m1, c0m2, …… c0m[movieCount-2], c0m[movieCount-1],
c1m0, c1m1, c1m2, …… c1m[movieCount-2], c1m[movieCount-1],
c2m0, c2m1, c2m2, …… c2m[movieCount-2], c2m[movieCount-1],
...

where c0m0 is the cluster 0’s rating for movie 0, and c0m1 is the cluster 0’s rating for movie 1, and so on.

In the Netflix data set, there’re 17770 movies and about half million viewers, but only about 100 million ratings, so on average each viewer rated about one in 89 movies.

Therefore in my inner loops above (for each movie rated by the viewer), the code is skipping thru the rows of the array from left to right. On average, the memory address of each rating value accessed is about 89 ratings (178 bytes) away from the previous address. So it’s a guaranteed cache miss in most cases, with horrible performance penalties. Most of the time, the CPU is just sitting there waiting for the data to arrive from memory.

OK, given this scenario, the obvious solution is to turn the ratings[][] array sideways, swapping the first index and the second index:

unsigned short ratings[movieCount] [clusterCount];

and the new memory layout:

m0c0, m0c1, m0c2, …… m0c[clusterCount-2], m0c[clusterCount-1],
m1c0, m1c1, m1c2, …… m1c[clusterCount-2], m1c[clusterCount-1],
m2c0, m2c1, m2c2, …… m2c[clusterCount-2], m2c[clusterCount-1],
...

where m0c0 is the cluster 0’s rating for movie 0, and m0c1 is the cluster 1’s rating for movie 0, and so on. Now we go thru the data in a new order:

//loop thru data in viewer-movie-cluster order
float clusterSquares[clusterCount];
for each viewer {
    set all elements of clusterSquares[] to zero.
    for each movie rated by the current viewer {
        for each cluster {
            clusterSquares[clusterIndex]+=(ratingOfViewer – ratingOfCluster)^2;
        }
    } // for each movie rated by the current viewer
    Assign the viewer to the cluster that has the smallest value in clusterSquares[].
} // for each viewer {

Now in the inner loop, the code is scanning thru the array/memory sequentially with no skips, greatly reducing cache miss.

At this time I also changed the data type of ratings[][] array from unsigned short (2 bytes) to unsigned char (1 byte), to reduce the amount of data that need to be accessed. Yes I lose some accuracy this way, but the probe RMSE using 1-byte rating value is only a few 0.0001’s worse than before, so the effect is negligible for the result we’re trying to achieve here.

With both changes above, the assignment step now took about 44.3 seconds, and the update step took about 8.0 seconds (yes I wrote these numbers down), for a total of 52.3 seconds, a good improvement from the previous 90 seconds.

But wait, there’s more. The new inner-loop is scanning thru the array/memory sequentially, so we can easily take advantage of data prefetching.

However, given that cluster count is relatively small, prefetching within the cluster rating data of each movie (each row in the array/memory layout) is probably not useful. So I decided to prefetch by movie, at the second-most inner-loop level:

for each movie rated by the current viewer {
    movieToPrefetch=moviesRatedByViewer[currentIndex +prefetchOffset];
    prefetch memory address &ratings[movieToPrefetch][0]
    ...
} // for each movie rated by the current viewer

Now, there're a few caveats about prefetching:

First of all, for production code, you obviously want to write some CPU detection code to make sure the running CPU supports prefetching instructions.

Secondly, the optimal prefetch distance depends on a number of factors (yes the article talks about Itanium, but the general principle applies to Pentium too). Not a trivial task.

And finally, there’s a potential crash bug in the calculation of the prefetch memory address:

MovieToPrefetch=moviesRatedByViewer[currentIndex +prefetchOffset];

Because of prefetchOffset, we’re reading beyond the memory allocated for moviesRatedByViewer[], a potential crash.

In reality, it probably won’t crash, because moviesRatedByViewer[] sits in a block of memory pages allocated by the C++’s runtime memory manager, and the memory immediately after moviesRatedByViewer[] most likely belongs to the memory manager, so our program can still read it. The only way it could crash is if moviesRatedByViewer[] ends at or near the last byte of the last page of the block of pages it sits in.

But still, it’s better to just fix this. Fortunately, the fix is very simple: just allocate some extra bytes for moviesRatedByViewer[]. If you estimate your maximum prefetchOffset is N, just allocate an extra N*sizeof(*moviesRatedByViewer) bytes.

After I added prefetching, and did a few trial runs to find the best prefetchOffset value, the running time of assignment step went from 44.3 seconds to 32.2 seconds. Another 27% improvement.

But wait, there’s more. The new inner-loop is scanning thru array/memory sequentially, doing the same type of operation on each data element. This practically screams out for SIMD optimization.

It’s always fun to improve performance by writing assembly language code, but we rarely have good reasons to do it nowadays. Now, if processing 100 million pieces of data, repeatedly, in a non-production environment, isn’t a good excuse, then I don’t know what is.

So I translated the inner-loop to inline assembly code using SSE instructions. Even a straight-forward translation, without any tuning at all, reduced the running time of assignment step from 32.2 seconds down to 10.4 seconds! After a few runs to find a new optimal prefetch distance, the running time dropped further to 9.6 seconds. A huge improvement !

So in summary, the per-iteration running times in seconds:

Original implementation: total=90
Rearrangement/reduction of data: assignement=44.3, update=8.0, total=52.5
Prefetch: assignement=32.2, update=8.0, total=40.2
SSE inline assembly code: assignment= 9.6, update=8.0, total=17.6

An 80% reduction in running time, or 5 times boost in speed.

Next up: The Greater Collaborative Filtering Groupthink (KNN)

The Great Collaborative Filtering Groupthink

2007-08-24T14:45:00.000-07:00

(Note: this post assumes you're familiar with The Netflix Prize.)

Now, some details on algorithms. I'll start with my viewer-clustering approach first.

The basic idea is to come up with a number of "representative" viewers, with each one representing a group of viewers with similar tastes. In various literatures these "representative" viewers are usually called "centroid" or "cluster" - I'll use the term "cluster" from now on.

In our case, each cluster is really a made-up viewer representing a group of viewers. The data for each cluster are just movie ratings, one rating value for each movie:

unsigned char clusterMovieRatings[MOVIE_CNT];

Note I used a single byte (unsigned char) for each rating. I'll write more on this detail later.

When it’s time to make a decision (prediction), each viewer just asks all the clusters he belongs to for a prediction, and then calculate the weighted average of the clusters' predictions. Of course, the predictions of each cluster is the average prediction of all viewers belonging to that cluster. Now you see why I call this algorithm the great collaborative filtering groupthink.

OK, how do you come up with the clusters ? I used the basic k-means algorithm:

Initialization: Setup some clusters, set their movie rating values to random.
Assignment step: For each viewer, determine which cluster the viewer belongs to, i.e. the cluster that is most similar to the viewer. I used the square of Euclidean distance as the measure of similarity, that is, the sum of squares of difference between the viewer’s rating and the cluster’s rating, over the movies the viewer has rated. Of course, the smaller the number the greater the similarity.
Update step: Go thru each cluster. For each movie, calculate the its average rating from the viewers who both belong to this cluster and have rated this movie. Replace the cluster’s rating for this movie with the average.
Repeat the previous two steps a number of times, or you can make a prediction for the probe data set and decide if to stop based on its RMSE.

To make a prediction for viewer v and movie m:

Find the K clusters that are most similar to viewer v. Again, the square of Euclidean distance is used to measure similarity. Calculate the weight for each cluster from the square of distance: weight=1/(squaredDistance + offset). The offset is a small positive number, it’s there so we don’t get a divide by zero if a viewer happen to match a cluster perfectly.
Calculate the weighted average of the K clusters' ratings for movie m, and that’s the predicted rating for viewer v and movie m.

That’s it. Pretty simple, isn’t it ? Of course there’re enhancement to the basic k-means algorithm such as soft k-means and k-means++ (and I don’t see any reason why they can’t be combined to produce a soft k-means++ algorithm), but I’m trying to get the simple and easy stuff working first. And given what I tried next, they probably don’t matter in the end.

With the algorithms above, I tried different combinations of the following parameters:

C: the total number of clusters.
I: the number of iterations used to build the clusters.
K: the number of most similar clusters used for prediction.

In the end, C=64, I=10, and K=5 seemed the best. Furthermore, since we set the clusters’ ratings to random values at the start of the algorithm, we can run the algorithm multiple times, using a different set of random initial values each time, and then average the predictions to get a better RMSE. Basically we create more groups among the viewers, and averaging nudges the predictions toward a more reliable, common value.

Here a mystery surfaced: K=6 or 7 consistently showed the best RMSE for raw predictions, but K=3 or 4 gave the best result when averaging multiple sets of predictions.

At this stage I started tweaking the algorithm. What I found was all tweaks improved the raw RMSE, but the averaged RMSEs stayed pretty much the same, another mystery. Here’re some probe RMSE from raw predictions with C=64 and I=10:

K=2, RMSE=0.9551
K=3, RMSE=0.9480
K=4, RMSE=0.9450
K=5, RMSE=0.9437
K=6, RMSE=0.9433
K=7, RMSE=0.9433
K=8, RMSE=0.9437
K=10, RMSE=0.9448
K=15, RMSE=0.9489
K=20, RMSE=0.9536
K=25, RMSE=0.9585
K=30, RMSE=0.9635

and here are some probe RMSEs from averaging multiple sets with C=64, I=10:

Average of 2, K=2, RMSE=0.9440
Average of 2, K=3, RMSE=0.9408
Average of 2, K=4, RMSE=0.9398
Average of 2, K=5, RMSE=0.9397
Average of 2, K=6, RMSE=0.9400
Average of 2, K=7, RMSE=0.9407

Average of 4, K=2, RMSE=0.9386
Average of 4, K=3, RMSE=0.9374
Average of 4, K=4, RMSE=0.9374
Average of 4, K=5, RMSE=0.9379
Average of 4, K=6, RMSE=0.9387
Average of 4, K=7, RMSE=0.9396

Average of 8, K=2, RMSE=0.9358
Average of 8, K=3, RMSE=0.9356
Average of 8, K=4, RMSE=0.9360
Average of 8, K=5, RMSE=0.9368
Average of 8, K=6, RMSE=0.9378
Average of 8, K=7, RMSE=0.9388

So in summary, groups are good, individuals are bad. We’re the Borg, you’ll be assimilated.

Next up: The Great Collaborative Filtering Groupthink Thinks Faster

The Great Collaborative Filtering Framework

2007-08-22T22:51:00.000-07:00

(Note: this post assumes you're familiar with The Netflix Prize.)

Continuing from the last blog, I plodded along and wrote some boring C# code to generate my binary files. Another task is to extract the probe data from the full data set. In the end I had 3 sets of data: the full data set, the probe-less data set, and the probe data set. For each set, I created two binary files: one sorted by movie ID first and then by viewer ID, the other sorted by viewer ID first and then by movie ID.

Next, I generated a prediction file using weighted averages of movie and viewer average ratings. To do that I had to get a least squares fitting/linear regression program going. This program is also used later to blend results from different algorithms. Plus I had to write a program to grade the probe predictions. The RMSE number I got agreed with what other people were reporting, so I was more or less on the right track. Whew, it was a lot of work just to get a basic framework going. But it should be smooth sailing from now on, right ?

Well, before I go any further, there're questions to be answered:

Should I filter the data somehow ? As others have reported, there're oddball viewers in the data set. Like the guy who rated about 17000 movies, or the guy who rated thousands of movies in one day.
Should I do some sort of global offset or normalization ?
What to do about rating dates ? If you look at the global averages vs dates, you'll see the averages steadily going up, and there was an abrupt jump of about 0.2 around April 2004.

Hmm, I don't have any good answers, but a few hokey theories and SWAGs (scientific wild ass guess) came to the rescue:

Filtering: not needed, as the number of oddball viewers is too small and their effect is probably negligible.
Global offset/normalization: other people have reported getting RMSE in the low 0.98s using only movie/viewer averages. That’s within .03 of Netflix’s Cinematch algorithm, which is a non-trivial algorithm, which has to do more than just offset/normalization, if it does it at all. So the effect of offset/normalization is probably around 0.015. So I think as a data mining newbie just getting started on this, I can ignore this issue for now, I’ll worry about it after I exhaust other possibilities.
Rating dates: this might be nothing. It’s possible that as time progresses, Netflix collected more and more data and continuously updated their software, so people get better and better recommendations, and as a result the ratings got better and better. As for the abrupt jump around April 2004, maybe they did a major software upgrade/bug fix around that time.

Actually, by this time I was so tired of reading research papers and so eager to get a nontrivial algorithm going, I’ll make any excuse to take the “damn the torpedoes full speed ahead” approach.

Next up: the great collaborative filtering groupthink, aka clustering.

Great Expectations and Database

2007-08-16T11:29:00.000-07:00

(Note: this post assumes you're familiar with The Netflix Prize.)

When I first came across the Netflix prize, I just couldn't resist urge to participate. It looked like a well-planned contest with an interesting problem. It would test and improve your coding skills and your understanding of interesting algorithms. Plus there's a $1,000,000 prize. So I eagerly signed up, downloaded the (large) data set, read the documentations, and pondered, how to start ?

At first, I thought this was a good project with which to improve my relatively limited knowledge of C#. I also had the idea of putting all the data into a database, so I could easily run queries like "gimme all viewers who gave this movie a score of 3", and I could learn the C#/.NET interface to database at the same time. So the version 0 of my grand plan was:

Improve my C# skillz.
Learn C# and .NET database interface.
Read about and implement interesting algorithms.
Win the one million dollar prize.

Hehehe... good thing I didn't get XML and Haskell into the mess.

So I setup a database, and started writing a C# program to reformat the raw data files from Netflix. That worked out pretty well and soon I had some data in the database. But as soon as I tried accessing the database I noticed it is unbearably slow for this task. It was probably OK even if I wanted to run thousands of queries at a time, but a prediction algorithm needs to go thru all 100 million pieces of data, and more. With everything in the database, even the simplest algorithm would take a few hours to run. This seemed to mirror the experience of other people. Maybe it's workable if I have a cluster of 1000 fast PCs at my disposal, but with at most 2 computers at my disposal, the database approach would never work.

So I abandoned the database-based approach, and switched to sorting and saving data in binary files for fast access. This time I put a little more thought into it and decided to use memory-mapped file as my primary means of getting data off the disk. This simplified my data access and has worked out very well so far.

My Netflix Attempt

2007-08-10T14:34:00.001-07:00

I've been working on the Netflix prize contest off and on, basically the it's a data-mining challenge with a $1 million dollar prize. I'll be describing some of the approaches I took, hopefully someone will give me some feedback that will improve my score (0.899 as of July 20, 2007).

So far I've the following algorithms up and running: SVD (based on Timely Development's code, thank you!), KNN, clustering, MLE(or is it EM ? I'm not sure, anyway it didn't work), and a couple of amateur-ish schemes I dreamt up, which also didn't work out (by not working I mean did not beat Netflix's Cinematch's RMSE). Oh by the way, this is the first time I've ever implemented these algorithms.

While I certainly have enough programming experience for this contest, I'm not a mathematician or a statistician, and I have never dabbled in data mining before. So you won't find deep theoretical musings on topics like sampling techniques, SVD, restricted Boltzmann machine guns here. I'll concentrate on my implementation detail and try to keep my understanding and my terminology correct. All malaprops may or may not be intestinal.

More details to follow.