<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-9207672005321238583</id><updated>2011-11-27T15:52:47.641-08:00</updated><title type='text'>Hacking Netflix Prize</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>23</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-7797494765388188373</id><published>2011-09-28T15:01:00.000-07:00</published><updated>2011-09-28T15:01:00.562-07:00</updated><title type='text'>Heritage Health Prize Round 1 Results</title><content type='html'>The Heritage Health Prize &lt;a href="http://www.heritagehealthprize.com/c/hhp/Leaderboard/milestone1"&gt;round 1 milestone results were released&lt;/a&gt;. I came in 3rd on the final private rankings, even though I was only the 5th on public leaderboard. So I guess I didn't overfit too badly.&lt;br /&gt;&lt;br /&gt;I wonder by how much I missed winning the lottery this time.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-7797494765388188373?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/7797494765388188373/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=7797494765388188373' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/7797494765388188373'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/7797494765388188373'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2011/09/heritage-health-prize-round-1-results.html' title='Heritage Health Prize Round 1 Results'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-820290133692651343</id><published>2011-09-25T15:09:00.000-07:00</published><updated>2011-10-06T10:22:13.281-07:00</updated><title type='text'>On Zombie Apocalypse Survival</title><content type='html'>Some people like to discuss &lt;a href="http://en.wikipedia.org/wiki/Zombie_apocalypse"&gt;zombie apocalypse&lt;/a&gt; survival scenarios. The suggested preparations usually involve stockpiling guns and ammo, building defensive structures, practicing marksmanship and survival skills, etc.&lt;br /&gt;&lt;br /&gt;I used to not take zombie apocalypse survival seriously, but lately I gave it some serious thought.&lt;br /&gt;&lt;br /&gt;Almost all zombie apocalypse survivalists are living in a fantasy. Because based on all credible evidence available (movies, duh), when the zombie apocalypse comes, most of them will become zombies themselves. Stockpiling guns and ammo ? Well chances are your stockpile will be used against your future zombie self by the few lucky remaining humans.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Seriously, zombie apocalypse survival is&amp;nbsp;about surviving as&amp;nbsp;zombies, not as&amp;nbsp;humans.&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;If you're saying, "If I become a zombie, then I don't care about survival any more, and I'd rather be dead." Well, technically you're already dead, but &lt;strong&gt;hope springs eternal, even for zombies&lt;/strong&gt;. Your job is to survive long enough until &lt;a href="http://en.wikipedia.org/wiki/I_Am_Legend_(film)"&gt;someone comes up with a cure that turns zombies back into humans&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;What can you do today to increase your chance of survival as a zombie in the future ? Keep in mind that all credible evidence available (movies, duh) show that zombies are incapable of higher functioning, and survives by killing humans and eating flesh off the corpses.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;The number 1 thing you can do to improve your chance of survival as a zombie is getting strong, sharp, flesh-tearing titanium teeth implants.&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Even if zombie apocalypse does not come in your lifetime, titanium teeth implants still have advantages: you'll be a better chewer, and you don't need to worry about cavities any more.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-820290133692651343?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/820290133692651343/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=820290133692651343' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/820290133692651343'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/820290133692651343'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2011/09/on-zombie-apocalypse-survival.html' title='On Zombie Apocalypse Survival'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-3389474358992929931</id><published>2011-07-18T00:11:00.000-07:00</published><updated>2011-07-18T00:12:29.928-07:00</updated><title type='text'>Another Tile-based Picture Puzzle Game in Javascript</title><content type='html'>Wrote&amp;nbsp;another little tile-based picture puzzle in Javascript. Again, due to blogger.com's layout interfering with picture display, I had to put it &lt;a href="http://members.shaw.ca/byang/blogsup/TileSwap.htm"&gt;on a separate page&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Enter the URL to a picture on the web, and the game will load that picture and split it into tiles. The tiles will be randomly shuffled, and to increase difficult, the center of each tile will be covered by a gray square.&lt;br /&gt;&lt;br /&gt;You click on two tiles to swap their positions. After all tiles are swapped back into their correct positions in the picture, the whole picture will be displayed.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-3389474358992929931?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/3389474358992929931/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=3389474358992929931' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/3389474358992929931'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/3389474358992929931'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2011/07/another-tile-based-picture-puzzle-game.html' title='Another Tile-based Picture Puzzle Game in Javascript'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-5378508744039360166</id><published>2011-06-17T11:35:00.000-07:00</published><updated>2011-09-28T14:03:12.398-07:00</updated><title type='text'>Javascript 15-puzzle Sliding Tile Game</title><content type='html'>I have been using Javascript more&amp;nbsp;recently, and decided to cleanup the &lt;a href="http://en.wikipedia.org/wiki/Sliding_puzzle"&gt;15-puzzle sliding tile game&lt;/a&gt; I wrote in Javascript some time ago. Instead of using boring numbered tiles, my version let you use images and it displays tiled images.&lt;br /&gt;&lt;br /&gt;I had to put it &lt;a href="http://members.shaw.ca/byang/blogsup/slide15.htm"&gt;on a separate page&lt;/a&gt; because columns in blog layout interferes with displaying bigger images.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-5378508744039360166?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/5378508744039360166/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=5378508744039360166' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/5378508744039360166'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/5378508744039360166'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2011/06/javascript-15-puzzle-sliding-tile-game.html' title='Javascript 15-puzzle Sliding Tile Game'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-164460682871444569</id><published>2010-12-10T16:44:00.000-08:00</published><updated>2011-07-01T03:19:37.093-07:00</updated><title type='text'>Cheating at Video Games: Electronic Automatic Trigger</title><content type='html'>Some time ago I bought this little "Space Invader" game machine. It plugs into your TV and contains 5 classic games. Two of them, &lt;a href="http://en.wikipedia.org/wiki/Phoenix_(video_game)"&gt;"Pheonix"&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Colony_7"&gt;"Colony 7"&lt;/a&gt;, are rapid-fire type shooting games that requires you to press the fire button quickly and repeatedly. Of course, your hand/wrist/arm quickly get sore from pressing the button repeatedly and the it's not fun at all.&lt;br /&gt;&lt;br /&gt;I decided to see if I could "cheat" by make an electronic automatic trigger of some sort. I took the game box apart, and using a multimeter, quickly determined that when you press a button, it just connects one point in the circuit to the ground. So make an electronic automatic trigger should be pretty easy.&lt;br /&gt;&lt;br /&gt;I drilled a hole on the side on the game box, and run 4 wires in, soldering them to the power, ground, A button contact, and B button contact (unused in my circuit)&amp;nbsp;of the game circuit. I wanted to build a simple oscillator circuit using the &lt;a href="http://en.wikipedia.org/wiki/555_chip"&gt;555 IC&lt;/a&gt;, but I was out of 555 so I used one half of a 556 (which contains two 555's) instead. The "trigger" circuit is your basic 555 astable mode circuit:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/-1i4WxN1aS4k/TgJ_M4929AI/AAAAAAAAACo/L6EF63VD8gY/s1600/FullAutoConversion.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="272" i$="true" src="http://1.bp.blogspot.com/-1i4WxN1aS4k/TgJ_M4929AI/AAAAAAAAACo/L6EF63VD8gY/s320/FullAutoConversion.png" width="320" /&gt;&lt;/a&gt;&lt;br /&gt;When the output is high, it turns on the transistor and connects the button contact to the ground, thus electrically "press" the button. The rate of fire, using the component values shown, is about 18hz, or about 1000 rounds per minute. That's more than enough for the games since they allow far fewer bullets on screen at a time.&lt;br /&gt;&lt;br /&gt;I had to solder the power wire to the power pin of the 556 chip, but everything else was jump-wired on a small breadboard duct-taped to the game box:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-p1ZlMHnlcAA/TgJ_D3iGhJI/AAAAAAAAACk/xfMvB-qvJdM/s1600/GameUnit.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="263" i$="true" src="http://4.bp.blogspot.com/-p1ZlMHnlcAA/TgJ_D3iGhJI/AAAAAAAAACk/xfMvB-qvJdM/s320/GameUnit.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-164460682871444569?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/164460682871444569/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=164460682871444569' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/164460682871444569'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/164460682871444569'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2010/12/cheating-at-video-games-electronic.html' title='Cheating at Video Games: Electronic Automatic Trigger'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-1i4WxN1aS4k/TgJ_M4929AI/AAAAAAAAACo/L6EF63VD8gY/s72-c/FullAutoConversion.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-3687183901651258459</id><published>2009-10-06T01:13:00.001-07:00</published><updated>2009-11-22T21:32:44.861-08:00</updated><title type='text'>Coffee, water, anti-tank weapons, &amp; nuclear explosions</title><content type='html'>I was working with a camera development board from &lt;a href="http://www.aptina.com/"&gt;Aptina Imaging&lt;/a&gt;, and I decided to try to capture a few high-speed videos, and here're the results. The videos were captured at a few hundred frames per second, then slowed down for playback.&lt;br /&gt;&lt;br /&gt;The hardware I used is nowhere nearly fast enough to capture the flight of bullets, so I was stuck with the milk-drop variety. There are lots of slow motion videos of milk drops on youtube, so I decided to do something slightly different: drops of coffee falling into clear water. I know, very creative, but the choice of clear water turned out to be good. The video quality isn't good, but hey, I was using a cheap hardware system.&lt;br /&gt;&lt;br /&gt;Here's the first video, a drop of coffee into water:&lt;br /&gt;&lt;br /&gt;&lt;object width="320" height="265"&gt;&lt;param name="movie" value="http://www.youtube.com/v/VwNweO2TFBg&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/VwNweO2TFBg&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="320" height="265"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;Here's the second video, 2 drops of coffee into water:&lt;br /&gt;&lt;br /&gt;&lt;object width="320" height="265"&gt;&lt;param name="movie" value="http://www.youtube.com/v/-6zbirXhBtk&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/-6zbirXhBtk&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="320" height="265"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;Here's the third video, 3 drops of coffee into water. No, just kidding, it's side-view of a drop of coffee into water:&lt;br /&gt;&lt;br /&gt;&lt;object width="320" height="265"&gt;&lt;param name="movie" value="http://www.youtube.com/v/jHtFdTgN_V8&amp;hl=en_US&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/jHtFdTgN_V8&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="320" height="265"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;Now the third video is most interesting. Like everyone else, I've seen things falling into water countless times, but that third, side-view video revealed details I've never noticed before.&lt;br /&gt;&lt;br /&gt;Basically, when a drop of coffee falls into a glass of water, what happens can be roughly divided into 5 stages:&lt;br /&gt;&lt;br /&gt;1. The drop of coffee falls thru the surface of the water, creating a depression on the surface of the water.&lt;br /&gt;&lt;br /&gt;2. As the water rush in and fill that depression, a jet of water forms and shoots upward.&lt;br /&gt;&lt;br /&gt;3. Apparently different parts of the water jet travel at different speeds, because some water separate from the main body of the jet and forms a small water sphere above it. &lt;br /&gt;&lt;br /&gt;4. Everything fall back into the water, waves form.&lt;br /&gt;&lt;br /&gt;5. A "coffee ring" forms in the water, expanding slowly and sinking.&lt;br /&gt;&lt;br /&gt;The formation of the water jet from the depression on the water surface reminds me of the &lt;a href="http://en.wikipedia.org/wiki/Munroe_effect"&gt;Munroe effect&lt;/a&gt;, the physical phenomenon used by &lt;a href="http://en.wikipedia.org/wiki/Shaped_charge"&gt;shaped-charge&lt;/a&gt; anti-tank weapons to punch thru thick tank amour. And that expanding, sinking coffee ring reminds me of the expanding but rising mushroom cloud following a nuclear explosion.&lt;br /&gt;&lt;br /&gt;Now you know how to do weapons research using coffee and water.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-3687183901651258459?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/3687183901651258459/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=3687183901651258459' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/3687183901651258459'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/3687183901651258459'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2009/10/coffee-water-anti-tank-weapons-nuclear.html' title='Coffee, water, anti-tank weapons, &amp; nuclear explosions'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-8837089475451088517</id><published>2009-09-24T10:04:00.000-07:00</published><updated>2009-09-24T13:17:35.107-07:00</updated><title type='text'>The Netflix Prize Announcement</title><content type='html'>I went to the Netflix Prize announcement in New York on Sept 21. It was great to finally meet some of my team members, as well as our arch-nemesis &lt;a href="http://www.research.att.com/~volinsky/netflix/bpc.html"&gt;BellKor's Pragmatic Chaos&lt;/a&gt;. I also did some sightseeing in NYC so it was a pretty good trip.&lt;br /&gt;&lt;br /&gt;The prize announcement was pretty well-covered by the media. Reporters from New York Times, Forbes, AP, Business Week, Wall Street Journal, etc were there. Just do a search and you'll find plenty of press coverage of the event.&lt;br /&gt;&lt;br /&gt;From talking to my team members, it's not surprising that we all work in IT-related fields, but I'm still impressed by the breadth of experience in the Ensemble team, from health care to entertainment industry to unmanned weapon systems. That's right, from saving your life to killing you and making a movie about it, we got you covered.&lt;br /&gt;&lt;br /&gt;BellKor's Pragmatic Chaos revealed one surprising chance-of-luck happening that explained some of the behind-the-scene drama we saw in the last 48 hours of the contest.&lt;br /&gt;&lt;br /&gt;PS: I accidentally got quoted in &lt;a href="http://www.businessweek.com/technology/content/sep2009/tc20090921_645345.htm"&gt;this Business Week article&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-8837089475451088517?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/8837089475451088517/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=8837089475451088517' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/8837089475451088517'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/8837089475451088517'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2009/09/netflix-prize-announcement.html' title='The Netflix Prize Announcement'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-1463999180349168865</id><published>2009-08-23T23:57:00.000-07:00</published><updated>2011-07-11T11:08:51.864-07:00</updated><title type='text'>An Interview by Three Ensemblers</title><content type='html'>Three members of &lt;a href="http://www.the-ensemble.com/"&gt;The Ensemble&lt;/a&gt;, &lt;a href="http://expensivelunch.blogspot.com/"&gt;Joe Sill&lt;/a&gt;, &lt;a href="http://www.ventirx.com/about/mngmt_howbert.htm"&gt;Jeff Howbert&lt;/a&gt;, and &lt;a href="http://www.facebook.com/chris.hefele"&gt;Chris Hefele&lt;/a&gt; have done a great interview at &lt;a href="http://www.twit.tv/natn114"&gt;http://www.twit.tv/natn114&lt;/a&gt;. If you download the MP3 file, the actual interview starts at 25 minutes into the file.&lt;br /&gt;&lt;br /&gt;The host really got into the drama factor of the last 30 days. Still,&amp;nbsp;most of the behind-the-scene happenings have not been told yet, I hope that someday&amp;nbsp; the whole will be told and from BPC's perspective too. I guarantee you it will be 10 times more interesting. &lt;br /&gt;&lt;br /&gt;And it was an incredible &amp;amp; awesome experience, filled with uncertainty, suspense, excitement, debates... and the last 48 hours was absolutely thrilling !&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-1463999180349168865?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/1463999180349168865/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=1463999180349168865' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/1463999180349168865'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/1463999180349168865'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2009/08/interview-by-ensemble-members.html' title='An Interview by Three Ensemblers'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-402343170574031938</id><published>2009-07-29T01:03:00.000-07:00</published><updated>2010-08-27T00:26:08.769-07:00</updated><title type='text'>How I End Up In The Ensemble</title><content type='html'>When I signed up for the &lt;a href="http://www.netflixprize.com/"&gt;Netflix Prize&lt;/a&gt;, I had to pick a team name. I like the show Seinfeld, and &lt;a href="http://en.wikipedia.org/wiki/Newman_%28Seinfeld%29"&gt;Newman&lt;/a&gt; is my favorite character in the show, due to his sheer absurdity and hilariousness. So I chose "Newman!" as my team name.&lt;br /&gt;&lt;br /&gt;Later I formed a joint team with another contestant, and we called this team "Newman and &lt;a href="http://en.wikipedia.org/wiki/Cosmo_Kramer"&gt;Kramer&lt;/a&gt; !", named after the two most absurd characters in the show. Kramer helped me with my &lt;a href="http://en.wikipedia.org/wiki/Boltzmann_machine"&gt;RBM&lt;/a&gt; implementation and then quickly went MIA (but not presumed dead, where are you Eric ?).&lt;br /&gt;&lt;br /&gt;Then &lt;a href="http://www.linkedin.com/in/gregmcalpin"&gt;Greg McAlphin&lt;/a&gt; (ADifferentName/OfADifferentKind) and I formed a new team named "Newman and &lt;a href="http://en.wikipedia.org/wiki/George_Costanza"&gt;George&lt;/a&gt; !".&lt;br /&gt;&lt;br /&gt;Then &lt;a href="http://information-density.blogspot.com/"&gt;Bill Bame&lt;/a&gt; (clueless) joined us and we formed a new team named "Newman, George, and &lt;a href="http://en.wikipedia.org/wiki/Jacopo_Peterman"&gt;Peterman&lt;/a&gt; !".&lt;br /&gt;&lt;br /&gt;So at this point, we were "polluting" the &lt;a href="http://www.netflixprize.com//leaderboard?limit=100"&gt;leaderboard&lt;/a&gt; in four teams:&lt;br /&gt;&lt;br /&gt;*Newman!&lt;br /&gt;*Newman and Kramer !&lt;br /&gt;*Newman and George !&lt;br /&gt;*Newman, George, and Peterman !&lt;br /&gt;&lt;br /&gt;Then &lt;a href="http://www.facebook.com/chris.hefele"&gt;Chris Hefele&lt;/a&gt; (chef-ele) joined us, and we were going to be "Newman, George, Peterman, and &lt;a href="http://en.wikipedia.org/wiki/Kenny_Bania"&gt;Bania&lt;/a&gt; !". Now obviously we had a scalability problem: the team name was growing too long. And at this point, we fully expected other people to join later, so what were we going to be in the end ? "Newman, George, Peterman, Bania, &lt;a href="http://en.wikipedia.org/wiki/Justin_Pitt"&gt;Mr Pitt&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Uncle_Leo"&gt;Uncle Leo&lt;/a&gt;, and &lt;a href="http://en.wikipedia.org/wiki/Soup_nazi"&gt;Soup Nazi&lt;/a&gt; !" ?&lt;br /&gt;&lt;br /&gt;We decided to call our new 4-member team "Vandelay Industries".&lt;br /&gt;&lt;br /&gt;Then &lt;a href="http://howbert.netherweb.com/"&gt;Jeff Howbert&lt;/a&gt; (howbert) joined. Shortly afterwards, the contest entered the last 30 days, and a slew of others on the leaderboard joined Vandelay Industries. The team expanded faster than Newman's waistline (the character's, not mine).&lt;br /&gt;&lt;br /&gt;Members of Vandelay Industries created a few sub-teams to explore different blends. Later, in preparation to merge with the &lt;a href="http://www.grandprizeteam.com/"&gt;Grand Prize Team&lt;/a&gt;, &lt;a href="http://www.operasolutions.com/"&gt;Opera Solutions&lt;/a&gt;, and &lt;a href="http://www.feeds2.com/"&gt;Feeds2&lt;/a&gt; to form &lt;a href="http://www.the-ensemble.com/"&gt;The Ensemble&lt;/a&gt;, we got rid of all the sub-teams including the Newman-and-whoever teams.&lt;br /&gt;&lt;br /&gt;Except "Newman and Kramer !". Kramer is MIA but out of respect for him and his help, I left "Newman and Kramer !" on the leaderboard.&lt;br /&gt;&lt;br /&gt;That's my journey from Newman! to &lt;a href="http://www.the-ensemble.com/"&gt;The Ensemble&lt;/a&gt;. Here's &lt;a href="http://www.delineateme.com/"&gt;David Lin's story&lt;/a&gt; of Dinosaur Planets and The Ensemble.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-402343170574031938?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/402343170574031938/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=402343170574031938' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/402343170574031938'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/402343170574031938'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2009/07/how-i-end-up-in-ensemble.html' title='How I End Up In The Ensemble'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-861104715460541576</id><published>2009-07-04T18:49:00.001-07:00</published><updated>2009-07-16T12:17:01.733-07:00</updated><title type='text'>We Are The Borg</title><content type='html'>We are the borg. Your algorithm will be assimilated. Resistance is irrelevant.&lt;br /&gt;&lt;br /&gt;We're also known as: &lt;a href="http://sites.google.com/site/xlvector/"&gt;xlvector&lt;/a&gt;, OfADifferentKind, &amp; Newman(me).&lt;br /&gt;&lt;br /&gt;Update: to clean up the leaderboard, we have voluntarily removed "We are the Borg" and other sub-teams of "Vandelay Industries !". The pre-Vandelay Industries team "Newman, George, and Peterman !" is gone too.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-861104715460541576?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/861104715460541576/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=861104715460541576' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/861104715460541576'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/861104715460541576'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2009/07/we-are-borg.html' title='We Are The Borg'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-4733605637909505077</id><published>2009-07-01T00:19:00.001-07:00</published><updated>2009-07-01T02:17:41.111-07:00</updated><title type='text'>The End Is Near</title><content type='html'>The end Netflix Prize is near: &lt;a href="http://www.netflixprize.com//leaderboard"&gt;BPC has reached 10%&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Now, how do I apply what I've learning during the participation of this contest ? Maybe I should get into trying to predict the stock market or commodity prices, or something. People are already doing that on the stock market, it's called &lt;a href="http://en.wikipedia.org/wiki/Algorithmic_trading"&gt;algorithmic trading&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Of course recommender systems and data-mining can be applied to all kinds of products and services, not just movies. For example truck dealers can try to find people who are most likely to buy trucks and send them truck advertisement; online matchmaking services can match people based on user data and history of matches.&lt;br /&gt;&lt;br /&gt;Of course, for most products and service there are obvious "indicators" that we all know. For example, males in construction industry are probably far more likely to buy trucks than people in general. But still, there might be some subtle indicators and trends that can only be discovered by a computer algorithm, which might improve the accuracy of recommendations by 10%, as measured by RMSE. And as BellKor's research shows, even a few percentages of RMSE improvement can translate into huge increase in the quality of the recommendations you get. You'll actually like the products recommended to you, or the people.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;How Can It Work For Matchmaking ?&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Well, now I think about it, it probably won't work for matchmaking services. Why ? Because for matchmaking services, it will take a long time to truly know the quality of the recommendations.&lt;br /&gt;&lt;br /&gt;Sure, you can find out about bad ones pretty soon: people go on one date and can't stand each other. But how do you know which recommendations are really good, and which ones only look good now but will end up in messy divorce 10 year later ? I think for matchmaking services, you'll have to wait for a generation to tell which recommendations are truly good. But by that time, society and people in general would have chance a lot, so whatever worked 20 years ago probably won't work so well today.&lt;br /&gt;&lt;br /&gt;So for matchmaking services, the value of recommender systems is probably to filter out potential incompatible dates.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-4733605637909505077?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/4733605637909505077/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=4733605637909505077' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/4733605637909505077'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/4733605637909505077'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2009/07/end-is-near.html' title='The End Is Near'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-2934509414861647777</id><published>2009-06-11T11:55:00.000-07:00</published><updated>2010-02-16T09:21:09.932-08:00</updated><title type='text'>Calculating 316 million movie correlations in 2 minutes (down from 2.5 hours)</title><content type='html'>&lt;a href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm"&gt;KNN&lt;/a&gt; is one of the most popular CF algorithms. Central to KNN is how to define the neighborhood relationship, or the distance between two objects. For this purpose, &lt;a href="http://en.wikipedia.org/wiki/Pearson_r"&gt;Pearson Correlation&lt;/a&gt; is a pretty good measure and is used frequently (&lt;a href="http://dmnewbie.blogspot.com/2007/09/greater-collaborative-filtering.html"&gt;see my previous post for my basic KNN approach&lt;/a&gt;). In the Netflix dataset, there are 17770 movies, so we need to calculate 17770*17770, or about 316 million Pearson correlations. Not a trivial task.&lt;br /&gt;&lt;br /&gt;In this post, I'll describe the optimizations I used for the calculation of Pearson Correlation, which cut my running time from 2.5 hours to less than 2 minutes. It won't help your RMSE directly, but it may help indirectly by allowing you to explore the KNN parameter space faster. And although I used Pearson Correlation, the methods described in this post can be applied to many other neighborhood measures too.&lt;br /&gt;&lt;br /&gt;Since this is about optimization, system spec is important. The numbers reported here were generated on a PC with Intel Core2 Quad CPU (4 cores) @ 2.4G Hz, 2 GB of RAM, running Window XP. Most code was written in C++, built by Visual Studio 2008. (TODO: get memory spec)&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_NEh4kUkAi48/Sb6kc-7tTnI/AAAAAAAAAA4/Edu5W_6iaNA/s1600-h/Untitled.png"&gt;&lt;img id="BLOGGER_PHOTO_ID_5313865428180487794" style="FLOAT: right; MARGIN: 0px 0px 10px 10px; WIDTH: 300px; CURSOR: hand; HEIGHT: 320px" alt="" src="http://1.bp.blogspot.com/_NEh4kUkAi48/Sb6kc-7tTnI/AAAAAAAAAA4/Edu5W_6iaNA/s320/Untitled.png" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;The Matrix: Halved&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Before discussing any optimization, I'll discuss an optimization that's &lt;strong&gt;NOT&lt;/strong&gt; taken. Look at Figure 1. It shows the correlation matrix of 5 movies. AB is the correlation of movie A and B, AC is the correlation of movie A and C, and so on.&lt;br /&gt;&lt;br /&gt;Now, if you assume movie correlations are symmetrical, then AB=BA, AC=CA, BC=CB, and so on, so only less than half of the matrix needs to be calculated. Now let’s say I calculated the blue-shaded part of the matrix, and saved the data to a big file. To make predictions, I need to read the data back:&lt;br /&gt;&lt;br /&gt;For movie A, I need to read AB to AE, and that’s 1 file-seek and 1 file-read operation.&lt;br /&gt;&lt;br /&gt;For movie B, I need to read AB, and BC to BE, that’s 2 file-seek and 2 file-read operations.&lt;br /&gt;&lt;br /&gt;For movie C, I need to read AC, BC, and CD and CE, that’s 3 file-seek and 3 file-read operations.&lt;br /&gt;&lt;br /&gt;And so on…&lt;br /&gt;&lt;br /&gt;In the Netflix data set, there are M=17770 movies, so by the time I get to the last movie, I’d need 17769 file seek and read operations per movie. All that file seeking will kill my performance and my hard disk.&lt;br /&gt;&lt;br /&gt;The simple solution is to have enough memory to hold half the matrix, so I can load all the data from disk in one shot. But for each matrix entry, I need not only the correlation value, but also two average values, and a common viewer count, making it 16 bytes per entry. So for M=17770 movies, I need about 2.35 GB, but I needed this thing to run on my laptop, which had only 1.25 G when I started coding this. Even after I compacted it down to 5 bytes per entry, memory situation was still tight.&lt;br /&gt;&lt;br /&gt;(Actually, I think you only need enough memory to hold one quarter of the matrix to eliminate the huge number of file seeks, but the in-memory indexing scheme gets pretty complicated. There might be clever ways to do this, but I'm not sure yet.)&lt;br /&gt;&lt;br /&gt;And all that applies only if you assume movie correlations are symmetrical, and there're ways to calculate correlations where they're not symmetrical. So out of concerns for memory usage and simplicity, I decided to ignore this half-matrix optimization.&lt;br /&gt;&lt;br /&gt;Now let's get back to the calculation of Pearson's r.&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;br /&gt;&lt;strong&gt;The Brute-Force Approach&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;To calculate Pearson's r of two movies X and Y, we need to find viewers who rated both X and Y, and calculate some intermediate values:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;  struct PearsonIntermediate {&lt;br /&gt;  //from all viewers who rated both movie X and Y.&lt;br /&gt;    float x; //sum of ratings for movie X&lt;br /&gt;    float y; //sum of ratings for movie Y&lt;br /&gt;    float xy; //sum of product of ratings for movies X and Y&lt;br /&gt;    float xx; //sum of square of ratings for movie X&lt;br /&gt;    float yy; //sum of square of ratings for movie Y&lt;br /&gt;    unsigned int cnt; //number of viewers who rated both movies&lt;br /&gt;  };&lt;/pre&gt;and you can easily calculate &lt;a href="http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient"&gt;Pearson's r&lt;/a&gt; from these intermediate values.&lt;br /&gt;&lt;br /&gt;For Netflix data, float can be replaced by "unsigned int" when processing raw, integer ratings, which can only be 1, 2, 3, 4, or 5.&lt;br /&gt;&lt;br /&gt;As mentioned on my previous blog, I have the rating data sorted by movie index followed by viewer index, so finding common viewers of two movies is simply finding common values in two sorted arrays. Very easy, and very slow. The inner loop of this approach looks this:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;  while (not at the end of pViewerIndex0 and pViewerIndex1) {&lt;br /&gt;    if ( *pViewerIndex0 &lt; *pViewerIndex1) {         &lt;br /&gt;        pViewerIndex0++;&lt;br /&gt;    } else if ( *pViewerIndex0 &gt; *pViewerIndex1) {&lt;br /&gt;        pViewerIndex1++;&lt;br /&gt;    } else {&lt;br /&gt;        //a viewer who rated both movies&lt;br /&gt;        //update PearsonIntermediate&lt;br /&gt;    }&lt;br /&gt;  }&lt;/pre&gt;The inner loop is conditional branch-heavy, and even worse, given that the viewer indices are randomly spaced, they are very unpredictable branches, probably completely defeating the &lt;a href="http://en.wikipedia.org/wiki/Branch_prediction"&gt;branch prediction logic of the CPU&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I think the Big O Notation of this approach is roughly O(M*M*T/M)=O(M*T), where M is the total number of movies, and T is the total number of ratings. Basically, for each movie, you have to compare against all other movies (M*M), and the average length of comparisions is the average number of ratings per movie (T/M).&lt;br /&gt;&lt;br /&gt;My implementation of this approach took 4.5 hours to run on my little laptop, and 2.5 hours on the faster desktop. And I suffered thru this 4.5 hours a few times, due to the inevitable bugs in the initial implementation.&lt;br /&gt;&lt;br /&gt;You can definitely optimize the code somewhat, but once I got KNN working correctly, I moved onto other algorithms.&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;br /&gt;&lt;strong&gt;A Break-thru&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;So my KNN algorithm just sat there, for months. During this time, I worked on other CF algorithms, which is probably why one day, I was working on something totally unrelated to KNN, and I suddenly had one of those "Oh my God I'm an idiot !" moment.&lt;br /&gt;&lt;br /&gt;I realized there's a much bette way of calculating correlations. You see, for each movie X, instead of comparing it to every other movie directly, I can go thru all the viewers V who rated movie X; and for each viewer V, go thru all the movies Y rated by that viewer; and for each movie Y, update the PearsonIntermediate of movie pair X and Y. At the end, I'll have PearsonIntermediate of movie X and all the movies in the dataset (including X itself).&lt;br /&gt;&lt;br /&gt;In pseudo-code, to calculate Pearson correlations of movie X and all movies in the dataset, you declare:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;  PearsonIntermediate array[MOVIE_CNT]; //MOVIE_CNT is 17770&lt;br /&gt;&lt;br /&gt;  initialize array[] to zeros&lt;br /&gt;  for each viewer v who rated movie X {&lt;br /&gt;    for each movie Y rated by viewer v {&lt;br /&gt;       update values of array[Y]&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;  go thru array[] to calculate all correlations for movie X&lt;/pre&gt;As a sanity check, you can verify the correlation from array[X] is 1 (or very close to 1, due to floating point inaccuracy), which is the correlation of movie X and itself.&lt;br /&gt;&lt;br /&gt;The Big O Notation for this approach is O(M*(T/M)*(T/V))=O(T*T/V). The brute-force approach is O(M*T). Since T/V&gt;&gt;M, I should get a significant speed-up. Furthermore, the inner loop does not have the unpredictable branchings of the brute-force approach.&lt;br /&gt;&lt;br /&gt;This is just the motivation I needed to implement KNN on residues of other algorithms. If the original algorithm is slow with raw ratings (1 to 5 integer), it will be even slower on residues (floating point). After I was done with residues, I retrofitted this approach to my raw-rating KNN, and the improvement was huge: calculation time dropped to about about 15 minutes, a 10x speed-up !&lt;br /&gt;&lt;br /&gt;That's pretty good, but wait, there is more ...&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;br /&gt;&lt;strong&gt;Parallelization&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;At this point I implemented multi-threading to take advantage of multiple CPU cores. Movie correlation calculation is trivially parallelizable. The result is about 3.4x speed up on 4 cores, bringing the running time down to 265 seconds.&lt;br /&gt;&lt;br /&gt;But wait, there is more ...&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;br /&gt;&lt;strong&gt;Simplifying the Inner Loop&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;I thought since raw ratings can be only 1, 2, 3, 4, or 5, I should be able to speed things up significantly by replacing the PearsonIntermediate structure with:&lt;br /&gt;&lt;pre&gt;    unsigned int32 vCount[5][5];&lt;/pre&gt;where vCount[m][n] is the number of viewers who gave movie X a score of m+1 and movie Y a score of n+1. This simplifies the inner-loop to just incrementing an integer !&lt;br /&gt;&lt;br /&gt;I quickly implemented this "clever" idea, however, the result was disappointing: the speed-up was a few percent in some quick tests. Hmm, what could be wrong ?&lt;br /&gt;&lt;br /&gt;After some head-scratching, I guessed the problem was the size of memory used. Each PearsonIntermediate structure takes only 24 bytes, but "unsigned int32 vCount[5][5]" takes 100 bytes. So as the code does the calculations, more memory had to be swapped in and out of cache and this seriously harmed the performance.&lt;br /&gt;&lt;br /&gt;Fortunately we can use the nature of the data itself to work around this problem. Of the 17770 movies in the data set, 33% have less than 256 ratings, 65% have between 256 and 65535 ratings, and only 2% have more than 65536 ratings. Of course, if a movie has less than 256 ratings, there is no way the it has 256 or more common viewers with another movie, so the common viewer counts of this movie can be stored using "unsigned char" instead of "unsigned int", or 1 byte instead of 4 bytes. For movies that have between 256 and 65535 ratings, we can used "unsigned short" (2 bytes).&lt;br /&gt;&lt;br /&gt;This means I need 3 versions of the correlation calculation function, one each for unsigned int/short/char (or int32/int16/int8). Here C++'s template function feature comes to the rescue (in spite of the awkward syntax). I wrote one function that uses a type T to store the common viewer counts:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;  T vCount[5][MOVIE_CNT][5]; //initialize to zero&lt;br /&gt;&lt;br /&gt;  for each viewer V who rated movie X {&lt;br /&gt;    m=the rating viewer V gave to movie X &lt;br /&gt;    for each movie Y rated by viewer V {&lt;br /&gt;      n=the rating viewer V gave to movie Y&lt;br /&gt;      vCount[m][Y][n]++;&lt;br /&gt;    }&lt;br /&gt;  }&lt;/pre&gt;and I had 3 "instantiations" (those jargons are as bad as the awkward syntax) of this function where T is unsigned int/short/char. Notice the vCount array is arranged as [5][MOVIE_CNT][5] instead of the more "straight-forward" form of [MOVIE_CNT][5][5]. My tests showed [5][MOVIE_CNT][5] gave the best performance, no doubt because it minimized cache misses.&lt;br /&gt;&lt;br /&gt;With this change, the running time improved to 156 seconds.&lt;br /&gt;&lt;br /&gt;But wait, there is more ...&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;br /&gt;&lt;strong&gt;Complicating the Inner Loop&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;The previous chanage simplified the inner loop, but most gains seem to come from reducing the demand on memory subsytem and cache miss. Can we do better ?&lt;br /&gt;&lt;br /&gt;Consider a movie with 3000 ratings, the type of its vCount[][][] array is "unsigned short", or 2 bytes per element. But it's unlikely that the final value of any array element would be 256 or bigger, given 3000/25=120. Even for movies with 6400 or more ratings (6400/25=256), because ratings are not evenly distributed, most vCount[][][] elements are still less than 255.&lt;br /&gt;&lt;br /&gt;So for most movies, one byte is enough for most vCount elements. But how do we use the 1-byte (unsigned char) version of vCount, and still be able to handle the occasional elements that go above 255 ? Well, use another array to count the overflows:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;  unsigned char vCount[5][MOVIE_CNT][5]; //initialize to zero&lt;br /&gt;  unsigned short overflows[5][MOVIE_CNT][5]; //initialize to zero&lt;br /&gt;&lt;br /&gt;  for each viewer V who rated movie X {&lt;br /&gt;    m=the rating viewer V gave to movie X &lt;br /&gt;    for each movie Y rated by viewer V {&lt;br /&gt;      n=the rating viewer V gave to movie Y&lt;br /&gt;      vCount[m][Y][n]++;&lt;br /&gt;      if (0==vCount[m][Y][n]) overflows[m][Y][n]++;&lt;br /&gt;    }&lt;br /&gt;  }&lt;/pre&gt;The code above takes advantage of the fact that in 2's complement system, when an integer value overflows, it wraps around to zero.&lt;br /&gt;&lt;br /&gt;At first I was very skeptical of this change, because it adds a branching to the inner-most loop. How could it ever improve performance with that ? However that branching is highly predictable, and everything I read says with modern CPU's branch prediction logic, the cost of predictable branches is almost nothing.&lt;br /&gt;&lt;br /&gt;So with great skepticism, I tried the code above. To my relief it did run faster, bringing the running time down to 143 seconds.&lt;br /&gt;&lt;br /&gt;But wait, there is more ....&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;br /&gt;&lt;strong&gt;Reordering the Movies&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;For any viewer V, the IDs of movies V has rated are pretty much randomly distributed. Imaging the array of movies, with red dots indicating the movies rated by viewer V:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_NEh4kUkAi48/SihBDbkAhrI/AAAAAAAAABg/474-ci0Qw-8/s1600-h/1.PNG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5343592485069293234" style="DISPLAY: block; MARGIN: 0px auto 10px; TEXT-ALIGN: center" alt="" src="http://4.bp.blogspot.com/_NEh4kUkAi48/SihBDbkAhrI/AAAAAAAAABg/474-ci0Qw-8/s400/1.PNG" border="0" /&gt;&lt;/a&gt;and as we go thru the movies rated by viewer V, most red dots incur cache misses.&lt;br /&gt;&lt;br /&gt;However, if we sort the movie IDs in increasing order of rating count of movies, then the distribution of red dots changes:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_NEh4kUkAi48/SihBnqj7KiI/AAAAAAAAABo/T_bg2Q-bzxA/s1600-h/2.PNG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5343593107570764322" style="DISPLAY: block; MARGIN: 0px auto 10px; TEXT-ALIGN: center" alt="" src="http://4.bp.blogspot.com/_NEh4kUkAi48/SihBnqj7KiI/AAAAAAAAABo/T_bg2Q-bzxA/s400/2.PNG" border="0" /&gt;&lt;/a&gt;statistically, we have the same or even more cache misses at the beginning, but as we go into higher movie indices, the chances that the viewer rated closely spaced movies increases, thus reducing the cache misses as vCount[][][] is updated.&lt;br /&gt;&lt;br /&gt;Notice the "sorting" of the movie IDs by movie rating count is not an O(N*log(N)) quicksort for 100 million ratings, but just an O(N) remapping.&lt;br /&gt;&lt;br /&gt;This change reduced the running time to 116 seconds, less than 2 minutes.&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;br /&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Well, that's it. I've picked all the low-hanging fruits, at least those that I could see.&lt;br /&gt;&lt;br /&gt;Here's the list of performance gains from my initial implementation:&lt;br /&gt;&lt;br /&gt;*Algorithm change: 10x&lt;br /&gt;*Parallelization: 3.4x (4 cores vs 1 core)&lt;br /&gt;*Low-level tweaks: 2.3x&lt;br /&gt;&lt;br /&gt;In total about 78x performance gain, without writing a single line of assembly language.&lt;br /&gt;&lt;br /&gt;Of the 3 items above, the biggest performance gain vs ease of implementation is parallization, because KNN calculation can be parallized very easily. I believe more cores will help the performance more, but with diminishing returns.&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;br /&gt;&lt;strong&gt;The Memory Bandwidth Bottleneck&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Memory bandwidth seems to be the performance bottleneck of KNN algorithms.&lt;br /&gt;&lt;br /&gt;When I was coding the parallization step, I installed some software to monitor MSRs, or &lt;a href="http://en.wikipedia.org/wiki/Model-specific_register"&gt;model-specific registers&lt;/a&gt;. MSRs are very useful for performance monitoring, but unfortunately they're very sparsely documented. The "&lt;em&gt;Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide&lt;/em&gt;" has some details, but nowhere as detailed as the individual CPU instructions.&lt;br /&gt;&lt;br /&gt;One particular MSR I've looked at is DCU_MISS_OUTSTANDING. &lt;a href="http://www.technovelty.org/code/arch/micro-analysis.html"&gt;This webpage&lt;/a&gt; has the following to say about it:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;DCU_MISS_OUTSTANDING gives a weighted measure of the cycles spent waiting for a cache line to be brought in. Each cycle spent waiting is weighted by the number of outstanding cache misses ... and has some caveats, but can be considered a rough estimate of the time spent waiting for a cache line to be brought in.&lt;/blockquote&gt;&lt;br /&gt;Anyway, when I first implemented multithreading, the code took 265 seconds to run, and I recorded about 400 billion CPU cycles spent on DCU_MISS_OUTSTANDING events. The final version took 116 seconds to run and I recorded about 100 billion CPU cycles. The difference is 149 seconds and 300 billion cycles. 300 billion cycles divided by 2.4G Hz (CPU clock speed) gives 125 seconds. So it looks like a lot of gains came from reducing the time spent waiting for data to be loaded into the cache.&lt;br /&gt; &lt;br /&gt;I won't pretend I understand DCU_MISS_OUTSTANDING or many other MSRs completely, I just hope intel provide better documentation.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-2934509414861647777?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/2934509414861647777/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=2934509414861647777' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/2934509414861647777'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/2934509414861647777'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2009/06/calculating-316-million-movie.html' title='Calculating 316 million movie correlations in 2 minutes (down from 2.5 hours)'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_NEh4kUkAi48/Sb6kc-7tTnI/AAAAAAAAAA4/Edu5W_6iaNA/s72-c/Untitled.png' height='72' width='72'/><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-4139701591420465909</id><published>2009-02-23T13:26:00.000-08:00</published><updated>2009-02-23T13:28:04.316-08:00</updated><title type='text'>Almost A Year Since Last Update</title><content type='html'>Wow, it's almost a year since last update. What have I done in the last year ?&lt;br /&gt;&lt;br /&gt;*SVD++. This is similar to the SVD++ described by BellKor, or the transdusive MF described by Gravity.&lt;br /&gt;&lt;br /&gt;*Expanded KNN, combinations of movie/viewer and raw/residue/binary, and much faster too (raw movie correlation calcuation takes less than 2 minutes now). For all my KNNs, I no longer save huge correlation files, instead the program just calculates correlations and predictions in one pass and saves predictions only.&lt;br /&gt;&lt;br /&gt;*Recently got basic RBM working. This is the most confusing model for me. Basically I got it working thru trial and error. I still don't really understand it, and I have yet to make it work on residues of other predictors.&lt;br /&gt;&lt;br /&gt;*Read a lot of research papers. I actually spent more time reading papers than writing code.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-4139701591420465909?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/4139701591420465909/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=4139701591420465909' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/4139701591420465909'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/4139701591420465909'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2009/02/almost-year-since-last-update.html' title='Almost A Year Since Last Update'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-234234070808451544</id><published>2008-02-25T17:11:00.001-08:00</published><updated>2008-02-25T20:58:44.144-08:00</updated><title type='text'>Residues and Threads</title><content type='html'>I'm slowly changing all my algorithms to work on either raw data or residue data. Because raw data are integer values in the range of 1 to 5, and thus enabling some performance optimizations and shortcuts, but residue data are essentially floats, all my algorithms took a performance hit.&lt;br /&gt;&lt;br /&gt;At the same time, I'm adding multi-threading support to speed things up. The improvement is significant, but I'm already seeing diminishing returns: on a quad-core systems, matrix factorization using 4 cores runs at only 3.27x the speed of using one core.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-234234070808451544?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/234234070808451544/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=234234070808451544' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/234234070808451544'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/234234070808451544'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2008/02/residues-and-threads.html' title='Residues and Threads'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-8011102441048530624</id><published>2008-01-28T22:53:00.001-08:00</published><updated>2008-01-28T23:17:39.500-08:00</updated><title type='text'>Back to Basics</title><content type='html'>New year, now I'm playing with the most basic stuff: global effects. This time I'll try to be more patient and take a more systematic approach, instead of the "hack anything that looks promising" approach I've been taking.&lt;br /&gt;&lt;br /&gt;Right now I've a nice number for my qualifying RMSE: 0.8888. Hey, it takes real skill to hit such a nice number by chance ! I'll try to resist the urge to submit better results until I'm sure I've made a big improvement.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-8011102441048530624?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/8011102441048530624/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=8011102441048530624' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/8011102441048530624'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/8011102441048530624'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2008/01/back-to-basics.html' title='Back to Basics'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-6519544979074768851</id><published>2007-12-14T11:11:00.001-08:00</published><updated>2007-12-14T11:22:20.969-08:00</updated><title type='text'>Error distribution of different algorithms</title><content type='html'>&lt;a href="http://3.bp.blogspot.com/_NEh4kUkAi48/R2LXnuNoLDI/AAAAAAAAAAU/y207AiD5ylI/s1600-h/abc.PNG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5143910801830587442" style="FLOAT: left; MARGIN: 0px 10px 10px 0px; CURSOR: hand" alt="" src="http://3.bp.blogspot.com/_NEh4kUkAi48/R2LXnuNoLDI/AAAAAAAAAAU/y207AiD5ylI/s400/abc.PNG" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-6519544979074768851?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/6519544979074768851/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=6519544979074768851' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/6519544979074768851'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/6519544979074768851'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2007/12/error-distribution-of-different.html' title='Error distribution of different algorithms'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_NEh4kUkAi48/R2LXnuNoLDI/AAAAAAAAAAU/y207AiD5ylI/s72-c/abc.PNG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-5930911781955475516</id><published>2007-09-10T17:28:00.000-07:00</published><updated>2008-03-11T01:21:53.972-07:00</updated><title type='text'>The Greater Collaborative Filtering Groupthink: KNN</title><content type='html'>(Note: this post assumes you're familiar with &lt;a href="http://www.netflixprize.com//rules"&gt;The Netflix Prize&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;In this post I'll describe my implementation of the KNN algorithm, which stands for K Nearest Newbies (with me being the biggest newbie around). A more descriptive name for my implementation is &amp;quot;movie-to-movie Pearson's r-based KNN with slope-one correction and other tweaks&amp;quot; - that's how I see it anyway. I was able to get a 0.9174 qualifying score using this method.&lt;br /&gt;&lt;br /&gt;In a nutshell, the KNN algorithm is this: to predict viewer v's rating for movie m, you first get the list of all the movies previously rated by viewer v. Among those movies, you find the K closest &amp;quot;neighbors&amp;quot; to movie m. Then you calculate the weighted-average of viewer v's ratings for those &amp;quot;neighbor&amp;quot; movies, and that’s your final prediction.&lt;br /&gt;&lt;br /&gt;The groupthink part of this algorithms lies in how you determine the closest neighbors.&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;br /&gt;&lt;br /&gt;&lt;B&gt;Making KNN Predictions&lt;/B&gt;&lt;br /&gt;&lt;br /&gt;Conceptually, to predict the rating for movie m by viewer v, I do the following:&lt;br /&gt;&lt;br /&gt;First get a list of all movies rated by viewer v, let's call this list L0.&lt;br /&gt;&lt;br /&gt;Pick a value for parameter MinCV, which is the minimum number of common  viewers required. By common viewers I mean viewers who rated both movies. Remove all movies in L0 that have less than MinCV common viewers with movie m. From trial runs, it seems 16 is a good number for MinCV.&lt;br /&gt; &lt;br /&gt;For each movie n left in the list L0, calculate the following data structure:&lt;br /&gt;&lt;pre&gt;  struct MovieNeighbor { //movie n as neighbor of movie m&lt;br /&gt;      unsigned int commonViewers;&lt;br /&gt;      float mAvg; //average rating of movie m&lt;br /&gt;      float nAvg; //average rating of movie n&lt;br /&gt;&lt;br /&gt;      float nRating; //current viewer's rating for movie n&lt;br /&gt;&lt;br /&gt;      float rRaw;    //raw Pearson's r value&lt;br /&gt;      float rLower; &lt;br /&gt;      float weight;&lt;br /&gt;  };&lt;/pre&gt;&lt;br /&gt;In the structure above, commonViewers is the number of viewers who rated both movie m and n, mAvg and nAvg are the average ratings of movie m and n by common viewers, respectively. And nRating is viewer v's rating for movie n.&lt;br /&gt;&lt;br /&gt;The rRaw is &lt;a href="http://en.wikipedia.org/wiki/Pearson%27s_r"&gt;Pearson's product moment coefficient&lt;/a&gt;, aka Pearson's r, for the movie pair in question. And rLower is the lower bound of the movie pair's Pearson's r at 95% &lt;a href="http://en.wikipedia.org/wiki/Confidence_interval"&gt;confidence interval&lt;/a&gt;. This is calculated using rRaw, commonViewers, and &lt;a href="http://en.wikipedia.org/wiki/Fisher_transformation"&gt;Fisher Transform/Fisher Inverse&lt;/a&gt;. If rRaw has a small absolute value, rLower may end up having a different sign than rRaw, in this case, I chose to set rLower to 0.&lt;br /&gt;&lt;br /&gt;Once rLower is calculated, I calculate the weight this way:&lt;br /&gt;&lt;pre&gt;    weight = rLower*rLower*log(commonViewers);&lt;/pre&gt;&lt;br /&gt;This is just one of the many ways to calculate weight. For example, you can change the confidence interval, you can use a power function on commonViewers, or a sigmoid function, or just linear with a max limit. I tried many different ways of calculating the weight and found the best ones gave very similar end results.&lt;br /&gt;&lt;br /&gt;Put all MovieNeighbor structures in a new list, let's call this list L1. Each item in list L1 represents a neighbor for movie m. And we don't need list L0 any more.&lt;br /&gt;&lt;br /&gt;Create a new MovieNeighbor structure, and set the following fields:&lt;br /&gt;&lt;pre&gt;    mAvg=average rating of movie m from all viewers&lt;br /&gt;    nAvg=ratingn=0;  &lt;br /&gt;    weight=log(MinCV); &lt;/pre&gt;&lt;br /&gt;add this new structure to list L1. This creates a dummy neighbor that predicts movie m's average rating. The effect of this dummy neighbor depends on the movie m's real neighbors. If movie m has lots of weak neighbors (low correlation, low number of common viewers), then the dummy neighbor will pull the prediction toward movie m's average. On the other hand, if movie m has many good-quality neighbors (high correlation, high number of common viewers), then the dummy neighbor will have no effect on the final outcome.&lt;br /&gt;&lt;br /&gt;Pick a value for parameter MaxN, which is the maximum number of neighbors used for prediction. Sort list L1 by weight and keep the MaxN items that have the greatest weights (implementation hint: don't call sort or qsort, use nth_element).&lt;br /&gt;&lt;br /&gt;For each item remaining in L1, calculate its prediction:&lt;br /&gt;&lt;pre&gt;    dif = ratingn - nAvg;&lt;br /&gt;    if (rRaw&lt;0) dif = -dif;&lt;br /&gt;    prediction = mAvg + dif;&lt;/pre&gt;&lt;br /&gt;The final prediction is the weighted-average of predictions of all remaining items in list L1. That's it !&lt;br /&gt;&lt;br /&gt;&lt;hr&gt;&lt;br /&gt;&lt;br /&gt;&lt;B&gt;Implementation Notes&lt;/B&gt;&lt;br /&gt;&lt;br /&gt;A straight-forward implementation of the above will work, but it will be slow and make it almost impractical to explore large number of parameter combinations. As you can see there're a number of parameters you can tune:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;confidence interval&lt;/li&gt;&lt;li&gt;power rLower&lt;/li&gt;&lt;li&gt;function of common viewers (log, power, sigmoid, etc)&lt;/li&gt;&lt;li&gt;minimum number of common viewers required&lt;/li&gt;&lt;li&gt;max number of neighbors&lt;/li&gt;&lt;li&gt;etc ...&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;To try large combinations of parameter values efficiently, you want to pre-calculate and share the computational overhead as much as possible.&lt;br /&gt;&lt;br /&gt;On the top of the list is the following for each movie pair in the data set:&lt;br /&gt;&lt;pre&gt;    struct MovieCorrelation { //for movie pair m-n&lt;br /&gt;    unsigned int commonViewers;&lt;br /&gt;    float mAvg;    //average rating of movie m&lt;br /&gt;    float nAvg;    //average rating of movie n&lt;br /&gt;    float rRaw;    //raw Pearson's r value&lt;br /&gt;}; &lt;/pre&gt;&lt;br /&gt;There're 17770 movies in the data set, so there're 17770x17770 (of which half are unique) MovieCorrelation structures we can calculate and save to a large file. This sounds like a major task, but it only needs to be done once (once your code is bug-free).&lt;br /&gt;&lt;br /&gt;And thanks to the raw speed of modern computers, it does not take as long as it looks. I managed to get it done in less than 20 minutes on my laptop, and my implementation is by no means optimized to the max. If you have a fast PC, it should take less than 15 minutes; and if you have enough memory to hold 17770x17770/2 (packed) records in memory, it will probably take less than 10 minutes.&lt;br /&gt;&lt;br /&gt;Once you have all that MovieCorrelation calculated and saved in a file, arrange your prediction code like this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;  for each movie m to predict {&lt;br /&gt;    array0=read all MovieCorrelation's of movie m from file&lt;br /&gt;    for each value of 1st parameter {&lt;br /&gt;      array1=some_operation(array0);&lt;br /&gt;      for each value of 2nd parameter {&lt;br /&gt;        array2=some_operation(array1);&lt;br /&gt;        for each value of 3rd parameter {&lt;br /&gt;          array3=some_operation(array2);&lt;br /&gt;          do predictions using array3;       &lt;br /&gt;        }&lt;br /&gt;      }&lt;br /&gt;    }&lt;br /&gt;  }&lt;/pre&gt;&lt;br /&gt;The 1st parameter might be confidence interval, the 2nd parameter might be maximum number of neighbors, etc. After I wrote my prediction code like above, it took about one minute to generate one probe prediction, but to generate 32 probe predictions, it only took 3 minutes !&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-5930911781955475516?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/5930911781955475516/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=5930911781955475516' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/5930911781955475516'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/5930911781955475516'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2007/09/greater-collaborative-filtering.html' title='The Greater Collaborative Filtering Groupthink: KNN'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-3763165069277713264</id><published>2007-08-29T15:20:00.000-07:00</published><updated>2007-08-31T19:05:16.903-07:00</updated><title type='text'>The Great Collaborative Filtering Groupthink, Thinks Faster</title><content type='html'>(Note: this post assumes you're familiar with &lt;a href="http://www.netflixprize.com//rules"&gt;The Netflix Prize&lt;/a&gt;.)&lt;br /&gt;(Warning: long post)&lt;br /&gt;&lt;br /&gt;As described in the &lt;a href="http://dmnewbie.blogspot.com/2007/08/great-collaborative-filtering.html"&gt;“The Great Collaborative Filtering Groupthink”&lt;/a&gt; post, my viewer-clustering algorithm has an assignment step and an update step, repeated for many iterations.&lt;br /&gt;&lt;br /&gt;In my initial implementation, each iteration took about 90 seconds, for a set of 64 clusters (&lt;a href="http://dmnewbie.blogspot.com/2007/08/hardware-specification.html"&gt;see previous post for my hardware spec&lt;/a&gt;). I didn’t write down the initial performance numbers, but if I recall correctly, the assignment step took anywhere from 70 to 80 seconds, with the rest going to the update step.&lt;br /&gt;&lt;br /&gt;Now if a set of clusters takes 10 iterations to build, and I need to build 8 sets (with different random starting values), that’s 2 hours of running time already. I’d rather not wait for 2 hours every time I tweak the algorithm or fix a bug, plus I need to experiment with different number of clusters, so I looked for ways to make it run faster, eventually reducing the per-iteration running time to 17.6 seconds.&lt;br /&gt;&lt;br /&gt;Obviously the assignment step is the place to look first. The following is its pseudo-code:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;//loop thru data in viewer-cluster-movie order&lt;br /&gt;for each viewer {&lt;br /&gt;    minSquare=FLT_MAX; //maximum float value&lt;br /&gt;    for each cluster {&lt;br /&gt;        clusterSquare=0;&lt;br /&gt;        for each movie rated by the current viewer {&lt;br /&gt;            clusterSquare+=(ratingOfViewer – ratingOfCluster)^2;&lt;br /&gt;        } // for each movie rated by the current viewer&lt;br /&gt;        if (clusterSqueare &amp;lt minSquare) {&lt;br /&gt;            Assign the current viewer to the current cluster.&lt;br /&gt;            minSquare=clusterSquare;&lt;br /&gt;        }&lt;br /&gt;    } //for each cluster&lt;br /&gt;} //for each viewer&lt;/pre&gt;&lt;br /&gt;First, I tried an early-out approach for the inner loop:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;//loop thru data in viewer-cluster-movie order&lt;br /&gt;for each viewer {&lt;br /&gt;    minSquare=FLT_MAX; //maximum float value&lt;br /&gt;    for each cluster {&lt;br /&gt;        clusterSquare=0;&lt;br /&gt;        for each 5% of the movies rated by the current viewer {&lt;br /&gt;            for each movie in the current 5% {&lt;br /&gt;                clusterSquare+=(ratingOfViewer – ratingOfCluster)^2;&lt;br /&gt;            } // for each movie in the next 5%&lt;br /&gt;            if (clusterSqure &gt;= minSquare) skip current cluster.&lt;br /&gt;        } //repeat 20 times&lt;br /&gt;        Assign the current viewer to the current cluster.&lt;br /&gt;        minSquare=clusterSquare;&lt;br /&gt;    } //for each cluster&lt;br /&gt;} //for each viewer&lt;/pre&gt;&lt;br /&gt;Basically, I'm tracking the current minimum square of distance value in minSquare, and as soon as the current cluster's clusterSquare value exceeds minSquare, I move onto the next cluster. However this improved performance less than 10%. I think the reason is that for most clusters, you need to go thru most of the data before clusterSquare exceeds minSquare. It’s time to look deeper.&lt;br /&gt;&lt;br /&gt;For each set of clusters, I stored all the clusters’ rating data in a 2-D array:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;unsigned short ratings[clusterCount][movieCount];&lt;/pre&gt;&lt;br /&gt;That’s right, I used a 16-bit unsigned integer for each rating. The value in ratings[][] array is the real rating value times 1024. The data is stored in the natural cluster-by-movie order: the first index of the array is cluster index, and the second index of the array is movie index. Here’s the memory layout of the data (from low memory address to high memory address):&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;c0m0, c0m1, c0m2, …… c0m[movieCount-2], c0m[movieCount-1],&lt;br /&gt;c1m0, c1m1, c1m2, …… c1m[movieCount-2], c1m[movieCount-1],&lt;br /&gt;c2m0, c2m1, c2m2, …… c2m[movieCount-2], c2m[movieCount-1],&lt;br /&gt;...&lt;/pre&gt;&lt;br /&gt;where c0m0 is the cluster 0’s rating for movie 0, and c0m1 is the cluster 0’s rating for movie 1, and so on.&lt;br /&gt;&lt;br /&gt;In the Netflix data set, there’re 17770 movies and about half million viewers, but only about 100 million ratings, so on average each viewer rated about one in 89 movies.&lt;br /&gt;&lt;br /&gt;Therefore in my inner loops above (for each movie rated by the viewer), the code is skipping thru the rows of the array from left to right. On average, the memory address of each rating value accessed is about 89 ratings (178 bytes) away from the previous address. So it’s a guaranteed cache miss in most cases, with horrible performance penalties. Most of the time, the CPU is just sitting there waiting for the data to arrive from memory.&lt;br /&gt;&lt;br /&gt;OK, given this scenario, the obvious solution is to turn the ratings[][] array sideways, swapping the first index and the second index:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;unsigned short ratings[movieCount] [clusterCount];&lt;/pre&gt;&lt;br /&gt;and the new memory layout:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;m0c0, m0c1, m0c2, …… m0c[clusterCount-2], m0c[clusterCount-1],&lt;br /&gt;m1c0, m1c1, m1c2, …… m1c[clusterCount-2], m1c[clusterCount-1],&lt;br /&gt;m2c0, m2c1, m2c2, …… m2c[clusterCount-2], m2c[clusterCount-1],&lt;br /&gt;...&lt;/pre&gt;&lt;br /&gt;where m0c0 is the cluster 0’s rating for movie 0, and m0c1 is the cluster 1’s rating for movie 0, and so on. Now we go thru the data in a new order:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;//loop thru data in viewer-movie-cluster order&lt;br /&gt;float clusterSquares[clusterCount];&lt;br /&gt;for each viewer {&lt;br /&gt;    set all elements of clusterSquares[] to zero.&lt;br /&gt;    for each movie rated by the current viewer {&lt;br /&gt;        for each cluster {&lt;br /&gt;            clusterSquares[clusterIndex]+=(ratingOfViewer – ratingOfCluster)^2;&lt;br /&gt;        }&lt;br /&gt;    } // for each movie rated by the current viewer&lt;br /&gt;    Assign the viewer to the cluster that has the smallest value in clusterSquares[].&lt;br /&gt;} // for each viewer {&lt;/pre&gt;&lt;br /&gt;Now in the inner loop, the code is scanning thru the array/memory sequentially with no skips, greatly reducing cache miss.&lt;br /&gt;&lt;br /&gt;At this time I also changed the data type of ratings[][] array from unsigned short (2 bytes) to unsigned char (1 byte), to reduce the amount of data that need to be accessed. Yes I lose some accuracy this way, but the probe RMSE using 1-byte rating value is only a few 0.0001’s worse than before, so the effect is negligible for the result we’re trying to achieve here.&lt;br /&gt;&lt;br /&gt;With both changes above, the assignment step now took about 44.3 seconds, and the update step took about 8.0 seconds (yes I wrote these numbers down), for a total of 52.3 seconds, a good improvement from the previous 90 seconds.&lt;br /&gt;&lt;br /&gt;But wait, there’s more. The new inner-loop is scanning thru the array/memory sequentially, so we can easily take advantage of &lt;a href="http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/ia32/pentium4/83886.htm"&gt;data prefetching&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;However, given that cluster count is relatively small, prefetching within the cluster rating data of each movie (each row in the array/memory layout) is probably not useful. So I decided to prefetch by movie, at the second-most inner-loop level:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;for each movie rated by the current viewer {&lt;br /&gt;    movieToPrefetch=moviesRatedByViewer[currentIndex +prefetchOffset];&lt;br /&gt;    prefetch memory address &amp;ratings[movieToPrefetch][0]&lt;br /&gt;    ...&lt;br /&gt;} // for each movie rated by the current viewer&lt;/pre&gt;&lt;br /&gt;Now, there're a few caveats about prefetching:&lt;br /&gt;&lt;br /&gt;First of all, for production code, you obviously want to write some CPU detection code to make sure the running CPU supports prefetching instructions.&lt;br /&gt;&lt;br /&gt;Secondly, the optimal &lt;a href="http://www.intel.com/cd/ids/developer/asmo-na/eng/195896.htm?page=6"&gt;prefetch distance depends on the number of cycles per iteration, the stride of the data access and the latency to main memory&lt;/a&gt; (yes the article talks about Itanium, but the general principle applies to Pentium too). Again, for production code, you should write some extra code to detect the ideal prefetch distance on the running PC, not a trivial task.&lt;br /&gt;&lt;br /&gt;And finally, there’s a potential crash bug in the calculation of the prefetch memory address:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;MovieToPrefetch=moviesRatedByViewer[currentIndex +prefetchOffset];&lt;/pre&gt;&lt;br /&gt;Because of prefetchOffset, we’re reading beyond the memory allocated for moviesRatedByViewer[], a potential crash.&lt;br /&gt;&lt;br /&gt;In reality, it probably won’t crash, because moviesRatedByViewer[] sits in a block of memory pages allocated by the C++’s runtime memory manager, and the memory immediately after moviesRatedByViewer[] most likely belongs to the memory manager, so our program can still read it. The only way it could crash is if moviesRatedByViewer[] ends at or near the last byte of the last page of the block of pages it sits in.&lt;br /&gt;&lt;br /&gt;But still, it’s better to just fix this. Fortunately, the fix is very simple: just allocate some extra bytes for moviesRatedByViewer[]. If you estimate your maximum prefetchOffset is N, just allocate an extra N*sizeof(*moviesRatedByViewer) bytes.&lt;br /&gt;&lt;br /&gt;After I added prefetching, and did a few trial runs to find the best prefetchOffset value, the running time of assignment step went from 44.3 seconds to 32.2 seconds. Another 27% improvement.&lt;br /&gt;&lt;br /&gt;But wait, there’s more. The new inner-loop is scanning thru array/memory sequentially, doing the same type of operation on each data element. This practically screams out for SIMD optimization. &lt;br /&gt;&lt;br /&gt;It’s always fun to improve performance by writing assembly language code, but we rarely have good reasons to do it nowadays. Now, if processing 100 million pieces of data, repeatedly, in a non-production environment, isn’t a good excuse for programming down to the metal, then I don’t know what is.&lt;br /&gt;&lt;br /&gt;So I translated the inner-loop to inline assembly code using SSE instructions. Even a straight-forward translation, without any tuning at all, reduced the running time of assignment step from 32.2 seconds down to 10.4 seconds! After a few runs to find a new optimal prefetch distance, the running time dropped further to 9.6 seconds. A huge improvement !&lt;br /&gt;&lt;br /&gt;So in summary, the per-iteration running times in seconds:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Original implementation: total=90&lt;/li&gt;&lt;li&gt;Rearrangement/reduction of data: assignement=44.3, update=8.0, total=52.5&lt;/li&gt;&lt;li&gt;Prefetch: assignement=32.2, update=8.0, total=40.2&lt;/li&gt;&lt;li&gt;SSE inline assembly code: assignment= 9.6, update=8.0, total=17.6&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;An 80% reduction in running time, or 5 times boost in speed.&lt;br /&gt;&lt;br /&gt;Next up: The Greater Collaborative Filtering Groupthink (KNN) &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-3763165069277713264?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/3763165069277713264/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=3763165069277713264' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/3763165069277713264'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/3763165069277713264'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2007/08/great-collaborative-filtering_29.html' title='The Great Collaborative Filtering Groupthink, Thinks Faster'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-2935557659601742479</id><published>2007-08-28T15:30:00.000-07:00</published><updated>2007-09-10T21:00:29.760-07:00</updated><title type='text'>Development Environment</title><content type='html'>(Note: this post assumes you're familiar with &lt;a href="http://www.netflixprize.com//rules"&gt;The Netflix Prize&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;Before I get into how fast "The Great Collaborative Filtering Groupthink" can think, it’s useful to specify the hardware that’s doing the thinking. My main development PC is a Dell Inspiron 1300 laptop, with the following specs:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;CPU: 1.5 GHz Intel Celeron M&lt;/li&gt;&lt;li&gt;L1 Cache: 32K ? (not sure)&lt;/li&gt;&lt;li&gt;L2 Cache: 1 MB or 512K (not sure)&lt;/li&gt;&lt;li&gt;RAM: 1.25 GB PC2-4200 SDRAM (minus 128MB for video RAM)&lt;/li&gt;&lt;li&gt;DRAM bus width: 64 bits&lt;/li&gt;&lt;li&gt;FSB: 400 MHz or 533 MHz (not sure)&lt;/li&gt;&lt;li&gt;Hard disk: ?&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Two things to note here:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;It’s actually quite difficult it is to nail down the detailed spec. Of course for most daily uses they don’t matter that much.&lt;/li&gt;&lt;li&gt;Most of the items listed above have something to do with memory, that’s because for the Netflix task, memory bandwidth is probably the biggest bottleneck once the data is in memory. How you access memory will make a big difference in performance.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Other misc info:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;OS: Windows XP Professional&lt;/li&gt;&lt;li&gt;IDE: &lt;a href="http://msdn.microsoft.com/vstudio/express/"&gt;Microsoft Visual C++ Express and Visual C# Express&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Scripting: &lt;a href="http://www.ruby-lang.org/"&gt;Ruby&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Random number library: &lt;a href="http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html"&gt;Mersenne Twister&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Linear algebra package: &lt;a href="http://math.nist.gov/tnt/index.html"&gt;TNT/JAMA&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;If you can suggest better alternatives to the two libraries above, please let me know.&lt;/p&gt;&lt;p&gt;Now, onto "The Great Collaborative Filtering Groupthink Thinks Faster" …&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-2935557659601742479?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/2935557659601742479/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=2935557659601742479' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/2935557659601742479'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/2935557659601742479'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2007/08/hardware-specification.html' title='Development Environment'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-1334539240734378660</id><published>2007-08-24T14:45:00.000-07:00</published><updated>2007-08-28T11:05:13.177-07:00</updated><title type='text'>The Great Collaborative Filtering Groupthink</title><content type='html'>(Note: this post assumes you're familiar with &lt;a href="http://www.netflixprize.com//rules"&gt;The Netflix Prize&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;Now, some details on algorithms. I'll start with my viewer-clustering approach first.&lt;br /&gt;&lt;br /&gt;The basic idea is to come up with a number of "representative" viewers, with each one representing a group of viewers with similar tastes. In various literatures these "representative" viewers are usually called "centroid" or "cluster" - I'll use the term "cluster" from now on.&lt;br /&gt;&lt;br /&gt;In our case, each cluster is really a made-up viewer representing a group of viewers. The data for each cluster are just movie ratings, one rating value for each movie:&lt;br /&gt;&lt;blockquote&gt;unsigned char clusterMovieRatings[MOVIE_CNT];&lt;/blockquote&gt;Note I used a single byte (unsigned char) for each rating. I'll write more on this detail later.&lt;br /&gt;&lt;br /&gt;When it’s time to make a decision (prediction), each viewer just asks all the clusters he belongs to for a prediction, and then calculate the weighted average of the clusters' predictions. Of course, the predictions of each cluster is the average prediction of all viewers belonging to that cluster. Now you see why I call this algorithm the great collaborative filtering groupthink.&lt;br /&gt;&lt;br /&gt;OK, how do you come up with the clusters ? I used the basic &lt;a href="http://en.wikipedia.org/wiki/K-means_algorithm"&gt;k-means algorithm&lt;/a&gt;:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Initialization: Setup some clusters, set their movie rating values to random.&lt;/li&gt;&lt;li&gt;Assignment step: For each viewer, determine which cluster the viewer belongs to, i.e. the cluster that is most similar to the viewer. I used the square of Euclidean distance as the measure of similarity, that is, the sum of squares of difference between the viewer’s rating and the cluster’s rating, over the movies the viewer has rated. Of course, the smaller the number the greater the similarity.&lt;/li&gt;&lt;li&gt;Update step: Go thru each cluster. For each movie, calculate the its average rating from the viewers who both belong to this cluster and have rated this movie. Replace the cluster’s rating for this movie with the average.&lt;/li&gt;&lt;li&gt;Repeat the previous two steps a number of times, or you can make a prediction for the probe data set and decide if to stop based on its RMSE.&lt;/li&gt;&lt;/ul&gt;To make a prediction for viewer v and movie m:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Find the K clusters that are most similar to viewer v. Again, the square of Euclidean distance is used to measure similarity. Calculate the weight for each cluster from the square of distance: weight=1/(squaredDistance + offset). The offset is a small positive number, it’s there so we don’t get a divide by zero if a viewer happen to match a cluster perfectly.&lt;/li&gt;&lt;li&gt;Calculate the weighted average of the K clusters' ratings for movie m, and that’s the predicted rating for viewer v and movie m.&lt;/li&gt;&lt;/ul&gt;That’s it. Pretty simple, isn’t it ? Of course there’re enhancement to the basic k-means algorithm such as soft k-means and k-means++ (and I don’t see any reason why they can’t be combined to produce a soft k-means++ algorithm), but I’m trying to get the simple and easy stuff working first. And given what I tried next, they probably don’t matter in the end.&lt;br /&gt;&lt;br /&gt;With the algorithms above, I tried different combinations of the following parameters:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;C: the total number of clusters.&lt;/li&gt;&lt;li&gt;I: the number of iterations used to build the clusters.&lt;/li&gt;&lt;li&gt;K: the number of most similar clusters used for prediction.&lt;/li&gt;&lt;/ul&gt;In the end, C=64, I=10, and K=5 seemed the best. Furthermore, since we set the clusters’ ratings to random values at the start of the algorithm, we can run the algorithm multiple times, using a different set of random initial values each time, and then average the predictions to get a better RMSE. Basically we create more groups among the viewers, and averaging nudges the predictions toward a more reliable, common value.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;span style="font-size:130%;"&gt;Here a mystery surfaced&lt;/span&gt;&lt;/strong&gt;: K=6 or 7 consistently showed the best RMSE for raw predictions, but K=3 or 4 gave the best result when averaging multiple sets of predictions.&lt;br /&gt;&lt;br /&gt;At this stage I started tweaking the algorithm. What I found was all tweaks improved the raw RMSE, but the averaged RMSEs stayed pretty much the same, &lt;strong&gt;&lt;span style="font-size:130%;"&gt;another mystery&lt;/span&gt;&lt;/strong&gt;. Here’re some probe RMSE from raw predictions with C=64 and I=10:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;K=2, RMSE=0.9551&lt;br /&gt;K=3, RMSE=0.9480&lt;br /&gt;K=4, RMSE=0.9450&lt;br /&gt;K=5, RMSE=0.9437&lt;br /&gt;K=6, RMSE=0.9433&lt;br /&gt;K=7, RMSE=0.9433&lt;br /&gt;K=8, RMSE=0.9437&lt;br /&gt;K=10, RMSE=0.9448&lt;br /&gt;K=15, RMSE=0.9489&lt;br /&gt;K=20, RMSE=0.9536&lt;br /&gt;K=25, RMSE=0.9585&lt;br /&gt;K=30, RMSE=0.9635&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;and here are some probe RMSEs from averaging multiple sets with C=64, I=10:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Average of 2, K=2, RMSE=0.9440&lt;br /&gt;Average of 2, K=3, RMSE=0.9408&lt;br /&gt;Average of 2, K=4, RMSE=0.9398&lt;br /&gt;Average of 2, K=5, RMSE=0.9397&lt;br /&gt;Average of 2, K=6, RMSE=0.9400&lt;br /&gt;Average of 2, K=7, RMSE=0.9407&lt;br /&gt;&lt;br /&gt;Average of 4, K=2, RMSE=0.9386&lt;br /&gt;Average of 4, K=3, RMSE=0.9374&lt;br /&gt;Average of 4, K=4, RMSE=0.9374&lt;br /&gt;Average of 4, K=5, RMSE=0.9379&lt;br /&gt;Average of 4, K=6, RMSE=0.9387&lt;br /&gt;Average of 4, K=7, RMSE=0.9396&lt;br /&gt;&lt;br /&gt;Average of 8, K=2, RMSE=0.9358&lt;br /&gt;Average of 8, K=3, RMSE=0.9356&lt;br /&gt;Average of 8, K=4, RMSE=0.9360&lt;br /&gt;Average of 8, K=5, RMSE=0.9368&lt;br /&gt;Average of 8, K=6, RMSE=0.9378&lt;br /&gt;Average of 8, K=7, RMSE=0.9388&lt;/blockquote&gt;So in summary, groups are good, individuals are bad. We’re the Borg, you’ll be assimilated.&lt;br /&gt;&lt;br /&gt;Next up: The Great Collaborative Filtering Groupthink Thinks Faster&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-1334539240734378660?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/1334539240734378660/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=1334539240734378660' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/1334539240734378660'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/1334539240734378660'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2007/08/great-collaborative-filtering.html' title='The Great Collaborative Filtering Groupthink'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-8337157299931085183</id><published>2007-08-22T22:51:00.000-07:00</published><updated>2007-08-28T10:57:08.114-07:00</updated><title type='text'>The Great Collaborative Filtering Framework</title><content type='html'>&lt;p&gt;(Note: this post assumes you're familiar with &lt;a href="http://www.netflixprize.com//rules"&gt;The Netflix Prize&lt;/a&gt;.)&lt;/p&gt;&lt;p&gt;Continuing from the last blog, I plodded along and wrote some boring C# code to generate my binary files. Another task is to extract the probe data from the full data set. In the end I had 3 sets of data: the full data set, the probe-less data set, and the probe data set. For each set, I created two binary files: one sorted by movie ID first and then by viewer ID, the other sorted by viewer ID first and then by movie ID.&lt;/p&gt;&lt;p&gt;Next, I generated a prediction file using weighted averages of movie and viewer average ratings. To do that I had to get a least squares fitting/linear regression program going. This program is also used later to blend results from different algorithms. Plus I had to write a program to grade the probe predictions. The RMSE number I got agreed with what other people were reporting, so I was more or less on the right track. Whew, it was a lot of work just to get a basic framework going. But it should be smooth sailing from now on, right ?&lt;/p&gt;&lt;p&gt;Well, before I go any further, there're questions to be answered:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Should I filter the data somehow ? As others have reported, there're oddball viewers in the data set. Like the guy who rated about 17000 movies, or the guy who rated thousands of movies in one day.&lt;/li&gt;&lt;li&gt;Should I do some sort of global offset or normalization ?&lt;/li&gt;&lt;li&gt;What to do about rating dates ? If you look at the global averages vs dates, you'll see the averages steadily going up, and there was an abrupt jump of about 0.2 around April 2004.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Hmm, I don't have any good answers, but a few hokey theories and SWAGs (scientific wild ass guess) came to the rescue:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Filtering: not needed, as the number of oddball viewers is too small and their effect is probably negligible.&lt;/li&gt;&lt;li&gt;Global offset/normalization: other people have reported getting RMSE in the low 0.98s using only movie/viewer averages. That’s within .03 of Netflix’s Cinematch algorithm, which is a non-trivial algorithm, which has to do more than just offset/normalization, if it does it at all. So the effect of offset/normalization is probably around 0.015. So I think as a data mining newbie just getting started on this, I can ignore this issue for now, I’ll worry about it after I exhaust other possibilities.&lt;/li&gt;&lt;li&gt;Rating dates: this might be nothing. It’s possible that as time progresses, Netflix collected more and more data and continuously updated their software, so people get better and better recommendations, and as a result the ratings got better and better. As for the abrupt jump around April 2004, maybe they did a major software upgrade/bug fix around that time.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Actually, by this time I was so tired of reading research papers and so eager to get a nontrivial algorithm going, I’ll make any excuse to take the “damn the torpedoes full speed ahead” approach.&lt;/p&gt;&lt;p&gt;Next up: the great collaborative filtering groupthink, aka clustering.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-8337157299931085183?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/8337157299931085183/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=8337157299931085183' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/8337157299931085183'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/8337157299931085183'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2007/08/great-collaborative-filtering-framework.html' title='The Great Collaborative Filtering Framework'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-1063130005568918381</id><published>2007-08-16T11:29:00.000-07:00</published><updated>2010-08-27T00:22:57.853-07:00</updated><title type='text'>Great Expectations and Database</title><content type='html'>&lt;p&gt;(Note: this post assumes you're familiar with &lt;a href="http://www.netflixprize.com//rules"&gt;The Netflix Prize&lt;/a&gt;.)&lt;/p&gt;&lt;p&gt;When I first came across the Netflix prize, I just couldn't resist urge to participate. It looked like a well-planned contest with an interesting problem. It would test and improve your coding skills and your understanding of interesting algorithms. Plus there's a $1,000,000 prize. So I eagerly signed up, downloaded the (large) data set, read the documentations, and pondered, how to start ?&lt;br /&gt;&lt;br /&gt;At first, I thought this was a good project with which to improve my relatively limited knowledge of C#. I also had the idea of putting all the data into a database, so I could easily run queries like "gimme all viewers who gave this movie a score of 3", and I could learn the C#/.NET interface to database at the same time. So the version 0 of my grand plan was:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Improve my C# skillz.&lt;/li&gt;&lt;li&gt;Learn C# and .NET database interface.&lt;/li&gt;&lt;li&gt;Read about and implement interesting algorithms.&lt;/li&gt;&lt;li&gt;Win the one million dollar prize.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Hehehe... good thing I didn't get XML and Haskell into the mess.&lt;br /&gt;&lt;br /&gt;So I setup a database, and started writing a C# program to reformat the raw data files from Netflix. That worked out pretty well and soon I had some data in the database. But as soon as I tried accessing the database I noticed it is unbearably slow for this task. It was probably OK even if I wanted to run thousands of queries at a time, but a prediction algorithm needs to go thru all 100 million pieces of data, and more. With everything in the database, even the simplest algorithm would take a few hours to run. This seemed to mirror the experience of other people. Maybe it's workable if I have a cluster of 1000 fast PCs at my disposal, but with at most 2 computers at my disposal, the database approach would never work.&lt;/p&gt;&lt;p&gt;So I abandoned the database-based approach, and switched to sorting and saving data in binary files for fast access. This time I put a little more thought into it and decided to use memory-mapped file as my primary means of getting data off the disk. This simplified my data access and has worked out very well so far.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-1063130005568918381?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/1063130005568918381/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=1063130005568918381' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/1063130005568918381'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/1063130005568918381'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2007/08/great-expectations-and-database.html' title='Great Expectations and Database'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9207672005321238583.post-651929629078278777</id><published>2007-08-10T14:34:00.001-07:00</published><updated>2007-08-20T16:58:51.191-07:00</updated><title type='text'>My Netflix Attempt</title><content type='html'>I've been working on the &lt;a href="http://www.netflixprize.com/"&gt;Netflix prize&lt;/a&gt; contest off and on, basically the it's a data-mining challenge with a $1 million dollar prize. I'll be describing some of the approaches I took, hopefully someone will give me some feedback that will improve my score (0.899 as of July 20, 2007).&lt;br /&gt;&lt;br /&gt;So far I've the following algorithms up and running: SVD (based on &lt;a href="http://www.timelydevelopment.com/Demos/NetflixPrize.htm"&gt;Timely Development's code&lt;/a&gt;, thank you!), KNN, clustering, MLE(or is it EM ? I'm not sure, anyway it didn't work), and a couple of amateur-ish schemes I dreamt up, which also didn't work out (by not working I mean did not beat Netflix's Cinematch's RMSE). Oh by the way, this is the first time I've ever implemented these algorithms.&lt;br /&gt;&lt;br /&gt;While I certainly have enough programming experience for this contest, I'm not a mathematician or a statistician, and I have never dabbled in data mining before. So you won't find deep theoretical musings on topics like sampling techniques, SVD, restricted Boltzmann machine guns here. I'll concentrate on my implementation detail and try to keep my understanding and my terminology correct. All malaprops may or may not be intestinal.&lt;br /&gt;&lt;br /&gt;More details to follow.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9207672005321238583-651929629078278777?l=dmnewbie.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dmnewbie.blogspot.com/feeds/651929629078278777/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9207672005321238583&amp;postID=651929629078278777' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/651929629078278777'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9207672005321238583/posts/default/651929629078278777'/><link rel='alternate' type='text/html' href='http://dmnewbie.blogspot.com/2007/08/my-netflix-attempt.html' title='My Netflix Attempt'/><author><name>Bo Yang</name><uri>http://www.blogger.com/profile/14537616692777428127</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
