I'm slowly changing all my algorithms to work on either raw data or residue data. Because raw data are integer values in the range of 1 to 5, and thus enabling some performance optimizations and shortcuts, but residue data are essentially floats, all my algorithms took a performance hit.
At the same time, I'm adding multi-threading support to speed things up. The improvement is significant, but I'm already seeing diminishing returns: on a quad-core systems, matrix factorization using 4 cores runs at only 3.27x the speed of using one core.
2 comments:
Hi Newman!
A little off-topic here - I couldn't find better place to ask.
I found your comment regarding C++ optimisation technique I'm trying to understand:
"I didn't mention that replacing int32 with int16 or int8 gave me the most performance boost, no doubt because absolute majority of viewers have less than 65536 ratings, and a large portion of them (more than 40% if I remember correctly) have less than 256 ratings, so the amount of memory you access is significantly less. In my C++ implementation, I just wrote a template function and have one copy each for int32, int16, and int8."
Does it mean that you have 3 functions (with identical signature): one works with int32, the other with int16 and the third with int8? I'm not familiar with C++ templates... :-(
Could you please elaborate on it with more details? Or maybe could you point me to some web page describing that issue?
Thanks in advance.
Wojtek
I wrote a template function, and have 3 "instantiations" of the template. So in source code, there's only one function that works with a type T. In compiled code, there're 3 functions, where T is unsigned int32/16/8.
Post a Comment