I'm slowly changing all my algorithms to work on either raw data or residue data. Because raw data are integer values in the range of 1 to 5, and thus enabling some performance optimizations and shortcuts, but residue data are essentially floats, all my algorithms took a performance hit.
At the same time, I'm adding multi-threading support to speed things up. The improvement is significant, but I'm already seeing diminishing returns: on a quad-core systems, matrix factorization using 4 cores runs at only 3.27x the speed of using one core.