Your handcrafted routine isn't *that* slow, just turn on compiler-optimization.
As for resampling, I had started to port my parallel bitmap-resampler to fmx, but then I thought, hey, these guys can use DirectDraw, there won't be a demand.
Now, seeing how poor the quality is for (supposedly) bilinear rescaling, I have continued working on it. A first version is usable on Windows only for the time being. I just have to add some demos, and I'll probably upload it later today to
https://github.com/rmesch/Parallel-Bitmap-Resampler
Just in case you might be interested.