You could use hardware support to quicken up the blt.

Direct draw has FastBlt and you can also use GDI+ to perform the bitblt. Both these methods are hardware compatible so any available hardware(video card) that has some decent 2d support should execute the code nearly instantly.

There is even a wrapper DLL so you can use GDI+ without writing your own C DLL.

Do a google search.