Sure, GPU-acceleration is helpful with regards to "speeding things up" -
but these 3D-APIs are not as comfortable to use as a 2D-lib ...

Though a combination of the two is possible as well of course...

E.g. you could build up your scene (with all the small Objects and render-outputs) via 2D-cairo-API -
and then "hand over" (uploading) your ScreenSurface into the GPU -
doing the final "BlendOps" hardware-accelerated (including superfast stretching to your target-hWnd or target-hDC).

Have to say though, that on "decently sized" Game-Surfaces (e.g. in 1024x768),
a normal "Overlay-BlendOp" takes about 0.3msec.
And even the more expensive "Multiply-BlendOp" will take only 5-7msec.

So, if your whole "scene-buildup" needs 13msec - and the final "multiply-blendop" adds 7msec -
you'll end up with 20msec total, which is still good for about 50FPS.

Olaf