There is probably more than one way to do it. You could build a queue of MP3s to play and then iterate over it using either sync or async calls, either via ActiveMovie/DirectShow or another mechanism.

But Windows Text to Speech is getting pretty good. Take a look at this SAPI demo. Runs fine on Windows 10, without installing anything extra. Should run all the way back to XP SP3 if not earlier.

I typed numbers into the textbox and it pronounced them reasonably well.