Fixes an issue with `wait_for_tasks()` and adds a lower-overhead
`push_loop` helper. We replace our usage of `parallelize_loop` with
`push_loop` as we didn't use the multi-future vector return and don't
need the extra overhead.
Thread pools are long-lasting executors that have close to zero overhead
when launching new jobs. This is advantageous over creating new threads
as we can use this for threading smalling jobs and smaller quanta. It
also avoids the heuristics needed to determine the optimal number of
threads to spawn