This was flagged by coverity but doesn't seem to be an actual issue in
g++/clang. It technically leaves the moved rvalue in a "valid but
undefined state", so it is best to avoid. The single copy into an
lvalue is (I think) cheap
Fixes an issue with `wait_for_tasks()` and adds a lower-overhead
`push_loop` helper. We replace our usage of `parallelize_loop` with
`push_loop` as we didn't use the multi-future vector return and don't
need the extra overhead.
Thread pools are long-lasting executors that have close to zero overhead
when launching new jobs. This is advantageous over creating new threads
as we can use this for threading smalling jobs and smaller quanta. It
also avoids the heuristics needed to determine the optimal number of
threads to spawn