dask cluster manager using wireguard and ansible
Go to file
xenia 76b55b149c add very very long death timeout (and fix bug)
this makes workers stick around even if they lose connection

verizon is unreliable ;w;
2021-08-02 23:31:45 -04:00
leylines document and add CLI for resources 2021-06-18 04:11:01 -04:00
leylines-ansible add very very long death timeout (and fix bug) 2021-08-02 23:31:45 -04:00
leylines-bootstrap template bootstrap project 2021-06-14 05:38:51 -04:00
leylines-monocypher add monocypher module 2021-06-14 05:18:32 -04:00
leylines-support add better instructions and nginx support files 2021-06-18 03:53:27 -04:00
.gitmodules add monocypher module 2021-06-14 05:18:32 -04:00
README.md document and add CLI for resources 2021-06-18 04:11:01 -04:00

README.md

leylines

this repo enables managing a dask cluster using wireguard to link nodes which may be separated by WAN[^1] and includes an opinionated mini wireguard manager (on the server side, workers use wg-quick) that doubles as an ansible inventory plugin. finally, ansible playbooks can run setup and deployment for dask nodes

how to

install the server

(cd leylines-monocypher && pip3 install --user .)
(cd leylines && pip3 install --user .)
mkdir -p ~/.config/leylines

ok now take a moment to edit leylines-support/leylines-daemon.service to be running as your user (change User= and Group=). put that into your /etc/systemd/system and then do

sudo systemctl --enable now leylines-daemon

congrats wireguard should be up. next, edit leylines-support/nginx.conf (change the listen address and the SSL certificate paths -- point those towards letsencrypt directories for a domain you already provisioned that your nginx is serving). put that block into your /etc/nginx/nginx.conf. to export your dask dashboard publicly, also adjust leylines-support/nginx-http.conf to your needs and include it in an http server block. it may be advantageous to do that first, then run certbot on the domain to get the certs provisioned, and then set up the stream block using the same certs as certbot inserted for https

then run

sudo nginx -s reload

install client

now that the server is running, you may choose to access it remotely. make a note of leylines print-token -- this is the auth token you will need. on your client (local laptop, or something)

(cd leylines && pip3 install --user .)
mkdir -p ~/.config/leylines
echo "auth token here" > ~/.config/leylines/token
echo "mycluster.domain.lgbt" > ~/.config/leylines/host

now you can access your server using the CLI. initialize it and add some nodes. in the init command provide the server's externally-facing public IP, and provide an SSH key that can be used to access it for ansible. then, to add workers provide a name for each one and an SSH key

leylines init -n myserver -i 1.2.3.4 -k path/to/ssh-key
leylines add -n worker-0 -k path/to/ssh-key
...
leylines add -n worker-n -k path/to/ssh-key

sync wireguard settings (this applies the configuration to the server's wireguard interface)

leylines sync

get status

leylines status

connect a worker

get config for a node

leylines get-conf <id>

manually copy that config to your worker node, /etc/wireguard/leyline-wg.conf and then systemctl enable --now wg-quick@leyline-wg

currently the wireguard topology is a star. this doesn't actually work optimally for my config, where some nodes are colocated and should have direct connections to each other and others should go over WAN to reach distant nodes. this will be changed in a later version

provision workers

run the ansible playbook. this will provision the needed components for dask on the server and all workers

cd leylines-ansible
ansible-playbook -i leylines_inv.py playbook-setup.yml

the first run will take a while. it builds python 3.9.5 and installs it, then builds a virtualenv with python dependencies in it, and then installs and starts systemd user services

now you can open <your server's wireguard ip>:31336 to view the dask dashboard (or if you are proxying it with nginx, it should be available there too)

use the cluster with

from dask.distributed import Client
client = Client("<your server's wireguard ip>:31337")

or more easily

from leylines.dask import init_dask
client = init_dask()

or

from leylines.dask import init_dask_async
client = await init_dask_async

leylines.dask also provides tqdmprogress which can be used in the place of distributed.diagnostics.progress for a task monitor using tqdm, and tqdm_await which can be used with an iterable of dask futures to display progress as they go (but only for async clients)

futures = [ some list of futures ... ]
async for fut in tqdm_await(futures, pbar=<optional tqdm instance to use>):
    print(fut.result())

time for magic

copy leylines-support/02-dask.py into ~/.ipython/profile_default/startup

this provides 2 new spells: %dask connects to your cluster, and %daskworker splits off a new ipython console on a worker selected by having free RAM available and not being busy. this is useful for ad-hoc code testing on a real worker

%dask also installs client, a reference to the client, and tqdmprogress from leylines.dask, and upload which uploads a file and returns a delayed function which will fetch the filename on a worker

resources

there is an abstract idea of nodes having resources which can be controlled by leylines add-resource and leylines del-resource (and leylines status shows you the resources). currently this assigns those with quantity 1 when starting the workers. due to a limitation of dask every worker process inherits the same quantity of resources. you can assign resources in a more ad-hoc way by opening an ipython session to a worker and then calling await distribted.get_worker().set_resources(someresource=1), which will temporarily assign that to the worker. if you modify resources through leylines you will need to run the ansible playbook again to apply the changes. you can use --start-at-task "install systemd task" to save some time