xenia
76b55b149c
this makes workers stick around even if they lose connection verizon is unreliable ;w; |
||
---|---|---|
leylines | ||
leylines-ansible | ||
leylines-bootstrap | ||
leylines-monocypher | ||
leylines-support | ||
.gitmodules | ||
README.md |
README.md
leylines
this repo enables managing a dask cluster using wireguard to link nodes which may be separated by WAN[^1] and includes an opinionated mini wireguard manager (on the server side, workers use wg-quick) that doubles as an ansible inventory plugin. finally, ansible playbooks can run setup and deployment for dask nodes
how to
install the server
(cd leylines-monocypher && pip3 install --user .)
(cd leylines && pip3 install --user .)
mkdir -p ~/.config/leylines
ok now take a moment to edit leylines-support/leylines-daemon.service
to be running as your user
(change User=
and Group=
). put that into your /etc/systemd/system
and then do
sudo systemctl --enable now leylines-daemon
congrats wireguard should be up. next, edit leylines-support/nginx.conf
(change the listen address
and the SSL certificate paths -- point those towards letsencrypt directories for a domain you
already provisioned that your nginx is serving). put that block into your /etc/nginx/nginx.conf
.
to export your dask dashboard publicly, also adjust leylines-support/nginx-http.conf
to your needs
and include it in an http server block. it may be advantageous to do that first, then run certbot
on the domain to get the certs provisioned, and then set up the stream
block using the same certs
as certbot inserted for https
then run
sudo nginx -s reload
install client
now that the server is running, you may choose to access it remotely. make a note of leylines print-token
-- this is the auth token you will need. on your client (local laptop, or something)
(cd leylines && pip3 install --user .)
mkdir -p ~/.config/leylines
echo "auth token here" > ~/.config/leylines/token
echo "mycluster.domain.lgbt" > ~/.config/leylines/host
now you can access your server using the CLI. initialize it and add some nodes. in the init
command provide the server's externally-facing public IP, and provide an SSH key that can be used to
access it for ansible. then, to add workers provide a name for each one and an SSH key
leylines init -n myserver -i 1.2.3.4 -k path/to/ssh-key
leylines add -n worker-0 -k path/to/ssh-key
...
leylines add -n worker-n -k path/to/ssh-key
sync wireguard settings (this applies the configuration to the server's wireguard interface)
leylines sync
get status
leylines status
connect a worker
get config for a node
leylines get-conf <id>
manually copy that config to your worker node, /etc/wireguard/leyline-wg.conf
and then
systemctl enable --now wg-quick@leyline-wg
currently the wireguard topology is a star. this doesn't actually work optimally for my config, where some nodes are colocated and should have direct connections to each other and others should go over WAN to reach distant nodes. this will be changed in a later version
provision workers
run the ansible playbook. this will provision the needed components for dask on the server and all workers
cd leylines-ansible
ansible-playbook -i leylines_inv.py playbook-setup.yml
the first run will take a while. it builds python 3.9.5 and installs it, then builds a virtualenv with python dependencies in it, and then installs and starts systemd user services
now you can open <your server's wireguard ip>:31336
to view the dask dashboard (or if you are
proxying it with nginx, it should be available there too)
use the cluster with
from dask.distributed import Client
client = Client("<your server's wireguard ip>:31337")
or more easily
from leylines.dask import init_dask
client = init_dask()
or
from leylines.dask import init_dask_async
client = await init_dask_async
leylines.dask
also provides tqdmprogress
which can be used in the place of
distributed.diagnostics.progress
for a task monitor using tqdm
, and tqdm_await
which can be
used with an iterable of dask futures to display progress as they go (but only for async clients)
futures = [ some list of futures ... ]
async for fut in tqdm_await(futures, pbar=<optional tqdm instance to use>):
print(fut.result())
time for magic
copy leylines-support/02-dask.py
into ~/.ipython/profile_default/startup
this provides 2 new spells: %dask
connects to your cluster, and %daskworker
splits off a new
ipython console on a worker selected by having free RAM available and not being busy. this is useful
for ad-hoc code testing on a real worker
%dask also installs client
, a reference to the client, and tqdmprogress
from leylines.dask
,
and upload
which uploads a file and returns a delayed function which will fetch the filename on a
worker
resources
there is an abstract idea of nodes having resources which can be controlled by leylines add-resource
and leylines del-resource
(and leylines status
shows you the resources). currently
this assigns those with quantity 1 when starting the workers. due to a limitation of dask every
worker process inherits the same quantity of resources. you can assign resources in a more ad-hoc
way by opening an ipython session to a worker and then calling await distribted.get_worker().set_resources(someresource=1)
, which will temporarily assign that to
the worker. if you modify resources through leylines you will need to run the ansible playbook again
to apply the changes. you can use --start-at-task "install systemd task"
to save some time