Modification to run on Summit and support login / batch / compute architecture#467
Modification to run on Summit and support login / batch / compute architecture#467dustinvanstee wants to merge 1 commit into
Conversation
… node architecture
|
Here is a sample of how I create LSFCluster using this idea .. from dask_jobqueue import LSFCluster |
|
Thanks a lot for your PR! This is certainly a bit of a hack but I think what you want can currently be achieved by doing: dask_worker_prefix = "jsrun -n1 -a1 -g0 -c1"
cluster = LSFCluster(
...,
python= f"{dask_worker_prefix} {sys.executable}")For how to do it in a cleaner way in the longer-term, using Jinja templates (and allowing user to tweak the Jinja template) seems a good way forward. Unfortunately I don't think I will find time to look at #370 any time soon .... |
|
@lesteve thanks for your feedback. this does exactly what I need, appreciate it ! closing PR |
I am a user on Summit and it uses LSF for job submittal. It has a unique architecture that has a login / batch / compute node setup such that a job submitted via LSF needs to have this jsrun wrapper script precede any batch job run on cluster as show in this sample job script (otherwise the job just stays on the batch node and never gets to the compute node).
#!/usr/bin/env bash
#BSUB -J dask-worker
#BSUB -P xxx201
#BSUB -W 00:30
#BSUB -nnodes 1
jsrun -n1 -a1 -g0 -c1 /ccs/home/vanstee/.conda/envs/powerai-ornl/bin/python -m distributed.cli.dask_worker tcp://10.41.0.33:40063 --nthreads 8 --memory-limit 4.00GB --name dummy-name --nanny --death-timeout 60 --interface ib0 --protocol tcp://
I modified core.py to achieve this goal, and would like to see if some version of this idea could make it into dask_jobqueue library. thanks