4 个版本

0.3.2 2023年2月14日
0.3.1 2023年2月12日
0.2.2 2023年2月11日
0.2.0 2023年2月7日

#815并发

每月40 次下载

GPL-3.0 许可证

59KB
1K SLoC

gosh-remote 可以将任何多进程并行脚本转换为 HPC 环境中跨多个节点的远程分布。

用法

  1. 使用批处理系统中的 mpirun 自动安装调度程序和工作节点

    run-gosh-remote.sh

    # install scheduler
    gosh-remote bootstrap as-scheduler &
    
    # install workers on allocated nodes from batch system
    mpirun gosh-remote -v bootstrap as-worker
    

    上面的脚本作为正常批处理脚本,可以使用 bsub 等命令提交到批处理系统

    bsub -J test -R "span[ptile=24]" -n 72 ./run-gosh-remote.sh
    

    上面的脚本请求 3 个节点进行远程执行。

  2. 更改作业脚本

    主脚本将并行使用 3 个进程调用 test.sh

    xargs -P 3 -n 1 gosh-remote client run <<<$'./test.sh ./test.sh ./test.sh'
    

    test.sh 的作业脚本

    #! /usr/bin/env bash
    echo running on $(hostname)
    

    示例输出

    "Ok(\"{\\\"JobCompleted\\\":\\\"running on node037\\\\n\\\"}\")"
    "Ok(\"{\\\"JobCompleted\\\":\\\"running on node038\\\\n\\\"}\")"
    "Ok(\"{\\\"JobCompleted\\\":\\\"running on node042\\\\n\\\"}\")"
    

动作示例(针对 magman)

run.sh

安装调度程序和工作节点,并运行 magman 的主脚本

#! /usr/bin/env bash

set -x
#export SPDKIT_RANDOM_SEED=2227866437669085292

LOCK_FILE="gosh-remote-scheduler.lock"
# run MAX_PROCS processes at a time
MAX_NPROC=8

# start remote execution services
(
# install scheduler on the master node; the service address will be recorded in LOCK_FILE
gosh-remote -v bootstrap -w "$LOCK_FILE" as-scheduler &

# use mpirun to install one worker on each node by creating a machinefile for mpirun
which mpirun
# for LSB batch system, we can read nodes from env var
#echo $LSB_HOSTS |xargs -n 1| uniq | xargs -I{} echo {}:1>machines
# or
mpirun hostname | sort | uniq |xargs -I{} echo {}:1 >machines
# works for Intel MPI, MPICH, MVAPICH
mpirun -bootstrap=ssh -prepend-rank -machinefile machines gosh-remote -vv bootstrap as-worker -w "$LOCK_FILE"
) 2>&1 | tee gosh-remote.log &
sleep 2

# step 2: run magman
# NOTE: to run vasp remotely using the scheduler, run-vasp.sh need to be set accordingly
# write clean output to magman.out, while write everything to magman.log
(magman -j $MAX_NPROC -r -vvv | tee magman.out) 2>&1 | tee magman.log

# step 3: when magman done, kill background services
sleep 1
pkill gosh-remote
# could be better if using mpirun?
# mpirun pkill gosh-remote

run-vasp.sh

调用 VASP 的脚本

#! /usr/bin/env bash

# get root directory path of this script file
SCRIPT_DIR=$(dirname $(realpath "${BASH_SOURCE[0]:-$0}"))
LOCK_FILE="$SCRIPT_DIR/gosh-remote-scheduler.lock"

# NOTE: the "-host" option is required for avoiding process migration due to
# nested mpirun call
gosh-remote -vv client -w "$LOCK_FILE" run "mpirun -np 72 -host \$(hostname) vasp"

常见问题解答

如何避免嵌套 mpirun 冲突

在客户端脚本中,确保 vasp 在工作节点上运行,不会迁移到其他节点

mpirun -host `hostname` vasp

依赖关系

~40–56MB
~1M SLoC