Here is a simple example of how to run a julia script on a SLURM cluter. If you want to run a julia script with multiple workers, you need to allocate some nodes and then have the ClusterManager use srun to get those nodes to run julia.
See the main.sh script for an example.
#!/usr/bin/sh # start an allocation with 4 nodes 2 cpus per node and run the sbatch script which will start multiple julia process in a Julia Cluster. salloc --nodes=4 --cpus-per-task 2 | sbatch julia.sbatch
This runs the following batch script which acquires the resources within the allocation and starts the main julia process.
#!/bin/sh #SBATCH --time=00:15:00 #SBATCH --nodes=4 #SBATCH --ntasks-per-node=1 # the resources requested above must be within the allocation # we need to load the julia module so that the paths are set up right. module load julia # this starts the julia script which will srun its own processes julia slurm.jl
Julia is responsible for starting the workers with
#= Julia code for launching jobs on the slurm cluster. This code is expected to be run from an sbatch script after a module load julia command has been run. It starts the remote processes with srun within an allocation. If you get an error make sure to Pkg.checkout("CluterManagers"). =# try using ClusterManagers catch Pkg.add("ClusterManagers") Pkg.checkout("ClusterManagers") end using ClusterManagers # Arguments to the Slurm srun(1) command can be given as keyword # arguments to addprocs. The argument name and value is translated to # a srun(1) command line argument as follows: # 1) If the length of the argument is 1 => "-arg value", # e.g. t="0:1:0" => "-t 0:1:0" # 2) If the length of the argument is > 1 => "--arg=value" # e.g. time="0:1:0" => "--time=0:1:0" # 3) If the value is the empty string, it becomes a flag value, # e.g. exclusive="" => "--exclusive" # 4) If the argument contains "_", they are replaced with "-", # e.g. mem_per_cpu=100 => "--mem-per-cpu=100" np = 4 # addprocs(SlurmManager(np), t="00:5:00") hosts =  pids =  println("We are all connected and ready.") for i in workers() host, pid = fetch(@spawnat i (gethostname(), getpid())) println(host, pid) push!(hosts, host) push!(pids, pid) end # The Slurm resource allocation is released when all the workers have # exited for i in workers() rmprocs(i) end
You can print out the output using
head *.out which will look like:
==> julia/job0000.out <== julia_worker:9009#172.30.0.146 ==> julia/job0001.out <== julia_worker:9009#172.30.0.147 ==> julia/job0002.out <== julia_worker:9009#172.30.0.148 ==> julia/job0003.out <== julia_worker:9009#172.30.0.149 ==> julia/slurm-10495.out <== connecting to worker 1 out of 4 connecting to worker 2 out of 4 connecting to worker 3 out of 4 connecting to worker 4 out of 4 We are all connected and ready. node1.domain.tld146677 node2.domain.tld109050 node3.domain.tld140934 node4.domain.tld48648