Print 

Slurm is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

 

Slurm workload manager is an important part of HIVE because with help of slurm we can implement and use the resources of Hive in the right way. 

Some basic necessary commands that every user should know: 

sinfo - show all available partitions and nodes sbatch - submit a job from script 
squeue / smap - show list of jobs in the queue salloc - submit allocation request
srun - submit an interactive job from a terminal scancel - delete/remove a job

 * You can find all information and all the options for using each command by running man or --help commands to see the manuals of the commands, example: # man sinfo , #sinfo --help


 

Partitions and nodes information

hive partition containing public compute nodes(bee's) this is a default partition. Running jobs on Hive partition's is recommended for new users and for users that want to be sure that their job will run until it end without any preemption.

queen partition containing high memory compute nodes(queen's), this partition is made for a jobs that require a lot of memory.

preempt partition are preempt-able that contain all the compute nodes(bee's) as well as public and private.  Preempt-able partition means that every job that submitted on higher priority partitions(hive,queen,private partitions) will kick off your job that running on preempt partition, and be restart from the beginning to the queue. on preempt partition you get a benefit of lower limitations per user.

ckpt partition are preempt-able that contain all the HP or DELL compute nodes(bee's) as well as public and private nodes.  Preempt-able partition means that every job that submitted on higher priority partitions(hive,queen,private partitions) will kick off your job that running on ckpt partition, and restart the job to the queue. On ckpt partition you get a benefit of lower limitations per user. This partition is very recommended for jobs that can be checkpointed, with combination of checkpoint-able job and low limitations on ckpt partition you will be able to use more resources than from the public Hive partition.

mpi partition that contain all the public compute nodes(bee's) that made for MPI jobs that using a large amount of resources

vespa partition that contain GPU's nodes(vespa's) that made for jobs that using GPU resources.

Private partition's partitions that contain private nodes that different groups has bough't, those partitions usually receive names of their group name. Running jobs on private partitions is allowed only by the users that belong the the partition group. private nodes has no limitations and will get 100% priority over any other partition that containing the private nodes(ckpt partitions for example).

Please see the below table with information regarding Hive compute nodes, different types and range of each type:

Name Quanity Model CPU's  RAM Notes/Name and range in Slurm
Compute(bee's) 32 Dell PowerEdge C6220ii 20 128GB bee001-032
Compute(bee's) 30 HP XL170r 24 128GB bee033-063
Fat node(queen's) 1 Dell PowerEdge R820 32 760GB queen01
Fat node(queen's) 2 HP DL560 56 760GB queen02-03
GPU(vespa's) 1 HP XL190r 28 256GB vespa01, Nvidia K80 GPU

 * For information regarding limitation per partition, please check the limitations section in website menu.


 

Slurm SBATCH command and slurm job submission Scripts

The command "sbatch" should be the default command for running batch jobs. 

With "sbatch" you can run simple batch jobs from a command line, or you can execute complicated jobs from a prepared batch script.

First, you should understand the basic options you can add to the sbatch command in order to request the right allocation of resources for your jobs:

Commonly used options in #srun, #sbatch, #salloc:

-p partitionName

submit a job to queue queueName

-o output.log

Append job's output to output.log instead of slurm-%j.out in the current directory

-e error.log

Append job's STDERR to error.log instead of job output file (see -o above)

--mail-type=type

Email submitter on job state changes. Valid type values are BEGIN, END,FAIL, REQUEUE and ALL (any state change).

--mail-user=email

User to receive email notification of state changes (see –mail-type above)

-n N

--ntasks N

Set number of processors (cores) to N(default=1), the cores will be allocated to cores chosen by SLURM

-N N

--nodes N

Set number of nodes that will be part of a job.

On each node there will be --ntasks-per-node processes started.

If the option --ntasks-per-node is not given, 1 process per node will

be started

--ntasks-per-node N

How many tasks per allocated node to start (see -N above)

--cpus-per-task N

Needed for multithreaded (e.g. OpenMP) jobs. The option tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads the program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS

-J

--job-name

Set job name which is shown in the queue. The job name (limited to first 24 characters) is used in emails sent to the

user

-w node1,node2,...

Restrict job to run on specific nodes only

-x node1,node2,...

Exclude specific nodes from job

Below example of running simple batch job directly from a command line/terminal of Hive with allocation of minimum resources(1 node and 1 CPU/core):

$ sbatch -N1 -n1 --wrap '. /etc/profile.d/modules.sh ; module load blast/2.2.30 ; blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn'
Submitted batch job 214726

Where: 

-N1 is request for 1 node

-n1 is request for 1 CPU/core
. /etc/profile.d/modules.sh ; module load blast/2.2.30
 is a command that will load software that your job is using(in the example blast program)

blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn is your job command

* The job will start on default partition named hive1d, if your job is going to run more then 1 day, then you should add the -p option to your command and specify hive7d or hiveunlim partition that will allow your job to run up to 7 or 31 days.

Below is an example of running a simple job on hive7d partition with time limit for 5 days:

$ sbatch -N1 -n1 -p hive7d --time=5-00:00:00 --wrap '. /etc/profile.d/modules.sh ; module load blast/2.2.30 ; blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn'
Submitted batch job 214726

So basically you should execute the job like that formula: [sbatch command] [allocation of resources] [--wrap] ['your job command']

the --wrap option has to be AFTER the allocation of needed resources and not before, you have to use the --wrap option in order to allow executing jobs from the command line because the standard of sbatch command is to run batch jobs from a batch script and not from the command line.

Running jobs from prepared slurm submission script:

The following is a typical Slurm submission script example.

* Please note that there is a few types of thin compute nodes(bee's) on hive. Dell(bee-001-032) compute nodes has 20 cores/CPU's while HP(bee033-063) compute nodes has 24 cores/CPU's

#!/bin/sh
#SBATCH --ntasks 20           # use 20 cores
#SBATCH --ntasks-per-node=20 # use 20 cpus per each node #SBATCH --time 1-03:00:00 # set job timelimit to 1 day and 3 hours #SBATCH --partition hive1d # partition name #SBATCH -J my_job_name # sensible name for the job # load up the correct modules, if required . /etc/profile.d/modules.sh module load openmpi/1.8.4 RAxML/8.1.15
# launch the code mpirun...

How to submit a job from a batch script

To submit this, run the following command:

sbatch myscript.sh

Warning: do not execute the script

The job submission script file is written to look like a bash shell script. However, you do NOT submit the job to the queue by executing the script.

In particular, the following is INCORRECT:

# this is the INCORRECT way to submit a job
./myscript.sh  # wrong! this will not submit the job!

The correct way is noted above (sbatch myscript.sh).

Please, refer to the manual of SBATCH (man sbatch) to see more helpful information about how to use the sbatch command in the right way.

 


 

Running job with multiple partition

There is an option selecting more then one partition when submitting your job to the queue, to do that you need to specify in comma separated list the partitions you would like to submit your job to.

Your job will be submitted to the first partition with free resources in the hierarchy  from left to right from the list of partitions that you entered in the list. Examples:

All the examples below show the standard -p / --partition option of Slurm but with comma seperated list of partitions that you should add to your job.

To run short jobs (less than 24 hours):

-p hive1d,hive7d, hiveunlim
If you also have some private nodes: (replace "private" with the partition name)
-p private,hive1d,hive7d, hiveunlim
If you are willing to risk your jobs being killed and rerun: (for short jobs)
-p hive1d,hive7d,hiveunlim,preempt1d,
preempt7d,preempt31d
Same with a private partition:
-p private,hive1d,hive7d,hiveunlim,
preempt1d,preempt7d,preempt31d

To run longer jobs: (up to 7 days)
-p hive7d,hiveunlim
Same with private partition:
-p private,hive7d,hiveunlim
With checkpointing:
-p hive7d,hiveunlim,ckptdell7d,ckptdell31d

or

-p hive7d,hiveunlim,ckpthp7d,ckpthp31d
Same with private partition:
-p private,hive7d,hiveunlim,ckptdell7d,ckptdell31d

or

-p private,hive7d,hiveunlim,ckpthp7d,ckpthp31d

To run very long jobs: (up to 31 days)
-p hiveunlim
Same with private partition:
-p private,hiveunlim
With checkpointing:
-p hiveunlim,ckptdell31d

or

-p hive7d,hiveunlim,ckpthp31d
Same with private partition:
-p private,hiveunlim,ckptdell31d

or

-p private,hiveunlim,ckpthp31d

To run jobs that require more than 128GB memory:
-p queen
With checkpointing:
-p queen,queenckpt

 Job Arrays

The job array is very helpful option of sbatch for users that need to run a lot of single-core computations. With this option your can start hundreds of jobs from one sbatch script that will start from the specified range and in the slurm queue all those jobs will have unique ID number and the pending jobs will be grouped to one.

Below example of sbatch job script with array option that will start 399 jobs that will execute file with the variable number from 1-399:

#!/bin/bash

#SBATCH --job-name=CodemlArrayJob #SBATCH --partition=hive1d,hive7d,hiveunlim #SBATCH --array=1-399 #SBATCH --output=out_%A_%a_%j.out #SBATCH --error=error_%A_%a_%j.err
## To make things simple: ${i} == $SLURM_ARRAY_TASK_ID
i=${SLURM_ARRAY_TASK_ID}
# Check $WORKDIR existence
function checkWorkdir() {
    if [ ! -d "$1" ]; then         echo "ERROR! Directory $1 doesn't exist. "         exit 1;     fi }
#load the modules environment
. /etc/profile.d/modules.sh module load PAML/4.8

#Job commands
echo "branch${i}"; WORKDIR="/data/home/privman/rwilly/2.Like.Clade12/guidence/guidence.results/BranchSiteModelA/BranchSiteWithScript/like.Clade12.branch${i}.workingDir"
checkWorkdir $WORKDIR cd $WORKDIR
codeml LikeClade12.branch${i}.ctl

Note: At --error and --output section you have %A,%a and %j, the %A will print your master array job id number, the %a will print you the array number of the job, and the %j will print you the unique id number of each job that you submitted in array mod. The information of the job id's could be useful when the job array is checkpoint-able, the job id number(array and original) will help you find which jobs has been checkpointed.


 

Jobs With Checkpoint options:

Slurm has an option to checkpoint your running jobs every X time, checkpoint-able jobs are needed for securing your progress on a preempted partitions or if you are running a very long job you will want to make checkpoints to have an option of stopping and continuing the job from your checkpoint.

Currently there are two partitions with timelimit of 1, 7 and 31 days that are made for check-pointing: ckptdell(ckptdell1d, ckptdell7d, ckptdell31d) and ckpthp(ckpthp1d, ckpthp7d, ckpthp31d)

Below you can see the example of sbatch job script that will make checkpoint every  6 hours:

!/bin/bash
#SBATCH --job-name=CheckpointableJobExample
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --partition=ckptdell1d
#SBATCH --checkpoint=360  # Time in minutes, every X minutes to make a checkpoint
#SBATCH --checkpoint-dir=/data/ckpt # Default place where your checkpoints will be created, you can change it to other place in your home folder


#
# The following lines needed for checkponting. ##restarterOpts="StickToNodes" # Use the stick to nodes option only if your job cannot be resubmitted to other machines that was from beginning restarterOpts="" . /data/scripts/include_for_checkpointing

### From here you can start editing your job commands

for i in {1..300}; do
echo -n "$i:    "
date
sleep 2
done

*ckpt partitions made for checkpoint-able jobs and has lower priority then hive partitions, that means that every ckpt job can be preempted and restarted by any higher priority job from hive or private partitions.

Attention: If your job with checkpoints will stop and then started again, your job will continue from progress of the last checkpoint that was made and not from the second before the job was stopped.

Limitations: The checkpoint option has some limitations on few types of jobs, below you can see list of job types that will not work with the checkpoint option:


 

Using SRUN command

The "srun" command is used to run interactive jobs on the compute nodes of the HPC, The following example will run "matlab" on the compute node of the cluster:

$ module load matlab/r2014b
$ srun matlab


MATLAB is selecting SOFTWARE OPENGL rendering. < M A T L A B (R) > Copyright 1984-2014 The MathWorks, Inc. R2014b (8.4.0.150421) 64-bit (glnxa64) September 15, 2014 To get started, type one of these: helpwin, helpdesk, or demo. For product information, visit www.mathworks.com. >> matlab command line

In example above we are loading the matlab module from the public software and then executing matlab as interactive job with the srun command, you can see in the output that you work with matlab interactively from matlab command line.

With "srun" command you can run jobs with a lot of different allocation options like assigning number of nodes for the job, assigning how many tasks to use, how many tasks to use per node, how many CPU's to use per task and more. Use the "man srun" command to see all the possible options.


 

Using SALLOC Command 

With the "salloc" command you can obtain an interactive SLURM job allocation (a set of nodes), execute a command, and then release the allocation when the command is finished.

If you would like to allocate resources on the cluster and then have the flexibility of using those resources in an interactive manner, you can use the command "salloc" to allow interactive use of resources allocated to your job. In the next example we will request 5 tasks and 2 hours for allocation:

 

$ salloc -n 5 --time=2:00:00

salloc: Pending job allocation 45924
salloc: job 45924 queued and waiting for resources
salloc: job 45924 has been allocated resources
salloc: Granted job allocation 45924

$ srun hostname
bee025
bee025
bee025
bee025
bee025

$hostname
hive01.haifa.ac.il

$exit
exit salloc: Relinquishing job allocation 45924 salloc: Job allocation 45924 has been revoked.

After that request enters the job queue just like any other job, and "salloc" will tell you that it is waiting for the requested resources if there aren't enough at the moment. When "salloc" tells you that your job has been allocated resources, you can interactively run programs on those resources with "srun" command. The commands you run with "srun" will then be executed on the resources your job has been allocated. If you finished your work before the allocated time or if you didn't allocated time at all, use "exit" command to stop the allocation permenantly

Warning: All commands that you are executing after your job has been allocated resources must run with "srun" command, otherwise those command will be executed from the access node and not on the allocated resources you asked for, you can see it in the example above.


Running GPU jobs

To execute a job that using GPU power, you should prepare an sbatch script that should contain --gres option(generic resources) with 4 devices that our GPU node contain(each device use 7 node CPU's), in addition you should load module named cuda/7.5 that will load GPU driver packages. GPU partition is named vespa and gpu node is named vespa01.

Below you can see the example of GPU job that execute CUDA benchmark:

 

#!/bin/bash
#SBATCH --error=/data/benchmark/logs/cuda.%J.errors
#SBATCH --output=/data/benchmark/logs/cuda.%J.output
#SBATCH -p vespa
#SBATCH --gres gpu:4

. /etc/profile.d/modules.sh
module load cuda/7.5

/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery

 


 

 

More Information:

List Partitions

To view the current status of all partitions accessible by the user:

$ sinfo -l

To view the current status of a partition named partitionName run:

$ sinfo -l partitionName

Display Partition Contents

To get list of all jobs running in partition named partitionName run:

$ squeue -p queueName

Same, limited to user userName:

$ squeue -p queueName -u userName

 

Control Nodes

Get nodes state

One of the following commands can be used to get node(s) state, depending on desired verbosity level:

# sinfo -N

or

# sinfo -o "%20N %.11T %.4c %.8z %.15C %.10O %.6m %.8d"

 

Commonly used options in #srun, #sbatch, #salloc:

-p partitionName

submit a job to queue queueName

-o output.log

Append job's output to output.log instead of slurm-%j.out in the current directory

-e error.log

Append job's STDERR to error.log instead of job output file (see -o above)

--mail-type=type

Email submitter on job state changes. Valid type values are BEGIN, END,FAIL, REQUEUE and ALL (any state change).

--mail-user=email

User to receive email notification of state changes (see –mail-type above)

-n N

--ntasks N

Set number of processors (cores) to N(default=1), the cores will be allocated to cores chosen by SLURM

-N N

--nodes N

Set number of nodes that will be part of a job.

On each node there will be --ntasks-per-node processes started.

If the option --ntasks-per-node is not given, 1 process per node will

be started

--ntasks-per-node N

How many tasks per allocated node to start (see -N above)

--cpus-per-task N

Needed for multithreaded (e.g. OpenMP) jobs. The option tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads the program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS

-J

--job-name

Set job name which is shown in the queue. The job name (limited to first 24 characters) is used in emails sent to the

user

-w node1,node2,...

Restrict job to run on specific nodes only

-x node1,node2,...

Exclude specific nodes from job

  


List of a few more helpfull slurm commands:  

Man pages exist for all SLURM daemons, commands, and API functions. The command option --help also provides a brief summary of options. Note that the command options are all case insensitive.

If you need more information about slurm, please go to the official manual page of slurm where you can find more interesting information: https://computing.llnl.gov/linux/slurm/documentation.html