Slurm - High performance computing system of the Faculty of Natural Sciences

Slurm is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm workload manager is an important part of HIVE because with help of slurm we can implement and use the resources of Hive in the right way.

Some basic necessary commands that every user should know:

sinfo - show all available partitions and nodes	sbatch - submit a job from script
squeue / smap - show list of jobs in the queue	salloc - submit allocation request
srun - submit an interactive job from a terminal	scancel - delete/remove a job

* You can find all information and all the options for using each command by running man or --help commands to see the manuals of the commands, example: # man sinfo , #sinfo --help

Partitions and nodes information

hive partition containing public compute nodes(bee's) this is a default partition. Running jobs on Hive partition's is recommended for new users and for users that want to be sure that their job will run until it end without any preemption.

queen partition containing high memory compute nodes(queen's), this partition is made for a jobs that require a lot of memory.

preempt partition are preempt-able that contain all the compute nodes(bee's) as well as public and private. Preempt-able partition means that every job that submitted on higher priority partitions(hive,queen,private partitions) will kick off your job that running on preempt partition, and be restart from the beginning to the queue. on preempt partition you get a benefit of lower limitations per user.

ckpt partition are preempt-able that contain all the HP or DELL compute nodes(bee's) as well as public and private nodes. Preempt-able partition means that every job that submitted on higher priority partitions(hive,queen,private partitions) will kick off your job that running on ckpt partition, and restart the job to the queue. On ckpt partition you get a benefit of lower limitations per user. This partition is very recommended for jobs that can be checkpointed, with combination of checkpoint-able job and low limitations on ckpt partition you will be able to use more resources than from the public Hive partition.

mpi partition that contain all the public compute nodes(bee's) that made for MPI jobs that using a large amount of resources

vespa partition that contain GPU's nodes(vespa's) that made for jobs that using GPU resources.

Private partition's partitions that contain private nodes that different groups has bough't, those partitions usually receive names of their group name. Running jobs on private partitions is allowed only by the users that belong the the partition group. private nodes has no limitations and will get 100% priority over any other partition that containing the private nodes(ckpt partitions for example).

Please see the below table with information regarding Hive compute nodes, different types and range of each type:

Name	Quanity	Model	CPU's №	RAM	Notes/Name and range in Slurm
Compute(bee's)	32	Dell PowerEdge C6220ii	20	128GB	bee001-032
Compute(bee's)	30	HP XL170r	24	128GB	bee033-063
Fat node(queen's)	1	Dell PowerEdge R820	32	760GB	queen01
Fat node(queen's)	2	HP DL560	56	760GB	queen02-03
GPU(vespa's)	1	HP XL190r	28	256GB	vespa01, Nvidia K80 GPU

* For information regarding limitation per partition, please check the limitations section in website menu.

Slurm SBATCH command and slurm job submission Scripts

The command "sbatch" should be the default command for running batch jobs.

With "sbatch" you can run simple batch jobs from a command line, or you can execute complicated jobs from a prepared batch script.

First, you should understand the basic options you can add to the sbatch command in order to request the right allocation of resources for your jobs:

Commonly used options in #srun, #sbatch, #salloc:

-p partitionName	submit a job to queue queueName
-o output.log	Append job's output to output.log instead of slurm-%j.out in the current directory
-e error.log	Append job's STDERR to error.log instead of job output file (see -o above)
--mail-type=type	Email submitter on job state changes. Valid type values are BEGIN, END,FAIL, REQUEUE and ALL (any state change).
--mail-user=email	User to receive email notification of state changes (see –mail-type above)
-n N --ntasks N	Set number of processors (cores) to N(default=1), the cores will be allocated to cores chosen by SLURM
-N N --nodes N	Set number of nodes that will be part of a job. On each node there will be --ntasks-per-node processes started. If the option --ntasks-per-node is not given, 1 process per node will be started
--ntasks-per-node N	How many tasks per allocated node to start (see -N above)
--cpus-per-task N	Needed for multithreaded (e.g. OpenMP) jobs. The option tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads the program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS
-J --job-name	Set job name which is shown in the queue. The job name (limited to first 24 characters) is used in emails sent to the user
-w node1,node2,...	Restrict job to run on specific nodes only
-x node1,node2,...	Exclude specific nodes from job

Below example of running simple batch job directly from a command line/terminal of Hive with allocation of minimum resources(1 node and 1 CPU/core):

$ sbatch -N1 -n1 --wrap '. /etc/profile.d/modules.sh ; module load blast/2.2.30 ; blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn'
Submitted batch job 214726

Where:

-N1 is request for 1 node

-n1 is request for 1 CPU/core
. /etc/profile.d/modules.sh ; module load blast/2.2.30 is a command that will load software that your job is using(in the example blast program)

blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn is your job command

* The job will start on default partition named hive1d, if your job is going to run more then 1 day, then you should add the -p option to your command and specify hive7d or hiveunlim partition that will allow your job to run up to 7 or 31 days.

Below is an example of running a simple job on hive7d partition with time limit for 5 days:

$ sbatch -N1 -n1 -p hive7d --time=5-00:00:00 --wrap '. /etc/profile.d/modules.sh ; module load blast/2.2.30 ; blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn'
Submitted batch job 214726

So basically you should execute the job like that formula: [sbatch command] [allocation of resources] [--wrap] ['your job command']

the --wrap option has to be AFTER the allocation of needed resources and not before, you have to use the --wrap option in order to allow executing jobs from the command line because the standard of sbatch command is to run batch jobs from a batch script and not from the command line.

Running jobs from prepared slurm submission script:

The following is a typical Slurm submission script example.

* Please note that there is a few types of thin compute nodes(bee's) on hive. Dell(bee-001-032) compute nodes has 20 cores/CPU's while HP(bee033-063) compute nodes has 24 cores/CPU's

#!/bin/sh
#SBATCH --ntasks 20           # use 20 cores
#SBATCH --ntasks-per-node=20  # use 20 cpus per each node
#SBATCH --time 1-03:00:00     # set job timelimit to 1 day and 3 hours
#SBATCH --partition hive1d    # partition name
#SBATCH -J my_job_name        # sensible name for the job

# load up the correct modules, if required
. /etc/profile.d/modules.sh
module load openmpi/1.8.4 RAxML/8.1.15

# launch the code
mpirun...

How to submit a job from a batch script

To submit this, run the following command:

sbatch myscript.sh

Warning: do not execute the script

The job submission script file is written to look like a bash shell script. However, you do NOT submit the job to the queue by executing the script.

In particular, the following is INCORRECT:

# this is the INCORRECT way to submit a job
./myscript.sh  # wrong! this will not submit the job!

The correct way is noted above (sbatch myscript.sh).

Please, refer to the manual of SBATCH (man sbatch) to see more helpful information about how to use the sbatch command in the right way.

Running job with multiple partition

There is an option selecting more then one partition when submitting your job to the queue, to do that you need to specify in comma separated list the partitions you would like to submit your job to.

Your job will be submitted to the first partition with free resources in the hierarchy from left to right from the list of partitions that you entered in the list. Examples:

All the examples below show the standard -p / --partition option of Slurm but with comma seperated list of partitions that you should add to your job.

To run short jobs (less than 24 hours):

-p hive1d,hive7d, hiveunlim

If you also have some private nodes: (replace "private" with the partition name)

-p private,hive1d,hive7d, hiveunlim

If you are willing to risk your jobs being killed and rerun: (for short jobs)

-p hive1d,hive7d,hiveunlim,preempt1d,preempt7d,preempt31d

Same with a private partition:

-p private,hive1d,hive7d,hiveunlim,preempt1d,preempt7d,preempt31d



To run longer jobs: (up to 7 days)

-p hive7d,hiveunlim

Same with private partition:

-p private,hive7d,hiveunlim

With checkpointing:

-p hive7d,hiveunlim,ckptdell7d,ckptdell31d

or

-p hive7d,hiveunlim,ckpthp7d,ckpthp31d

Same with private partition:

-p private,hive7d,hiveunlim,ckptdell7d,ckptdell31d

or

-p private,hive7d,hiveunlim,ckpthp7d,ckpthp31d


To run very long jobs: (up to 31 days)

-p hiveunlim

Same with private partition:

-p private,hiveunlim

With checkpointing:

-p hiveunlim,ckptdell31d

or

-p hive7d,hiveunlim,ckpthp31d

Same with private partition:

-p private,hiveunlim,ckptdell31d

or

-p private,hiveunlim,ckpthp31d



To run jobs that require more than 128GB memory:

-p queen

With checkpointing:

-p queen,queenckpt

Job Arrays

The job array is very helpful option of sbatch for users that need to run a lot of single-core computations. With this option your can start hundreds of jobs from one sbatch script that will start from the specified range and in the slurm queue all those jobs will have unique ID number and the pending jobs will be grouped to one.

Below example of sbatch job script with array option that will start 399 jobs that will execute file with the variable number from 1-399:

#!/bin/bash

#SBATCH --job-name=CodemlArrayJob
#SBATCH --partition=hive1d,hive7d,hiveunlim
#SBATCH --array=1-399
#SBATCH --output=out_%A_%a_%j.out
#SBATCH --error=error_%A_%a_%j.err

## To make things simple: ${i} == $SLURM_ARRAY_TASK_ID

i=${SLURM_ARRAY_TASK_ID}

# Check $WORKDIR existence

function checkWorkdir() {

    if [ ! -d "$1" ]; then 
        echo "ERROR! Directory $1 doesn't exist. "
        exit 1;
    fi
}

#load the modules environment

. /etc/profile.d/modules.sh
module load PAML/4.8

#Job commands

echo "branch${i}";
WORKDIR="/data/home/privman/rwilly/2.Like.Clade12/guidence/guidence.results/BranchSiteModelA/BranchSiteWithScript/like.Clade12.branch${i}.workingDir"

checkWorkdir $WORKDIR
cd $WORKDIR

codeml LikeClade12.branch${i}.ctl

Note: At --error and --output section you have %A,%a and %j, the %A will print your master array job id number, the %a will print you the array number of the job, and the %j will print you the unique id number of each job that you submitted in array mod. The information of the job id's could be useful when the job array is checkpoint-able, the job id number(array and original) will help you find which jobs has been checkpointed.

Jobs With Checkpoint options:

Slurm has an option to checkpoint your running jobs every X time, checkpoint-able jobs are needed for securing your progress on a preempted partitions or if you are running a very long job you will want to make checkpoints to have an option of stopping and continuing the job from your checkpoint.

Currently there are two partitions with timelimit of 1, 7 and 31 days that are made for check-pointing: ckptdell(ckptdell1d, ckptdell7d, ckptdell31d) and ckpthp(ckpthp1d, ckpthp7d, ckpthp31d)

Below you can see the example of sbatch job script that will make checkpoint every 6 hours:

!/bin/bash
#SBATCH --job-name=CheckpointableJobExample
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --partition=ckptdell1d
#SBATCH --checkpoint=360  # Time in minutes, every X minutes to make a checkpoint
#SBATCH --checkpoint-dir=/data/ckpt # Default place where your checkpoints will be created, you can change it to other place in your home folder


## The following lines needed for checkponting.
##restarterOpts="StickToNodes" # Use the stick to nodes option only if your job cannot be resubmitted to other machines that was from beginning  
restarterOpts=""
. /data/scripts/include_for_checkpointing

### From here you can start editing your job commands

for i in {1..300}; do
echo -n "$i:    "
date
sleep 2
done

*ckpt partitions made for checkpoint-able jobs and has lower priority then hive partitions, that means that every ckpt job can be preempted and restarted by any higher priority job from hive or private partitions.

Attention: If your job with checkpoints will stop and then started again, your job will continue from progress of the last checkpoint that was made and not from the second before the job was stopped.

Limitations: The checkpoint option has some limitations on few types of jobs, below you can see list of job types that will not work with the checkpoint option:

BLCR will not checkpoint and/or restore open sockets (TCP/IP, Unix domain, etc.). At restart time any sockets will appear to have been closed.
BLCR will not checkpoint and/or restore open character or block devices (e.g. serial ports or raw partitions). At restart time any devices will appear to have been closed.
BLCR does not handle SysV IPC objects (man 5 ipc). Such resources are silently ignored at checkpoint time and are not restored.
If a checkpoint is taken of a process with any "zombie" children, then these children will not be recreated at restart time. A "zombie" is defined as a process that has exited, but who's exit status has not yet been reaped by its parent (via wait() or a related function). This means that a wait()-family call made after a restart will never return a status for such a child.

Using SRUN command

The "srun" command is used to run interactive jobs on the compute nodes of the HPC, The following example will run "matlab" on the compute node of the cluster:

$ module load matlab/r2014b
$ srun matlab

MATLAB is selecting SOFTWARE OPENGL rendering.

                            < M A T L A B (R) >
                  Copyright 1984-2014 The MathWorks, Inc.
                   R2014b (8.4.0.150421) 64-bit (glnxa64)
                             September 15, 2014

 
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
 
>> matlab command line

In example above we are loading the matlab module from the public software and then executing matlab as interactive job with the srun command, you can see in the output that you work with matlab interactively from matlab command line.

With "srun" command you can run jobs with a lot of different allocation options like assigning number of nodes for the job, assigning how many tasks to use, how many tasks to use per node, how many CPU's to use per task and more. Use the "man srun" command to see all the possible options.

Using SALLOC Command

With the "salloc" command you can obtain an interactive SLURM job allocation (a set of nodes), execute a command, and then release the allocation when the command is finished.

If you would like to allocate resources on the cluster and then have the flexibility of using those resources in an interactive manner, you can use the command "salloc" to allow interactive use of resources allocated to your job. In the next example we will request 5 tasks and 2 hours for allocation:

$ salloc -n 5 --time=2:00:00

salloc: Pending job allocation 45924
salloc: job 45924 queued and waiting for resources
salloc: job 45924 has been allocated resources
salloc: Granted job allocation 45924

$ srun hostname
bee025
bee025
bee025
bee025
bee025

$hostname
hive01.haifa.ac.il

$exit
exit
salloc: Relinquishing job allocation 45924
salloc: Job allocation 45924 has been revoked.

After that request enters the job queue just like any other job, and "salloc" will tell you that it is waiting for the requested resources if there aren't enough at the moment. When "salloc" tells you that your job has been allocated resources, you can interactively run programs on those resources with "srun" command. The commands you run with "srun" will then be executed on the resources your job has been allocated. If you finished your work before the allocated time or if you didn't allocated time at all, use "exit" command to stop the allocation permenantly

Warning: All commands that you are executing after your job has been allocated resources must run with "srun" command, otherwise those command will be executed from the access node and not on the allocated resources you asked for, you can see it in the example above.

Running GPU jobs

To execute a job that using GPU power, you should prepare an sbatch script that should contain --gres option(generic resources) with 4 devices that our GPU node contain(each device use 7 node CPU's), in addition you should load module named cuda/7.5 that will load GPU driver packages. GPU partition is named vespa and gpu node is named vespa01.

Below you can see the example of GPU job that execute CUDA benchmark:

#!/bin/bash
#SBATCH --error=/data/benchmark/logs/cuda.%J.errors
#SBATCH --output=/data/benchmark/logs/cuda.%J.output
#SBATCH -p vespa
#SBATCH --gres gpu:4

. /etc/profile.d/modules.sh
module load cuda/7.5

/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery

More Information:

List Partitions

To view the current status of all partitions accessible by the user:

$ sinfo -l

To view the current status of a partition named partitionName run:

$ sinfo -l partitionName

Display Partition Contents

To get list of all jobs running in partition named partitionName run:

$ squeue -p queueName

Same, limited to user userName:

$ squeue -p queueName -u userName

Control Nodes

Get nodes state

One of the following commands can be used to get node(s) state, depending on desired verbosity level:

# sinfo -N

# sinfo -o "%20N %.11T %.4c %.8z %.15C %.10O %.6m %.8d"

Commonly used options in #srun, #sbatch, #salloc:

-p partitionName	submit a job to queue queueName
-o output.log	Append job's output to output.log instead of slurm-%j.out in the current directory
-e error.log	Append job's STDERR to error.log instead of job output file (see -o above)
--mail-type=type	Email submitter on job state changes. Valid type values are BEGIN, END,FAIL, REQUEUE and ALL (any state change).
--mail-user=email	User to receive email notification of state changes (see –mail-type above)
-n N --ntasks N	Set number of processors (cores) to N(default=1), the cores will be allocated to cores chosen by SLURM
-N N --nodes N	Set number of nodes that will be part of a job. On each node there will be --ntasks-per-node processes started. If the option --ntasks-per-node is not given, 1 process per node will be started
--ntasks-per-node N	How many tasks per allocated node to start (see -N above)
--cpus-per-task N	Needed for multithreaded (e.g. OpenMP) jobs. The option tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads the program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS
-J --job-name	Set job name which is shown in the queue. The job name (limited to first 24 characters) is used in emails sent to the user
-w node1,node2,...	Restrict job to run on specific nodes only
-x node1,node2,...	Exclude specific nodes from job

List of a few more helpfull slurm commands:

Man pages exist for all SLURM daemons, commands, and API functions. The command option --help also provides a brief summary of options. Note that the command options are all case insensitive.

sacct is used to report job or job step accounting information about active or completed jobs.
salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.
sattach is used to attach standard input, output, and error plus signal capabilities to a currently running job or job step. One can attach to and detach from jobs multiple times.
sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
scontrol is the administrative tool used to view and/or modify SLURM state. Note that many scontrol commands can only be executed as user root.
sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.
smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
strigger is used to set, get or view event triggers. Event triggers include things such as nodes going down or jobs approaching their time limit.
sview is a graphical user interface to get and update state information for jobs, partitions, and nodes managed by SLURM.

If you need more information about slurm, please go to the official manual page of slurm where you can find more interesting information: https://computing.llnl.gov/linux/slurm/documentation.html