Currently Hive has limits on public partitions, the limits are set per partition and the limitations are equal for all the users using public partitions.

Now there are 3 Hive public partitions, 3 preempt public partitions 3 ckptdell public partitions, 3 ckpthp public partitions, one ckptqueen partition, one queen partition(Fat memory node) and one mpi partition.

In order to view the real time partition info in Hive terminal, type command sinfo  , also there is a login message for every user that contains real-time QOS limitations on each partition, user information and nodes information.

hive partition containing 40 public compute nodes and is a default. Running jobs on Hive partition is recommended for new users and for users that want to be sure that their job will run until it end without any preemption.

queen partition containing high memory compute nodes(queen's), this partition is made for a jobs that require a lot of memory.

preempt partition are preempt-able that contain all the compute nodes(bee's) as well as public and private.  Preempt-able partition means that every job that submitted on higher priority partitions(hive,queen,private partitions) will kick off your job that running on preempt partition, and be restart from the beginning to the queue. on preempt partition you get a benefit of lower limitations per user.

ckpt partition are preempt-able that contain all the HP or DELL compute nodes(bee's) as well as public and private nodes.  Preempt-able partition means that every job that submitted on higher priority partitions(hive,queen,private partitions) will kick off your job that running on ckpt partition, and restart the job to the queue. On ckpt partition you get a benefit of lower limitations per user. This partition is very recommended for jobs that can be checkpointed, with combination of checkpoint-able job and low limitations on ckpt partition you will be able to use more resources than from the public Hive partition.

mpi partition that contain all the public compute nodes(bee's) that made for MPI jobs that using a large amount of resources

vespa partition that contain GPU's nodes(vespa's) that made for jobs that using GPU resources.

guest partition contains thin compute nodes, made for Hive guest users(non Faculty of Science members), partition is low-priority and preempt-able by other partitions.

Private partition's partitions that contain private nodes that different groups has bough't, those partitions usually receive names of their group name. Running jobs on private partitions is allowed only by the users that belong the the partition group. private nodes has no limitations and will get 100% priority over any other partition that containing the private nodes(ckpt partitions for example).

Below you can see information about the limits each partition has:

 
hive1d - DEFAULT partition, limits:  WallTime: 1 day, MaxCPU’sPerUser = 400 , MaxNodesPerUser=20
hive7d - limits:  WallTime: 7 days, MaxCPU’sPerUser = 100  , MaxNodesPerUser=10
hiveunlim limits:  WallTime: 31 day, MaxCPU’sPerUser = 60 ,  , MaxNodesPerUser=7
*Hive contains 40 public compute nodes
preempt1d - limits:  WallTime: 1 days, MaxCPU’sPerUser = 800 Preemption=REQUEUE Low priority preempt-able partition
preempt7d - limits:  WallTime: 7 days, MaxCPU’sPerUser = 600 Preemption=REQUEUE Low priority preempt-able partition
preempt31d - limits:  WallTime: 31 days, MaxCPU’sPerUser = 400 Preemption=REQUEUE Low priority preempt-able partition
*Preempt contains 63 compute nodes
ckpthp1d - limits:  WallTime: 1 days, MaxCPU’sPerUser = 400 Preemption=REQUEUE Low priority preempt-able partition
ckpthp7d - limits:  WallTime: 7 days, MaxCPU’sPerUser = 300 Preemption=REQUEUE Low priority preempt-able partition
ckpthp31d - limits:  WallTime: 31 days, MaxCPU’sPerUser = 200 Preemption=REQUEUE Low priority preempt-able partition
*HP contains 30 compute nodes
ckptdell1d - limits:  WallTime: 1 days, MaxCPU’sPerUser = 400 Preemption=REQUEUE Low priority preempt-able partition
ckptdell7d - limits:  WallTime: 7 days, MaxCPU’sPerUser = 300 Preemption=REQUEUE Low priority preempt-able partition
ckptdell31d - limits:  WallTime: 31 days, MaxCPU’sPerUser = 200 Preemption=REQUEUE Low priority preempt-able partition
*Dell contains 32 compute nodes
queen - limits:  WallTime: 31 days, MaxNODE’sPerUser = 1
ckptqueen - limits:  WallTime: 31 days Preemption=REQUEUE Low priority preempt-able partition
mpi - limits: MinCPU’sPerUser = 200
guest - WallTime: 7 day, MaxCPU’sPerUser = 400 *Low priority
 
*WallTime= maximum time that each job can run.
*MaxCPU’sPerUser= maximum number of CPU’s user can use under specific partition.
*MaxNODE'sPerUser= maximum number of NODE's user can run jobs under specific partition.
*MinCPU'sPerUser= minimum resource allocation per job.
*Preemption=REQUEUE= low priority partition, all jobs running on that partition could be preempted/checkpointed by higher priority partition. preempted job will be restarted from beginning or from last checkpoint.
*ckpt partitions made for checkpoint-able jobs and has lower priority then hive and private partitions, that means that every ckpt job can be preempted and restarted by any higher priority job from hive or private partitions.
.
 
Hive has two types of slim compute nodes(bee's) and two types of fat compute nodes (queen's), please see the below ranges of different nodes:
 
bee001-032 has 20 CPU's per node and 128 GB memory
bee033-063 has 24 CPU's per node and 128 GB memory
queen01 has 32 CPU's and 768 GB memory
queen02-03 has 56 CPU's and 768 GB memory
vespa01 is a GPU node that has Nvidia K80 GPU with 4 CPU's/Devices that using 28 node CPU's
 
For additional information regarding nodes hardware, please refer to hardware section.

The fair-share component of the job priority is calculated differently. The goal is to make sure that the priority strictly follows the account hierarchy, so that jobs under accounts with usage lower than their fair share will always have a higher priority than jobs belonging to accounts which are over their fair share.

The algorithm is based on ticket scheduling, where at the root of the account hierarchy one starts with a number of tickets, which are then distributed per the fairshare policy to the child accounts and users. Then, the job whose user has the highest number of tickets is assigned the fairshare priority of 1.0, and the other pending jobs are assigned priorities according to how many tickets their users have compared to the highest priority job.

In Hive, groups that buy more resources get more fairshare tickets assigned per group and will have a priority on the other jobs that was submitted to the queue by groups that bought less resources.

The fairshare factor is not defined only by the fairshare tickets that was assigned to groups by their investments to Hive, the fairshare factor is calculating the history of the group jobs and if one group will submit a lot of jobs and other group will submit less, then the fairshare factor can move the less active group jobs to the top of the job queue.

The fair-share algorithm in SLURM is described at slurm website, please refer to the next link if you are interested in understanding of the fairshare factor http://slurm.schedmd.com/fair_tree.html 

Below you will find some command examples of how to monitor your fairshare:

Check current fairshare definitions for your group account:

$ sshare -A ACCOUNTNAME

Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- ACCOUNTNAME 109 0.054801 1579497 0.004294 0.94713

Where Raw Shares is the number of tickets that group received(Depending on how much resources the group bought) and the FairShare number is the number you get after calculation of fairshare factor(refer to http://slurm.schedmd.com/fair_tree.html to understand how it works).

to be continued....

 

Slurm is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

 

Slurm workload manager is an important part of HIVE because with help of slurm we can implement and use the resources of Hive in the right way. 

Some basic necessary commands that every user should know: 

sinfo - show all available partitions and nodes sbatch - submit a job from script 
squeue / smap - show list of jobs in the queue salloc - submit allocation request
srun - submit an interactive job from a terminal scancel - delete/remove a job

 * You can find all information and all the options for using each command by running man or --help commands to see the manuals of the commands, example: # man sinfo , #sinfo --help


 

Partitions and nodes information

hive partition containing public compute nodes(bee's) this is a default partition. Running jobs on Hive partition's is recommended for new users and for users that want to be sure that their job will run until it end without any preemption.

queen partition containing high memory compute nodes(queen's), this partition is made for a jobs that require a lot of memory.

preempt partition are preempt-able that contain all the compute nodes(bee's) as well as public and private.  Preempt-able partition means that every job that submitted on higher priority partitions(hive,queen,private partitions) will kick off your job that running on preempt partition, and be restart from the beginning to the queue. on preempt partition you get a benefit of lower limitations per user.

ckpt partition are preempt-able that contain all the HP or DELL compute nodes(bee's) as well as public and private nodes.  Preempt-able partition means that every job that submitted on higher priority partitions(hive,queen,private partitions) will kick off your job that running on ckpt partition, and restart the job to the queue. On ckpt partition you get a benefit of lower limitations per user. This partition is very recommended for jobs that can be checkpointed, with combination of checkpoint-able job and low limitations on ckpt partition you will be able to use more resources than from the public Hive partition.

mpi partition that contain all the public compute nodes(bee's) that made for MPI jobs that using a large amount of resources

vespa partition that contain GPU's nodes(vespa's) that made for jobs that using GPU resources.

Private partition's partitions that contain private nodes that different groups has bough't, those partitions usually receive names of their group name. Running jobs on private partitions is allowed only by the users that belong the the partition group. private nodes has no limitations and will get 100% priority over any other partition that containing the private nodes(ckpt partitions for example).

Please see the below table with information regarding Hive compute nodes, different types and range of each type:

Name Quanity Model CPU's  RAM Notes/Name and range in Slurm
Compute(bee's) 32 Dell PowerEdge C6220ii 20 128GB bee001-032
Compute(bee's) 30 HP XL170r 24 128GB bee033-063
Fat node(queen's) 1 Dell PowerEdge R820 32 760GB queen01
Fat node(queen's) 2 HP DL560 56 760GB queen02-03
GPU(vespa's) 1 HP XL190r 28 256GB vespa01, Nvidia K80 GPU

 * For information regarding limitation per partition, please check the limitations section in website menu.


 

Slurm SBATCH command and slurm job submission Scripts

The command "sbatch" should be the default command for running batch jobs. 

With "sbatch" you can run simple batch jobs from a command line, or you can execute complicated jobs from a prepared batch script.

First, you should understand the basic options you can add to the sbatch command in order to request the right allocation of resources for your jobs:

Commonly used options in #srun, #sbatch, #salloc:

-p partitionName

submit a job to queue queueName

-o output.log

Append job's output to output.log instead of slurm-%j.out in the current directory

-e error.log

Append job's STDERR to error.log instead of job output file (see -o above)

--mail-type=type

Email submitter on job state changes. Valid type values are BEGIN, END,FAIL, REQUEUE and ALL (any state change).

--mail-user=email

User to receive email notification of state changes (see –mail-type above)

-n N

--ntasks N

Set number of processors (cores) to N(default=1), the cores will be allocated to cores chosen by SLURM

-N N

--nodes N

Set number of nodes that will be part of a job.

On each node there will be --ntasks-per-node processes started.

If the option --ntasks-per-node is not given, 1 process per node will

be started

--ntasks-per-node N

How many tasks per allocated node to start (see -N above)

--cpus-per-task N

Needed for multithreaded (e.g. OpenMP) jobs. The option tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads the program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS

-J

--job-name

Set job name which is shown in the queue. The job name (limited to first 24 characters) is used in emails sent to the

user

-w node1,node2,...

Restrict job to run on specific nodes only

-x node1,node2,...

Exclude specific nodes from job

Below example of running simple batch job directly from a command line/terminal of Hive with allocation of minimum resources(1 node and 1 CPU/core):

$ sbatch -N1 -n1 --wrap '. /etc/profile.d/modules.sh ; module load blast/2.2.30 ; blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn'
Submitted batch job 214726

Where: 

-N1 is request for 1 node

-n1 is request for 1 CPU/core
. /etc/profile.d/modules.sh ; module load blast/2.2.30
 is a command that will load software that your job is using(in the example blast program)

blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn is your job command

* The job will start on default partition named hive1d, if your job is going to run more then 1 day, then you should add the -p option to your command and specify hive7d or hiveunlim partition that will allow your job to run up to 7 or 31 days.

Below is an example of running a simple job on hive7d partition with time limit for 5 days:

$ sbatch -N1 -n1 -p hive7d --time=5-00:00:00 --wrap '. /etc/profile.d/modules.sh ; module load blast/2.2.30 ; blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn'
Submitted batch job 214726

So basically you should execute the job like that formula: [sbatch command] [allocation of resources] [--wrap] ['your job command']

the --wrap option has to be AFTER the allocation of needed resources and not before, you have to use the --wrap option in order to allow executing jobs from the command line because the standard of sbatch command is to run batch jobs from a batch script and not from the command line.

Running jobs from prepared slurm submission script:

The following is a typical Slurm submission script example.

* Please note that there is a few types of thin compute nodes(bee's) on hive. Dell(bee-001-032) compute nodes has 20 cores/CPU's while HP(bee033-063) compute nodes has 24 cores/CPU's

#!/bin/sh
#SBATCH --ntasks 20           # use 20 cores
#SBATCH --ntasks-per-node=20 # use 20 cpus per each node #SBATCH --time 1-03:00:00 # set job timelimit to 1 day and 3 hours #SBATCH --partition hive1d # partition name #SBATCH -J my_job_name # sensible name for the job # load up the correct modules, if required . /etc/profile.d/modules.sh module load openmpi/1.8.4 RAxML/8.1.15
# launch the code mpirun...

How to submit a job from a batch script

To submit this, run the following command:

sbatch myscript.sh

Warning: do not execute the script

The job submission script file is written to look like a bash shell script. However, you do NOT submit the job to the queue by executing the script.

In particular, the following is INCORRECT:

# this is the INCORRECT way to submit a job
./myscript.sh  # wrong! this will not submit the job!

The correct way is noted above (sbatch myscript.sh).

Please, refer to the manual of SBATCH (man sbatch) to see more helpful information about how to use the sbatch command in the right way.

 


 

Running job with multiple partition

There is an option selecting more then one partition when submitting your job to the queue, to do that you need to specify in comma separated list the partitions you would like to submit your job to.

Your job will be submitted to the first partition with free resources in the hierarchy  from left to right from the list of partitions that you entered in the list. Examples:

All the examples below show the standard -p / --partition option of Slurm but with comma seperated list of partitions that you should add to your job.

To run short jobs (less than 24 hours):

-p hive1d,hive7d, hiveunlim
If you also have some private nodes: (replace "private" with the partition name)
-p private,hive1d,hive7d, hiveunlim
If you are willing to risk your jobs being killed and rerun: (for short jobs)
-p hive1d,hive7d,hiveunlim,preempt1d,
preempt7d,preempt31d
Same with a private partition:
-p private,hive1d,hive7d,hiveunlim,
preempt1d,preempt7d,preempt31d

To run longer jobs: (up to 7 days)
-p hive7d,hiveunlim
Same with private partition:
-p private,hive7d,hiveunlim
With checkpointing:
-p hive7d,hiveunlim,ckptdell7d,ckptdell31d

or

-p hive7d,hiveunlim,ckpthp7d,ckpthp31d
Same with private partition:
-p private,hive7d,hiveunlim,ckptdell7d,ckptdell31d

or

-p private,hive7d,hiveunlim,ckpthp7d,ckpthp31d

To run very long jobs: (up to 31 days)
-p hiveunlim
Same with private partition:
-p private,hiveunlim
With checkpointing:
-p hiveunlim,ckptdell31d

or

-p hive7d,hiveunlim,ckpthp31d
Same with private partition:
-p private,hiveunlim,ckptdell31d

or

-p private,hiveunlim,ckpthp31d

To run jobs that require more than 128GB memory:
-p queen
With checkpointing:
-p queen,queenckpt

 Job Arrays

The job array is very helpful option of sbatch for users that need to run a lot of single-core computations. With this option your can start hundreds of jobs from one sbatch script that will start from the specified range and in the slurm queue all those jobs will have unique ID number and the pending jobs will be grouped to one.

Below example of sbatch job script with array option that will start 399 jobs that will execute file with the variable number from 1-399:

#!/bin/bash

#SBATCH --job-name=CodemlArrayJob #SBATCH --partition=hive1d,hive7d,hiveunlim #SBATCH --array=1-399 #SBATCH --output=out_%A_%a_%j.out #SBATCH --error=error_%A_%a_%j.err
## To make things simple: ${i} == $SLURM_ARRAY_TASK_ID
i=${SLURM_ARRAY_TASK_ID}
# Check $WORKDIR existence
function checkWorkdir() {
    if [ ! -d "$1" ]; then         echo "ERROR! Directory $1 doesn't exist. "         exit 1;     fi }
#load the modules environment
. /etc/profile.d/modules.sh module load PAML/4.8

#Job commands
echo "branch${i}"; WORKDIR="/data/home/privman/rwilly/2.Like.Clade12/guidence/guidence.results/BranchSiteModelA/BranchSiteWithScript/like.Clade12.branch${i}.workingDir"
checkWorkdir $WORKDIR cd $WORKDIR
codeml LikeClade12.branch${i}.ctl

Note: At --error and --output section you have %A,%a and %j, the %A will print your master array job id number, the %a will print you the array number of the job, and the %j will print you the unique id number of each job that you submitted in array mod. The information of the job id's could be useful when the job array is checkpoint-able, the job id number(array and original) will help you find which jobs has been checkpointed.


 

Jobs With Checkpoint options:

Slurm has an option to checkpoint your running jobs every X time, checkpoint-able jobs are needed for securing your progress on a preempted partitions or if you are running a very long job you will want to make checkpoints to have an option of stopping and continuing the job from your checkpoint.

Currently there are two partitions with timelimit of 1, 7 and 31 days that are made for check-pointing: ckptdell(ckptdell1d, ckptdell7d, ckptdell31d) and ckpthp(ckpthp1d, ckpthp7d, ckpthp31d)

Below you can see the example of sbatch job script that will make checkpoint every  6 hours:

!/bin/bash
#SBATCH --job-name=CheckpointableJobExample
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --partition=ckptdell1d
#SBATCH --checkpoint=360  # Time in minutes, every X minutes to make a checkpoint
#SBATCH --checkpoint-dir=/data/ckpt # Default place where your checkpoints will be created, you can change it to other place in your home folder


#
# The following lines needed for checkponting. ##restarterOpts="StickToNodes" # Use the stick to nodes option only if your job cannot be resubmitted to other machines that was from beginning restarterOpts="" . /data/scripts/include_for_checkpointing

### From here you can start editing your job commands

for i in {1..300}; do
echo -n "$i:    "
date
sleep 2
done

*ckpt partitions made for checkpoint-able jobs and has lower priority then hive partitions, that means that every ckpt job can be preempted and restarted by any higher priority job from hive or private partitions.

Attention: If your job with checkpoints will stop and then started again, your job will continue from progress of the last checkpoint that was made and not from the second before the job was stopped.

Limitations: The checkpoint option has some limitations on few types of jobs, below you can see list of job types that will not work with the checkpoint option:

  • BLCR will not checkpoint and/or restore open sockets (TCP/IP, Unix domain, etc.). At restart time any sockets will appear to have been closed.
  • BLCR will not checkpoint and/or restore open character or block devices (e.g. serial ports or raw partitions). At restart time any devices will appear to have been closed.
  • BLCR does not handle SysV IPC objects (man 5 ipc). Such resources are silently ignored at checkpoint time and are not restored.
  • If a checkpoint is taken of a process with any "zombie" children, then these children will not be recreated at restart time. A "zombie" is defined as a process that has exited, but who's exit status has not yet been reaped by its parent (via wait() or a related function). This means that a wait()-family call made after a restart will never return a status for such a child.

 

Using SRUN command

The "srun" command is used to run interactive jobs on the compute nodes of the HPC, The following example will run "matlab" on the compute node of the cluster:

$ module load matlab/r2014b
$ srun matlab


MATLAB is selecting SOFTWARE OPENGL rendering. < M A T L A B (R) > Copyright 1984-2014 The MathWorks, Inc. R2014b (8.4.0.150421) 64-bit (glnxa64) September 15, 2014 To get started, type one of these: helpwin, helpdesk, or demo. For product information, visit www.mathworks.com. >> matlab command line

In example above we are loading the matlab module from the public software and then executing matlab as interactive job with the srun command, you can see in the output that you work with matlab interactively from matlab command line.

With "srun" command you can run jobs with a lot of different allocation options like assigning number of nodes for the job, assigning how many tasks to use, how many tasks to use per node, how many CPU's to use per task and more. Use the "man srun" command to see all the possible options.


 

Using SALLOC Command 

With the "salloc" command you can obtain an interactive SLURM job allocation (a set of nodes), execute a command, and then release the allocation when the command is finished.

If you would like to allocate resources on the cluster and then have the flexibility of using those resources in an interactive manner, you can use the command "salloc" to allow interactive use of resources allocated to your job. In the next example we will request 5 tasks and 2 hours for allocation:

 

$ salloc -n 5 --time=2:00:00

salloc: Pending job allocation 45924
salloc: job 45924 queued and waiting for resources
salloc: job 45924 has been allocated resources
salloc: Granted job allocation 45924

$ srun hostname
bee025
bee025
bee025
bee025
bee025

$hostname
hive01.haifa.ac.il

$exit
exit salloc: Relinquishing job allocation 45924 salloc: Job allocation 45924 has been revoked.

After that request enters the job queue just like any other job, and "salloc" will tell you that it is waiting for the requested resources if there aren't enough at the moment. When "salloc" tells you that your job has been allocated resources, you can interactively run programs on those resources with "srun" command. The commands you run with "srun" will then be executed on the resources your job has been allocated. If you finished your work before the allocated time or if you didn't allocated time at all, use "exit" command to stop the allocation permenantly

Warning: All commands that you are executing after your job has been allocated resources must run with "srun" command, otherwise those command will be executed from the access node and not on the allocated resources you asked for, you can see it in the example above.


Running GPU jobs

To execute a job that using GPU power, you should prepare an sbatch script that should contain --gres option(generic resources) with 4 devices that our GPU node contain(each device use 7 node CPU's), in addition you should load module named cuda/7.5 that will load GPU driver packages. GPU partition is named vespa and gpu node is named vespa01.

Below you can see the example of GPU job that execute CUDA benchmark:

 

#!/bin/bash
#SBATCH --error=/data/benchmark/logs/cuda.%J.errors
#SBATCH --output=/data/benchmark/logs/cuda.%J.output
#SBATCH -p vespa
#SBATCH --gres gpu:4

. /etc/profile.d/modules.sh
module load cuda/7.5

/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery

 


 

 

More Information:

List Partitions

To view the current status of all partitions accessible by the user:

$ sinfo -l

To view the current status of a partition named partitionName run:

$ sinfo -l partitionName

Display Partition Contents

To get list of all jobs running in partition named partitionName run:

$ squeue -p queueName

Same, limited to user userName:

$ squeue -p queueName -u userName

 

Control Nodes

Get nodes state

One of the following commands can be used to get node(s) state, depending on desired verbosity level:

# sinfo -N

or

# sinfo -o "%20N %.11T %.4c %.8z %.15C %.10O %.6m %.8d"

 

Commonly used options in #srun, #sbatch, #salloc:

-p partitionName

submit a job to queue queueName

-o output.log

Append job's output to output.log instead of slurm-%j.out in the current directory

-e error.log

Append job's STDERR to error.log instead of job output file (see -o above)

--mail-type=type

Email submitter on job state changes. Valid type values are BEGIN, END,FAIL, REQUEUE and ALL (any state change).

--mail-user=email

User to receive email notification of state changes (see –mail-type above)

-n N

--ntasks N

Set number of processors (cores) to N(default=1), the cores will be allocated to cores chosen by SLURM

-N N

--nodes N

Set number of nodes that will be part of a job.

On each node there will be --ntasks-per-node processes started.

If the option --ntasks-per-node is not given, 1 process per node will

be started

--ntasks-per-node N

How many tasks per allocated node to start (see -N above)

--cpus-per-task N

Needed for multithreaded (e.g. OpenMP) jobs. The option tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads the program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS

-J

--job-name

Set job name which is shown in the queue. The job name (limited to first 24 characters) is used in emails sent to the

user

-w node1,node2,...

Restrict job to run on specific nodes only

-x node1,node2,...

Exclude specific nodes from job

  


List of a few more helpfull slurm commands:  

Man pages exist for all SLURM daemons, commands, and API functions. The command option --help also provides a brief summary of options. Note that the command options are all case insensitive.

  • sacct is used to report job or job step accounting information about active or completed jobs.
  • salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.
  • sattach is used to attach standard input, output, and error plus signal capabilities to a currently running job or job step. One can attach to and detach from jobs multiple times.
  • sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
  • sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
  • scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
  • scontrol is the administrative tool used to view and/or modify SLURM state. Note that many scontrol commands can only be executed as user root.
  • sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
  • smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
  • squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
  • srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.
  • smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
  • strigger is used to set, get or view event triggers. Event triggers include things such as nodes going down or jobs approaching their time limit.
  • sview is a graphical user interface to get and update state information for jobs, partitions, and nodes managed by SLURM.
If you need more information about slurm, please go to the official manual page of slurm where you can find more interesting information: https://computing.llnl.gov/linux/slurm/documentation.html

 

High-level Infrastructure Architecture

The Ethernet network is used for cluster management:

1. Console access (iDRAC/BMC)

2. Compute nodes provisioning using xCAT

3. Grid internal communications (scheduling, remote login, naming services, etc.)

Infiniband

InfiniBand is used for storage access and MPI traffic.

Access from HaifaU LAN

All grid components, including compute nodes, are accessible from University of Haifa LAN

Storage

Three BeeGFS nodes providing nearly 115TB usable shared storage space

Servers:

Master node - R620 2 x E5-2620v2 64GB Ram 4*1TB SATA - hosting the SLURM scheduler, BeeGFS and cluster management services, monitoring services

Access node - virtual machine hosted on cluster management node, used by the grid users to access the system and submit jobs to the grid.

Storage Nodes - 4 storage servers with BeeGFS file system -  2x Dell PowerEdge R720XD 2 x E5-2620v2 64GB Ram, 12 x 4TB Storage and 1x HP DL380 2x E5- 2620v3 64GB RAM, 12 x 4TB Storage

Backup Node - RX620 2 x E5-2620v2 64GB Ram 2*300GB - dedicated storage node connected directly to the management node. Used to as backup store for a portion of user data. 

 

Name Quanity Model CPU's  RAM Notes
Compute(bee's) 32 Dell PowerEdge C6220ii 20 128GB bee001-032
Compute(bee's) 30 HP XL170r 24 128GB bee033-063
Fat node(queen's) 1 Dell PowerEdge R820 32 760GB queen01
Fat node(queen's) 2 HP DL560 56 760GB queen02-03
GPU(vespa's) 1 HP XL190r 28 256GB vespa01, Nvidia K80 GPU

 

Operating Systems

Operating system on all grid's nodes are CentOS 6 update 7.

 

 

The computational needs of the Faculty of Natural Sciences are rapidly growing, particularly in bioinformatics research. The research of a large and growing proportion of biological labs is becoming critically dependant on high-performance computing (HPC), especially for genomic analyses. Other researchers in physics and mathematics also raised needs for HPC infrastructures.
 
The university recognized this need and allocated large funds to establish a general purpose computer cluster for the Faculty. Individual labs also contributed funds. These funds were used to contract EMET Computing, a specialist in the field. EMET's engineers designed and constructed a system to address our requirements. The key features are:
1. As many independent computing servers with as many computing cores as our money can buy (Dell PowerEdge servers, HP XL170r)
2. Fast interconnectivity allowing parallel computation (Infiniband network)
3. Large and fast central file server, suitable for genomic data and their analysis (BeeGFS distributed file system)
4. Backup of large volumes of critical data in a remote machine
5. Central management of the system, including an advanced job scheduler (SLURM)

 

Our friends and sponsors: