LMOD PLUGIN INFO

Shared software is located in /lustre1/data/apps and working with the help of module utility LMOD (which allows to load and unload needed software and versions of software), you can read about module here: https://lmod.readthedocs.io/

 

LMOD USER'S COMMANDS

 

The module command sets the appropriate environment variable independent of the user’s shell. Typically the system will load a default set of modules. A user can list the modules loaded by:

$ module list

To find out what modules are available to be loaded a user can do:

$ module avail

To load packages a user simply does:

$ module load package1 package2 ...

To unload packages a user does:

$ module unload package1 package2 ...

A user might wish to change from one compiler to another:

$ module swap gcc intel

The above command is short for:

$ module unload gcc
$ module load intel

To remove all modules do:

$ module purge

This will remove all modules. Lmod will try to reload any sticky modules.

To remove all modules including sticky modules do:

$ module --force purge

A user may wish to go back to an initial set of modules:

$ module reset

This will unload all currently loaded modules, including the sticky ones, then load the list of modules specified by LMOD_SYSTEM_DEFAULT_MODULES. There is a related command:

$ module restore

This command will also unload all currently loaded modules, including the sticky ones, and then load the system default unless the user has a default collection. See User Collections for more details.

If there are many modules on a system, it can be difficult to see what modules are available to load. Lmod provides the overview command to provide a concise listing. For example:

$ module overview

------------------ /opt/apps/modulefiles/Core -----------------
StdEnv    (1)   hashrf    (2)   papi        (2)   xalt     (1)
ddt       (1)   intel     (2)   singularity (2)
git       (1)   noweb     (1)   valgrind    (1)

--------------- /opt/apps/lmod/lmod/modulefiles/Core ----------
lmod (1)   settarg (1)

This shows the short name of the module (i.e. git, or singularity) and the number in the parenthesis is the number of versions for each. This list above shows that there is one version of git and two versions of singularity.

If a module is not available then an error message is produced:

$ module load packageXYZ
Warning: Failed to load: packageXYZ

It is possible to try to load a module with no error message if it does not exist. Any other failures to load will be reported.:

$ module try-load packageXYZ

Modulefiles can contain help messages. To access a modulefile’s help do:

$ module help packageName

To get a list of all the commands that module knows about do:

$ module help

The module avail command has search capabilities:

$ module avail cc

will list for any modulefile where the name contains the string “cc”.

Users may wish to test whether certain modules are already loaded:

$ module is-loaded packageName1 packageName2 ...

Lmod will return a true status if all modules are loaded and a false status if one is not. Note that Lmod is setting the status bit, there is nothing printed out. This means that one can do the following:

$ if module is-loaded pkg ; then echo "pkg is loaded"; fi

Users also may wish to test whether certain modules can be loaded with the current $MODULEPATH:

$ module is-avail packageName1 packageName2 ...

Lmod will a true status if all modules are available and false if one can not be loaded. Again this command sets the status bit.

Modulefiles can have a description section known as “whatis”. It is accessed by:

$ module whatis pmetis
pmetis/3.1  : Name: ParMETIS
pmetis/3.1  : Version: 3.1
pmetis/3.1  : Category: library, mathematics
pmetis/3.1  : Description: Parallel graph partitioning..

There is a keyword search tool:

$ module keyword word1 word2 ...

This will search any help message or whatis description for the word(s) given on the command line.

Another way to search for modules is with the “module spider” command. This command searches the entire list of possible modules. The difference between “module avail” and “module spider” is explained in the “Module Hierarchy” and “Searching for Modules” section.:

$ module spider

Users can also find which categories a site provides:

$ module category

------------------ List of Categories --------------------
Compiler           Programming tools        library
Graph partitioner  System Environment/Base  mpi
MPI library        Visual Tool              tools

To know which modules, users can pick one or more name from the list of categories with the number of modules provide:

$ module category library

---------------------- MPI library -----------------------
mpich (30)   openmpi (6)

----------------------- library --------------------------
boost     (1)   hdf5  (18)   pdtoolkit (1)    pmetis (6)
fftw2     (6)   metis (2)    petsc     (13)   tau    (3)
gotoblas2 (1)   papi  (1)    phdf5     (40)

Here we see that there are 30 versions of mpich and 6 version of openmpi. Also that category name given, in this case “library” does partial matches and is case-insensitive.

Specifying modules to load

Modules are a way to ask for a certain version of a package. For example a site might have two or more versions of the gcc compiler collection (say versions 7.1 and 8.2). So a user may load:

$ module load gcc

or:

$ module load gcc/7.1

In the second case, Lmod will load gcc version 7.1 where as in the first case Lmod will load the default version of gcc which normally be 8.2 unless the site marks 7.1 as the default.

In this user guide, we call gcc/7.1 the fullName of the module and gcc as the shortName. We also call what the user asked for as the userName which could either be the fullName or the shortName depending on what the user typed on the command line.

Showing the contents of a module

There are several ways to use the show sub-command to show the contents of a modulefile. The first is to show the module functions instead of executing them:

$ module show gcc

This shows the functions such as setenv () or prepend_path () but nothing else. If you want to know the contents of the modulefile you can use:

$ module --raw show gcc

This will show the raw text of the modulefile. This is same as printing the modulefile, but here Lmod will find the modulefile for you. If you want to know just the location of a modulefile do:

$ module --redirect --location show gcc

You will probably use the –redirect option so that the output goes to stdout and not stderr.

If you want to know how Lmod will parse a TCL modulefile you can do:

$ tclsh $LMOD_DIR/tcl2lua.tcl  <path_to_TCL_modulefile>

This useful when there is some question on how Lmod will treat a TCL modulefile.

ml: A convenient tool

For those of you who can’t type the mdoulemoduel, err module command correctly, Lmod has a tool for you. With ml you won’t have to type the module command again. The two most common commands are module list and module load <something> and ml does both:

$ ml

means module list. And:

$ ml foo

means module load foo while:

$ ml -bar

means module unload bar. It won’t come as a surprise that you can combine them:

$ ml foo -bar

means module unload bar; module load foo. You can do all the module commands:

$ ml spider
$ ml avail
$ ml show foo

If you ever have to load a module name spider you can do:

$ ml load spider

If you are ever forced to type the module command instead of ml then that is a bug and should be reported.

clearLmod: Complete remove Lmod setup

It is rare, but sometimes a user might need to remove the Lmod setup from their current shell. This command can be used with bash/zsh/csh/tcsh to remove the Lmod setup:

$ clearLmod

This command prints a message telling the user what it has done. This message can be silented with:

$ clearLmod --quiet

SAFETY FEATURES

(1): Users can only have one version active: The One Name Rule

If a user does:

$ module avail xyz

--------------- /opt/apps/modulefiles ----------------
xyz/8.1   xyz/11.1 (D)   xyz/12.1

$ module load xyz
$ module load xyz/12.0

The first load command will load the 11.1 version of xyz. In the second load, the module command knows that the user already has xyz/11.1 loaded so it unloads that and then loads xyz/12.0. This protection is only available with Lmod.

This is known as the One Name rule. This feature is core to how Lmod works and there is no way to override this.

(2) : Users can only load one compiler or MPI stack at a time.

Lmod provides an additional level of protection. If each of the compiler modulefiles add a line:

family("compiler")

Then Lmod will not load another compiler modulefile. Another benefit of the modulefile family directive is that an environment variable “LMOD_FAMILY_COMPILER” is assigned the name (and not the version). This can be useful specifying different options for different compilers. In the High Performance Computing (HPC) world, the message passing interface (MPI) libraries are important. The mpi modulefiles can contain a family(“MPI”) directive which will prevent users from loading more than one MPI implementation at a time. Also the environment variable “LMOD_FAMILY_MPI” is defined to the name of the mpi library.

Module Hierarchy

Libraries built with one compiler need to be linked with applications with the same compiler version. If sites are going to provide libraries, then there will be more than one version of the library, one for each compiler version. Therefore, whether it is the Boost library or an mpi library, there are multiple versions.

There are two main choices for system administrators. For the XYZ library compiled with either the UCC compiler or the GCC compiler, there could be the xyz-ucc modulefile and the xyz-gcc module file. This gets much more complicated when there are multiple versions of the XYZ library and different compilers. How does one label the various versions of the library and the compiler? Even if one makes sense of the version labeling, when a user changes compilers, the user will have to remember to unload the ucc and the xyz-ucc modulefiles when changing to gcc and xyz-gcc. If users have mismatched modules, their programs are going to fail in very mysterious ways.

A much saner strategy is to use a module hierarchy. Each compiler module adds to the MODULEPATH a compiler version modulefile directory. Only modulefiles that exist in that directory are packages that have been built with that compiler. When a user loads a particular compiler, that user only sees modulefile(s) that are valid for that compiler.

Similarly, applications that use libraries depending on MPI implementations must be built with the same compiler - MPI pairing. This leads to modulefile hierarchy. Therefore, as users start with the minimum set of loaded modules, all they will see are compilers, not any of the packages that depend on a compiler. Once they load a compiler they will see the modules that depend on that compiler. After choosing an MPI implementation, the modules that depend on that compiler-MPI pairing will be available. One of the nice features of Lmod is that it handles the hierarchy easily. If a user swaps compilers, then Lmod automatically unloads any modules that depends on the old compiler and reloads those modules that are dependent on the new compiler.

$ module list

1) gcc/4.4.5 2) boost/1.45.0

$ module swap gcc ucc

Due to MODULEPATH changes the follow modules have been reloaded: 1) boost

If a modulefile is not available with the new compiler, then the module is marked as inactive. Every time MODULEPATH changes, Lmod attempts to reload any inactive modules.

Searching For Modules

When a user enters:

$ module avail

Lmod reports only the modules that are in the current MODULEPATH. Those are the only modules that the user can load. If there is a modulefile hierarchy, then a package the user wants may be available but not with the current compiler version. Lmod offers a new command:

$ module spider

which lists all possible modules and not just the modules that can be seen in the current MODULEPATH. This command has three modes. The first mode is:

$ module spider

lmod: lmod/lmod
Lmod: An Environment Module System

ucc: ucc/11.1, ucc/12.0, ...
Ucc: the ultimate compiler collection

xyz: xyz/0.19, xyz/0.20, xyz/0.31
xyz: Solves any x or y or z problem.

This is a compact listing of all the possible modules on the system. The second mode describes a particular module:

$ module spider ucc
----------------------------------------------------------------------------
ucc:
----------------------------------------------------------------------------

Description:
Ucc: the ultimate compiler collection

Versions:
ucc/11.1
ucc/12.0

The third mode reports on a particular module version and where it can be found:

$ module spider parmetis/3.1.1
----------------------------------------------------------------------------
parmetis: parmetis/3.1.1
----------------------------------------------------------------------------
Description:
Parallel graph partitioning and fill-reduction matrix ordering routines

This module can be loaded through the following modules:
ucc/12.0, openmpi/1.4.3
ucc/11.1, openmpi/1.4.3
gcc/4.4.5, openmpi/1.4.3

Help:
The parmetis module defines the following environment variables: ...
The module parmetis/3.1.1 has been compiled by three different versions of the ucc compiler and one MPI implementation.

Controlling Modules During Login

Normally when a user logs in, there are a standard set of modules that are automatically loaded. Users can override and add to this standard set in two ways. The first is adding module commands to their personal startup files. The second way is through the “module save” command.

To add module commands to users’ startup scripts requires a few steps. Bash users can put the module commands in either their ~/.profile file or their ~/.bashrc file. It is simplest to place the following in their ~/.profile file:

if [ -f ~/.bashrc ]; then
   .   ~/.bashrc
fi

and place the following in their ~/.bashrc file:

if [ -z "$BASHRC_READ" ]; then
   export BASHRC_READ=1
   # Place any module commands here
   # module load git
fi

By wrapping the module command in an if test, the module commands need only be read in once. Any sub-shell will inherit the PATH and other environment variables automatically. On login shells the ~/.profile file is read which, in the above setup, causes the ~/.bashrc file to be read. On interactive non-login shells, the ~/.bashrc file is read instead. Obviously, having this setup means that module commands need only be added in one file and not two.

Csh users need only specify the module commands in their ~/.cshrc file as that file is always sourced:

if ( ! $?CSHRC_READ ) then
   setenv CSHRC_READ 1
   # Place any module command here
   # module load git
endif

User Collections

User defined initial list of login modules:

Assuming that the system administrators have installed Lmod correctly, there is a second way which is much easier to set up. A user logs in with the standard modules loaded. Then the user modifies the default setup through the standard module commands:

$ module unload XYZ
$ module swap gcc ucc
$ module load git

Once users have the desired modules load then they issue:

$ module save

This creates a file called ~/.config/lmod/default which has the list of desired modules. Note only the current set of modules is recorded the in the collection. If module X loads module A and the user deletes module A before doing module save collectionName then module A will NOT be loaded when the collection is restored. All load(), always_load(), depends_on() statements inside the modulefiles are ignored when restoring a collection. Instead Lmod loads just the list of modulefiles stored in the collection.

Once this is set-up a user can issue:

$ module restore

and only the desired modules will be loaded. If Lmod is setup correctly (see Providing A Standard Set Of Modules for all Users) then the default collection will be the user’s initial set of modules.

If a user doesn’t have a default collection, the Lmod purges ALL currently loaded modules, including the sticky ones, and loads the list of module specified by LMOD_SYSTEM_DEFAULT_MODULES just like the module reset command.

Users can have as many collections as they like. They can save to a named collection with:

$ module save <collection_name>

and restore that named collection with:

$ module restore <collection_name>

A user can print the contents of a collection with:

$ module describe <collection_name>

A user can list the collections they have with:

$ module savelist

Finally a user can disable a collection with:

$ module disable <collection_name>

If no collection_name is given then the default is disabled. Note that the collection is not remove just renamed. If a user disables the foo collection, the file foo is renamed to foo~. To restore the foo collection, a user will have to do the following:

$ cd ~/.config/lmod;  mv foo~ foo

Rules for loading modules from a collection

Lmod has rules on what modules to load when restoring a collection. Remember that userName is what the user asked for, the fullName is the exact module name and shortName is name of the package (e.g. gcc, fftw3).

  1. Lmod records the fullName and the userName in the collection.

  2. If the userName is the same as the fullName then it loads fullName independent of the default.

  3. if the userName is not the same as the fullName then it loads the default.

  4. Unless LMOD_PIN_VERSIONS=yes then the fullName is always loaded.

In other words if a user does:

$ module --force purge; module load A B C
$ module save

then “module restore” will load the default A, B, and C. So if the default for module A changed between when the collection was saved and then restored, a new version of A will be loaded. This assumes that LMOD_PIN_VERSIONS is not set. If it is set or Lmod is configured that way then if A/1.1, B/2.4 and C/3.3 are the default then those modules will be loaded in the future independent of what the defaults are in the future.

On the other hand:

$ module --force purge; module load A/1.0 B/2.3 C/3.4
$ module save

then “module restore” will load the A/1.0, B/2.3, and C/3.4 independent of what the defaults are now or in the future.

User Collections on shared home file systems

If your site has a shared home file system, then things become a little more complicated. A shared home file system means that your site has a single home file system shared between two or more clusters. See Lmod on Shared Home File Systems for a system administrators point of view.

If you have a collection on one cluster it needs to be independent of another cluster. Your site should set $LMOD_SYSTEM_NAME uniquely for each cluster. Suppose you have cluster A and B. Then $LMOD_SYSTEM_NAME will be either A or B. A default collection will be named “default.A” for the A cluster and “default.B” for the B cluster. The names a user sees will have the extension removed. In other words on the A cluster a user would see:

$ module savelist

  1) default

where the default file is named “default.A”.

Showing hidden modules

Sites modules (or user personal modules) can be hidden from normal “module avail” or “module spider” through different mechanisms. See Hidden Modules

To see hidden modules, one can do:

$ module --show_hidden avail
$ module --show_hidden spider

 

The computational needs of the Faculty of Natural Sciences are rapidly growing, particularly in bioinformatics research. The research of a large and growing proportion of biological labs is becoming critically dependant on high-performance computing (HPC), especially for genomic analyses. Other researchers in physics and mathematics also raised needs for HPC infrastructures.
 
The university recognized this need and allocated large funds to establish a general purpose computer cluster for the Faculty. Individual labs also contributed funds. These funds were used to contract EMET Computing, a specialist in the field. EMET's engineers designed and constructed a system to address our requirements. The key features are:
1. As many independent computing servers with as many computing cores as our money can buy (HPE Cray supercomputing)
2. Fast interconnectivity allowing parallel computation (Infiniband network)
3. Large and fast central file server, suitable for genomic data and their analysis (Lustre file system)
4. Backup of large volumes of critical data in a remote machine
5. Central management of the system, including an advanced job scheduler (SLURM)

 

Our friends and sponsors:

 

 

 

Currently Hive has limits on public partitions, the limits are set per partition and the limitations are equal for all the users using public partitions.

Now there are 3 Hive public partitions, one queen partition(Fat memory node) and one mpi partition.

In order to view the real time partition info in Hive terminal, type command sinfo  , also there is a login message for every user that contains real-time QOS limitations on each partition, user information and nodes information.

hive partition containing 40 public compute nodes and is a default. Running jobs on Hive partition is recommended for new users and for users that want to be sure that their job will run until it end without any preemption.

queen partition containing high memory compute nodes(queen's), this partition is made for a jobs that require a lot of memory.

preempt partition are preempt-able that contain all the compute nodes(bee's) as well as public and private.  Preempt-able partition means that every job that submitted on higher priority partitions(hive,queen,private partitions) will kick off your job that running on preempt partition, and be restart from the beginning to the queue. on preempt partition you get a benefit of lower limitations per user.

mpi partition that contain all the public compute nodes(bee's) that made for MPI jobs that using a large amount of resources

guest partition contains thin compute nodes, made for Hive guest users(non Faculty of Science members), partition is low-priority and preempt-able by other partitions.

Private partition's partitions that contain private nodes that different groups has bough't, those partitions usually receive names of their group name. Running jobs on private partitions is allowed only by the users that belong the the partition group. private nodes has no limitations and will get 100% priority over any other partition that containing the private nodes(ckpt partitions for example).

Below you can see information about the limits each partition has:

 
hive1d - DEFAULT partition, limits:  WallTime: 1 day, MaxCPU’sPerUser ~ 1200
hive7d - limits:  WallTime: 7 days, MaxCPU’sPerUser ~ 350
hiveunlim limits:  WallTime: unlimited, MaxCPU’sPerUser ~ 180
preempt1d - limits:  WallTime: 1 day, MaxCPU’sPerUser ~ 2400
preempt7d - limits:  WallTime: 7 days, MaxCPU’sPerUser ~ 1800
queen - limits:  WallTime: 31 days, MaxNODE’sPerUser ~ 80
mpi - limits: MinCPU’sPerUser = 1
guest - WallTime: 7 day, MaxCPU’sPerUser = 400 *Low priority
 
*WallTime= maximum time that each job can run.
*MaxCPU’sPerUser= maximum number of CPU’s user can use under specific partition.
*MaxNODE'sPerUser= maximum number of NODE's user can run jobs under specific partition.
*MinCPU'sPerUser= minimum resource allocation per job.
*Preemption=REQUEUE= low priority partition, all jobs running on that partition could be preempted/checkpointed by higher priority partition. preempted job will be restarted from beginning or from last checkpoint.
.
 
Hive has two types of slim compute nodes(bee's) and two types of fat compute nodes (queen's), please see the below ranges of different nodes:
 
bee033-071 has 24 CPU's per node and 128 GB memory
bee073-145 has 64 CPU's per node and 250 GB memory
queen02-03 has 56 CPU's and 768 GB memory
queen04 has 80 CPU's and 1.5 TB memory
queen05 has 580CPU's and 360 GB memory
 
For additional information regarding nodes hardware, please refer to hardware section.

 

Slurm is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

 

Table of contents

 

Partitions and nodes information

hive partition containing public compute nodes(bee's) this is a default partition. Running jobs on Hive partition's is recommended for new users and for users that want to be sure that their job will run until it end without any preemption.

queen partition containing high memory compute nodes(queen's), this partition is made for a jobs that require a lot of memory.

preempt   is a VIP partition are preempt-able that contain all the compute nodes(bee's) as well as public and private.  Preempt-able partition means that every job that submitted on higher priority partitions(hive,queen,private partitions) will kick off your job that running on preempt partition, and be restart from the beginning to the queue. on preempt partition you get a benefit of lower limitations per user.

mpi partition that contain all the public compute nodes(bee's) that made for MPI jobs that using a large amount of resources, especially parallel jobs

Private partition's partitions that contain private nodes that different groups have bought, those partitions usually receive names of their group name. Running jobs on private partitions is allowed only by the users that belong the partition group. Private nodes have no limitations and will get 100% priority over any other partition.

Please see the below table with information regarding Hive compute nodes, different types and range of each type:

Name Quanity Model CPU's  RAM Notes/Name and range in Slurm
OldBees -Compute(bee's) 38 HP XL170r 24 128GB bee33-71
Bees - Compute(bee's) 73 HP ProLiant XL220n Gen10  64 250GB bee73-145
Fat node(old queen's) 2 HP DL560 56 760GB queen2-3
Fat node(queen's) 2 HP DL560-G10 80 1.5 TB queen4, queen5 (350 GB)
GPU(vespa's)-N/A at hive02 --- -------------- --- ---- ----------

 * For information regarding limitation per partition, please check the limitations section in website menu.


 

Slurm SBATCH command and slurm job submission Scripts

The command "sbatch" should be the default command for running batch jobs. 

With "sbatch" you can run simple batch jobs from a command line, or you can execute complicated jobs from a prepared batch script.

First, you should understand the basic options you can add to the sbatch command in order to request the right allocation of resources for your jobs:

Commonly used options in #srun, #sbatch, #salloc:

-p partitionName

submit a job to queue queueName

-o output.log

Append job's output to output.log instead of slurm-%j.out in the current directory

-e error.log

Append job's STDERR to error.log instead of job output file (see -o above)

--mail-type=type

Email submitter on job state changes. Valid type values are BEGIN, END,FAIL, REQUEUE and ALL (any state change).

--mail-user=email

User to receive email notification of state changes (see –mail-type above)

-n N

--ntasks N

Set number of processors (cores) to N(default=1), the cores will be allocated to cores chosen by SLURM

-N N

--nodes N

Set number of nodes that will be part of a job.

On each node there will be --ntasks-per-node processes started.

If the option --ntasks-per-node is not given, 1 process per node will

be started

--ntasks-per-node N

How many tasks per allocated node to start (see -N above)

--cpus-per-task N

Needed for multithreaded (e.g. OpenMP) jobs. The option tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads the program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS

-J

--job-name

Set job name which is shown in the queue. The job name (limited to first 24 characters) is used in emails sent to the

user

-w node1,node2,...

Restrict job to run on specific nodes only

-x node1,node2,...

Exclude specific nodes from job

Below example of running simple batch job directly from a command line/terminal of Hive with allocation of minimum resources(1 node and 1 CPU/core):

$ sbatch -N1 -n1 --wrap '. /etc/profile.d/modules.sh ; module load blast/2.2.30 ; blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn'
Submitted batch job 214726

Where: 

-N1 is request for 1 node

-n1 is request for 1 CPU/core
. /etc/profile.d/modules.sh ; module load blast/2.2.30
 is a command that will load software that your job is using(in the example blast program)

blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn is your job command

* The job will start on default partition named hive1d, if your job is going to run more then 1 day, then you should add the -p option to your command and specify hive7d or hiveunlim partition that will allow your job to run up to 7 or 31 days.

Below is an example of running a simple job on hive7d partition with time limit for 5 days:

$ sbatch -N1 -n1 -p hive7d --time=5-00:00:00 --wrap '. /etc/profile.d/modules.sh ; module load blast/2.2.30 ; blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn'
Submitted batch job 214726

So basically you should execute the job like that formula: [sbatch command] [allocation of resources] [--wrap] ['your job command']

the --wrap option has to be AFTER the allocation of needed resources and not before, you have to use the --wrap option in order to allow executing jobs from the command line because the standard of sbatch command is to run batch jobs from a batch script and not from the command line.

Running jobs from prepared slurm submission script:

The following is a typical Slurm submission script example.

* Please note that there is a few types of thin compute nodes(bee's) on hive. Dell(bee-001-032) compute nodes has 20 cores/CPU's while HP(bee033-063) compute nodes has 24 cores/CPU's

#!/bin/sh
#SBATCH --ntasks 20           # use 20 cores
#SBATCH --ntasks-per-node=20 # use 20 cpus per each node #SBATCH --time 1-03:00:00 # set job timelimit to 1 day and 3 hours #SBATCH --partition hive1d # partition name #SBATCH -J my_job_name # sensible name for the job # load up the correct modules, if required . /etc/profile.d/modules.sh module load openmpi/1.8.4 RAxML/8.1.15
# launch the code mpirun...

How to submit a job from a batch script

To submit this, run the following command:

sbatch myscript.sh

Warning: do not execute the script

The job submission script file is written to look like a bash shell script. However, you do NOT submit the job to the queue by executing the script.

In particular, the following is INCORRECT:

# this is the INCORRECT way to submit a job
./myscript.sh  # wrong! this will not submit the job!

The correct way is noted above (sbatch myscript.sh).

Please, refer to the manual of SBATCH (man sbatch) to see more helpful information about how to use the sbatch command in the right way.

 


 

Running job with multiple partition

There is an option selecting more then one partition when submitting your job to the queue, to do that you need to specify in comma separated list the partitions you would like to submit your job to.

Your job will be submitted to the first partition with free resources in the hierarchy  from left to right from the list of partitions that you entered in the list. Examples:

All the examples below show the standard -p / --partition option of Slurm but with comma seperated list of partitions that you should add to your job.

To run short jobs (less than 24 hours):

-p hive1d,hive7d, hiveunlim
If you also have some private nodes: (replace "private" with the partition name)
-p private,hive1d,hive7d, hiveunlim
If you are willing to risk your jobs being killed and rerun: (for short jobs)
-p hive1d,hive7d,hiveunlim,preempt1d,
preempt7d,preempt31d
Same with a private partition:
-p private,hive1d,hive7d,hiveunlim,
preempt1d,preempt7d,preempt31d

To run longer jobs: (up to 7 days)
-p hive7d,hiveunlim
Same with private partition:
-p private,hive7d,hiveunlim
With checkpointing:
-p hive7d,hiveunlim,ckptdell7d,ckptdell31d

or

-p hive7d,hiveunlim,ckpthp7d,ckpthp31d
Same with private partition:
-p private,hive7d,hiveunlim,ckptdell7d,ckptdell31d

or

-p private,hive7d,hiveunlim,ckpthp7d,ckpthp31d

To run very long jobs: (up to 31 days)
-p hiveunlim
Same with private partition:
-p private,hiveunlim
With checkpointing:
-p hiveunlim,ckptdell31d

or

-p hive7d,hiveunlim,ckpthp31d
Same with private partition:
-p private,hiveunlim,ckptdell31d

or

-p private,hiveunlim,ckpthp31d

To run jobs that require more than 128GB memory:
-p queen
With checkpointing:
-p queen,queenckpt

 Job Arrays

The job array is very helpful option of sbatch for users that need to run a lot of single-core computations. With this option your can start hundreds of jobs from one sbatch script that will start from the specified range and in the slurm queue all those jobs will have unique ID number and the pending jobs will be grouped to one.

Below example of sbatch job script with array option that will start 399 jobs that will execute file with the variable number from 1-399:

#!/bin/bash

#SBATCH --job-name=CodemlArrayJob #SBATCH --partition=hive1d,hive7d,hiveunlim #SBATCH --array=1-399 #SBATCH --output=out_%A_%a_%j.out #SBATCH --error=error_%A_%a_%j.err
## To make things simple: ${i} == $SLURM_ARRAY_TASK_ID
i=${SLURM_ARRAY_TASK_ID}
# Check $WORKDIR existence
function checkWorkdir() {
    if [ ! -d "$1" ]; then         echo "ERROR! Directory $1 doesn't exist. "         exit 1;     fi }
#load the modules environment
. /etc/profile.d/modules.sh module load PAML/4.8

#Job commands
echo "branch${i}"; WORKDIR="/data/home/privman/rwilly/2.Like.Clade12/guidence/guidence.results/BranchSiteModelA/BranchSiteWithScript/like.Clade12.branch${i}.workingDir"
checkWorkdir $WORKDIR cd $WORKDIR
codeml LikeClade12.branch${i}.ctl

Note: At --error and --output section you have %A,%a and %j, the %A will print your master array job id number, the %a will print you the array number of the job, and the %j will print you the unique id number of each job that you submitted in array mod. The information of the job id's could be useful when the job array is checkpoint-able, the job id number(array and original) will help you find which jobs has been checkpointed.


 

Jobs With Checkpoint options:

Slurm has an option to checkpoint your running jobs every X time, checkpoint-able jobs are needed for securing your progress on a preempted partitions or if you are running a very long job you will want to make checkpoints to have an option of stopping and continuing the job from your checkpoint.

Currently there are two partitions with timelimit of 1, 7 and 31 days that are made for check-pointing: ckptdell(ckptdell1d, ckptdell7d, ckptdell31d) and ckpthp(ckpthp1d, ckpthp7d, ckpthp31d)

Below you can see the example of sbatch job script that will make checkpoint every  6 hours:

!/bin/bash
#SBATCH --job-name=CheckpointableJobExample
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --partition=ckptdell1d
#SBATCH --checkpoint=360  # Time in minutes, every X minutes to make a checkpoint
#SBATCH --checkpoint-dir=/data/ckpt # Default place where your checkpoints will be created, you can change it to other place in your home folder


#
# The following lines needed for checkponting. ##restarterOpts="StickToNodes" # Use the stick to nodes option only if your job cannot be resubmitted to other machines that was from beginning restarterOpts="" . /data/scripts/include_for_checkpointing

### From here you can start editing your job commands

for i in {1..300}; do
echo -n "$i:    "
date
sleep 2
done

Attention: If your job with checkpoints will stop and then started again, your job will continue from progress of the last checkpoint that was made and not from the second before the job was stopped.

Limitations: The checkpoint option has some limitations on few types of jobs, below you can see list of job types that will not work with the checkpoint option:

  • BLCR will not checkpoint and/or restore open sockets (TCP/IP, Unix domain, etc.). At restart time any sockets will appear to have been closed.
  • BLCR will not checkpoint and/or restore open character or block devices (e.g. serial ports or raw partitions). At restart time any devices will appear to have been closed.
  • BLCR does not handle SysV IPC objects (man 5 ipc). Such resources are silently ignored at checkpoint time and are not restored.
  • If a checkpoint is taken of a process with any "zombie" children, then these children will not be recreated at restart time. A "zombie" is defined as a process that has exited, but who's exit status has not yet been reaped by its parent (via wait() or a related function). This means that a wait()-family call made after a restart will never return a status for such a child.

 

Using SRUN command

The "srun" command is used to run interactive jobs on the compute nodes of the HPC, The following example will run "matlab" on the compute node of the cluster:

$ module load matlab/r2014b
$ srun matlab


MATLAB is selecting SOFTWARE OPENGL rendering. < M A T L A B (R) > Copyright 1984-2014 The MathWorks, Inc. R2014b (8.4.0.150421) 64-bit (glnxa64) September 15, 2014 To get started, type one of these: helpwin, helpdesk, or demo. For product information, visit www.mathworks.com. >> matlab command line

In example above we are loading the matlab module from the public software and then executing matlab as interactive job with the srun command, you can see in the output that you work with matlab interactively from matlab command line.

With "srun" command you can run jobs with a lot of different allocation options like assigning number of nodes for the job, assigning how many tasks to use, how many tasks to use per node, how many CPU's to use per task and more. Use the "man srun" command to see all the possible options.


 

Using SALLOC Command 

With the "salloc" command you can obtain an interactive SLURM job allocation (a set of nodes), execute a command, and then release the allocation when the command is finished.

If you would like to allocate resources on the cluster and then have the flexibility of using those resources in an interactive manner, you can use the command "salloc" to allow interactive use of resources allocated to your job. In the next example we will request 5 tasks and 2 hours for allocation:

 

$ salloc -n 5 --time=2:00:00

salloc: Pending job allocation 45924
salloc: job 45924 queued and waiting for resources
salloc: job 45924 has been allocated resources
salloc: Granted job allocation 45924

$ srun hostname
bee025
bee025
bee025
bee025
bee025

$hostname
hive01.haifa.ac.il

$exit
exit salloc: Relinquishing job allocation 45924 salloc: Job allocation 45924 has been revoked.

After that request enters the job queue just like any other job, and "salloc" will tell you that it is waiting for the requested resources if there aren't enough at the moment. When "salloc" tells you that your job has been allocated resources, you can interactively run programs on those resources with "srun" command. The commands you run with "srun" will then be executed on the resources your job has been allocated. If you finished your work before the allocated time or if you didn't allocated time at all, use "exit" command to stop the allocation permenantly

Warning: All commands that you are executing after your job has been allocated resources must run with "srun" command, otherwise those command will be executed from the access node and not on the allocated resources you asked for, you can see it in the example above.


Running GPU jobs

At present  there is no GPU unit at hive02.To execute a job that using GPU power, you should contact Nikolai (This email address is being protected from spambots. You need JavaScript enabled to view it.) and hopefully get access to Deep Learning Cluster belonging to the Department of Computer Sciences

 

 


 

 

More Information:

List Partitions

To view the current status of all partitions accessible by the user:

$ sinfo -l

To view the current status of a partition named partitionName run:

$ sinfo -l partitionName

Display Partition Contents

To get list of all jobs running in partition named partitionName run:

$ squeue -p queueName

Same, limited to user userName:

$ squeue -p queueName -u userName

 

Control Nodes

Get nodes state

One of the following commands can be used to get node(s) state, depending on desired verbosity level:

# sinfo -N

or

# sinfo -o "%20N %.11T %.4c %.8z %.15C %.10O %.6m %.8d"

 

Commonly used options in #srun, #sbatch, #salloc:

-p partitionName

submit a job to queue queueName

-o output.log

Append job's output to output.log instead of slurm-%j.out in the current directory

-e error.log

Append job's STDERR to error.log instead of job output file (see -o above)

--mail-type=type

Email submitter on job state changes. Valid type values are BEGIN, END,FAIL, REQUEUE and ALL (any state change).

--mail-user=email

User to receive email notification of state changes (see –mail-type above)

-n N

--ntasks N

Set number of processors (cores) to N(default=1), the cores will be allocated to cores chosen by SLURM

-N N

--nodes N

Set number of nodes that will be part of a job.

On each node there will be --ntasks-per-node processes started.

If the option --ntasks-per-node is not given, 1 process per node will

be started

--ntasks-per-node N

How many tasks per allocated node to start (see -N above)

--cpus-per-task N

Needed for multithreaded (e.g. OpenMP) jobs. The option tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads the program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS

-J

--job-name

Set job name which is shown in the queue. The job name (limited to first 24 characters) is used in emails sent to the

user

-w node1,node2,...

Restrict job to run on specific nodes only

-x node1,node2,...

Exclude specific nodes from job

  


List of a few more helpfull slurm commands:  

Man pages exist for all SLURM daemons, commands, and API functions. The command option --help also provides a brief summary of options. Note that the command options are all case insensitive.

  • sacct is used to report job or job step accounting information about active or completed jobs.
  • salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.
  • sattach is used to attach standard input, output, and error plus signal capabilities to a currently running job or job step. One can attach to and detach from jobs multiple times.
  • sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
  • sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
  • scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
  • scontrol is the administrative tool used to view and/or modify SLURM state. Note that many scontrol commands can only be executed as user root.
  • sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
  • smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
  • squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
  • srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.
  • smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
  • strigger is used to set, get or view event triggers. Event triggers include things such as nodes going down or jobs approaching their time limit.
  • sview is a graphical user interface to get and update state information for jobs, partitions, and nodes managed by SLURM.
If you need more information about slurm, please go to the official manual page of slurm where you can find more interesting information: https://computing.llnl.gov/linux/slurm/documentation.html

 

The fair-share component of the job priority is calculated differently. The goal is to make sure that the priority strictly follows the account hierarchy, so that jobs under accounts with usage lower than their fair share will always have a higher priority than jobs belonging to accounts which are over their fair share.

The algorithm is based on ticket scheduling, where at the root of the account hierarchy one starts with a number of tickets, which are then distributed per the fairshare policy to the child accounts and users. Then, the job whose user has the highest number of tickets is assigned the fairshare priority of 1.0, and the other pending jobs are assigned priorities according to how many tickets their users have compared to the highest priority job.

In Hive, groups that buy more resources get more fairshare tickets assigned per group and will have a priority on the other jobs that was submitted to the queue by groups that bought less resources.

The fairshare factor is not defined only by the fairshare tickets that was assigned to groups by their investments to Hive, the fairshare factor is calculating the history of the group jobs and if one group will submit a lot of jobs and other group will submit less, then the fairshare factor can move the less active group jobs to the top of the job queue.

The fair-share algorithm in SLURM is described at slurm website, please refer to the next link if you are interested in understanding of the fairshare factor http://slurm.schedmd.com/fair_tree.html 

Below you will find some command examples of how to monitor your fairshare:

Check current fairshare definitions for your group account:

$ sshare -A ACCOUNTNAME

Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- ACCOUNTNAME 109 0.054801 1579497 0.004294 0.94713

Where Raw Shares is the number of tickets that group received(Depending on how much resources the group bought) and the FairShare number is the number you get after calculation of fairshare factor(refer to http://slurm.schedmd.com/fair_tree.html to understand how it works).

to be continued....

 

High-level Infrastructure Architecture

 

The Ethernet network is used for cluster management:

1. Console access (iDRAC/BMC)

2. Compute nodes provisioning using xCAT

3. Grid internal communications (scheduling, remote login, naming services, etc.)

Infiniband

InfiniBand is used for storage access and MPI traffic.

Access from HaifaU LAN

All grid components, including compute nodes, are accessible from University of Haifa LAN

Storage

Lustre is a high-performance parallel distributed file system designed for use in HPC and supercomputing environments. It offers scalable and high-speed storage solutions, enabling efficient data access and storage for large-scale computational workloads. 

Servers:

Lustre Storage Servers - Redundant servers support storage fail-over, while metadata and data are stored on separate servers, allowing each file system to be optimized for different workloads. Lustre can deliver fast IO to applications across high-speed network fabrics, such as Ethernet, InfiniBand (IB), Omni-Path (OPA), and others.

Storage server - The storage system of the cluster is an HPE Cray E1000 ClusterStor server. Note that this storage is meant for ongoing analyses. This is not an archive system. This is a distributed file system that allows high performance - fast reading and writing of many files, both large and small. It is composed of many disks, but functions as one storage volume, with a total of 919 TB. This total volume is made up from a hybrid set of disks, including both HDD and SSD, which ensures high performance for different usecases.

Please note that your files on the Hive2 storage system are NOT BACKED UP by default. It is your responsibility to backup your vital files and irreplaceable data. While the E1000 server is a highly resilient solution, any system has risk of failure and loss of data. Therefore, please make sure to backup important files.

Backup server - The Hive2 backup system can automatically back up files, but only for users who buy space on the backup server. The backup system uses rsync replication of user's data. Rsnapshot is used to create daily snapshots, keeping up to 14 snapshots.

 

Name Quantity Model CPU's  RAM Notes
Compute(Old bee's) 38 HP XL170r 
24 128GB bee033-071
Compute(bee's) 73 HP ProLiant XL220n Gen10 Plus 
64 250GB bee073-145
Fat node(Old queen's) 1 HP DL560
56 760GB queen02-03
Fat node(queen's) 2 HP DL560-G10
80 1.5TB queen4 (1.5 TB) queen5 (360GB)
GPU(vespa's)          

 

Operating Systems

For xCAT we are using RHEL8.5 for xCAT supported OS reasons. Compute nodes and xCAT managed hosts we are using RHEL9.1. Operating system can be upgraded easily as needed using xCAT.