- Details
LMOD PLUGIN INFO
Shared software is located in /lustre1/data/apps and working with the help of module utility LMOD (which allows to load and unload needed software and versions of software), you can read about module here: https://lmod.readthedocs.io/
LMOD USER'S COMMANDS
The module command sets the appropriate environment variable independent of the user’s shell. Typically the system will load a default set of modules. A user can list the modules loaded by:
To find out what modules are available to be loaded a user can do:
To load packages a user simply does:
To unload packages a user does:
A user might wish to change from one compiler to another:
The above command is short for:
To remove all modules do:
This will remove all modules. Lmod will try to reload any sticky modules.
To remove all modules including sticky modules do:
A user may wish to go back to an initial set of modules:
This will unload all currently loaded modules, including the sticky ones, then load the list of modules specified by LMOD_SYSTEM_DEFAULT_MODULES. There is a related command:
This command will also unload all currently loaded modules, including the sticky ones, and then load the system default unless the user has a default collection. See User Collections for more details.
If there are many modules on a system, it can be difficult to see what modules are available to load. Lmod provides the overview command to provide a concise listing. For example:
This shows the short name of the module (i.e. git, or singularity) and the number in the parenthesis is the number of versions for each. This list above shows that there is one version of git and two versions of singularity.
If a module is not available then an error message is produced:
It is possible to try to load a module with no error message if it does not exist. Any other failures to load will be reported.:
Modulefiles can contain help messages. To access a modulefile’s help do:
To get a list of all the commands that module knows about do:
The module avail command has search capabilities:
will list for any modulefile where the name contains the string “cc”.
Users may wish to test whether certain modules are already loaded:
Lmod will return a true status if all modules are loaded and a false status if one is not. Note that Lmod is setting the status bit, there is nothing printed out. This means that one can do the following:
Users also may wish to test whether certain modules can be loaded with the current $MODULEPATH:
Lmod will a true status if all modules are available and false if one can not be loaded. Again this command sets the status bit.
Modulefiles can have a description section known as “whatis”. It is accessed by:
There is a keyword search tool:
This will search any help message or whatis description for the word(s) given on the command line.
Another way to search for modules is with the “module spider” command. This command searches the entire list of possible modules. The difference between “module avail” and “module spider” is explained in the “Module Hierarchy” and “Searching for Modules” section.:
Users can also find which categories a site provides:
To know which modules, users can pick one or more name from the list of categories with the number of modules provide:
Here we see that there are 30 versions of mpich and 6 version of openmpi. Also that category name given, in this case “library” does partial matches and is case-insensitive.
Specifying modules to load
Modules are a way to ask for a certain version of a package. For example a site might have two or more versions of the gcc compiler collection (say versions 7.1 and 8.2). So a user may load:
or:
In the second case, Lmod will load gcc version 7.1 where as in the first case Lmod will load the default version of gcc which normally be 8.2 unless the site marks 7.1 as the default.
In this user guide, we call gcc/7.1 the fullName of the module and gcc as the shortName. We also call what the user asked for as the userName which could either be the fullName or the shortName depending on what the user typed on the command line.
Showing the contents of a module
There are several ways to use the show sub-command to show the contents of a modulefile. The first is to show the module functions instead of executing them:
This shows the functions such as setenv () or prepend_path () but nothing else. If you want to know the contents of the modulefile you can use:
This will show the raw text of the modulefile. This is same as printing the modulefile, but here Lmod will find the modulefile for you. If you want to know just the location of a modulefile do:
You will probably use the –redirect option so that the output goes to stdout and not stderr.
If you want to know how Lmod will parse a TCL modulefile you can do:
This useful when there is some question on how Lmod will treat a TCL modulefile.
ml: A convenient tool
For those of you who can’t type the mdoule, moduel, err module command correctly, Lmod has a tool for you. With ml you won’t have to type the module command again. The two most common commands are module list and module load <something> and ml does both:
means module list. And:
means module load foo while:
means module unload bar. It won’t come as a surprise that you can combine them:
means module unload bar; module load foo. You can do all the module commands:
If you ever have to load a module name spider you can do:
If you are ever forced to type the module command instead of ml then that is a bug and should be reported.
clearLmod: Complete remove Lmod setup
It is rare, but sometimes a user might need to remove the Lmod setup from their current shell. This command can be used with bash/zsh/csh/tcsh to remove the Lmod setup:
This command prints a message telling the user what it has done. This message can be silented with:
SAFETY FEATURES
(1): Users can only have one version active: The One Name Rule
If a user does:
The first load command will load the 11.1 version of xyz. In the second load, the module command knows that the user already has xyz/11.1 loaded so it unloads that and then loads xyz/12.0. This protection is only available with Lmod.
This is known as the One Name rule. This feature is core to how Lmod works and there is no way to override this.
(2) : Users can only load one compiler or MPI stack at a time.
Lmod provides an additional level of protection. If each of the compiler modulefiles add a line:
Then Lmod will not load another compiler modulefile. Another benefit of the modulefile family directive is that an environment variable “LMOD_FAMILY_COMPILER” is assigned the name (and not the version). This can be useful specifying different options for different compilers. In the High Performance Computing (HPC) world, the message passing interface (MPI) libraries are important. The mpi modulefiles can contain a family(“MPI”) directive which will prevent users from loading more than one MPI implementation at a time. Also the environment variable “LMOD_FAMILY_MPI” is defined to the name of the mpi library.
Module Hierarchy
Libraries built with one compiler need to be linked with applications with the same compiler version. If sites are going to provide libraries, then there will be more than one version of the library, one for each compiler version. Therefore, whether it is the Boost library or an mpi library, there are multiple versions.
There are two main choices for system administrators. For the XYZ library compiled with either the UCC compiler or the GCC compiler, there could be the xyz-ucc modulefile and the xyz-gcc module file. This gets much more complicated when there are multiple versions of the XYZ library and different compilers. How does one label the various versions of the library and the compiler? Even if one makes sense of the version labeling, when a user changes compilers, the user will have to remember to unload the ucc and the xyz-ucc modulefiles when changing to gcc and xyz-gcc. If users have mismatched modules, their programs are going to fail in very mysterious ways.
A much saner strategy is to use a module hierarchy. Each compiler module adds to the MODULEPATH a compiler version modulefile directory. Only modulefiles that exist in that directory are packages that have been built with that compiler. When a user loads a particular compiler, that user only sees modulefile(s) that are valid for that compiler.
Similarly, applications that use libraries depending on MPI implementations must be built with the same compiler - MPI pairing. This leads to modulefile hierarchy. Therefore, as users start with the minimum set of loaded modules, all they will see are compilers, not any of the packages that depend on a compiler. Once they load a compiler they will see the modules that depend on that compiler. After choosing an MPI implementation, the modules that depend on that compiler-MPI pairing will be available. One of the nice features of Lmod is that it handles the hierarchy easily. If a user swaps compilers, then Lmod automatically unloads any modules that depends on the old compiler and reloads those modules that are dependent on the new compiler.
If a modulefile is not available with the new compiler, then the module is marked as inactive. Every time MODULEPATH changes, Lmod attempts to reload any inactive modules.
Searching For Modules
When a user enters:
Lmod reports only the modules that are in the current MODULEPATH. Those are the only modules that the user can load. If there is a modulefile hierarchy, then a package the user wants may be available but not with the current compiler version. Lmod offers a new command:
which lists all possible modules and not just the modules that can be seen in the current MODULEPATH. This command has three modes. The first mode is:
This is a compact listing of all the possible modules on the system. The second mode describes a particular module:
The third mode reports on a particular module version and where it can be found:
Controlling Modules During Login
Normally when a user logs in, there are a standard set of modules that are automatically loaded. Users can override and add to this standard set in two ways. The first is adding module commands to their personal startup files. The second way is through the “module save” command.
To add module commands to users’ startup scripts requires a few steps. Bash users can put the module commands in either their ~/.profile
file or their ~/.bashrc
file. It is simplest to place the following in their ~/.profile
file:
and place the following in their ~/.bashrc
file:
By wrapping the module command in an if test, the module commands need only be read in once. Any sub-shell will inherit the PATH and other environment variables automatically. On login shells the ~/.profile
file is read which, in the above setup, causes the ~/.bashrc
file to be read. On interactive non-login shells, the ~/.bashrc
file is read instead. Obviously, having this setup means that module commands need only be added in one file and not two.
Csh users need only specify the module commands in their ~/.cshrc
file as that file is always sourced:
User Collections
User defined initial list of login modules:
Assuming that the system administrators have installed Lmod correctly, there is a second way which is much easier to set up. A user logs in with the standard modules loaded. Then the user modifies the default setup through the standard module commands:
Once users have the desired modules load then they issue:
This creates a file called ~/.config/lmod/default
which has the list of desired modules. Note only the current set of modules is recorded the in the collection. If module X loads module A and the user deletes module A before doing module save collectionName
then module A will NOT be loaded when the collection is restored. All load(), always_load(), depends_on() statements inside the modulefiles are ignored when restoring a collection. Instead Lmod loads just the list of modulefiles stored in the collection.
Once this is set-up a user can issue:
and only the desired modules will be loaded. If Lmod is setup correctly (see Providing A Standard Set Of Modules for all Users) then the default collection will be the user’s initial set of modules.
If a user doesn’t have a default collection, the Lmod purges ALL currently loaded modules, including the sticky ones, and loads the list of module specified by LMOD_SYSTEM_DEFAULT_MODULES just like the module reset
command.
Users can have as many collections as they like. They can save to a named collection with:
and restore that named collection with:
A user can print the contents of a collection with:
A user can list the collections they have with:
Finally a user can disable a collection with:
If no collection_name
is given then the default is disabled. Note that the collection is not remove just renamed. If a user disables the foo collection, the file foo is renamed to foo~. To restore the foo collection, a user will have to do the following:
Rules for loading modules from a collection
Lmod has rules on what modules to load when restoring a collection. Remember that userName is what the user asked for, the fullName is the exact module name and shortName is name of the package (e.g. gcc, fftw3).
-
Lmod records the fullName and the userName in the collection.
-
If the userName is the same as the fullName then it loads fullName independent of the default.
-
if the userName is not the same as the fullName then it loads the default.
-
Unless LMOD_PIN_VERSIONS=yes then the fullName is always loaded.
In other words if a user does:
then “module restore” will load the default A, B, and C. So if the default for module A changed between when the collection was saved and then restored, a new version of A will be loaded. This assumes that LMOD_PIN_VERSIONS is not set. If it is set or Lmod is configured that way then if A/1.1, B/2.4 and C/3.3 are the default then those modules will be loaded in the future independent of what the defaults are in the future.
On the other hand:
then “module restore” will load the A/1.0, B/2.3, and C/3.4 independent of what the defaults are now or in the future.
- Details
|
![]() |
![]() |
Our friends and sponsors:
- Details
Currently Hive has limits on public partitions, the limits are set per partition and the limitations are equal for all the users using public partitions.
Now there are 3 Hive public partitions, one queen partition(Fat memory node) and one mpi partition.
In order to view the real time partition info in Hive terminal, type command sinfo , also there is a login message for every user that contains real-time QOS limitations on each partition, user information and nodes information.
hive partition containing 40 public compute nodes and is a default. Running jobs on Hive partition is recommended for new users and for users that want to be sure that their job will run until it end without any preemption.
queen partition containing high memory compute nodes(queen's), this partition is made for a jobs that require a lot of memory.
preempt partition are preempt-able that contain all the compute nodes(bee's) as well as public and private. Preempt-able partition means that every job that submitted on higher priority partitions(hive,queen,private partitions) will kick off your job that running on preempt partition, and be restart from the beginning to the queue. on preempt partition you get a benefit of lower limitations per user.
mpi partition that contain all the public compute nodes(bee's) that made for MPI jobs that using a large amount of resources
guest partition contains thin compute nodes, made for Hive guest users(non Faculty of Science members), partition is low-priority and preempt-able by other partitions.
Private partition's partitions that contain private nodes that different groups has bough't, those partitions usually receive names of their group name. Running jobs on private partitions is allowed only by the users that belong the the partition group. private nodes has no limitations and will get 100% priority over any other partition that containing the private nodes(ckpt partitions for example).
Below you can see information about the limits each partition has:
- Details
Slurm is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Table of contents
Partitions and nodes information
hive partition containing public compute nodes(bee's) this is a default partition. Running jobs on Hive partition's is recommended for new users and for users that want to be sure that their job will run until it end without any preemption.
queen partition containing high memory compute nodes(queen's), this partition is made for a jobs that require a lot of memory.
preempt is a VIP partition are preempt-able that contain all the compute nodes(bee's) as well as public and private. Preempt-able partition means that every job that submitted on higher priority partitions(hive,queen,private partitions) will kick off your job that running on preempt partition, and be restart from the beginning to the queue. on preempt partition you get a benefit of lower limitations per user.
mpi partition that contain all the public compute nodes(bee's) that made for MPI jobs that using a large amount of resources, especially parallel jobs
Private partition's partitions that contain private nodes that different groups have bought, those partitions usually receive names of their group name. Running jobs on private partitions is allowed only by the users that belong the partition group. Private nodes have no limitations and will get 100% priority over any other partition.
Please see the below table with information regarding Hive compute nodes, different types and range of each type:
Name | Quanity | Model | CPU's № | RAM | Notes/Name and range in Slurm |
OldBees -Compute(bee's) | 38 | HP XL170r | 24 | 128GB | bee33-71 |
Bees - Compute(bee's) | 73 | HP ProLiant XL220n Gen10 | 64 | 250GB | bee73-145 |
Fat node(old queen's) | 2 | HP DL560 | 56 | 760GB | queen2-3 |
Fat node(queen's) | 2 | HP DL560-G10 | 80 | 1.5 TB | queen4, queen5 (350 GB) |
GPU(vespa's)-N/A at hive02 | --- | -------------- | --- | ---- | ---------- |
* For information regarding limitation per partition, please check the limitations section in website menu.
Slurm SBATCH command and slurm job submission Scripts
The command "sbatch" should be the default command for running batch jobs.
With "sbatch" you can run simple batch jobs from a command line, or you can execute complicated jobs from a prepared batch script.
First, you should understand the basic options you can add to the sbatch command in order to request the right allocation of resources for your jobs:
Commonly used options in #srun, #sbatch, #salloc:
-p partitionName |
submit a job to queue queueName |
-o output.log |
Append job's output to output.log instead of slurm-%j.out in the current directory |
-e error.log |
Append job's STDERR to error.log instead of job output file (see -o above) |
--mail-type=type |
Email submitter on job state changes. Valid type values are BEGIN, END,FAIL, REQUEUE and ALL (any state change). |
--mail-user=email |
User to receive email notification of state changes (see –mail-type above) |
-n N --ntasks N |
Set number of processors (cores) to N(default=1), the cores will be allocated to cores chosen by SLURM |
-N N --nodes N |
Set number of nodes that will be part of a job. On each node there will be --ntasks-per-node processes started. If the option --ntasks-per-node is not given, 1 process per node will be started |
--ntasks-per-node N |
How many tasks per allocated node to start (see -N above) |
--cpus-per-task N |
Needed for multithreaded (e.g. OpenMP) jobs. The option tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads the program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS |
-J --job-name |
Set job name which is shown in the queue. The job name (limited to first 24 characters) is used in emails sent to the user |
-w node1,node2,... |
Restrict job to run on specific nodes only |
-x node1,node2,... |
Exclude specific nodes from job |
Below example of running simple batch job directly from a command line/terminal of Hive with allocation of minimum resources(1 node and 1 CPU/core):
$ sbatch -N1 -n1 --wrap '. /etc/profile.d/modules.sh ; module load blast/2.2.30 ; blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn'
Submitted batch job 214726
Where:
-N1 is request for 1 node
-n1 is request for 1 CPU/core
. /etc/profile.d/modules.sh ; module load blast/2.2.30 is a command that will load software that your job is using(in the example blast program)
blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn is your job command
* The job will start on default partition named hive1d, if your job is going to run more then 1 day, then you should add the -p option to your command and specify hive7d or hiveunlim partition that will allow your job to run up to 7 or 31 days.
Below is an example of running a simple job on hive7d partition with time limit for 5 days:
$ sbatch -N1 -n1 -p hive7d --time=5-00:00:00 --wrap '. /etc/profile.d/modules.sh ; module load blast/2.2.30 ; blastn -query sinv_traA.fa -db sinv_genome.fa -out sinv_traA.blastn'
Submitted batch job 214726
So basically you should execute the job like that formula: [sbatch command] [allocation of resources] [--wrap] ['your job command']
the --wrap option has to be AFTER the allocation of needed resources and not before, you have to use the --wrap option in order to allow executing jobs from the command line because the standard of sbatch command is to run batch jobs from a batch script and not from the command line.
Running jobs from prepared slurm submission script:
The following is a typical Slurm submission script example.
* Please note that there is a few types of thin compute nodes(bee's) on hive. Dell(bee-001-032) compute nodes has 20 cores/CPU's while HP(bee033-063) compute nodes has 24 cores/CPU's
#!/bin/sh #SBATCH --ntasks 20 # use 20 cores
#SBATCH --ntasks-per-node=20 # use 20 cpus per each node #SBATCH --time 1-03:00:00 # set job timelimit to 1 day and 3 hours #SBATCH --partition hive1d # partition name #SBATCH -J my_job_name # sensible name for the job # load up the correct modules, if required . /etc/profile.d/modules.sh module load openmpi/1.8.4 RAxML/8.1.15
# launch the code mpirun...
How to submit a job from a batch script
To submit this, run the following command:
sbatch myscript.sh
Warning: do not execute the script
The job submission script file is written to look like a bash shell script. However, you do NOT submit the job to the queue by executing the script.
In particular, the following is INCORRECT:
# this is the INCORRECT way to submit a job ./myscript.sh # wrong! this will not submit the job!
The correct way is noted above (sbatch myscript.sh
).
Please, refer to the manual of SBATCH (man sbatch) to see more helpful information about how to use the sbatch command in the right way.
Running job with multiple partition
There is an option selecting more then one partition when submitting your job to the queue, to do that you need to specify in comma separated list the partitions you would like to submit your job to.
Your job will be submitted to the first partition with free resources in the hierarchy from left to right from the list of partitions that you entered in the list. Examples:
All the examples below show the standard -p / --partition option of Slurm but with comma seperated list of partitions that you should add to your job.
To run short jobs (less than 24 hours):
-p hive1d,hive7d, hiveunlim
If you also have some private nodes: (replace "private" with the partition name)
-p private,hive1d,hive7d, hiveunlim
If you are willing to risk your jobs being killed and rerun: (for short jobs)
-p hive1d,hive7d,hiveunlim,preempt1d,preempt7d,preempt31d
Same with a private partition:
-p private,hive1d,hive7d,hiveunlim,preempt1d,preempt7d,preempt31d
To run longer jobs: (up to 7 days)
-p hive7d,hiveunlim
Same with private partition:
-p private,hive7d,hiveunlim
With checkpointing:
-p hive7d,hiveunlim,ckptdell7d,ckptdell31d
or
-p hive7d,hiveunlim,ckpthp7d,ckpthp31d
Same with private partition:
-p private,hive7d,hiveunlim,ckptdell7d,ckptdell31d
or
-p private,hive7d,hiveunlim,ckpthp7d,ckpthp31d
To run very long jobs: (up to 31 days)
-p hiveunlim
Same with private partition:
-p private,hiveunlim
With checkpointing:
-p hiveunlim,ckptdell31d
or
-p hive7d,hiveunlim,ckpthp31d
Same with private partition:
-p private,hiveunlim,ckptdell31d
or
-p private,hiveunlim,ckpthp31d
To run jobs that require more than 128GB memory:
-p queen
With checkpointing:
-p queen,queenckpt
Job Arrays
The job array is very helpful option of sbatch for users that need to run a lot of single-core computations. With this option your can start hundreds of jobs from one sbatch script that will start from the specified range and in the slurm queue all those jobs will have unique ID number and the pending jobs will be grouped to one.
Below example of sbatch job script with array option that will start 399 jobs that will execute file with the variable number from 1-399:
#!/bin/bash
#SBATCH --job-name=CodemlArrayJob
#SBATCH --partition=hive1d,hive7d,hiveunlim
#SBATCH --array=1-399
#SBATCH --output=out_%A_%a_%j.out
#SBATCH --error=error_%A_%a_%j.err
## To make things simple: ${i} == $SLURM_ARRAY_TASK_ID
i=${SLURM_ARRAY_TASK_ID}
# Check $WORKDIR existence
function checkWorkdir() {
if [ ! -d "$1" ]; then
echo "ERROR! Directory $1 doesn't exist. "
exit 1;
fi
}
#load the modules environment
. /etc/profile.d/modules.sh
module load PAML/4.8
#Job commands
echo "branch${i}";
WORKDIR="/data/home/privman/rwilly/2.Like.Clade12/guidence/guidence.results/BranchSiteModelA/BranchSiteWithScript/like.Clade12.branch${i}.workingDir"
checkWorkdir $WORKDIR
cd $WORKDIR
codeml LikeClade12.branch${i}.ctl
Note: At --error and --output section you have %A,%a and %j, the %A will print your master array job id number, the %a will print you the array number of the job, and the %j will print you the unique id number of each job that you submitted in array mod. The information of the job id's could be useful when the job array is checkpoint-able, the job id number(array and original) will help you find which jobs has been checkpointed.
Jobs With Checkpoint options:
Slurm has an option to checkpoint your running jobs every X time, checkpoint-able jobs are needed for securing your progress on a preempted partitions or if you are running a very long job you will want to make checkpoints to have an option of stopping and continuing the job from your checkpoint.
Currently there are two partitions with timelimit of 1, 7 and 31 days that are made for check-pointing: ckptdell(ckptdell1d, ckptdell7d, ckptdell31d) and ckpthp(ckpthp1d, ckpthp7d, ckpthp31d)
Below you can see the example of sbatch job script that will make checkpoint every 6 hours:
!/bin/bash
#SBATCH --job-name=CheckpointableJobExample
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --partition=ckptdell1d
#SBATCH --checkpoint=360 # Time in minutes, every X minutes to make a checkpoint
#SBATCH --checkpoint-dir=/data/ckpt # Default place where your checkpoints will be created, you can change it to other place in your home folder
## The following lines needed for checkponting.
##restarterOpts="StickToNodes" # Use the stick to nodes option only if your job cannot be resubmitted to other machines that was from beginning
restarterOpts=""
. /data/scripts/include_for_checkpointing
### From here you can start editing your job commands
for i in {1..300}; do
echo -n "$i: "
date
sleep 2
done
Attention: If your job with checkpoints will stop and then started again, your job will continue from progress of the last checkpoint that was made and not from the second before the job was stopped.
Limitations: The checkpoint option has some limitations on few types of jobs, below you can see list of job types that will not work with the checkpoint option:
- BLCR will not checkpoint and/or restore open sockets (TCP/IP, Unix domain, etc.). At restart time any sockets will appear to have been closed.
- BLCR will not checkpoint and/or restore open character or block devices (e.g. serial ports or raw partitions). At restart time any devices will appear to have been closed.
- BLCR does not handle SysV IPC objects (man 5 ipc). Such resources are silently ignored at checkpoint time and are not restored.
- If a checkpoint is taken of a process with any "zombie" children, then these children will not be recreated at restart time. A "zombie" is defined as a process that has exited, but who's exit status has not yet been reaped by its parent (via wait() or a related function). This means that a wait()-family call made after a restart will never return a status for such a child.
Using SRUN command
The "srun" command is used to run interactive jobs on the compute nodes of the HPC, The following example will run "matlab" on the compute node of the cluster:
$ module load matlab/r2014b
$ srun matlab
MATLAB is selecting SOFTWARE OPENGL rendering. < M A T L A B (R) > Copyright 1984-2014 The MathWorks, Inc. R2014b (8.4.0.150421) 64-bit (glnxa64) September 15, 2014 To get started, type one of these: helpwin, helpdesk, or demo. For product information, visit www.mathworks.com. >> matlab command line
In example above we are loading the matlab module from the public software and then executing matlab as interactive job with the srun command, you can see in the output that you work with matlab interactively from matlab command line.
With "srun" command you can run jobs with a lot of different allocation options like assigning number of nodes for the job, assigning how many tasks to use, how many tasks to use per node, how many CPU's to use per task and more. Use the "man srun" command to see all the possible options.
Using SALLOC Command
With the "salloc" command you can obtain an interactive SLURM job allocation (a set of nodes), execute a command, and then release the allocation when the command is finished.
If you would like to allocate resources on the cluster and then have the flexibility of using those resources in an interactive manner, you can use the command "salloc" to allow interactive use of resources allocated to your job. In the next example we will request 5 tasks and 2 hours for allocation:
$ salloc -n 5 --time=2:00:00
salloc: Pending job allocation 45924
salloc: job 45924 queued and waiting for resources
salloc: job 45924 has been allocated resources
salloc: Granted job allocation 45924
$ srun hostname
bee025
bee025
bee025
bee025
bee025
$hostname
hive01.haifa.ac.il
$exit
exit salloc: Relinquishing job allocation 45924 salloc: Job allocation 45924 has been revoked.
After that request enters the job queue just like any other job, and "salloc" will tell you that it is waiting for the requested resources if there aren't enough at the moment. When "salloc" tells you that your job has been allocated resources, you can interactively run programs on those resources with "srun" command. The commands you run with "srun" will then be executed on the resources your job has been allocated. If you finished your work before the allocated time or if you didn't allocated time at all, use "exit" command to stop the allocation permenantly
Warning: All commands that you are executing after your job has been allocated resources must run with "srun" command, otherwise those command will be executed from the access node and not on the allocated resources you asked for, you can see it in the example above.
Running GPU jobs
At present there is no GPU unit at hive02.To execute a job that using GPU power, you should contact Nikolai (This email address is being protected from spambots. You need JavaScript enabled to view it.) and hopefully get access to Deep Learning Cluster belonging to the Department of Computer Sciences
More Information:
List Partitions
To view the current status of all partitions accessible by the user:
$ sinfo -l
To view the current status of a partition named partitionName run:
$ sinfo -l partitionName
Display Partition Contents
To get list of all jobs running in partition named partitionName run:
$ squeue -p queueName
Same, limited to user userName:
$ squeue -p queueName -u userName
Control Nodes
Get nodes state
One of the following commands can be used to get node(s) state, depending on desired verbosity level:
# sinfo -N
or
# sinfo -o "%20N %.11T %.4c %.8z %.15C %.10O %.6m %.8d"
Commonly used options in #srun, #sbatch, #salloc:
-p partitionName |
submit a job to queue queueName |
-o output.log |
Append job's output to output.log instead of slurm-%j.out in the current directory |
-e error.log |
Append job's STDERR to error.log instead of job output file (see -o above) |
--mail-type=type |
Email submitter on job state changes. Valid type values are BEGIN, END,FAIL, REQUEUE and ALL (any state change). |
--mail-user=email |
User to receive email notification of state changes (see –mail-type above) |
-n N --ntasks N |
Set number of processors (cores) to N(default=1), the cores will be allocated to cores chosen by SLURM |
-N N --nodes N |
Set number of nodes that will be part of a job. On each node there will be --ntasks-per-node processes started. If the option --ntasks-per-node is not given, 1 process per node will be started |
--ntasks-per-node N |
How many tasks per allocated node to start (see -N above) |
--cpus-per-task N |
Needed for multithreaded (e.g. OpenMP) jobs. The option tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads the program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS |
-J --job-name |
Set job name which is shown in the queue. The job name (limited to first 24 characters) is used in emails sent to the user |
-w node1,node2,... |
Restrict job to run on specific nodes only |
-x node1,node2,... |
Exclude specific nodes from job |
List of a few more helpfull slurm commands:
Man pages exist for all SLURM daemons, commands, and API functions. The command option --help also provides a brief summary of options. Note that the command options are all case insensitive.
- sacct is used to report job or job step accounting information about active or completed jobs.
- salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.
- sattach is used to attach standard input, output, and error plus signal capabilities to a currently running job or job step. One can attach to and detach from jobs multiple times.
- sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
- sbcast is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
- scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
- scontrol is the administrative tool used to view and/or modify SLURM state. Note that many scontrol commands can only be executed as user root.
- sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
- smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
- squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
- srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.
- smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
- strigger is used to set, get or view event triggers. Event triggers include things such as nodes going down or jobs approaching their time limit.
- sview is a graphical user interface to get and update state information for jobs, partitions, and nodes managed by SLURM.
- Details
The fair-share component of the job priority is calculated differently. The goal is to make sure that the priority strictly follows the account hierarchy, so that jobs under accounts with usage lower than their fair share will always have a higher priority than jobs belonging to accounts which are over their fair share.
The algorithm is based on ticket scheduling, where at the root of the account hierarchy one starts with a number of tickets, which are then distributed per the fairshare policy to the child accounts and users. Then, the job whose user has the highest number of tickets is assigned the fairshare priority of 1.0, and the other pending jobs are assigned priorities according to how many tickets their users have compared to the highest priority job.
In Hive, groups that buy more resources get more fairshare tickets assigned per group and will have a priority on the other jobs that was submitted to the queue by groups that bought less resources.
The fairshare factor is not defined only by the fairshare tickets that was assigned to groups by their investments to Hive, the fairshare factor is calculating the history of the group jobs and if one group will submit a lot of jobs and other group will submit less, then the fairshare factor can move the less active group jobs to the top of the job queue.
The fair-share algorithm in SLURM is described at slurm website, please refer to the next link if you are interested in understanding of the fairshare factor http://slurm.schedmd.com/fair_tree.html
Below you will find some command examples of how to monitor your fairshare:
Check current fairshare definitions for your group account:
$ sshare -A ACCOUNTNAME
Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
ACCOUNTNAME 109 0.054801 1579497 0.004294 0.94713
Where Raw Shares is the number of tickets that group received(Depending on how much resources the group bought) and the FairShare number is the number you get after calculation of fairshare factor(refer to http://slurm.schedmd.com/fair_tree.html to understand how it works).
to be continued....
- Details
High-level Infrastructure Architecture
The Ethernet network is used for cluster management:
1. Console access (iDRAC/BMC)
2. Compute nodes provisioning using xCAT
3. Grid internal communications (scheduling, remote login, naming services, etc.)
Infiniband
InfiniBand is used for storage access and MPI traffic.
Access from HaifaU LAN
All grid components, including compute nodes, are accessible from University of Haifa LAN
Storage
Lustre is a high-performance parallel distributed file system designed for use in HPC and supercomputing environments. It offers scalable and high-speed storage solutions, enabling efficient data access and storage for large-scale computational workloads.
Servers:
Lustre Storage Servers - Redundant servers support storage fail-over, while metadata and data are stored on separate servers, allowing each file system to be optimized for different workloads. Lustre can deliver fast IO to applications across high-speed network fabrics, such as Ethernet, InfiniBand (IB), Omni-Path (OPA), and others.
Storage server - The storage system of the cluster is an HPE Cray E1000 ClusterStor server. Note that this storage is meant for ongoing analyses. This is not an archive system. This is a distributed file system that allows high performance - fast reading and writing of many files, both large and small. It is composed of many disks, but functions as one storage volume, with a total of 919 TB. This total volume is made up from a hybrid set of disks, including both HDD and SSD, which ensures high performance for different usecases.
Please note that your files on the Hive2 storage system are NOT BACKED UP by default. It is your responsibility to backup your vital files and irreplaceable data. While the E1000 server is a highly resilient solution, any system has risk of failure and loss of data. Therefore, please make sure to backup important files.
Backup server - The Hive2 backup system can automatically back up files, but only for users who buy space on the backup server. The backup system uses rsync replication of user's data. Rsnapshot is used to create daily snapshots, keeping up to 14 snapshots.
Name | Quantity | Model | CPU's № | RAM | Notes |
Compute(Old bee's) | 38 | HP XL170r |
24 | 128GB | bee033-071 |
Compute(bee's) | 73 | HP ProLiant XL220n Gen10 Plus |
64 | 250GB | bee073-145 |
Fat node(Old queen's) | 1 | HP DL560 |
56 | 760GB | queen02-03 |
Fat node(queen's) | 2 | HP DL560-G10 |
80 | 1.5TB | queen4 (1.5 TB) queen5 (360GB) |
GPU(vespa's) |
Operating Systems
For xCAT we are using RHEL8.5 for xCAT supported OS reasons. Compute nodes and xCAT managed hosts we are using RHEL9.1. Operating system can be upgraded easily as needed using xCAT.