Would like to inform Hive users that the default output of few commands has been modified on the access node in order to be more informative and friendly for users:

sacct(jobs history, running jobs):

Old output table: JobID   JobName  Partition    Account  AllocCPUS      State ExitCode
New output table: JobID   JobName Partition   Account   User  AllocCPUS    State  ExitCode   Start      End    Elapsed   NodeList  Comment

squeue(scheduler jobs queue):
Old output table: JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
New output table: JOBID PARTITION     NAME     USER    GROUP ST        TIME  TIME_LIMIT  NODES CPU NODELIST(REASON)

sinfo(view partition's and nodes)
Old output table: PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
New output table: PARTITION    AVAIL  TIMELIMIT     GROUPS NODES(A/I/O/T) NODELIST
**NODES(A/I/O/T) is Nodes State:

A = Allocated
I = IDLE
T = TOTAL
*In old output, each node state on each partition was on different line, now we have more clear output without duplicating lines because of node status.


**Every user can change the default output of each command as he want by modifying his environment, refer to manuals of commands to find the output arguments**

Would like to inform about feature of Slurm that could be useful for our users.

The option is called --dependency which help u to configure your sbatch job to start in "dependency" mod, look for example below:

#!/bin/bash
#SBATCH --dependency=<type>

The available dependency types for job chains are:

  • after:<jobID> job starts when job with <jobID> begun execution
  • afterany:<jobID> job starts when job with <jobID> terminates
  • afterok:<jobID> job starts when job with <jobID> terminates successfully
  • afternotok:<jobID> job starts when job with <jobID> terminates with failure
  • singleton jobs starts when any previously job with the same job name and user terminates


It could be very comfortable and helpful for running multiple jobs that depend on each other.

More interesting Slurm features everyone can find on the Slurm manual page: http://slurm.schedmd.com/documentation.html

 

If you are running checkpoint-able jobs in Slurm array mod, you would probably like to know where is the checkpoint file of each job.

The thing is when you running jobs in array mod, in the default squeue command you will see the id of the array job and not the unique id of the job which was check-pointed, as well as default stdout and stderr files will be saved with the array id job while all the checkpoints are saved in the folder with the original unique id of the job and not with the array job id.

Below you can see header example of the checkpoint-able job in array mod that will save each slurm job output file both with the array id and with the unique id of the job:

#!/bin/bash

#SBATCH --job-name=ckptJobArray
#SBATCH --partition=ckpt
#SBATCH --array=1-399
#SBATCH --output=out_%A_%a_%j.out
#SBATCH --error=error_%A_%a_%j.err
#SBATCH --checkpoint=360
#SBATCH --checkpoint-dir=/data/ckpt

# The following lines needed for checkponting.
##restarterOpts="StickToNodes"
restarterOpts=""
. /data/scripts/include_for_checkpointing


## To make things simple: ${i} == $SLURM_ARRAY_TASK_ID
i=${SLURM_ARRAY_TASK_ID}

## from here you can write your job commands

Pay attention to the --output and --error lines:
%A is the master array job id, %a is the number of the array job and %j is the unique id number of a job. You will find a checkpoint for your job from the unique id number of the job inside the --checkpoint-dir location.

So basically all you need to do to find which of the created checkpoints refers to your job, is just to compare the output Slurm file name with the checkpoint names(unique job id).

Below some examples:

View squeue with output of unique id number + job array id number + array number:

squeue -o "%.10A %.13i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R"


Detect which of the array jobs was preempted and restarted at-least one time:

If you are interested to investigate which of the N number of array jobs was preempted and restarted at least one time, you need to read the Slurm stdout/stderr job files. You can use a command that will read all the slurm array output files and search for the "PREEMPTION" word, below is the example of the command you can use to detect which jobs was preempted:

grep "DUE TO PREEMPTION" /YOUR/WORKDIR/*

This command will search which jobs was cancelled due to preemption.

Limitations has been added for Hive resources! That means that now each user will be limited in resources under public partitions, that will make Hive more “fair” and usable by higher number of users.

Currently there are 3 Hive and 3 ckpt partitions instead of one Hive and one CKPT:
 
hive1d - DEFAULT partition, limits:  WallTime: 1 day, MaxCPU’sPerUser = 200
hive7d - limits:  WallTime: 7 days, MaxCPU’sPerUser = 100
hive31d limits:  WallTime: 31 day, MaxCPU’sPerUser = 50
ckpt1d - limits:  WallTime: 1 days, MaxCPU’sPerUser = 800 Preemption=REQUEUE Low priority preempt-able partition
ckpt7d - limits:  WallTime: 7 days, MaxCPU’sPerUser = 600 Preemption=REQUEUE Low priority preempt-able partition
ckpt31d - limits:  WallTime: 31 days, MaxCPU’sPerUser = 400 Preemption=REQUEUE Low priority preempt-able partition
queen - limits:  WallTime: 31 days, MaxNODE’sPerUser = 1
ckptqueen - limits:  WallTime: 31 days Preemption=REQUEUE Low priority preempt-able partition
mpi - limits: MinCPU’sPerUser = 200
 
 
 
*WallTime= maximum time that each job can run
*MaxCPU’sPerUser= maximum number of CPU’s user can use under specific partition)
*MaxNODE'sPerUser= maximum number of NODE's user can user under specific partition
*MinCPU'sPerUser= minimum resource allocation per job
*Preemption=REQUEUE= Low priority partition, all jobs running on that partition could be preempted/checkpointed by higher priority partition. preempted job will restart from beginning or last checkpoint
 
Job submission recommendation:

Now that we Have limitations on the system, i recommend all users to give as much information as you can when submitting your jobs, that will help you have good statistics and high priority over other users. Users that will not give information will have lower priority, because system is calculating everything.

*When submitting a job, ask for the resources needed for your job and not MORE than your job is needed
* If your job is using Memory, do not forget to to allocate the required number of memory during submission(flag: --mem=<MB> )
* Do not forget to add the time flag when submitting jobs, if you are running a 3 day job under 7day partition, you could save 4 days to your statistics… (flag: -t, --time=<time>)

 

sbatch command should be used to run jobs by default

srun and salloc commands should be used only if the user want to run interactive job and see the output on the screen.

TIP:  You do not have to create a special batch script file in order to run the job with the sbatch command, you can use the --wrap option to run the job directly from the command line. Example:

sbatch -N1 -j testjob -o job_%j.out -e job_%j.err --wrap 'hostname'

So basically you should first give an allocation of resources for the job(number of nodes, jobname, stdout and stderr files) and only after allocation you should add the --wrap option(not before!) and after the --wrap you can execute your job command.

More information can be on Slurm page: http://hivehpc.haifa.ac.il/index.php/slurm

 

If anyone would like to connect to the University network through Iphone/Android phone, here is a quick manual on how to set it up:

Iphone:

- Go to app store > search for app named: Junos Pulse > install the app(it is free!) > start the application > go to VPN tab > configuration > and enter required information:

Name(optional): Haifa univ network
URL: sslt.haifa.ac.il
Username: your vpn juniper username
authentification type: should be set to password
Realm - leave it clear
Role - leave it clear

Android:

- Go to play market > search for app named: Junos Pulse > install the app(free!) > start the app > go to VPN tab > go to connections > setup a new connection > enter requited information:

Name(optional): Haifa univ network
URL: sslt.haifa.ac.il
Username: your vpn juniper username
authentification type: should be set to password
Realm - leave it clear
Role - leave it clear

 

After connecting to VPN, you can connect to Hive using any of possible SSH clients that are free to use from play market or app store