If you are running checkpoint-able jobs in Slurm array mod, you would probably like to know where is the checkpoint file of each job.
The thing is when you running jobs in array mod, in the default squeue command you will see the id of the array job and not the unique id of the job which was check-pointed, as well as default stdout and stderr files will be saved with the array id job while all the checkpoints are saved in the folder with the original unique id of the job and not with the array job id.
Below you can see header example of the checkpoint-able job in array mod that will save each slurm job output file both with the array id and with the unique id of the job:
#!/bin/bash #SBATCH --job-name=ckptJobArray #SBATCH --partition=ckpt #SBATCH --array=1-399 #SBATCH --output=out_%A_%a_%j.out #SBATCH --error=error_%A_%a_%j.err #SBATCH --checkpoint=360 #SBATCH --checkpoint-dir=/data/ckpt# The following lines needed for checkponting. ##restarterOpts="StickToNodes" restarterOpts="" . /data/scripts/include_for_checkpointing ## To make things simple: ${i} == $SLURM_ARRAY_TASK_ID i=${SLURM_ARRAY_TASK_ID} ## from here you can write your job commands
Pay attention to the --output and --error lines:
%A is the master array job id, %a is the number of the array job and %j is the unique id number of a job. You will find a checkpoint for your job from the unique id number of the job inside the --checkpoint-dir location.
So basically all you need to do to find which of the created checkpoints refers to your job, is just to compare the output Slurm file name with the checkpoint names(unique job id).
Below some examples:
View squeue with output of unique id number + job array id number + array number:
squeue -o "%.10A %.13i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R"
Detect which of the array jobs was preempted and restarted at-least one time:
If you are interested to investigate which of the N number of array jobs was preempted and restarted at least one time, you need to read the Slurm stdout/stderr job files. You can use a command that will read all the slurm array output files and search for the "PREEMPTION" word, below is the example of the command you can use to detect which jobs was preempted:
grep "DUE TO PREEMPTION"
/YOUR/WORKDIR/*
This command will search which jobs was cancelled due to preemption.