Table of contents

 

Jobs With Checkpoint options:

Slurm has an option to checkpoint your running jobs every X time, checkpoint-able jobs are needed for securing your progress on a preempted partitions or if you are running a very long job you will want to make checkpoints to have an option of stopping and continuing the job from your checkpoint.

Currently there are two partitions with timelimit of 1, 7 and 31 days that are made for check-pointing: ckptdell(ckptdell1d, ckptdell7d, ckptdell31d) and ckpthp(ckpthp1d, ckpthp7d, ckpthp31d)

Below you can see the example of sbatch job script that will make checkpoint every  6 hours:

!/bin/bash
#SBATCH --job-name=CheckpointableJobExample
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --partition=ckptdell1d
#SBATCH --checkpoint=360  # Time in minutes, every X minutes to make a checkpoint
#SBATCH --checkpoint-dir=/data/ckpt # Default place where your checkpoints will be created, you can change it to other place in your home folder


#
# The following lines needed for checkponting. ##restarterOpts="StickToNodes" # Use the stick to nodes option only if your job cannot be resubmitted to other machines that was from beginning restarterOpts="" . /data/scripts/include_for_checkpointing

### From here you can start editing your job commands

for i in {1..300}; do
echo -n "$i:    "
date
sleep 2
done

*ckpt partitions made for checkpoint-able jobs and has lower priority then hive partitions, that means that every ckpt job can be preempted and restarted by any higher priority job from hive or private partitions.

Attention: If your job with checkpoints will stop and then started again, your job will continue from progress of the last checkpoint that was made and not from the second before the job was stopped.

Limitations: The checkpoint option has some limitations on few types of jobs, below you can see list of job types that will not work with the checkpoint option:

  • BLCR will not checkpoint and/or restore open sockets (TCP/IP, Unix domain, etc.). At restart time any sockets will appear to have been closed.
  • BLCR will not checkpoint and/or restore open character or block devices (e.g. serial ports or raw partitions). At restart time any devices will appear to have been closed.
  • BLCR does not handle SysV IPC objects (man 5 ipc). Such resources are silently ignored at checkpoint time and are not restored.
  • If a checkpoint is taken of a process with any "zombie" children, then these children will not be recreated at restart time. A "zombie" is defined as a process that has exited, but who's exit status has not yet been reaped by its parent (via wait() or a related function). This means that a wait()-family call made after a restart will never return a status for such a child.