Batch queueing and job scheduling

Research Computing uses a queueing system called Slurm to manage compute resources and to schedule jobs that use them. Users use Slurm commands to submit batch and interactive jobs and to monitor their progress during execution.

Access to Slurm is provided by the slurm module.

$ module load slurm/cluster-name
where you should replace cluster-name with "summit" to submit jobs to Summit, and with "blanca" to submit jobs to Blanca. If you do not specify a cluster-name it will default to Janus.

Batch jobs

Slurm is primarily a resource manager for batch jobs: a user writes a job script that Slurm schedules to run non-interactively when resources are available. Users primarily submit computational jobs to the Slurm queue using the sbatch command.

$ sbatch job-script.sh

sbatch takes a number of command-line arguments. These arguments can be supplied on the command-line:

$ sbatch --ntasks 16 job-script.sh

or embedded in the header of the job script itself using #SBATCH directives:

#!/bin/bash
#SBATCH --ntasks 16

You can use the scancel command to cancel a job that has been queued, whether the job is pending or currently running. Jobs are cancelled by specifying the job id that is assigned to the job during submission.

Example batch job script: hello-world.sh

#!/bin/bash

#SBATCH --ntasks 1
#SBATCH --output hello-world.out
#SBATCH --qos debug
#SBATCH --time=00:05:00

echo Running on $(hostname --fqdn):  'Hello, world!'

This minimal example job script, hello-world.sh, when submitted with sbatch, writes the name of the cluster node on which the job ran, along with the standard programmer's greeting, "Hello, world!", into the output file hello-world.out

$ sbatch hello-world.sh

Note that any Slurm arguments must precede the name of the job script.

Example: Serial jobs

Job requirements

Slurm uses the requirements declared by job scripts and submission arguments to schedule and execute jobs as efficiently as possible. To minimize the time your jobs spend waiting to run, define your job's resource requirements as accurately as possible.

--nodes
The number of nodes your job requires to run.
--mem
The amount of memory required on each node.
--ntasks
The number of simultaneous tasks your job requires. (These tasks are analogous to MPI ranks.)
--ntasks-per-node
The number of tasks (or cores) your job will use on each node.
--time
The amount of time your job needs to run.

The --time requirement (also referred to as "walltime") deserves special mention. Job execution time can be somewhat variable, leading some users to overestimate (or even maximize) the defined time limit to prevent premature job termination; but an unnecessarily long time limit may delay the start of the job and allow undetected stuck jobs to waste more resources before they are terminated.

For all resources, --time included, smaller resource requirements generally lead to shorter wait times.

Summit nodes can be shared, meaning each such node may execute multiple jobs simultaneously, even from different users.

Additional job parameters are documented with the sbatch command.


Summit Partitions

On Summit, nodes with the same hardware configuration are grouped into partitions. You will need to specify a partition using --partition in order for your job to run on the appropriate type of node.

Partition name Description # of nodes cores/nodes RAM/core (GB) Max Walltime Billing weight
shas General Compute with Haswell CPUs (default) 380 24 5.25 24H 1
sgpu GPU-enabled 10 24 5.25 24H 2.5
smem High-memory 5 48 42 7D 6
sknl Phi (Knights Landing) CPU 20 64 TBD 24H 0.1

More details about each type of node can be found here.


Quality of service (QOS)

On Blanca, a QoS is specified to submit a job to either a group's high-priority queue or to the shared low-priority queue.

On Summit, QoSes are used to constrain or modify the characteristics that a job can have. For example, by selecting the "debug" QoS, a user can obtain higher queue priority for a job with the tradeoff that the maximum allowed wall time is reduced from what would otherwise be allowed on that partition. We recommend specifying a QoS (using the --qos flag or directive in Slurm) as well as a partition for every job

The available Summit QoSes are

QOS name Description Max walltime Max jobs/user Node limits Priority boost
normal default Derived from partition n/a 256/user 0
debug For quicker turnaround when testing 1H 1 32/job Equiv. of 3-day queue wait time
long For jobs needing longer wall times 7 D n/a 22/user; 40 nodes total 0
condo For groups who have purchased Summit nodes 7D n/a n/a Equiv. of 1 day queue wait time

Shell variables and environment

Jobs submitted to Summit are not automatically set up with the same environment variables as the shell from which they were submitted. Thus, it is required to load any necessary modules or set any environment variables needed by the job within the job script. These settings should be included after any #SBATCH directives in the job script.


Job arrays

Job arrays provide a mechanism for running several instances of the same job with minor variations.

Job arrays are submitted using sbatch, similar to standard batch jobs.

$ sbatch --array=[0-9] job-script.sh

Each job in the array will have access to a $SLURM_ARRAY_TASK_ID set to the value that represents that job's position in the array. By consulting this variable, the running job can perform the appropriate variant task.

Example array job script: hello-world.sh

#!/bin/bash

#SBATCH --array 0-9
#SBATCH --ntasks 1
#SBATCH --output array-job.out
#SBATCH --open-mode append
#SBATCH --qos debug
#SBATCH --time=00:05:00

echo "$(hostname --fqdn): index ${SLURM_ARRAY_TASK_ID}"

This minimal example job script, array-job.sh, when submitted with sbatch, submits ten jobs with indexes 0 through 9. Each job appends the name of the cluster node on which the job ran, along with the job's array index, into the output file array-job.out

$ sbatch array-job.sh
Example: Array jobs

Allocations

Access to computational resources is allocated via shares of CPU time assigned to Slurm allocation accounts. You can determine your default allocation account using the sacctmgr command.

$ sacctmgr list Users Users=$USER format=DefaultAccount

Use the --account argument to submit a job for an account other than your default.

#SBATCH --account=crcsupport

You can use the sacctmgr command to list your available accounts.

$ sacctmgr list Associations Users=$USER format=Account

Job mail

Slurm can be configured to send email notifications at different points in a job's lifetime. This is configured using the --mail-type and --mail-user arguments.

#SBATCH --mail-type=END
#SBATCH --mail-user=user@example.com

The --mail-type configures what points during job execution should generate notifications. Valid values include BEGIN, END, FAIL, and ALL.


Resource accounting

Resources used by Slurm jobs are recorded in the Slurm accounting database. This accounting data is used to track allocation usage.

The sacct command displays accounting data from the Slurm accounting database. To query the accounting data for a single job, use the --job argument.

$ sacct --job $jobid

sacct queries can take some time to complete. Please be patient.

You can change the fields that are printed with the --format option, and the fields available can be listed using the --helpformat option.

$ sacct --job=200 --format=jobid,jobname,qos,user,nodelist,state,start,maxrss,end

If you don't have a record of your job IDs, you can use date-range queries in sacct to find your job.

$ sacct --user=$USER --starttime=2017-01-01 --endtime=2017-01-03
To query the resources being used by a running job, use sstat instead:
 $sstat -a -j JobID.batch  
where you should replace JObID with the actual ID of your running job. sstat is especially useful for determining how much memory your job is using; see the "MaxRSS" field.

Monitoring job progress

The squeue command can be used to inspect the the Slurm job queue and a job's progress through it.

By default, squeue will list all jobs currently queued by all users. This is useful for inspecting the full queue; but, more often, users simply want to inspect the current state of their own jobs.

$ squeue --user=$USER

Slurm can provide an estimate of when your jobs will start, along with what resources it expects to dispatch your jobs to. Please keep in mind that this is only an estimate!

$ squeue --user=$USER --start

More detailed information about a specific job can be accessed using the scontrol command.

$ scontrol show job $SLURM_JOB_ID

Memory limits

To better balance the allocation of memory to CPU cores (for example, to prevent users from letting their jobs use all the memory on a shared node while only requesting a single core), we have limited each core to a fixed amount of memory. This limit is dependent on the requested node. You can either specify how much memory you need in MB and let Slurm assign the correct number of cores, or you can proportionally set the number of cores relative to the memory that your job will need.

Node type per-CPU limit per-node limit
crestone 3,942 MiB
shas 4,944 MiB 118,658 MiB
sgpu 4,944 MiB 118,658 MiB
smem 42,678 MiB 2,048,544 MiB

Interactive jobs

Interactive jobs allow users to log in to a compute node to run commands interactively on the command line. They are commonly run with the debug QoS as part of an interactive programming and debugging workflow. The simplest way to establish an interactive session is to use the sinteractive command:

$ sinteractive --qos=debug  --time=01:00:00

This will open a login shell using one core on one node for one hour. It also provides X11 forwarding via the submit host and can thus be used to run GUI applications.

If you prefer to submit an existing job script or other executable as an interactive job, use the salloc command.

$ salloc --qos debug job-script.sh

If you do not provide a command to execute, salloc starts up a Slurm job that nodes will be assigned to, but it does not log you in to the allocated node(s).

The sinteractive and salloc commands each support the same parameters as sbatch, and can override any default configuration. Note that any #SBATCH directives in your job script will not be interpreted by salloc when it is executed in this way. You must specify all arguments directly on the command line.


Topology-aware scheduling

Summit's general compute nodes are arranged into "islands" of about 30 nodes on a single Omni-Path switch. Nodes connected to the same switch have full interconnect bandwidth to other nodes in that same island. The bandwidth between islands is only half as much (ie, 2:1 blocking.) Thus, a job that does a lot of inter-node MPI communication may run faster if it is assigned to nodes in the same island.

If the --switches=1 directive is used, Slurm will put all of the job's tasks on nodes connected to a single switch. Keep in mind that jobs requesting topology-aware scheduling can use a maximum of 32 nodes and may spend a long time in the queue waiting for switch-specific nodes to be come available. To specify the maximum amount of time a job should wait for a single switch, use --switches=1@DD-HH:MM:SS and replace DD-HH:MM:SS with the desired number of days, hours, minutes, and seconds. After that time elapses, Slurm will schedule the job on any available nodes.