Slurm job submission

This tutorial demonstrates the process of submitting a compute job to the Research Computing batch queueing system, Slurm. The example job should run long enough to allow you to see it running and see the output file get created and updated.

Note: The example job does nothing but waste computer time. Please resist the temptation to run it at a larger scale.

Prerequisites

Before you begin, you need

  • an RC account
  • a registered OTP authenticator
  • an SSH client application

Log in

First, log into an RC login node. This step is dependent on your local environment; but in an OS X or Linux environment, you should be able to use the standard OpenSSH command-line client.

$ ssh -l $username login.rc.colorado.edu

Prepare a job directory

In your home directory (which is the directory you will be using by default when you first log in) create a subdirectory to contain your test job script and your job's eventual output.

$ mkdir test-job
$ cd test-job

The cd changes your working directory to the new test-job directory, which you can confirm with the pwd command.

$ pwd
/home/ralphie/test-job

Write the job script

In a batch queueing environment like that at Research Computing, compute tasks are submitted as scripts that will be executed by the queueing system on your behalf. This script often contains embedded metadata about the resources required to complete the job (e.g., the number of compute nodes and cores and for how long you intend to use them).

You can write this script in any text editor. For ease of instruction here, use the cat command to redirect the script text into a file and press 'enter' to save it.  Paste the following in at the dollar prompt or omit the first line "cat..." and the last line "EOF" if you paste into an editor.

 cat >test-job.sh << EOF
#!/bin/bash
#SBATCH --job-name test-job
#SBATCH --time 05:00
#SBATCH --nodes 1
#SBATCH --output test-job.out

echo "The job has begun."
echo "Wait one minute..."
sleep 60
echo "Wait a second minute..."
sleep 60
echo "Wait a third minute..."
sleep 60
echo "Enough waiting: job completed."
EOF

This script describes a job named "test-job" that will run for no longer than five minutes. The job consists of a single task running on a single node, with output directed to a test-job.out file.

You can use the cat command again to confirm the content of the new test-job.sh script.

$ cat test-job.sh

Submit the job

The test-job.sh file is a Bash shell script that serves as the initial executable for the job. The #SBATCH directives at the top of the script inform the scheduler of the job's requirements.

To submit the script as a batch job, first load the slurm module, which will provide access to the Slurm commands.

$ module load slurm/summit

Use the sbatch command to submit the script to Slurm.

$ sbatch --qos=debug test-job.sh
Submitted batch job 56

The --qos argument causes the job to be treated as a "debug" job which will grant it additional priority at the cost of tighter restrictions on its size and length. (--qos could have also been included in an #SBATCH directive in the script itself.)

When Slurm accepts a new job, it responds with the job id (a number) that can be used to identify the specific job in the queue. Referencing this job id when contacting rc-help@colorado.edu can expedite support.

Monitor job execution

Use the squeue command to monitor the status of pending or running jobs.

$ squeue --user $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                56     janus test-job  ralphie  R       0:06      1 node1701

If your job has not already started (i.e., it has a state of PD in stead of R) you can use the --start flag to query the estimated start time for your job.

$ squeue --user $USER --start

Once the job has started, the output it generates will be directed to the test-job.out file referenced in the job script. You can watch the output as it is written using the tail command.

$ tail -F test-job.out

Once the script has finished, the state will transition to C, and the job will eventually exit the queue.

Further reading