Interpreting scontrol job information

Slurm's scontrol command can be used to inspect the status of a job in the queue; but this output can be a bit verbose, and intimidating if you don't already know how to read it.

$ scontrol show job 6286
JobId=6286 Name=test_job
  UserId=ralphie(00001) GroupId=ralphiepgrp(00001)
  Priority=7 Nice=0 Account=ralphie QOS=normal
  JobState=RUNNING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
  RunTime=00:00:24 TimeLimit=01:00:00 TimeMin=N/A
  SubmitTime=2014-05-29T10:31:43 EligibleTime=2014-05-29T10:31:43
  StartTime=2014-05-29T10:31:47 EndTime=2014-05-29T11:31:47
  PreemptTime=None SuspendTime=None SecsPreSuspend=0
  Partition=janus AllocNode:Sid=login01:8396
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=node[1342-1345,1362-1367]
  BatchHost=node1342
  NumNodes=10 NumCPUs=120 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  Socks/Node=* NtasksPerN:B:S:C=12:0:*:* CoreSpec=0
  MinCPUsNode=12 MinMemoryCPU=1700M MinTmpDiskNode=0
  Features=(null) Gres=(null) Reservation=(null)
  Shared=0 Contiguous=0 Licenses=(null) Network=(null)
  Command=/home/ralphie/testjob/testjob_submit.sh
  WorkDir=/home/ralphie/testjob
  StdErr=/home/ralphie/testjob/6286.out
  StdIn=/dev/null
  StdOut=/home/ralphie/testjob/6286.out

Here, we'll inspect the output of this example job, and explain each element in detail.

JobId=6286
Each job that is submitted to Slurm is assigned a unique numerical ID. This ID appears in the output of several Slurm commands, and can be used to refer to the job for modification or cancellation.
Name=test_job
When submitting your job, you can define a descriptive name using --job-name (or -J). Otherwise, the job name will be the name of the script that was submitted.
UserId=ralphie(00001)
GroupId=ralphiepgrp(00001)
Each job runs using the user credentials of the user process that submitted it. These are the same credentials indicated by the id command.
Priority=7
The current scheduling priority for the job, calculated based on the current scheduling policy for the cluster. Jobs with a higher priority are more likely to start sooner.
Nice=0
The nice value is a subtractive adjustment to a job's priority. You can voluntarily reduce your job priority using the --nice argument.
Account=ralphie
Access to Research Computing compute resources is moderated by the use of core-hour allocations to compute accounts. This account is specified using the --account (or -A) argument.
QOS=normal
Slurm uses a "quality of service" system to control job properties. The Research Computing environment also uses QOS values to map jobs to node types.
The QOS is selected during job submission using the --qos argument. More information is available on the Batch queueing and job scheduling page.
JobState=RUNNING
Slurm jobs pass through a number of different states. Common states are PENDING, RUNNING, and COMPLETED.
Reason=None
For PENDING jobs, an explanation for why the job is not yet RUNNING is listed here.
Dependency=(null)
If the job depends on another job (as defined by --dependency or -d, that dependency will be indicated here.
Requeue=1
If a job fails due to certain scheduler conditions, Slurm may re-queue the job to run at a later time. Re-queueing can be disabled using --no-requeue.
Restarts=0
If the job has been restarted (see Requeue above) the number of restarts will be reflected here.
BatchFlag=1
Whether or not the job was submitted using sbatch.
ExitCode=0:0
The exit code and terminating signal (if applicable) for exited jobs.
RunTime=00:00:24
How long the job has been running.
TimeLimit=01:00:00
The time limit for the job, specified by --time or -t.
TimeMin=N/A
SubmitTime=2014-05-29T10:31:43
When the job was submitted.
EligibleTime=2014-05-29T10:31:43
When the job became eligible to run. Examples of reasons a job might be ineligible to run include being bound to a reservation that has not started; exceeding the maximum number of jobs allowed to be run by a user, group, or account; having an unmet job dependency; or specifying a later start time using --begin.
StartTime=2014-05-29T10:31:47
When the job last started.
EndTime=2014-05-29T11:31:47
For a RUNNING job, this is the predicted time that the job will end, based on the time limit specified by --time or -t. For a COMPLETED or CANCELLED job, this is the time that the job ended.
PreemptTime=None
If the scheduler preempts a running job to allow the start of another job, the time that the job was last preempted will be recorded here.
SuspendTime=None
If a job is suspended (e.g., using scontrol suspend) the time that it was last suspended will be recorded here.
SecsPreSuspend=0
Unknown.
Partition=janus
The partition of compute resources targeted by the job. While the partition can be manually set using --partition or -p, the Research Computing environment automatically selects the correct partition when the user specifies the desired QOS using --qos.
AllocNode:Sid=login01:8396
Which node the job was submitted from, along with the system id. (It's safe to ignore the system id for now.)
ReqNodeList=(null)
The list of nodes explicitly requested by the job, as specified by the --nodelist or -w argument.
ExcNodeList=(null)
The list of nodes explicitly excluded by the job, as specified by the --exclude or -x argument.
NodeList=node[1342-1345,1362-1367]
The list of nodes that the job is currently running on.
BatchHost=node1342
The "head node" for the job. This is where the job script itself actually runs.
NumNodes=10
The number of nodes requested by the job. May be specified using --nodes or -N.
NumCPUs=120
The number of CPUs requested by the job, calculated by the nodes requested, the number of tasks requested, and the allocation of CPUs to tasks.
CPUs/Task=1
The number of CPU cores assigned to task. May be specified using, for example, --ntasks (-n) and --cpus-per-task (-c).
ReqB:S:C:T=0:0:*:*
An undocumented breakout of the node hardware.
Socks/Node=*
Reflects the specific allocation of CPU sockets per node (where a single socketed CPU may contain many cores). This can be specified using --sockets-per-node, implied by --cores-per-socket, or affected by other node specification arguments.
NtasksPerN:B:S:C=12:0:*:*
An undocumented breakout of the tasks pre node.
CoreSpec=0
Unknown.
MinCPUsNode=12
The minimum number of CPU cores per node requested by the job. Useful for jobs that can run on a flexible number of processors, as specified by --mincpus.
MinMemoryCPU=1700M
The minimum amount of memory required per CPU. Set automatically by the scheduler, but explicitly configurable with --mem-per-cpu.
MinTmpDiskNode=0
The amount of temporary disk space required per node, as requested by --tmp. Note that Janus nodes do not have local disks attached, and it is expected that most file IO will take place in the shared parallel filesystem.
Features=(null)
Node features required by the job, as specified by --constraint or -C. Node features are not currently used in the Research Computing environment.
Gres=(null)
Generic consumable features required by the job, as specified by --gres. Generic resources are not currently used in the Research Computing environment.
Reservation=(null)
If the job is running as part of a resource reservation (using --reservation), that reservation will be identified here.
Shared=0
Whether or not the job can share resources with other running jobs, as specified with --share or -s.
Contiguous=0
Whether or not the nodes allocated for the ode must be contiguous, as specified by --contiguous.
Licenses=(null)
List of licenses requested by the job, as specified by --licenses or -L. Note that Slurm is not used for license management in the Research Computing environment.
Network=(null)
System-specific network specification information. Not applicable to the Research Computing environment.
Command=/home/ralphie/testjob/testjob_submit.sh
The command that will be executed on the head node to start the job. (See BatchHost, above.)
WorkDir=/home/ralphie/testjob
The initial working directory for the job, as specified by --workdir or -D. By default, this will be the working directory when the job is submitted.
StdErr=/home/ralphie/testjob/6286.out
The output file for the stderr stream (fd 2) of the main process of the job, running on the head node. Set by --output or -o, or explicitly by --error or -e.
StdIn=/dev/null
The input file for the stdin stream (fd 0) of the main process of the job, running on the head node. Set to /dev/null by default, but can be configured with --input or -i.
StdOut=/home/ralphie/testjob/6286.out
The output file for the stdout stream (fd 1) of the main process of the job, running on the head node. Set by --output or -o.