How to reduce your queue wait time on Janus

How to reduce your queue wait time on Janus

(aka “Use these old weird tricks to get jaw-droppingly short wait times.  #7 will blow your mind!”)


With the recent reduction in available Janus nodes, job queue wait times have increased.  Here are some hints for maximizing your computational output and minimizing the time your jobs spend waiting in the queue.


Game the scheduler.  

It’s easier for Slurm to schedule shorter jobs rather than longer ones.  Thus, if you know your job will complete in 6 hours, don’t take the default 24-hr wall time. See https://www.rc.colorado.edu/support/user-guide/batch-queueing.html#simple-table-of-contents-2  for how to specify an appropriate wall time.  However, jobs that run for under 15 mins are inefficient because Slurm takes a minute or so to start and stop each job; the fractional cost of these prolog/epilog tasks is substantial for very short jobs.


Fewer than 10% of Janus nodes are available in the janus-long QoS.  If you can make your jobs finish in less than 24 hrs, or if you can checkpoint them so they can be broken into 24-hr chunks, you can use the much larger janus QoS.


Some Janus nodes are set aside in a special reservation called janus-serial because their inter-node communication may be slow.  These nodes are perfectly fine otherwise.  If you are running single-node jobs, you can submit them into this reservation, where there will be less competition from multi-node jobs.  You can check how busy the janus-serial reservation is by running

squeue --reservation=janus-serial

You can specify this reservation at submit time by including

#SBATCH --reservation=janus-serial

in your batch script, or on the command line using

sbatch --reservation=janus-serial …

If you want to move a pending job to janus-serial, use

scontrol update jobid=1234567 reservation=janus-serial


The Slurm scheduling software is tuned to give higher priority to jobs that request a lot of nodes.  If you are submitting many (more than ~30) single-node jobs, you might bundle the individual  tasks into one big job using RC’s “load balancer” software (documentation at https://www.rc.colorado.edu/support/examples-and-tutorials/load-balancer.html ).  Note, though, that jobs bigger than ~150 nodes are competing against other large jobs and are difficult to backfill.  Thus, the effect of job size on queue wait time depends on the mix of other job sizes currently in the queue.


See https://www.rc.colorado.edu/support/user-guide/batch-queueing.html for more background on using Slurm.


Use as many cores per node as possible.

On Janus, jobs are assigned to a minimum of one full node (which has 12 CPU cores), even if the job can only use a single core.  Thus, if your workflow involves single-core (ie, completely non-parallel) jobs, you will definitely want to use special techniques to pack as many of these single-core tasks as possible onto a single node.  Besides reducing the number of jobs you have to get through the queue, you will also waste less of your allocation of compute time, because each Janus job is charged 12 CPU-hours per node per hour even if it is only running on one core.  In addition, fewer wasted cores means that more of the limited resource is available to other users.   


In order to determine how many single-core tasks can fit into the memory of a Janus node, you’ll need to know how much memory each task requires.  You can check how much memory a completed job has used with

sacct -j 1234567 --format=JobID,MaxRSS

which will help you determine how many single-core computations you can pack onto a single node without running out of memory.  (A Janus node has about 20GB RAM available to user jobs.)


If you are running 12 or fewer independent tasks, it’s straightforward to pack them all into one job on a single node just using a basic batch script.  For example:


#!/bin/bash

#SBATCH --job-name efficient

#SBATCH --qos janus

#SBATCH --nodes 1

#SBATCH --ntasks-per-node 12

#SBATCH --time 03:10:00

# load any modules that may be needed

module load intel/impi-15.0.1

# change to scratch filesystem for faster I/O to output files

cd /lustre/janus_scratch/username

# put the individual computational tasks here, one per line.

# the & indicates that the script should not wait for each

# task to complete before starting the next.

/projects/username/bin/application.x input1.txt > output1.txt &

/projects/username/bin/application.x input2.txt > output2.txt &

/projects/username/bin/application.x input3.txt > output3.txt &

/projects/username/bin/application.x input4.txt > output4.txt &

/projects/username/bin/application.x input5.txt > output5.txt &

/projects/username/bin/application.x input6.txt > output6.txt &

/projects/username/bin/application.x input7.txt > output7.txt &

/projects/username/bin/application.x input8.txt > output8.txt &

/projects/username/bin/application.x input9.txt > output9.txt &

# the next line tells the script to wait until all tasks have

# finished before ending the job.

wait


RC’s “load balancer” software (documentation at https://www.rc.colorado.edu/support/examples-and-tutorials/load-balancer.html) is extremely efficient at farming many single-core tasks out over multiple nodes.  If you have more than 12 non-parallel tasks to run at once, please give it a try.


RC's Crestone cluster (qos=crestone) is designed for running serial workloads, and as such allows multiple jobs to share a single node. If for some reason you can't use the options above, you can submit single-core jobs directly to Crestone. Keep in mind that Crestone nodes have a slow network connection to storage systems including Janus Lustre so Crestone may not be the best choice for data-intensive tasks.

Make sure your applications are running as efficiently as possible.

Ensure that your executables are optimized.  

  • Test with different compiler optimizations (e.g., -O, -xhost, -fast) to see whether they improve speed while maintaining the necessary accuracy.

  • Build your software on a compile node (not a login node) so you can get architecture-specific optimization built in.

  • Make sure you are using optimized math libraries for things like matrix operations and FFTs.  We strongly recommend Intel Math Kernel Library (MKL) rather than the default BLAS or LAPACK routines that might come packaged with your software.  We have seen performance improvements of 5x-10x for some applications after switching to MKL.

  • Understand how your application scales.  If doubling the number of nodes only reduces your wall time 10%, request the smaller number of nodes.

  • Consider profiling your application using the tools listed in the “Debuggers and Optimizers” section of https://www.rc.colorado.edu/support/user-guide/software.html to find out if there are any inefficient sections that could be improved.



Be sure that any nontrivial disk I/O is happening on /lustre/janus_scratch rather than /home or /projects.  /lustre/janus_scratch can be over 10x faster for some I/O operations.  For more information on how to use Lustre most efficiently, see https://www.rc.colorado.edu/support/examples-and-tutorials/parallel-io-on-janus-lustre.html .


RC staff are available for consulting on ways to optimize your workflow and your applications themselves.  Send an email to rc-help@colorado.edu to ask a question or set up an appointment.