Examples and tutorials

Array jobs

Job arrays run multiple instances of the the same base job. Each instance is assigned a unique index to the environment variable $SLURM_ARRAY_TASK_ID.

This example job array uses the sha1sum command to compute a hash for each of a set of files.

Input dataset

Ordinarily you have an input dataset, perhaps one file per intended job index. The example input dataset here is simply a set of randomly-generated, one-megabyte files.

for i in $(seq 0 9)
  dd if=/dev/urandom of=input-data-${i} bs=1m count=1

Job script

sha1sum-array.sh is a Slurm job that expects to run as an index of a Slurm job array. It writes the calculated SHA-1 hash to the --output file as well as to a dedicated hash data file.


#SBATCH --job-name sha1sum-array
#SBATCH --time 5:00
#SBATCH --nodes 1
#SBATCH --output example-array-%a.out

if [ -z "${SLURM_ARRAY_TASK_ID}" ]
    echo 1>&2 "Error: not running as a job array."
    exit 1

echo "Array index: ${SLURM_ARRAY_TASK_ID}"

sha1sum $data_file | tee "${data_file}.sha1"

Job submission

The example input dataset consists of 10 input files intended to be processed by 10 job array indices. The janus-debug QOS is restricted to four queued jobs per user, though, so we can start by running the first four indices as debug jobs.

$ sbatch --qos janus-debug --array 0-3 sha1sum-array.sh

If output from those job indices is correct, you can follow-up by submitting the rest of the indices with the regular janus QOS.

$ sbatch --qos janus --array 4-9 sha1sum-array.sh

Environment modules transcript

The following transcript shows that we are starting off with no modules loaded. Obtaining a listing of available modules. Then loading the Intel compiler, followed by the Intel MPI implementation, Jasper, NetCDF and Parallel NetCDF.

We list the modules currently loaded and see that the dependecies for MPI, namely slurm and the dependecies for NetCDF, namely HDF5 and szip were automatically loaded.

All these loaded modules are then saved as a collection called wrf, since these are the modules needed to build WRF.

We then purge all the loaded modules, show that nothing is loaded and then restore this collection of modules.

$ ml
No modules loaded

$ ml av

---------------------------------- Compilers -----------------------------------
   gcc/5.1.0    intel/15.0.2 (m)    pgi/15.3

--------------------------- Independent Applications ---------------------------
   allinea/forge/5.0.1 (m)      loadbalance/0.1    slurm/blanca      (S)
   allinea/forge/5.1   (m,D)    matlab/2015a       slurm/slurm       (S,D)
   autotools/2.69               ncl/6.3.0          stdenv
   cmake/3.2.2                  papi/5.4.1         subversion/1.8.13
   cuda/7.0.28         (g)      paraview/4.3.1     tcltk/8.6.4
   curc-bench/master            pdtoolkit/3.20     totalview/8.15.4
   expat/2.1.0                  perl/5.22.0        udunits/2.2.19
   git/2.4.2                    qt/5.5.0           valgrind/3.10.1
   itac/      (m)      rocoto/1.2
   jdk/1.8.0                    ruby/2.2.3

---------------------------- Lmod Internal Modules -----------------------------
   lmod/6.0.8    settarg/6.0.8

   S:  Module is Sticky, requires --force to unload or purge
   g:  built for GPU
   m:  built for host and native MIC
   D:  Default Module

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching
any of the "keys".

$ ml intel
$ ml impi
$ ml jasper netcdf pnetcdf
$ ml

Currently Loaded Modules:
  1) intel/15.0.2 (m)   3) impi/   5) szip/2.1      7) netcdf/
  2) slurm/slurm  (S)   4) jasper/1.900.1   6) hdf5/1.8.15   8) pnetcdf/1.6.1

   S:  Module is Sticky, requires --force to unload or purge
   m:  built for host and native MIC

$ ml save wrf
$ ml --force purge
$ ml
No modules loaded

$ ml restore wrf
Restoring modules to user's wrf
$ ml

Currently Loaded Modules:
  1) intel/15.0.2 (m)   3) impi/   5) szip/2.1      7) netcdf/
  2) slurm/slurm  (S)   4) jasper/1.900.1   6) hdf5/1.8.15   8) pnetcdf/1.6.1

   S:  Module is Sticky, requires --force to unload or purge
   m:  built for host and native MIC

Interpreting scontrol job information

Slurm's scontrol command can be used to inspect the status of a job in the queue; but this output can be a bit verbose, and intimidating if you don't already know how to read it.

$ scontrol show job 6286
JobId=6286 Name=test_job
  UserId=ralphie(00001) GroupId=ralphiepgrp(00001)
  Priority=7 Nice=0 Account=ralphie QOS=normal
  JobState=RUNNING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
  RunTime=00:00:24 TimeLimit=01:00:00 TimeMin=N/A
  SubmitTime=2014-05-29T10:31:43 EligibleTime=2014-05-29T10:31:43
  StartTime=2014-05-29T10:31:47 EndTime=2014-05-29T11:31:47
  PreemptTime=None SuspendTime=None SecsPreSuspend=0
  Partition=janus AllocNode:Sid=login01:8396
  ReqNodeList=(null) ExcNodeList=(null)
  NumNodes=10 NumCPUs=120 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  Socks/Node=* NtasksPerN:B:S:C=12:0:*:* CoreSpec=0
  MinCPUsNode=12 MinMemoryCPU=1700M MinTmpDiskNode=0
  Features=(null) Gres=(null) Reservation=(null)
  Shared=0 Contiguous=0 Licenses=(null) Network=(null)

Here, we'll inspect the output of this example job, and explain each element in detail.

Each job that is submitted to Slurm is assigned a unique numerical ID. This ID appears in the output of several Slurm commands, and can be used to refer to the job for modification or cancellation.
When submitting your job, you can define a descriptive name using --job-name (or -J). Otherwise, the job name will be the name of the script that was submitted.
Each job runs using the user credentials of the user process that submitted it. These are the same credentials indicated by the id command.
The current scheduling priority for the job, calculated based on the current scheduling policy for the cluster. Jobs with a higher priority are more likely to start sooner.
The nice value is a subtractive adjustment to a job's priority. You can voluntarily reduce your job priority using the --nice argument.
Access to Research Computing compute resources is moderated by the use of core-hour allocations to compute accounts. This account is specified using the --account (or -A) argument.
Slurm uses a "quality of service" system to control job properties. The Research Computing environment also uses QOS values to map jobs to node types.
The QOS is selected during job submission using the --qos argument. More information is available on the Batch queueing and job scheduling page.
Slurm jobs pass through a number of different states. Common states are PENDING, RUNNING, and COMPLETED.
For PENDING jobs, an explanation for why the job is not yet RUNNING is listed here.
If the job depends on another job (as defined by --dependency or -d, that dependency will be indicated here.
If a job fails due to certain scheduler conditions, Slurm may re-queue the job to run at a later time. Re-queueing can be disabled using --no-requeue.
If the job has been restarted (see Requeue above) the number of restarts will be reflected here.
Whether or not the job was submitted using sbatch.
The exit code and terminating signal (if applicable) for exited jobs.
How long the job has been running.
The time limit for the job, specified by --time or -t.
When the job was submitted.
When the job became eligible to run. Examples of reasons a job might be ineligible to run include being bound to a reservation that has not started; exceeding the maximum number of jobs allowed to be run by a user, group, or account; having an unmet job dependency; or specifying a later start time using --begin.
When the job last started.
For a RUNNING job, this is the predicted time that the job will end, based on the time limit specified by --time or -t. For a COMPLETED or CANCELLED job, this is the time that the job ended.
If the scheduler preempts a running job to allow the start of another job, the time that the job was last preempted will be recorded here.
If a job is suspended (e.g., using scontrol suspend) the time that it was last suspended will be recorded here.
The partition of compute resources targeted by the job. While the partition can be manually set using --partition or -p, the Research Computing environment automatically selects the correct partition when the user specifies the desired QOS using --qos.
Which node the job was submitted from, along with the system id. (It's safe to ignore the system id for now.)
The list of nodes explicitly requested by the job, as specified by the --nodelist or -w argument.
The list of nodes explicitly excluded by the job, as specified by the --exclude or -x argument.
The list of nodes that the job is currently running on.
The "head node" for the job. This is where the job script itself actually runs.
The number of nodes requested by the job. May be specified using --nodes or -N.
The number of CPUs requested by the job, calculated by the nodes requested, the number of tasks requested, and the allocation of CPUs to tasks.
The number of CPU cores assigned to task. May be specified using, for example, --ntasks (-n) and --cpus-per-task (-c).
An undocumented breakout of the node hardware.
Reflects the specific allocation of CPU sockets per node (where a single socketed CPU may contain many cores). This can be specified using --sockets-per-node, implied by --cores-per-socket, or affected by other node specification arguments.
An undocumented breakout of the tasks pre node.
The minimum number of CPU cores per node requested by the job. Useful for jobs that can run on a flexible number of processors, as specified by --mincpus.
The minimum amount of memory required per CPU. Set automatically by the scheduler, but explicitly configurable with --mem-per-cpu.
The amount of temporary disk space required per node, as requested by --tmp. Note that Janus nodes do not have local disks attached, and it is expected that most file IO will take place in the shared parallel filesystem.
Node features required by the job, as specified by --constraint or -C. Node features are not currently used in the Research Computing environment.
Generic consumable features required by the job, as specified by --gres. Generic resources are not currently used in the Research Computing environment.
If the job is running as part of a resource reservation (using --reservation), that reservation will be identified here.
Whether or not the job can share resources with other running jobs, as specified with --share or -s.
Whether or not the nodes allocated for the ode must be contiguous, as specified by --contiguous.
List of licenses requested by the job, as specified by --licenses or -L. Note that Slurm is not used for license management in the Research Computing environment.
System-specific network specification information. Not applicable to the Research Computing environment.
The command that will be executed on the head node to start the job. (See BatchHost, above.)
The initial working directory for the job, as specified by --workdir or -D. By default, this will be the working directory when the job is submitted.
The output file for the stderr stream (fd 2) of the main process of the job, running on the head node. Set by --output or -o, or explicitly by --error or -e.
The input file for the stdin stream (fd 0) of the main process of the job, running on the head node. Set to /dev/null by default, but can be configured with --input or -i.
The output file for the stdout stream (fd 1) of the main process of the job, running on the head node. Set by --output or -o.

Load balancer

This is a simple example illustrating how to use the load balancer tool. Essentially you enumerate the commands you would like to execute in parallel in a file, making sure that you only put one command per line. If you have multiple commands as part of the same task, separate them with a semicolon (;). This means that you can use your favorite scripting language to create your set of tasks, or you can simply copy-and-paste if you only have a few.

There are two ways to execute this code: interactively and batched. More information about submitting jobs in the Research Computing environment is available in the documentation for batch queueing and job scheduling.

  1. Create the cmd_lines file in the directory you created.
    $ for  i in {1..100} ; do \
     echo "sleep 2; echo process $i" >> cmd_lines ; \
    Your ending file should look like the following:
    sleep 2; echo process 1
    sleep 2; echo process 2
    sleep 2; echo process 3
    sleep 2; echo process 99
    sleep 2; echo process 100
  2. Load the load balancer module.
    $ module load loadbalance
  3. Submit your job, for example using srun.
    $ module load slurm
    $ srun -N 1 --ntasks-per-node=12 lb cmd_lines
  4. You may also submit your job by creating and submitting an sbatch script:
    $ module load slurm
    $ module load loadbalance
    $ vi submit.sh
    Your sbatch script should look something like this:
    #SBATCH -N 1
    #SBATCH --ntasks-per-node 12
    #SBATCH --output output.out
    #SBATCH --qos janus
    srun lb cmd_lines
    Finally submit your script.
    $ sbatch submit.sh

Parallel IO on Janus Lustre

Shared filesystems benefit from maximizing each user’s efficient utilization of resources, in turn increasing collective performance and reliability. Appropriation and misuse by a single user may reduce performance or interrupt service across the entire computing infrastructure. While Janus’ Lustre scratch file system presents a familiar POSIX interface to the user, the structure differs substantially from common file systems like ext4, NTFS, or HFS+. Examining some of Lustre’s operational details may increase the end user’s application performance and facilitate equitable usage of the file system.

Best practices

  • Reading, writing, and removing many small files burdens the file system and can slow it down for all users.
  • Placing too many files in a single directory negatively impacts performance.
  • Avoid using ls -l (or any ls command that uses the -l flag, or any command that must stat many files in quick succession.
  • Do not use wildcards (*) in directories containing thousands of files.
  • Avoid unnecessary use of stderr and stdout streams from parallel processes.
  • Place large files into a directory with a high stripe count. (On Janus Lustre, a reasonable starting point for testing is a stripe count of 6 for files of about 100 MB. A stripe count above 10 may not improve performance except for files larger than 10s of GB.)
  • Store small files, or directories containing many small files, on a single OST (stripe count 1) to reduce contention.
  • Open files read-only, specifying O_NOATIME if there is no reason to update the access time.
  • If many processes need the stat information from a single file, it is most efficient to have a single process perform the stat call and broadcast the results.
  • Avoid frequently opening files in append mode, writing small amounts of data, and closing the file.
  • Instead of reading a small file from every task, read the entire file from one task and broadcast the contents to all other tasks.


Lustre is an object storage system, rather than a block storage system. While block file systems access data as sectors on hard disks, object storage systems perform IO on higher-level data structures called objects. Typically IO is restricted to whole-object manipulation via an Application Programming Interface (API). Lustre is POSIX compliant, meaning that it complies with the IEEE standard suite for operating system compatibility. Users may access the file system without any knowledge of the underlying objects.

As adapted from the Lustre 2.x Filesystem Operations Manual:

Metadata Server (MDS)
The MDS serves metadata stored in the metadata logical disk (MDT) to Lustre clients. Each MDS manages the Lustre file system namespace and provides network request handling for one or more local MDTs.
Metadata Target (MDT)
The MDT stores metadata (such as filenames, directories, permissions and file layout) on storage attached to an MDS. Each file system has one MDT. An MDT on a shared storage target can be available to multiple MDSs, although only one can access it at a time. If an active MDS fails, a standby MDS can serve the MDT and make it available to clients. This is referred to as MDS failover.
Object Storage Target (OST)
On Janus Lustre, an OST is a LUN created from a RAID6 pool of 2TB SATA disks. User file data is stored in one or more objects, with each object on a separate OST in a Lustre file system. The number of objects per file is configurable by the user and can be tuned to optimize performance for a given workload.
Object Storage Servers (OSS)
The OSS provides file I/O service and network request handling for one or more local OSTs. Typically, an OSS serves between two and eight OSTs, up to 16 TB each. A typical configuration is an MDT on a dedicated node, two or more OSTs on each OSS node, and a client on each of a large number of compute nodes.
Lustre clients
Lustre clients are computational, visualization, desktop, or data transfer nodes that are running Lustre client software, allowing them to mount the Lustre file system. The Lustre client software provides an interface between the Linux virtual file system and the Lustre servers.
  • management client (MGC)
  • metadata client (MDC)
  • object storage clients (OSCs)
Logical Object Volume (LOV)
A LOV aggregates the OSCs to provide transparent access across all the OSTs. Thus, a client with the Lustre file system mounted sees a single, coherent, synchronized namespace. Several clients can write to different parts of the same file simultaneously, while other clients can read from the file.

When a Lustre client submits an IO request to the MDS, the MDS returns the “layout EA” (Extended Attributes) and File IDentifier (FID). FIDs are used to identify a file and where it resides on disk (i.e., the file's "metadata", the closest equivalent to an inode). An FID is comprised of a unique 64-bit sequence number, a 32-bit Object ID (OID), and a 32-bit version number. The layout EA provides the client with the object to OST mapping, and OST to OSS association. Subsequent transactions occur simultaneously between the client and the OSS hosts managing the OSTs containing object segments, avoiding the bottleneck of communications with the single MDS. The single active MDS is the source of the HPC adage: “Lustre doesn’t have great metadata performance.”

File striping

One of the primary features of Lustre is the ability to distribute sections of a file (stripes) across a specified number of OSTs using a round-robin algorithm. Lustre assigns a default stripe count of 1, which instructs the file system to place the entire file on 1 OST, and a stripe size of 1 MB. Users can alter the striping of a directory or file, with files inheriting the stripe count of their parent directory. File striping increases the bandwidth available to the client by parallelizing IO to multiple servers and ensures that a single OST isn’t filled by a small number of large files. However, setting the stripe count too high can result in unnecessary network utilization and server contention.

Aligned IO is an important component of optimal file system utilization. A file striping is said to be aligned if it can be “accessed at offsets that correspond to stripe boundaries.”

In an example 4 MB write, using the default stripe size of 1 MB and a stripe count of 3, the file would be written in four stripes.

  • Process 0 writes 0.5 MB starting at offset 0 MB
  • Process 1 writes 1.25 MB starting at offset 0.5 MB
  • Process 2 writes 2.25 MB starting at offset 1.75 MB

Since these four writes are not an integer multiple of the 1 MB stripe size, they are considered unaligned.

This access pattern is inefficient, necessitating 6 writes. The OSS hosts process the additional requests, inducing unnecessary IO on the OSTs. Unaligned access also causes additional contention from IO requests that are not aligned to file extent (byte-range) lock boundaries, potentially reducing application performance significantly.

An aligned alternative to this write pattern could be for process 0 and 1 to each write 1 MB, leaving Proc 2 to write two (1 MB) stripes, totaling 4 operations.

Performing aligned IO can be automated using MPI-IO with the ADIO/ROMIO Lustre drivers.

Stripe settings for a given file or directory can be accessed using the lfs getstripe and lfs setstripe commands.

$ lfs getstripe /lustre/janus_scratch/$USER/testfile
lmm_stripe_count:   60
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  16
        obdidx      objid           objid                     group
        16          41470855        0x278cb87                 0
        43          39823010        0x25fa6a2                 0
        25          42664951        0x28b03f7                 0
         2          38995264        0x2530540                 0
        39          39838779        0x25fe43b                 0
         4          41093613        0x27309ed                 0
        13          40132541        0x2645fbd                 0
        58          39266789        0x25729e5                 0
        54          37891979        0x2422f8b                 0
        46          37399198        0x23aaa9e                 0
        36          38519155        0x24bc173                 0
        47          40141032        0x26480e8                 0

As indicated by lmm_stripe_count and lmm_stripe_size, the file is comprised of 60 1 MiB objects (stripes). The obdidx column lists the OSTs containing the objects in question.

For circumstances where striping is not desirable, setting the stripe count to 1 disables file striping.

$ lfs setstripe /lustre/janus_scratch/$USER/testdir -c 1

Setting the stripe count to -1 stripes across all available OSTs. (Not recommended for smaller than 1 TB.)

$ lfs setstripe /lustre/janus_scratch/$USER/testdir -c -1

For more information on the lfs getstripe and lfs setstripe commands, run the commands lfs help getstripe and lfs help setstripe.

Directory structures for many files

Use a two-tier directory structure for large numbers of files (approximately 10,000), with √n directories each containing √n files. Even reading from a single file in a directory containing tens of thousands will severely reduce performance across the entire system! For more information, see Access Patterns and IO.

Listing and finding files (ls and find)

The –l argument to ls requires Lustre to connect to every OST that contains a listed file, and is correspondingly expensive. A simple ls command retrieves the listing from the MDS. (ls –f is the fastest option, avoiding the sort operation.)

Many systems alias ls to use -–color=auto, which may behave similarly to ls –l. Specifying /bin/ls explicitly, or running unalias ls, will avoid this potential performance impact.

The most efficient option is to use the Lustre lfs find command.

$ lfs find --maxdepth 0 /lustre/janus_scratch/$USER/

The lfs find command is preferred even to the system find command, and supports many of the POSIX find operations. For more information on using lfs find, run the command lfs help find.

Operating on many files at once

Executing rm –rf * or tar in a directory with tens of thousands of files can paralyze Lustre’s MDS. Users attempting to issue wildcards or globbing with Linux binaries have brought the entire cluster down.

An alternative to these operations is to use the lfs find command to select files for passing to a later executable.

$ lfs find /lustre/janus_scratch/username/subdirectory --type f -print0 |
  xargs -0 rm -f

This ensures that the files are deleted serially, substantially reducing load on the MDS, and OST contention. A corollary to this is to restrict the creation of many small files during a compute job. Sustained file creation rates of more than about 500 files per second are likely to cause problematic load.


MPI-IO forms an entire category of research in parallel file system utilization. Data-sieving, interleaving, and collective IO can yield substantial application performance gains. We recommend the following reading for more information on optimizing your use of Lustre with MPI-IO:

Further reading

Parallel MPI jobs

There are three different mechanisms by which MPI jobs may be dispatched using Slurm. The recommended mechanism uses srun to directly launch tasks and initialize inter-process communication. More information on other mechanisms is available in the Slurm MPI and UPC Users Guide.

MPI dispatch with srun

These are example parallel job scripts using current best practices. They each run a 24-task MPI test job on two Janus nodes with a ten-minute time limit.

Intel MPI


#SBATCH --job-name mpi_test
#SBATCH --qos janus
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 12
#SBATCH --time 00:10:00
#SBATCH --output mpi_test.out

# the slurm module provides the srun command
module load slurm

module load intel/impi-13.0.0

srun mpi_test



#SBATCH --job-name mpi_test
#SBATCH --qos janus
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 12
#SBATCH --time 00:10:00
#SBATCH --output mpi_test.out

# the slurm module provides the srun command
module load slurm

module load openmpi/1.8.3_intel-13.0.0

srun mpi_test



#SBATCH --job-name mpi_test
#SBATCH --qos janus
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 12
#SBATCH --time 00:10:00
#SBATCH --output mpi_test.out

# the slurm module provides the srun command
module load slurm

module load mpich/mpich-3.1.2_intel-13.0.0

# PMI-2 must be specified explicitly when using MPICH
srun --mpi=pmi2 mpi_test

Multiple program, multiple data (MPMD)

Slurm supports multi-program MPI through the use of the --multi-prog argument to srun. In this case, the executable supplied to srun is expected to be a configuration file that maps MPI ranks to executables and arguments.

# mpmd-example.conf

0-9 ./a.out
10-24 ./b.out

Compared to the above SPMD examples, simply modify the srun command to run with the defined MPMD configuration.

srun --multi-prog mpmd-example.conf

More information is available in the srun manpage, under the heading "MULTIPLE PROGRAM CONFIGURATION."

Custom task geometry

If your MPI program requires a custom task geometry you will need to redefine the environment variable SLURM_TASKS_PER_NODE and use mpiexec to execute you program.

In this case we request 4 nodes with 12 tasks per node giving a total of 48 tasks. However we want the root rank (rank 0) to run on the first node by itself, then the next two nodes to run 12 tasks each (ranks 1-12) and last node to run 6 tasks only (ranks 13-18).


#SBATCH --job-name mpi_test
#SBATCH --qos janus
#SBATCH --nodes 4
#SBATCH --ntasks-per-node 12
#SBATCH --time 00:10:00
#SBATCH --output mpi_test.out

# the slurm module provides the srun command
module load slurm

module load intel/impi-13.0.0

export SLURM_TASKS_PER_NODE='1,12(x2),6'
mpiexec ./mpi_test

Parallel Programming with JupyterHub

This tutorial demonstrates simple parallel processing examples using the CURC JupyterHub web service, in both ipyparallel and MPI for Python.

This tutorial can currently be found here

OpenMP Performance Tips for Summit


Summit compute nodes are two-socket nodes. As a programmer, you can use the two-socket nodes like they are in a single socket, but you might miss out on performance. Why? Memory allocated on by a processor on socket 0 will be accessed the fastest by processors on socket 0. Memory allocated by processors on socket 1 will take longer to access from processors on socket 0 because it's farther away. You're memory allocation impacts you're performance even more than it used to! This post will cover some hints and tips to make sure you're not losing performance with OpenMP on our two-socket Haswell compute nodes.

"First Touch" policy and data initialization examples

Linux uses a “first touch” memory allocation policy by default. The process that first touches memory causes that memory to be allocated close to the processor on which the process in running. The following two diagrams explain "first touch" on a two processor node where each processor has it's own memory (similar functionality to a two-socket node). Figures were taken from this presentation: http://www.compunity.org/training/tutorials/4%20OpenMP_and_Performance.pdf .The first example is a serial initialization of an array in C would put the entire array close to one processor.

Memory allocated on one of two processors.

Similarly, the following parallel data initialization would put memory on both processors.

Memory allocated on both of two processors.

Why does this matter? As mentioned before, memory that's closer is faster to access. Thus you want to allocate memory as close as possible to the processor that will use it. This will be more apparent in the matrix*vector example later on, but first we'll go over which cores are where and how to place threads.

Hardware Information

How do you know which cores belong to which processor? You can figure out which core is on which processor by looking at /proc/cpuinfo file. For example: cat /proc/cpuinfo on a Summit node would show you that there are two 12-core Haswell processors each running at 2.5Ghz. The physical id shows you which core is on which processor. On Summit, all even numbered id's are on one processor and all odd numbered cores are on the other.

Core Pinning

How can you select which cores to use? By default the OS will choose which cores you're using. If you're using the Intel compilers you can set the KMP_AFFINITY environment variable to see which cores will be used. The following shows the default case (OS chooses thread placement) when running the MFiX program (https://mfix.netl.doe.gov/) using the Intel 16 compiler with 4 threads:

## Default setting
-bash-4.2$ export KMP_AFFINITY="verbose"​  #verbose gives information on how OpenMP is running.
-bash-4.2$ ./mfix ​
OMP: Info #242: KMP_AFFINITY: pid 125634 thread 0 bound to OS proc set​
OMP: Info #242: KMP_AFFINITY: pid 125634 thread 2 bound to OS proc set ​
OMP: Info #242: KMP_AFFINITY: pid 125634 thread 1 bound to OS proc set​
OMP: Info #242: KMP_AFFINITY: pid 125634 thread 3 bound to OS proc set ​

From this you can see that the threads are free to run on any of the 24 cores. You can select which cores to run on using KMP_AFFINITY with the "scatter" and "compact" options. The following two examples assume an OpenMP program running with 4 threads. The first shows threads on both processors which the second shows all 4 threads on one processor.

## Threads on both sockets:​
-bash-4.2$ export KMP_AFFINITY="scatter,verbose"​  #Put 2 threads on each processor, cores 0, 1, 2, 3
-bash-4.2$ ./mfix ​
OMP: Info #242: KMP_AFFINITY: pid 125556 thread 0 bound to OS proc set {0}​   #Thread 0 on core 0
OMP: Info #242: KMP_AFFINITY: pid 125556 thread 2 bound to OS proc set {2}​
OMP: Info #242: KMP_AFFINITY: pid 125556 thread 1 bound to OS proc set {1}​
OMP: Info #242: KMP_AFFINITY: pid 125556 thread 3 bound to OS proc set {3}

## Threads on one socket:​
-bash-4.2$ export KMP_AFFINITY="compact,verbose"​ #All threads on one processor, cores 0, 2, 4, 6
-bash-4.2$ ./mfix ​
OMP: Info #242: KMP_AFFINITY: pid 125482 thread 0 bound to OS proc set {0}​
OMP: Info #242: KMP_AFFINITY: pid 125482 thread 1 bound to OS proc set {2}​
OMP: Info #242: KMP_AFFINITY: pid 125482 thread 2 bound to OS proc set {4}​
OMP: Info #242: KMP_AFFINITY: pid 125482 thread 3 bound to OS proc set {6} 

Performance Impacts

Which option should you use and when? For MOST OpenMP applications the "scatter" option performs best. The following examples show several ways to code a matrix times a vector in C. The three main configurations are:

  • Serial
  • OpenMP with serial data initialization
  • OpenMP with parallel data initialization

The following shows the data initialization and loops of the code. The version shown uses OpenMP and parallel data initialization. The OpenMP case with serial data initialization can be achieved be removing the #pragma immediately under the Data initialization comment. To get the serial version remove both #pragma statements.

 // Data initialization
  #pragma omp parallel for default(none) \
  private(i,j) shared(m,n,a,b,c)
  for (i=0; i < m; i++){
      a[i] = 1.0;
      c[i] = 1.0;
      for (j=0; j<n; j++){
          b[i*n + j] = i*n + j;

  // matrix * vector
  #pragma omp parallel for default(none) \
  private(i,j) shared(m,n,a,b,c)
  for (i=0; i&lt;m; i++){
      for (j=0; j<n; j++){
          a[i] += b[i*n + j]*c[j];

The following figure shows the performance results of various configurations for the matrix * vector code. In the figure, SI means serial data initialization, PI means parallel initialization, Scatter means that KMP_AFFINITY is set to scatter, Compact means that KMP_AFFINITY is set to compact. All OpenMP implementations offered significant speedup in comparison to the serial version. OpenMP cases with parallel data initialization performed 20% better than those with serial data initialization. Solely by changing the data distribution the performance increased 20%. The KMP_AFFINITY options made little difference in this case. The fastest case was officially with KMP_AFFINITY set to scatter.

Matrix time vector performance results. Two socket scatter performs the best.

Environment Variables

GCC 6.1.0

GOMP_CPU_AFFINITY - Binds threads to specific CPUs. Example:
-bash-4.2$ export GOMP_CPU_AFFINITY="0 3 1-2 4-15"
Binds the first thread to cpu 0, the second to cpu 3, the third and fourth to cpus 1 and 2. The fifth through 16th to cpus 4-15. Only cpus 0-15 will be used. If more than 16 threads are used, there will be multiple threads on some cpus.
OMP_DISPLAY_ENV - If set to TRUE, the OpenMP version number and the values associated with the OpenMP environment variables are printed to stderr. If set to VERBOSE, it additionally shows the value of the environment variables which are GNU extensions. If undefined or set to FALSE, this information will not be shown.
OMP_PROC_BIND - Whether threads can move between cpus. When undefined, OMP_PROC_BIND defaults to TRUE when OMP_PLACES or GOMP_CPU_AFFINITY is set and FALSE otherwise.


Intel 16.0.3

KMP_AFFINITY - Binds threads to specific CPUs. Example:
-bash-4.2$ export KMP_AFFINITY="explicit,proclist=[0,1,2,3],verbose"
Binds the first thread to cpu 0, the second to cpu 1, the third to cpu2, and the fourth to cpu 3. Verbose will display information on OpenMP settings.
-bash-4.2$ export KMP_AFFINITY="scatter"​
Specifying scatter distributes the threads as evenly as possible across the entire system. On a Summit compute node, threads placement will alternate between sockets.
-bash-4.2$ export KMP_AFFINITY="compact"
Put threads physically as close to each other as possible. On a Summit compute node, all threads will be placed onto one socket first.


PGI 16.5

MP_BIND - Set to yes or y to bind processes or threads executing in a parallel region to physical processors
-bash-4.2$ export MP_BIND=yes
MP_BLIST Binds threads to specific CPUs.
-bash-4.2$ export MP_BLIST="1,3,5,7"
Bind thread 1 to cpu 1, thread 2 to cpu 3, thread 3 to cpu 5, thread 4 to cpu 7.


Full Source Code

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/types.h>
#include <sys/time.h>

double calctime(struct timeval start, struct timeval end)
  double time = 0.0;

  time = end.tv_usec - start.tv_usec;
  time = time/1000000;
  time += end.tv_sec - start.tv_sec;

  return time;

int main(){

    double *a;
    double *b;
    double *c;
    long long int i, j, m, n;
    m = 200*128; // 200 KB array = ~5.2GB matrix
    //At 1000*128, problem no longer fits in Summit memory, segfault.
    n = m;

    struct timeval start, end, diff;

    a = (double*) malloc(m*sizeof(double));
    b = (double*) malloc(m*n*sizeof(double));
    c = (double*) malloc(m*sizeof(double));

    // Data initialization
#pragma omp parallel for default(none) \
    private(i,j) shared(m,n,a,b,c)
    for (i=0; i<m; i++){
        a[i] = 2.0;
        c[i] = 1.0;
        for (j=0; j<n; j++){
            b[i*n + j] = i*n + j;

    // matrix * vector
    gettimeofday(&amp;start, NULL);
#pragma omp parallel for default(none) \
    private(i,j) shared(m,n,a,b,c)
    for (i=0; i<m; i++){
        for (j=0; j<n; j++){
            a[i] += b[i*n + j]*c[j];
    gettimeofday(&amp;end, NULL);

    printf("Time in seconds: %lf(s) for matrix size %dx%d\n",
               calctime(start, end), m/8, n/8);



Reduced OTP password entries

Because one-time passwords (like those generated by the Research Computing Vasco tokens) are, by design, only valid once, they cannot be cached for automatic entry by an ssh client. The Research Computing environment does not permit SSH keys for remote access, either; so users often find themselves having to frequently type in an OTP password.

OpenSSH (a common SSH client present on most Linux and Unix systems, including Mac OS X) provides a feature that allows multiple SSH sessions to share a single connection. The practical result is that a user need log in only once, with all other connections being directed over the existing, pre-authenticated connection.

This configuration takes place on your local client / workstation, not on a Research Computing login node or other system. Run these commands on your local system.

$ mkdir -p ~/.ssh
$ echo >>~/.ssh/config 'Host *login*.rc.colorado.edu
ControlMaster auto
ControlPath /tmp/ssh_mux_%u_%h_%p_%r'

With this stanza in place at the bottom of your ~/.ssh/config file, you can ssh to login.rc.colorado.edu (or any specific Research Computing login node) and every subsequent connection will share the first connection. You will not be required to re-authenticate unless all open connections are closed.

There are many more customization opportunities available in the ~/.ssh/config file (notably the remote username and keepalive settings). For more information, see the ssh_config(5) manpage.

$ man 5 ssh_config

SSH config file

The OpenSSH client has a number of configuration options. Specifying these options in your ~/.ssh/config file can simplify the process of accessing remote servers that you use frequently.

Configuration options in ~/.ssh.config can be thought of as being grouped into logical stanzas with a Host pattern as its header. (This is not quite true; but it's a usable enough model for most cases.)


The first thing you might want to do is specify host aliases for a server that you access frequently. You might have been doing this already with a shell script, shell alias, or some other mechanism; but it's easy to do it with OpenSSH itself, and it has the benefit of initializing your configuration stanza, as well.

Host login curc rc login.rc.colorado.edu
HostName login.rc.colorado.edu

With this simple stanza defined in ~/.ssh.config, you can now ssh into a Research Computing login node (login.rc.colorado.edu) with any of the defined aliases.

$ ssh login
$ ssh curc
$ ssh rc
$ ssh login.rc.colorado.edu


If you're lucky enough to have the same username on all the local and remote systems you use, then this feature won't mean much to you; but if you're like most people, you have one username on your office workstation, a different username on your home workstation, and quite possibly a different username at each remote system you access.

Rather than having to remember and specify your username each time you log into a remote system (with -l or $user@$host), you can record the correct username for each remote host in ~/.ssh/config, and the defined username will be used by default when accessing that host. (Otherwise, OpenSSH will attempt to use your client-side username by default.)

Host login curc rc login.rc.colorado.edu
HostName login.rc.colorado.edu
User joan5896


Do you frequently use X11 forwarding to run graphical utilities on a Research Computing login node? If so, you probably usually specify -X or -Y when logging in. You can, however, configure OpenSSH to forward X11 connections automatically.

Host login curc rc login.rc.colorado.edu
HostName login.rc.colorado.edu
User joan5896
ForwardX11 yes

More options

Virtually any option that can be specified during an ssh login can be added to ~/.ssh/config. For a full list of the available options, see the ssh_config(5) manpage.

$ man 5 ssh_config

Of particular additional note are the ControlMaster and ControlPath options, which can be used to reduce the burden of using a one-time password (OTP) authenticator; but these options are covered in a separate tutorial.

Serial jobs

This is an example serial job script using current best practices. More information about submitting jobs in the Research Computing environment is available in the documentation for batch queueing and job scheduling.


# example-serial-job.sh

# The job name will be displayed in the output of squeue
#SBATCH --job-name example-serial-job

# Set a time limit for the job (HH:MM:SS)
#SBATCH --time 0:00:30

# Start a single task (de facto on a single node)
#SBATCH --ntasks 1

# Set the output file name (default: slurm-${jobid}.out)
#SBATCH -o example-serial-job-%j.out

# Load modules used by your job
module load slurm

# Execute the program.

Slurm job submission

This tutorial demonstrates the process of submitting a compute job to the Research Computing batch queueing system, Slurm. The example job should run long enough to allow you to see it running and see the output file get created and updated.

Note: The example job does nothing but waste computer time. Please resist the temptation to run it at a larger scale.


Before you begin, you need

  • an RC account
  • a registered OTP authenticator
  • an SSH client application

Log in

First, log into an RC login node. This step is dependent on your local environment; but in an OS X or Linux environment, you should be able to use the standard OpenSSH command-line client.

$ ssh -l $username login.rc.colorado.edu

Prepare a job directory

In your home directory (which is the directory you will be using by default when you first log in) create a subdirectory to contain your test job script and your job's eventual output.

$ mkdir test-job
$ cd test-job

The cd changes your working directory to the new test-job directory, which you can confirm with the pwd command.

$ pwd

Write the job script

In a batch queueing environment like that at Research Computing, compute tasks are submitted as scripts that will be executed by the queueing system on your behalf. This script often contains embedded metadata about the resources required to complete the job (e.g., the number of compute nodes and cores and for how long you intend to use them).

You can write this script in any text editor. For ease of instruction here, use the cat command to redirect the script text into a file and press 'enter' to save it.  Paste the following in at the dollar prompt or omit the first line "cat..." and the last line "EOF" if you paste into an editor.

 cat >test-job.sh << EOF
#SBATCH --job-name test-job
#SBATCH --time 05:00
#SBATCH --nodes 1
#SBATCH --output test-job.out

echo "The job has begun."
echo "Wait one minute..."
sleep 60
echo "Wait a second minute..."
sleep 60
echo "Wait a third minute..."
sleep 60
echo "Enough waiting: job completed."

This script describes a job named "test-job" that will run for no longer than five minutes. The job consists of a single task running on a single node, with output directed to a test-job.out file.

You can use the cat command again to confirm the content of the new test-job.sh script.

$ cat test-job.sh

Submit the job

The test-job.sh file is a Bash shell script that serves as the initial executable for the job. The #SBATCH directives at the top of the script inform the scheduler of the job's requirements.

To submit the script as a batch job, first load the slurm module, which will provide access to the Slurm commands.

$ module load slurm/summit

Use the sbatch command to submit the script to Slurm.

$ sbatch --qos=debug test-job.sh
Submitted batch job 56

The --qos argument causes the job to be treated as a "debug" job which will grant it additional priority at the cost of tighter restrictions on its size and length. (--qos could have also been included in an #SBATCH directive in the script itself.)

When Slurm accepts a new job, it responds with the job id (a number) that can be used to identify the specific job in the queue. Referencing this job id when contacting rc-help@colorado.edu can expedite support.

Monitor job execution

Use the squeue command to monitor the status of pending or running jobs.

$ squeue --user $USER
                56     janus test-job  ralphie  R       0:06      1 node1701

If your job has not already started (i.e., it has a state of PD in stead of R) you can use the --start flag to query the estimated start time for your job.

$ squeue --user $USER --start

Once the job has started, the output it generates will be directed to the test-job.out file referenced in the job script. You can watch the output as it is written using the tail command.

$ tail -F test-job.out

Once the script has finished, the state will transition to C, and the job will eventually exit the queue.

Further reading