Summit Deployment Updates

July 10, 2017
Summit's 20-node Xeon Phi "Knights Landing" partition has been installed and tested successfully.  It will be available for general use shortly.

Please note that this completes the deployment of Summit, so this page will not be updated further.


March 3, 2017

The Summit "condo" expansion consisting of 65 compute nodes owned by individual research groups has been installed and networked into the existing Summit cluster.


February 6, 2017
The "Haswell", "GPU", and "High-Memory" partitions of Summit are now in full production and available to any CU-Boulder or CSU researcher. 


January 18, 2017

We are in the final stages of preparing Summit for general availability to UCB and CSU researchers by the end of January. 

Since the last update, we have added capacity to Summit's scratch storage system and reconfirmed that its performance exceeds the acceptance criteria.

Approximately 40 users from nearly 10 research groups at UCB and CSU have built and tested applications on Summit so far.  Overall, performance and scaling have met expectations.  Scheduling and prioritization of jobs through Slurm's "fairshare" mechanism is working well.

We still need to perform one fairly major hardware replacement in the scratch storage system, provide remote access to the scratch filesystem, and finalize the configuration of Slurm and the allocation system, but these tasks are all expected to be complete before the end of the month.

Watch for upcoming announcements on how to apply for access to Summit's "general" allocation account, which is the analog of a "startup" allocation on Janus.
 

November 11, 2016
We are ready for large-scale application test jobs on Summit.  Please read carefully through https://docs.google.com/document/d/15HuROjqlOcZn2MOUFAcYAiU9u_hJmCYE5kjhuukzrD0 , which has been updated with additional info about optimized compiling, the GPFS scratch storage, and Slurm.

We hope to expand the availability of Summit to a wider range of early users starting in a couple of weeks, so if you are a tester it would be ideal if you could get started asap.

As always, let us know via this list if you have any questions or difficulties.  Please also report performance results here, and whether they match expectations.  And if your app runs way better than expected, that'd be great to know too!

We expect Summit to be open for general users in January 2017.

September 15, 2016
 
Two Summit GPU nodes are now available for application building and initial testing.

The storage and interconnect vendors have made excellent progress this week toward improving the performance of Summit's GPFS storage system.  Barring unexpected setbacks, it seems likely that it will be possible to begin storage acceptance testing within the next couple of weeks.

Also, two nodes with Xeon Phi ("Knights Landing") host processors are now available for any CU researcher who would like to test an application on a platform with 256 threads per node.  These Phi nodes are connected together with 100Gbps Omni-Path network and so we are especially interested in hybrid MPI/OpenMP applications that can span both nodes.  RC staff will be available to assist with optimizing applications for Phi.  Any application that currently runs on a standard Intel Xeon processor should build easily on Phi, but will almost certainly require some tweaking for best performance.
 

September 7, 2016

Summit's GPU and high-memory nodes have passed the initial round of acceptance testing and will be available for application testing as soon as RC staff can re-install them with the RC-specific operating system configuration.

 

The storage and interconnect vendors are continuing to make progress toward understanding the source of the performance issues around Summit's GPFS scratch storage system.  However, at this time we still do not have a firm estimate of when the scratch filesystem will be production-ready.

August 24th, 2016

Application acceptance testing on the Summit general compute nodes is underway, with at least three scientific and engineering programs having been build successfully.  

However, the GPFS scratch storage system is still underperforming significantly as a result of poor GPFS communication speeds between compute nodes and the storage servers over the Omni-Path network.  The storage vendor has built a full-scale copy of the Summit storage system in their lab and is working with the OPA vendor on a fix, but we do not yet have a good estimate of when that will be in place.

Large scale application testing will likely have to wait until the scratch storage is fully performing.  As a result, "early user" access is pushed back indefinitely until that time.

 

It appears that the performance issues on the GPU and high-memory nodes have been successfully resolved and hardware acceptance testing on them is now in progress.  The results from the intermediate test phases have been good so we expect that we'll be able to start operating system installation and application building on them early in the week of August 29.
 

August 17th, 2016

The largest group of Summit nodes, the 380 general compute nodes, have passed the first phase of hardware acceptance testing.  These tests involve synthetic applications designed to put maximum stress on the hardware and include High-Performance LINPACK (for CPU and memory), stream (for memory), and OSU MPI (for Omni-Path network).  We expect to begin the second phase of acceptance tests, involving a set of science and engineering applications, by early next week.

The remaining Summit components are still being tuned.  We expect the high-memory nodes to pass the initial hardware tests soon, but the GPU nodes and scratch filesystem are still experiencing some performance difficulties associated with their Omni-Path connections.  Their respective vendors have Level-3 engineering teams assigned to work on the remaining issues but we do not yet have an estimate of when these systems' performance will have improved sufficiently to pass our acceptance criteria. 

August 9th, 2016

There have been unexpected delays in the delivery and installation of Summit but we are happy to announce that all of the hardware has been delivered and installed and hardware acceptance testing is well underway.
 

The vast majority of the hardware is performing better than expected  although there are still some kinks to be worked out with regards to several of Summit’s more cutting edge features.  We are on track for the vendors to hand the cluster over to us so that we can  move forward with application testing around August 22nd.
 

Thank you to everyone who volunteered to be an early tester for Summit.  As of now we have enough early testers and are excited to get some solid data on Summit application performance for our users.
 

We anticipate that early user access will be available beginning the week of Labor Day with a targeted general availability  date of mid-October.  During the early user phase we will be ensuring that scheduling, accounting, and the new software stack are behaving as expected.

 


As we move forward with getting Summit into production please check out our website at www.rc.colorado.edu/news/summitupdates every Tuesday for more information and a weekly update on our progress.