Roger Goff, Data Direct Networks, and the SFA14k

Roger Goff from Data Direct Networks visits to talk about the SFA14k, the storage system that serves the scratch filesystem for our upcoming 'Summit' system. We also end up on a number of tangents including procurement processes, Omni-Path, and the minutae of IO design in a high-performance storage system.

Research Computing’s Jonathon Anderson and Scott Ferguson interviewed Roger Goff, a pre-sales engineer who works with customers to design solutions to storage problems. He works for Data Direct Networks (DDN), the company supplying scratch storage for the Summit supercomputer that will become available this summer. The company’s Storage Fusion Architecture (SFA) optimizes storage while minimizing cost by using a real-time storage operating system on which every operation is a storage operation, eliminating interference from extraneous processes to make performance highly consistent between runs. The SFA 14K, which will be used for Summit, is the first storage system to support Intel’s Omnipath I/O Interconnect. Goff said, “It really is a great collaboration. We love partnering with our customers to do this type of solution, so we’re really looking forward to working with your group on this one.”

The SFA 14K optimizes storage in several ways. It includes the storage fusion fabric, an interconnect that provides excess bandwidth to maximize the system’s performance and availability. In this model, when an I/O node is connected to storage, it is connected to two nodes. If one controller is down, the I/O node retains a path to the other controller, which maintains availability. Even if an entire enclosure on the supercomputer is lost, the host I/O will be maintained to prevent storage from going down. The 14K’s embedded platform includes 4 virtual machines inside each controller, each of which will own one Omnipath port with about 10 GB of throughput per second. Although the virtual machines use resources, a higher core-count processor and more RAM give the virtual machine environment a memory configuration with fewer hops through interface cards to reach storage. Partial rebuilds within SFA, of which every DDN platform is capable, journal all I/O to each drive. Thanks to this journaling, a drive that is removed can be brought back online within ten minutes of removal without being rebuilt. Discs in the storage system are distributed across multiple enclosures, creating multiple paths to discs and making the system more resilient: 20% can be lost without the system going down.

Looking towards the future, DDN has been investing resources in the archive tier of storage. When researchers complete their degrees and publish their work, their data no longer needs to be accessible in the fast parallel storage system, but it has to be permanently saved. Archiving this data in a multi-leveraged tiering space will keep it from occupying resources in the parallel file system. Additionally, DDN will soon release an Infinite Memory System, which has been seen to improve I/O speeds by as much as three orders of magnitude. After deploying this system at the Ohio Supercomputer Center, DDN will wait to hear from customers before making further enhancements. Goff emphasized that DDN is customer-driven. “We are the number one provider of HPC storage in the industry,” he said, “because HPC storage is all we do.”

Publication date: 
Tue, 08 Mar 2016 17:00 -0700
Host: 
jonathon.anderson@colorado.edu (Jonathon Anderson)
UUID: 
05574b65-e8a9-4492-9fe0-ecfe3ebaa6c8