GPFS down event on Summit

Update, 7 September 2017

GPFS has remained stable on Summit after being brought back up yesterday. Logs further indicate no issues. If the problem reoccurs, we will investigate further; but for now I'm going to treat this as a transient failure occuring due to the end of maintenance, rather than an ongoing operational issue. (e.g., there were issues with the fabric manager for a bit that were later resolved)


I happened to notice that GPFS was down on the vast majority of Summit compute nodes tonight, preventing jobs from starting. I'm uncertain as to why GPFS was shut down (logs claim that it was shut down normally) but I've brought it back up, and cleared network performance counters. I'll check back in in the morning and see if the problem has reoccured, and if there's any network problem that might be causing GPFS to shut down.

Wed, 06 Sep 2017 -0600