Potential power interruption in HPCF (affecting Summit, likely weather-related)

Update (10:52, 19 May 2017)

Summit haswell, gpu, and himem resources are back in production and running jobs, with no long-term fallout observed so far from the power interruption. That said, if you notice anything that seems wrong, please contact us at rc-help@colorado.edu.

---

Update (09:00, 19 May 2017)

The weather appears to have stabalized, and I don't see any new failures over-night; so we're bringing Summit back into production this morning, and will provide at least one more update when the work is complete.

---

Update (12:52, 18 May 2017)

Groups of Summit nodes have continued to reboot throughout the day, coincident with power interruptions on the CU campus, likely related to the ongoing storm.

The good news is that we seem to have confirmed that these reboots, which previously had happened overnight, are in fact related to power supply interruptions. We have already planned to install a whole-system UPS in the HPCF this summer, which should dramatically improve the power supply consistency for Summit and prevent these types of failures in the future.

The bad news is that we're likely to continue to experience power interruptions as the storm continues.

We've configured Summit to not start new jobs at this point, to prevent further unintended job failures, and will restore service once the storm subsides and power returns to relative stability. (You may notice some new jobs starting until approximately 4pm, from an ongoing tutorial that we're trying to continue to support.)

As always, if you have any questions, please contact us at rc-help@colorado.edu. We do apologize for the interruption, and we look forward to being able to provide more reliable service after the installation of the HPCF UPS.

~ Jonathon Anderson

---

Original message:

A significant number of Summit compute nodes have just recently rebooted. We have reason to believe that this is related to an immediately-previous momentary power glitch, likely related to the local weather. We are investigating the downed nodes and will return them to service as soon as possible.

Thu, 18 May 2017 -0600