Off-cycle Planned Maintenance - 19-20 July

17:34, 20 July 2017

Summit has been returned to production, and jobs are once again running.

This concludes the second of three UPS installation outages. During this outage, all HPCF IT infrastructure (including Summit) was re-wired to be powered through our new, UPS-equipped power distribution system. When it is activated, this will provide not only power conditioning (remediating a power quality issue that has led to several past Summit compute outages) but also at least 15 minutes of UPS-backed runtime in the event of a complete utility power outage (which should eventually provide us sufficient time to power the system off in a controlled manner).

Our third and final UPS outage is scheduled to begin during our regularly-scheduled maintenance period on 2 August, but will extend through Friday, 4 August. During this outage, the power cabling between the UPS infrastructure and the in-row power-distribution infrastructure will be re-routed for future maintainability; the legacy UPS will be decommissioned; additional in-row power distribution infrastructure will be installed; and the new UPS will be put into full production.

If you notice any problems, or have any questions or concerns, please contact us at rc-help@colorado.edu

---

16:02, 20 July 2017

Power and cooling have been restored at the HPCF, and we are now bringing Summit back into production.

---

15:21, 20 July 2017

The facilities work at the HPCF is extending beyond schedule. I'm onsite and actively monitoring the remaining work, and will start bringing the system up as soon as the building is back up.

---

09:18, 20 July 2017

The report from the field is that work deploying the new HPCF UPS, supporting Summit, progressed as expected yesterday, if not more smoothly than expected. Work continues today, and is scheduled to conclude by 15:00 (3 PM). We'll require a bit of additional time to bring the sytem back up at that point; but we haven't scheduled any additional work to coincide with this bringup; so unless something goes wrong, that should be a relatively straightforward process.

Further updates as they are available.

---

06:21, 19 July 2017

All jobs on Summit have already stopped in anticipation of the reservation for today's off-cycle planned maintenance; so I'm going to go ahead and start shutting the system down in hopes that it might allow work to begin eary.

Details about this outage are below; but be reminded that this is also an extended outage, with work scheduled to end 15:00, 20 July (3 PM tomorrow). After that, we still have to bring the system back up; but we'll do that as quickly as we can, and we won't be doing any additional work during the bringup that might delay the process.

Progres on the work will be reported here as it becomes available.

---

Research Computing will perform off-cycle planned maintenance Wednesday and Thursday, 19-20 July 2017. During this period, no jobs will run on Summit compute, and Summit scratch will be unavailable.

As previously announced, Research Computing is currently in the process of upgrading the HPCF power infrastructure with an uninterruptable power supply (UPS) component. This will improve the reliability of the power supply to Summit and any future systems installed along with it. In particular, this improved power reliability should prevent future Summit compute outages as previously experienced multiple times this year due to sags in utility power and brief weather-related interruptions.

The installation of this UPS is being done in three phases, each of which requires a full power down of the HPCF and a full Summit outage. The first of these outages was completed during the last regularly-scheduled planned maintenance period, during wich the UPS was installed and connected to utility power.

The next upcoming outage is schedule to occur off-cycle from our regular PM schedule, starting 07:00 Wednesday, 19 July and ending approximately 17:00 Thursday, 20 July. During this outage, existing equipment (including Summit compute and Summit scratch) will be connected to power via the UPS.

The third and final outage is schedule to begin proximate to our regularly-scheduled August PM, starting at 05:00 Wednesday, 2 August and ending 17:00 Friday, 4 August. During this outage, the HPCF power distribution system will be audited and improved for future expansion and maintainability.

If you have any questions or concerns, please contact us at rc-help@colorado.edu

Wed, 19 Jul 2017 -0600