Resolved Issues

Default Intel compiler module changed from 16.0.3 to 17.4 due to OS incompatibility

An unanticipated incompatibility between the updated operating system version installed on Summit (Red Hat Enterprise Linux 7.3) and the default Intel compiler version, 16.0.3. This incompatibility was announced by Intel 18 January 2017, and covers the entirety of the Intel Compiler version 16.

GPFS down event on Summit

Update, 7 September 2017

GPFS has remained stable on Summit after being brought back up yesterday. Logs further indicate no issues. If the problem reoccurs, we will investigate further; but for now I'm going to treat this as a transient failure occuring due to the end of maintenance, rather than an ongoing operational issue. (e.g., there were issues with the fabric manager for a bit that were later resolved)

---

Planned Maintenance - Wednesday, 6 September 2017

20:45, 6 September 2017

Today's planned maintenance activities have concluded, and Summit has been returned to service. A small set of nodes have been reserved after being observed to be particularly slow on the network, and we'll follow up with those tomorrow.

DNS interruption in the RC environment

Update, 22 August (next day)

The word from the greater OIT department is that Internet traffic related to people on campus streaming the eclipse yesterday led to a disruption of DNS services across the campus. I'm still confused as to why this affected our ability to resolve internal addresses, so we have a point of architectural inquiry into our internal DNS infrastructure; but the actual incident has passed and service resumed as normal, so I'm marking this issue resolved.

---

Update, 13:08

Campus border firewall maintenance

University of Colorado Boulder campus border firewalls will undergo maintenance on the morning of Wednesday, August 16, from 5:30 a.m. to 6:30 a.m. MDT.

Planned Maintenance - Wednesday, 2-4 August 2017

4 August 2017, 17:37

Summit has been returned to production. This concludes the third and final outage in support of the HPCF UPS installation. (!)

blogin01 reboot and update of Nvidia Drivers

We have opened up logins again on blogin01.

Should you have any questions or issue please send a ticket into rc-help@colorado.edu.

 =================================

Today we will be taking down blogin01 at 5:30pm to apply some Nvidia driver updates and to reboot the node.  Any jobs running on Blanca will remain running, however all VNC sessions will be terminated on blogin01. We are hoping that the Nvidia driver update will resolve issues seen by users trying to use indirect GLX on blogin01.

Science Network outage affecting RC services

Update 00:30

A failed supervisor card in one of the Science Network core routers was replaced and service restored. The Research Computing environment weathered the outage relatively well, considering the extent ouf the outage. Summit and Blanca compute, as well as login services, appear to be functional and able to communicate with Core Storage as expected. I've resumed the starting of new jobs, and will follow-up further in the morning.

Off-cycle Planned Maintenance - 19-20 July

17:34, 20 July 2017

Summit has been returned to production, and jobs are once again running.

Partial HPCF power interruption

Update, 7 July 2017

The source of the power interruption was found and corrected during Wednesday's planned maintenance. Nodes in Summit rack 3 (shas03*, sgpu03*, and smem0301) have been returned to service.

---

Pages

Subscribe to RSS - Resolved Issues