Resolved Issues

Possible network connectivity issue affecting RC login

Update 14:17, 13 November

The network team believes that it understands the root cause of this routing issue. Full remediation will require a partial outage and will be implemented after SC17 has concluded, likely during the next planned maintenance period. The work-around that is in place appears to have continued to work for now; but please contact rc-help@colorado.edu if you have any trouble connecting to the RC login environment.

Update 11:25, 10 November

Summit interconnect/storage issue causing stale file handles

Update, 20 November 2017

We're continuing to pursue a resolution for the GPFS problems we're having on Summit. We've just received a procedure to upgrade the OPA software and firmware on the Summit storage system to bring it in-line with the version installed on the Summit compute environment, and we should be able to implement this procedure today. We do not know whether this will resolve the issue, but it is at least one more thing that *could* be the issue.

Planned Maintenance - Wednesday, 1 November

Update 19:57, 1 November:

Maintenance activities have concluded, and we have released Summit for jobs.

Be advised that we are aware of some unfortunately persistent instability in the Summit interconnect, particularly affecting Summit storage. We're actively pursuing resolution with both Intel and DDN.

If you have trouble, as always, please contact us at rc-help@colorado.edu.

---

Update 19:10, 1 November:

We have concluded performance testing on blanca-ics, and have released it for jobs.

Unintended quota errors on Summit scratch

Final information / root cause:

I've been going through tickets and see that this quota error impacted quite a few people with jobs dying. This is not unexpected, given the type of error; but please accept our apologies for this disruption.

Due to the impact from this error, I wanted to provide some explicit detail about what happened, so you could at least have the confidence to know that it was a simple mistake, and should not be interpreted as a fundamental instability or problem with the system.

Planned Maintenance - Wednesday, 4 October 2017

Update 17:45

Maintenance activities affecting DTN concluded 15:45 today. There's still a very little work we might do for Blanca; but this should have very little service impact, and may even be completed on a different day.

With that, today's planned maintenance activites have concluded. If you have trouble, please contact rc-help@colorado.edu.

---

Update 13:33

Core storage issue affecting RC environment (Summit, Blanca)

Update, 12:16 PM

Core storage has been brought back into full functionality, and both Summit and Blanca have resumed processing jobs.

A set of jobs with a high-iops load was processing against either /home or /projects. This appears to have caused a fault in the core storage infrastructure, leading causing a subset of its redundant IP addresses to become unresponsive to existing clients. We have held these jobs for now, and terminated those which were running, and will be reaching out to the jobs' owner to adjust the workload.

shas0136 (1/2 of scompile) rebooting after OOM

shas0136 has been returned to service.

---

shas0136 was run out-of-memory earlier today, which has caused it to go inaccessible via ssh. We're rebooting the host remotely, which should return it to service.

Default Intel compiler module changed from 16.0.3 to 17.4 due to OS incompatibility

An unanticipated incompatibility between the updated operating system version installed on Summit (Red Hat Enterprise Linux 7.3) and the default Intel compiler version, 16.0.3. This incompatibility was announced by Intel 18 January 2017, and covers the entirety of the Intel Compiler version 16.

GPFS down event on Summit

Update, 7 September 2017

GPFS has remained stable on Summit after being brought back up yesterday. Logs further indicate no issues. If the problem reoccurs, we will investigate further; but for now I'm going to treat this as a transient failure occuring due to the end of maintenance, rather than an ongoing operational issue. (e.g., there were issues with the fabric manager for a bit that were later resolved)

---

Planned Maintenance - Wednesday, 6 September 2017

20:45, 6 September 2017

Today's planned maintenance activities have concluded, and Summit has been returned to service. A small set of nodes have been reserved after being observed to be particularly slow on the network, and we'll follow up with those tomorrow.

Pages

Subscribe to RSS - Resolved Issues