Resolved Issues

Unplanned service interruption

We believed that we've resolved the remaining NFS client issues, but will continue to monitor in case the problem reoccurs. Our networking admin is further investigating a possible cause of the issue, which appears to have originated from an unplanned reboot of one of the core Science Network switches.

Please contact rc-help@colorado.edu if you are experiencing any issues.

--

Planned maintenance Wednesday, 7 February 2018

The rest of blanca was subsequently successfully returned to service.

-

14:54, 8 February 2018

Duo two factor authentication issues

In working with IAM, Security, and Networking we have finally discovered the root cause of the duo authentication failures that started at 7am this morning.  We are working with those 3 groups now to ensure that processes are in place to prevent this kind of downtime in the future.

Planned Maintenance Wednesday, 3 January 2018

Maintenance activities have concluded, and we have released Summit for jobs.

If you have trouble, as always, please contact us at rc-help@colorado.edu.

---

This is a reminder about the maintenance tomorrow. 

Please be aware that it also includes a reboot of Summit Scratch SFA that will affect /scratch/summit and /gpfs/summit/datasets. 

Possible network connectivity issue affecting RC login

Update 14:17, 13 November

The network team believes that it understands the root cause of this routing issue. Full remediation will require a partial outage and will be implemented after SC17 has concluded, likely during the next planned maintenance period. The work-around that is in place appears to have continued to work for now; but please contact rc-help@colorado.edu if you have any trouble connecting to the RC login environment.

Update 11:25, 10 November

Planned Maintenance - Wednesday, 1 November

Update 19:57, 1 November:

Maintenance activities have concluded, and we have released Summit for jobs.

Be advised that we are aware of some unfortunately persistent instability in the Summit interconnect, particularly affecting Summit storage. We're actively pursuing resolution with both Intel and DDN.

If you have trouble, as always, please contact us at rc-help@colorado.edu.

---

Update 19:10, 1 November:

We have concluded performance testing on blanca-ics, and have released it for jobs.

Unintended quota errors on Summit scratch

Final information / root cause:

I've been going through tickets and see that this quota error impacted quite a few people with jobs dying. This is not unexpected, given the type of error; but please accept our apologies for this disruption.

Due to the impact from this error, I wanted to provide some explicit detail about what happened, so you could at least have the confidence to know that it was a simple mistake, and should not be interpreted as a fundamental instability or problem with the system.

Planned Maintenance - Wednesday, 4 October 2017

Update 17:45

Maintenance activities affecting DTN concluded 15:45 today. There's still a very little work we might do for Blanca; but this should have very little service impact, and may even be completed on a different day.

With that, today's planned maintenance activites have concluded. If you have trouble, please contact rc-help@colorado.edu.

---

Update 13:33

Core storage issue affecting RC environment (Summit, Blanca)

Update, 12:16 PM

Core storage has been brought back into full functionality, and both Summit and Blanca have resumed processing jobs.

A set of jobs with a high-iops load was processing against either /home or /projects. This appears to have caused a fault in the core storage infrastructure, leading causing a subset of its redundant IP addresses to become unresponsive to existing clients. We have held these jobs for now, and terminated those which were running, and will be reaching out to the jobs' owner to adjust the workload.

shas0136 (1/2 of scompile) rebooting after OOM

shas0136 has been returned to service.

---

shas0136 was run out-of-memory earlier today, which has caused it to go inaccessible via ssh. We're rebooting the host remotely, which should return it to service.

Pages

Subscribe to RSS - Resolved Issues