Science Network outage affecting RC services

Update 00:30

A failed supervisor card in one of the Science Network core routers was replaced and service restored. The Research Computing environment weathered the outage relatively well, considering the extent ouf the outage. Summit and Blanca compute, as well as login services, appear to be functional and able to communicate with Core Storage as expected. I've resumed the starting of new jobs, and will follow-up further in the morning.

update, 21:50

We believe that we have identified the failed component, and are working on replacing it with an on-site spare. We'll assess the extent of the disruption when the network is back up, and update here.


The Science Network appears to have suffered a hardware failure that is causing widespread connectivity issues between RC services. Notably, connectivity to RC Core Storage has been disrupted, including /home and /projects directories.

Summit has been marked down to prevent new jobs from starting; but the legacy Slurm deployment on Blanca prevents us from doing the same there, as Blanca Slurm itself is dependent on Core Storage to function.

We are investigating the cause of the failure, and are working toward resolution. Further updates will be posted here as they become available.

