Janus scratch degradation (potential disruption)

Update 17 April, 9:04 PM

Our storage system support vendor has determined that the affected controller has failed and should be replaced. They are moving to provide a replacement; but be reminded that Janus, including this scratch storage system, is no longer covered by a support contract. This recent controller failure is indicative of the age of the system, and should serve as warning that Janus scratch should be vacated if possible.

In the mean time, we've returned the filesystem to a stable state, and have reactivated Slurm for dispatch on Janus compute resources.

Be reminded that Janus is scheduled for decommissioning, starting 24 May with the cessation of compute work. The current plan is to begin decommissioning Janus scratch 31 May by changing the filesystem to read-only; but further failures in the system may lead to the scratch filesystem becoming unavailable earlier.

Janus users are strongly encouraged to move compute workflows and any data remaining on Janus scratch to Summit resources as soon as possible.

For more information on the Janus decomissioning schedule, see https://www.rc.colorado.edu/janusdecommissioning.

And if you have any further questions, please contact rc-help@colorado.edu.

---

Update 17 April, 12:44 PM

We're working to restore full functionality and redundancy on Janus scratch; but these efforts may be causing some disruption in the functionaliy of the scratch filesystem. We're preventing new jobs from starting for now to be on the safe side.

---

Janus scratch (lustre) experienced a controller failure today. The remaining controller in the affected couplet has remained functional, and appears to be providing expected redundancy; but Janus scratch may provide lower than expected performance unless and until we are able to bring the failed controller back online.

At this time we are going to continue to allow jobs to run normally on Janus; but please contact us at rc-help@colorado.edu if you experience any unexpected trouble.

Sun, 16 Apr 2017 -0600