Computing at SLAC
Search SLAC
SLAC Home | Computing Home | Help

Partial Scientific Computing Outage, August 25

Affecting some unix fileservers, batch machines, HPSS, and some other servers

Updated: 19 August 2009



What Is Happening

On Tuesday, August 25th, 2009, there will be an outage of some unix (NFS and xrootd) fileservers, HPSS, the tersks, and some batch machines. Shutdown of the servers will begin at 6:00 am, and we expect to have most services restored by 5:00 pm. Some exceptions are noted below.

In order to perform necessary electrical work on the first floor of building 50, we will need to shutdown some unix fileservers (primarily for scientific data); the HPSS system (used for tape data storage and backup), the tersk pool of machines (interactive solaris machines), plus some other machines: red, the euclids, and some BaBar machines associated with data distribution.

In addition, Sun needs to perform repair/replacement work on the temperature sensors in the "black boxes" (Modular Datacenters), which will require shutting down the several hundred batch machines in them. Batch workers within building 50 will not be affected. However, because of the large number of unavailable fileservers, we will be inactivating the general batch queues starting Monday evening (8/24). This should limit the fall out (failed jobs, hung batch workers) caused by taking the fileservers down. Jobs already running on batch workers within building 50 will continue to run, but new jobs will not be started.

We also need to perform maintenance on the batch system. During this maintenance period (approximately 10am to Noon), batch jobs will continue to run on the available batch workers, but commands like "bjobs" will be unavailable.

Note that enterprise and network services will not be affected.

Examples of enterprise services that will remain in operation include:

  • Telephones
  • Email
  • Network infrastructure
  • Windows and UNIX home directories; Windows group space; UNIX AFS group space
  • Web servers; SharePoint; Citrix
  • Business IT

What Is Going Down

There are several classes of machines which will be shutdown. First, there are the unix batch workers in the two "black boxes": bali0001-0252; boer0001-0135; and yili0001-0131. These machines will stop accepting new batch jobs beginning Friday night, August 21st, and will be powered off at 6am, Tuesday, August 25th.

The other servers to be shutdown are located on the first floor of building 50. The complete list can be found here:

     1st floor servers going down

The list includes a number of NFS file servers. The file systems they export are listed here:

     unavailable NFS filesystems

Users should avoid running batch jobs that depend on any of the filesystems in the above list during the outage.

Also note that users who have symbolic links to these NFS filesystems in their home directories may experience slowness when performing certain commands (including 'ls') because of the unavailable fileservers.

 

The HPSS system and its associated servers will be going down. This means that the astore, bstore, and mstore services will be unavailable; and restoring backups from AFS, NFS and windows will be unavailable during the outage.

Other servers which will be down:

  • tersks (interactive solaris machines)
  • euclids (KIPAC)
  • red (KIPAC) note: red will be shutdown at Noon, Monday, 8/24, and may not be back in service until Wednesday, 8/26.

Finally, while there will be some batch workers running without interruption, we will also take this opportunity to perform some maintenance on the batch system itself. During this time, approximately 10am until Noon, batch commands such as bsub, bjobs, bpeek, etc., will be unavailable. Jobs already running will be unaffected. The general batch queues will be inactivated Monday evening, and reactivated once the outage is over


John Bartelt