![]() |
||
| SLAC Home | Computing Home | Help | ||
On Tuesday, August 25th, 2009, there will be an outage of some unix (NFS and xrootd) fileservers, HPSS, the tersks, and some batch machines. Shutdown of the servers will begin at 6:00 am, and we expect to have most services restored by 5:00 pm. Some exceptions are noted below.
In order to perform necessary electrical work on the first floor of building 50, we will need to shutdown some unix fileservers (primarily for scientific data); the HPSS system (used for tape data storage and backup), the tersk pool of machines (interactive solaris machines), plus some other machines: red, the euclids, and some BaBar machines associated with data distribution.
In addition, Sun needs to perform repair/replacement work on the temperature sensors in the "black boxes" (Modular Datacenters), which will require shutting down the several hundred batch machines in them. Batch workers within building 50 will not be affected. However, because of the large number of unavailable fileservers, we will be inactivating the general batch queues starting Monday evening (8/24). This should limit the fall out (failed jobs, hung batch workers) caused by taking the fileservers down. Jobs already running on batch workers within building 50 will continue to run, but new jobs will not be started.
We also need to perform maintenance on the batch system. During this maintenance period (approximately 10am to Noon), batch jobs will continue to run on the available batch workers, but commands like "bjobs" will be unavailable.
Note that enterprise and network services will not be affected.
Examples of enterprise services that will remain in operation include:
There are several classes of machines which will be shutdown. First, there are the unix batch workers in the two "black boxes": bali0001-0252; boer0001-0135; and yili0001-0131. These machines will stop accepting new batch jobs beginning Friday night, August 21st, and will be powered off at 6am, Tuesday, August 25th.
The other servers to be shutdown are located on the first floor of building 50. The complete list can be found here:
The list includes a number of NFS file servers. The file systems they export are listed here:
Users should avoid running batch jobs that depend on any of the filesystems in the above list during the outage.
Also note that users who have symbolic links to these NFS filesystems in their home directories may experience slowness when performing certain commands (including 'ls') because of the unavailable fileservers.
The HPSS system and its associated servers will be going down. This means that the astore, bstore, and mstore services will be unavailable; and restoring backups from AFS, NFS and windows will be unavailable during the outage.
Other servers which will be down:
Finally, while there will be some batch workers running without interruption, we will also take this opportunity to perform some maintenance on the batch system itself. During this time, approximately 10am until Noon, batch commands such as bsub, bjobs, bpeek, etc., will be unavailable. Jobs already running will be unaffected. The general batch queues will be inactivated Monday evening, and reactivated once the outage is over