LCLS Channel Archiver Routine Maintenance

This document describes the routine maintenance of the LCLS archiver system. It does not address more complicated maintenance activities, such as adding new archiver engines or changing the current top-level archiver index as it grows large (e.g., approaching the current Channel Archiver 2 GB limit).

New Method to Copy Current Engine Directory Data and Rebuild NFS Regular Density Indexes

This section describes the new method of restarting archiver engines to close engine directory data/index files, copying these files to the archiver NFS long-term storage area, and rebuilding the NFS regular density indexes. This new method supersedes the old method described in the next section. This new method is an automated approach for implementing this functionality and requires minimum maintenance by the archiver administrators.

There are two subsystems to accomplish this functionality. The first is the automatic restart of the archive engines every workday to close engine directory data/index files and the monitoring of this activity for unexpected conditions. This occurs on the LCLS archiver sampling system, lcls-archeng. The second subsystem is the copying of recently closed engine directory data/index files and rebuilding the NFS regular density indexes. This occurs on the LCLS archiver server system, lcls-archsrv.

Automatic Restart of Archive Engines and Monitoring

Archiver engines are restarted every workday by the Archive Daemon process. This approach differs from the old method of manually restarting the Archive Daemon process and initiating the stopping of all the engine processes running under its control so that the new Archive Daemon process would restart them. Engines are restarted one minute apart in order to reduce the CPU and network load for Channel Access reconnection. Currently the 16 archiver engines are restarted in order starting at 10:00 AM every workday (the 16th and last archive engine is therefore restarted at 10:15 AM every workday) as specified by the /arch/archiveconfig.xml configuration file.

The Archive Daemon process has been modified to not restart the engines on weekends or on holidays, as specified in the /arch/holidays.txt file. This has been done so engines are not restarted when archiver administrators are unavailble for manual intervention if an unexpected event occurs. When it is desired to modify this holidays.txt file (e.g., yearly or to conform to the archiver administrator's vacation schedule), the following procedure should be invoked so the Archive Daemon process, the archive_engine_monitor.pl process, and the auto_update_server.pl process are restarted to reread the new holidays.txt file upon initialization:

  1. Login as laci on the lcls-archeng machine:
  2. cd /arch
  3. Modify the holidays.txt file.
  4. scripts/stop_daemons.pl -p
  5. scripts/start_daemons.pl
  6. Stop the archive_engine_monitor.pl process.
  7. cd /arch/scripts
  8. ./st.archive_engine_monitor.pl
  9. Login as laci on the lcls-archsrv machine:
  10. Stop the auto_update_server.pl process.
  11. cd /nfs/slac/g/archiver/arch_lcls/scripts
  12. ./st.auto_update_server.pl

The /arch/scripts/st.archive_engine_monitor.pl script spawns the execution of the archive_engine_monitor.pl script, which runs continuously as a daemon process on lcls-archeng. It monitors the restarting of the archive engines and performs corrective actions if unexpected conditions occur. Currently it performs its monitoring activities every workday from 10:00 AM (when the first archive engine is restarted) to 10:30 AM (15 minutes after the last archive engine is restarted), performing processing beginning at the start of each minute. It sends email to the archiver administrators to inform them of any corrective actions it took due to unexpected conditions and after all the engines were verified to have correctly restarted. After all of the archiver engines were verified to have restarted and to be storing index and data files, a file whose name contains the timestamp of the current day (e.g., 2011_04_29_ready_for_copy_and_index.txt) is written to the /arch/log directory to signal to the lcls-archsrv auto_update_server.pl process that the data/index copy and index rebuilding activity can begin.

Automatic Copy of Current Engine Directory Data and Rebuild NFS Regular Density Indexes

The /nfs/slac/g/archiver/arch_lcls/scripts/st.auto_update_server.pl spawns the execution of the auto_update_server.pl script, which runs continiously as a daemon process on lcls-archsrv. It monitors for the creation of a new "ready_for_copy_and_index.txt" file with the current day timestamp (e.g., 2011_04_29_ready_for_copy_and_index.txt) in the /arch/log directory. When it detects that "ready_for_copy_and_index.txt" with the current day timestamp exists (and its own activites have not been completed), it copies the recently closed engine directory data/index files and rebuilding the NFS regular density indexes. It sends email to the archiver administrators to inform them of any error conditions it detected or the successful completion of copy and index activities. It also writes a file whose name contains the the timestamp of the current day (e.g., 2011_04_29_completed_copy_and_indexing.txt) to the /arch/log directory indicating the success or failure of its processing.

Old Method to Copy Current Engine Directory Data and Rebuild NFS Regular Density Indexes

This section describes the old method of copying current engine directory data/indexes and rebuilding the NFS regular density indexes. This method has now been replaced by the new method of performing these functions as described above. This section remains in the documentation in case there is ever a need for using it if unexpected future events call for using this procedure again.

To provide the greatest level of safety in preventing data corruption, all LCLS archiver engines should be restarted daily so that index and data files can be closed in the local disk buffer current engine data directories and later copied to the NFS LCLS archiver storage disk while new archiver data is stored in new engine data directories. After the data copy has completed, the NFS regular density indexes should be rebuilt to reference the newly copied data on the NFS LCLS archiver storage disk.

Restart the Archive Daemon Process, which restarts all archiver engines

Perform the following procedure every workday:

  1. In window A, login to a machine where the LCLS Archive Daemon web interface can be accessed, such as lcls-archsrv:
    ssh lcls-archsrv -l laci
  2. In this same window (window A), bring up the Firefox browser:
    firefox
  3. In the browser brought up in window A, enter the URL for the LCLS Archive Daemon web interface:
    http://lcls-archeng:4900/
  4. In this browser, note the "Status" column for each engine. The format is "nnnn/nnnn channels connected", where the first number is the number of channels currently connected for the engine and the second number is the total number of channels requested to be archived for each engine. In all cases, both numbers should be in the thousands. It is normal for not all channels to be connected for an engine. However, if the number of channels connected is not in the thousands there is a problem. The response to this problem, to restart the engine which obviously has a small fraction of its channels currently connected, is described below in this procedure after all engines are restarted.
  5. In window B, login to the lcls-archeng machine as laci:
    ssh lcls-archeng -l laci
  6. In the same window (window B on lcls-archeng):
    cd /arch
  7. In the same window (window B on lcls-archeng), stop the Archive Daemon process and all of the archive engines it controlled:
    scripts/stop_daemons.pl -p
  8. In the same window (window B on lcls-archeng), immediately restart the Archive Daemon process:
    scripts/start_daemons.pl
  9. In the same window (window B on lcls-archeng), verify that a new Archive Daemon process was started:
    ps -ef | grep -i daemon
  10. In the same window (window B on lcls-archeng), it is useful to determine when the archive engines for the old stopped Archive Daemon process have died and all of the new archive engine processes have been started by the newly started Archive Daemon process. This can be done by periodically entering the following command:
    ps -ef | grep -i engine
    It may take some time (e.g., 2-3) minutes for this process to complete. The timestamps for the engines will indicate whether the engine was started a previous day or today. When all of the engine processes (e.g., 16) are running and all start timestamps indicate a time (and not a previous date), proceed to the next step.
  11. Observe the "Status" column for each engine in the browser brough up in window A. Up to two or three minutes may elapse before statuses of the form "nnnn/nnnn channels connected" appear for all engines. Until then, statuses of the form "Unknown. Lock file, but no response. (Check again, see if issue persists)" and "Not running." will appear. After statuses of the form "nnnn/nnnn channels connected" appear for all engines, check to verify that the first number (the number of channels currently connected) is in the thousands (4 digits). This should occur most restarts of the LCLS archiver engines. However, occasionally for one (and rarely, two) engines only a small fraction of the channels connect. In this case, enter the following URL in the browser window brough up from window A to kill each such engine (the Archive Daemon will immediately restart killed engines):

Alternative procedure to restart the Archive Daemon Process, which restarts all archiver engines

Occasionally the Archive Daemon process may die, which will make the LCLS Archive Daemon web interface unavailable until the Archive Daemon process is restarted. In this case invoke the following procedure to cause all of the LCLS archive engines to be first stopped and then a new LCLS Archive Daemon process to be started (which will restart all of the LCLS archive engine processes). This is an alternative to the "stop_daemons.pl -p" and "start_daemons.pl" procedure described above. This procedure can also be used as a quick "emergency" procedure if unexpected problems occur with the running of archive engines.
  1. cd /arch
  2. Stop any Archive Daemon process that may be running:
  3. scripts/start_daemons.pl -p

After restarting the LCLS Archive Daemon process, one should check whether this process has successfully restarted all of the LCLS archive engines. This is done through the following web page:

It may a minute or two from the time the LCLS Archive Daemon is restarted for an indication to appear on this web page for each LCLS archive engine that each engine is running. For each LCLS archive engine, a red "Not running" message usually means that the system has not finished dectecting whether the engine is running. The indication that an LCLS archive engine is running is also shown in red and shows the number of channels connected and the total number of channels in the archiver configuration file for that engine (e.g., "6376/6426 channels connected").

If a message appears indicating that a lock file may be present, remove all lock files for the associated archive engine. For example, for the LCLS_1 archive engine:

  1. cd /arch/lcls/lcls_1
  2. rm *.lck

Rare cases procedures

In very rare cases, an engine may not start. This will be indicated by the persistent Archive Daemon web interface status for the engine: "Unknown. Lock file, but no response. (Check again, see if issue persists)". This indicates that the engine lock file was not removed when the engine was previously stopped. In this case, do the following:

  1. In window B on lcls-archeng, change the present working directory to the engine directory. For example, if engine 1 is not restarting:
    cd /arch/lcls/lcls_1
  2. Note when the engine lock file was created:
    ls -l *.lck
  3. Remove the lock file:
    rm *.lck

In very rare cases, an engine may not store data after it has been restarted. This will be indicated by an email message from the /arch/script/archive_data_file_monitor.pl cronjob (which runs every 5 minutes) indicating that data is not being stored for an engine and this condition should be investigated. In this case, do the following in window B on lcls-archeng:

  1. Set the present working directory to the current data storage directory for the problem engine. For example, if the problem engine is engine 1, the year is 2010, the month is July (the 7th month), and the day is the 12th:
    cd /arch/lcls/lcls_1/2010/07_12
  2. Verify that no data is being stored in this directory (there is only an index file):
    ls -alt
  3. cd /arch
  4. Stop the Archive Daemon process and all of the archive engines it controlled:
    scripts/stop_daemons.pl -p
  5. Wait for all engines to be stopped by repeatly issuing the following command:
    ps -ef | grep -i engine
  6. Delete the directory in which no data is being stored. For the above example, (WARNING: THIS IS JUST FOR THE EXAMPLE CASE ABOVE. BE EXTREMELY CAREFUL TO ONLY DELETE THE CONTENTS OF THE DIRECTORY FOR THE PROBLEM ENGINE IN WHICH NO DATA IS BEING STORED!):
    rm -rf /arch/lcls/lcls_1/2010/07_12
  7. Restart the Archive Daemon process:
    scripts/start_daemons.pl
  8. Continue following the procedure described above in the normal case after the Archive Daemon process has been restarted to verify that all of the engines were restarted successfully.

Invoke the copy data and rebuild indexes script

After performing the procedure to restart the Archive Daemon process (which restarts all archiver engines) during a workday, the data/index files in the current engine data directories before the restart have been closed and new archiver data is then stored in new engine data directories. The closed data/index files in these previous current engine data directories may now be copied to the NFS LCLS archiver storage disk and the NFS regular density indexes be updated to reference this copied data. This is done by invoking the /nfs/slac/g/archiver/arch_lcls/scripts/update_server.pl script. This should be done immediately after restarting the Archive Daemon process. It is recommended that this be done early on the workday (especially on Mondays since extra archiver data needs to be copied from the preceeding weekend) since the network seems to be faster during the mornings and early afternoon than later, resulting in faster copying.

  1. Logon to the lcls-archsrv machine as laci. Note this is the lcls-archsrv machine, not the lcls-archsrv machine. The lcls-archsrv machine is considerably more powerful, which will considerably reduce the time spent copying the data (which required much more time than updating the NFS regular indexes).
    ssh lcls-archsrv -l laci
  2. Change the present working directory (do not try to change this directory to the scripts subdirectory):
    cd /nfs/slac/g/archiver/arch_lcls
  3. Invoke the script which performs the copy and updates indexes:
    scripts/update_server.pl
  4. A prompt will appear asking whether data should be copied from the found source archiver engine directories containing the most recently closed data/index files. Enter "y":
    y
  5. The script will take a considerable time to complete (e.g., one hour for a day of archiver data and perhaps three hours for three days of archiver data, which would be accumulated if the script was run on a Monday after last being run the preceeding Friday).

Rebuild LCLS/FACET Archiver Top-Level Current Index

It is desirable to rebuild the LCLS (and FACET) archiver top-level current index periodically (e.g., once per week). This is due to the fact that the LCLS archiver top-level current index contains extra index information and it is best to keep this extra index information to a minimum to minimize the size of this file. However, this this rebuilding does not absolutely need to be done every week (it can be postponed until the LCLS archiver top-level current index grows until approximately 1.5 GB). The extra index information results because the index contains indexes from previous day current data directory indexes as well as NFS regular density indexes. While it is necessary for the LCLS archiver top-level current index to contain indexes from each current engine data index, once previous current data directory data and index information is copied to NFS LCLS archiver storage disk and NFS regular density indexes are rebuilt to reference this copied data, previous current data directory indexes are not needed in the LCLS archiver top-level index. Also once the LCLS archiver top-level current index has been rebuilt, the lcls-archeng local disk buffer directories older than the current engine data directories may be deleted without affecting the LCLS archiver top-level current index (although this is not done immediately, since it is desirable to retain at least two weeks of lcls-archeng local disk buffer archiver data in case of an unanticipated long unavailability of the NFS LCLS archiver storage disk).

The FACET archiver setup is very similar to the LCLS archiver setup and has the same scripts and so forth. Instead of using /nfs/slac/g/archiver/arch_lcls, use /nfs/slac/g/archiver/arch_facet instead. The FACET archiver top-level current index does not grow as fast as the LCLS archiver top-level current index; however, it is necessary to check on the size of the index periodically and compact it by running scripts/rebuild_archiver_indexes.pl from within /nfs/slac/g/archiver/arch_facet.

FIRST IMPORTANT NOTE: The restarting the Archive Daemon process to restart all of the archiver engines and the subsequent running of the "scripts/update_server.pl" script for the current day MUST BE DONE PRIOR TO rebuilding the LCLS Archiver Top-Level Current Index. Otherwise, the resulting LCLS archiver top-level current index will not contain the indexes for the previous current data directory indexes, resulting in a "gap" of data that cannot be retrieved.

SECOND IMPORTANT NOTE: The rebuild LCLS archiver top-level current index procedure MUST BE RUN on lcls-archsrv. The regular LCLS archiver top-level index update script runs on lcls-archsrv. The rebuild script first builds a temporary top-level index in /nfs/slac/g/archiver/arch_lcls_2, continually updates it, and points retrieval to it. It then stops the regular LCLS archiver top-level index update script, builds a new top-level index in /nfs/slac/g/archiver/arch_lcls and continually updates it. This becomes the new regular LCLS archiver top-level index update process. Retrievel is pointed to the new rebuilt index in /nfs/slac/g/archiver/arch_lcls. The temporary (backup) update index process is then stopped. This entire procedure (invoked by the scripts/rebuild_archiver_indexes.pl Perl script) allows the rebuilding of the LCLS archiver top-level current index with no disruption in archiver retrieval.

  1. Logon to the lcls-archsrv machine as laci. IT IS IMPORTANT THAT THIS PROCEDURE IS RUN ON THE lcls-archsrv MACHINE AS "laci".
    ssh lcls-archsrv -l laci
  2. READ THE FIRST IMPORTANT NOTE ABOVE BEFORE PROCEEDING.
  3. Change the present working directory (do not try to change this directory to the scripts subdirectory):
    cd /nfs/slac/g/archiver/arch_lcls
  4. Invoke the script which rebuilds the LCLS archiver top-level current index:
    scripts/rebuild_archiver_indexes.pl
    Depending on the size of the rebuilt index, this procedure can take several minutes or up to three hours to complete.

Author:  Bob Hall 14-Jul-2010

Rev:  Bob Hall 09-Nov-2010 Added the alternative "start_daemons.pl -p" procedure.

Rev:  Bob Hall 04-May-2011 Added "New Method to Copy Current Engine Directory Data and Rebuild NFS Regular Density Indexes" section.