Summary

This document is applicable for regular maintenance of the EPICS archiver appliances for LCLS, FACET and Test Facilities. It contains information on how we currently upgrade, maintain and monitor the appliances.

General Information

  1. The various files used are located in /afs/slac/g/lcls/package/ArchiverAppliance.
  2. On the appliances, the various folders of interest are in /arch.
  3. LCLS, FACET and TestFac archivers connect to the IOC's using a gateway. LCLS and FACET gateways run on lcls-prod01; the TestFac gateway runs on testfac-daemon2 The gateways are specific to the archivers; use ps -afde | grep ARCH to locate the gateway processes and configuration folders.

Upgrading the appliances to a new release

  1. Change to the snapshot folder using cd /afs/slac/g/lcls/package/ArchiverAppliance/snapshot and create a new folder here using the naming convention.
    
    $ cd /afs/slac/g/lcls/package/ArchiverAppliance/snapshot
    $ mkdir snapshot_03182014 
    $ cd snapshot_03182014
    
    
  2. cd into this newly created folder.
  3. Copy the new release snapshot to this newly created folder.
  4. Untar the new snapshot using tar zxf archappl_xxxxxxxxx.tar.gz in the newly created folder. This will expand the tarball into many files; you should expect to see at least 4 WAR files.
    
    $ tar zxf archappl_v0.0.1_SNAPSHOT_18-March-2014T09-40-30.tar.gz
    $ ls -ltra
    total 215129
    -rwxr-xr-x  1 mshankar cd      7752 Mar 18 09:38 quickstart.sh*
    -rw-r--r--  1 mshankar cd      3512 Mar 18 09:38 LICENSE
    -rw-r--r--  1 mshankar cd  26114814 Mar 18 09:40 retrieval.war
    -rw-r--r--  1 mshankar cd  27595096 Mar 18 09:40 engine.war
    -rw-r--r--  1 mshankar cd  26111821 Mar 18 09:40 etl.war
    -rw-r--r--  1 mshankar cd  31049103 Mar 18 09:41 mgmt.war
    drwxrwxr-x 27 mshankar cd      8192 Mar 18 10:41 ../
    -rw-r--r--  1 mshankar cd 109392409 Mar 18 10:46 archappl_v0.0.1_SNAPSHOT_18-March-2014T09-40-30.tar.gz
    drwxrwxr-x  3 mshankar cd      2048 Mar 18 10:46 sample_site_specific_content/
    drwxrwxr-x  2 mshankar cd      2048 Mar 18 10:46 install_scripts/
    drwxrwxr-x  4 mshankar cd      2048 Mar 18 10:46 ./
    
    
  5. In /afs/slac/g/lcls/package/ArchiverAppliance/snapshot, move the current softlink to point to newly created folder.
  6. 
    $ rm current
    rm: remove symbolic link `current'? y
    $ ln -s snapshot_03182014 current
    $ ls -ltrd current 
    lrwxr-xr-x 1 mshankar cd 17 Mar 18 10:50 current -> snapshot_03182014/
    
    
  7. Login to the appliance(s)
    1. For TESTFAC ssh acctf@testfac-archapp
    2. For FACET ssh laci@facet-archapp
    3. For LCLS ssh laci@lcls-archapp01 or ssh laci@lcls-archapp02 or ssh laci@lcls-archapp03
  8. cd /afs/slac/g/lcls/package/ArchiverAppliance/tools/script
  9. Take a screenshot of the metrics page. This helps in making sure that the appliances connect to all the PVs that we normally connect to. See the Controls Archiver Admin Access page for more information.
  10. Stop the tomcat processes before deployment using /etc/init.d/st.tomcat stop.
    1. Note this is not strictly necessary; the deployWARfiles.bash step also does this but I have found that this is a more reliable way to make sure the processes have stopped.
  11. Check for Tomcat processes using ps -afde | grep jsvc.
    1. It is extremely important that you wait until all the processes terminate. The ETL process flushes data from STS to MTS on shutdown; this can take 2-5 minutes.
      1. If the processes are still running after 5-10 minutes; consider terminating them using kill -9. However, note that there is a chance you may lose data in this case.
  12. Run ./deployWARfiles.bash facility where facility is one of lcls, facet or testfac. This deploys the WAR files in the /afs/slac/g/lcls/package/ArchiverAppliance/snapshot/current folder onto this appliance and starts the tomcat servers in /arch/tomcats.
  13. Check for Tomcat processes using ps -afde | grep jsvc. You should see 8 processes; two for each component.
    
    [laci@facet-archapp ~]$ ps -afde | grep jsvc
    laci     19914 19761  0 11:02 pts/0    00:00:00 grep jsvc
    laci     21819     1  0 Mar18 ?        00:00:00 jsvc.exec -server ... -Dcatalina.base=/arch/tomcats/mgmt ...
    laci     21820 21819  4 Mar18 ?        10:14:26 jsvc.exec -server ... -Dcatalina.base=/arch/tomcats/mgmt ...
    laci     21822     1  0 Mar18 ?        00:00:00 jsvc.exec -server ... -Dcatalina.base=/arch/tomcats/engine ...
    laci     21823 21822 28 Mar18 ?      2-19:32:18 jsvc.exec -server ... -Dcatalina.base=/arch/tomcats/engine ...
    laci     21862     1  0 Mar18 ?        00:00:00 jsvc.exec -server ... -Dcatalina.base=/arch/tomcats/etl ...
    laci     21864 21862  5 Mar18 ?        12:23:17 jsvc.exec -server ... -Dcatalina.base=/arch/tomcats/etl ...
    laci     21904     1  0 Mar18 ?        00:00:00 jsvc.exec -server ... -Dcatalina.base=/arch/tomcats/retrieval ...
    laci     21905 21904  4 Mar18 ?        10:32:24 jsvc.exec -server ... -Dcatalina.base=/arch/tomcats/retrieval ...
    
    
  14. Watch the mgmt logs tail -f /arch/tomcats/mgmt/logs/arch.log. Once you see the message All components in this appliance have started up. We should be ready to start accepting UI requests, this appliance has started correctly.
    
    48401 [Startup executor] INFO  config.org.epics._..._.mgmt.MgmtPostStartup  - Finished post startup for the mgmt webapp
    49503 [http-bio-17665-exec-7] INFO  config.org.epics._..._.mgmt.WebappReady  - Received webAppReady from RETRIEVAL
    49684 [http-bio-17665-exec-8] INFO  config.org.epics._..._.mgmt.WebappReady  - Received webAppReady from ETL
    50812 [http-bio-17665-exec-9] INFO  config.org.epics._..._.mgmt.WebappReady  - Received webAppReady from ENGINE
    60566 [http-bio-17665-exec-8] INFO  config.org.epics._..._.mgmt.MgmtRuntimeState  - All components in
     this appliance have started up. We should be ready to start accepting UI requests
    
    
  15. Check for exceptions on startup using find /arch/tomcats -wholename '*/logs/*' -exec grep -l xception {} \; . Ideally, you should not see any.
  16. Check for other FATAL log messages on startup using find /arch/tomcats -wholename '*/logs/*' -exec grep -l FATAL {} \; . Ideally, you should not see any.
  17. For LCLS, most of the time, we restart the entire cluster. That is, we stop the Tomcat processes on all the appliances, make sure they are indeed gone and then start the processes on each appliance one by one
    1. One can also completely stop and start the processes on one appliance before moving onto the next appliance.

Configuration backups

The MySQL config databases for the various facilities are backed up daily. Database backups are in /nfs/slac/g/acctest/tools/ArchiveAppliance/mysql for TestFac and in /nfs/slac/g/lcls/tools/ArchiveAppliance/mysql for LCLS and FACET. There is an automated script to validate these backups that should be run periodically. To avoid accidents, please run these on a development machine that has the appropriate setup. We need
  1. The mysql server and client (you can install these using yum install mysql mysql-server)
  2. The python mysql connector (pip install --allow-all-external mysql-connector-python)
  3. In MySQL, create a schema called RESTORE_TEST using
    
    CREATE DATABASE RESTORE_TEST;
    GRANT ALL ON RESTORE_TEST.* TO 'mshankar' identified by 'slac123';
    
    
  4. A hardcoded folder /scratch/Work/prodMysqlBackups. These scripts cd to this folder to minimize accidents
To validate the backups, create softlinks to /afs/slac/g/lcls/package/ArchiverAppliance/tools/script/validateBackup.sh and /afs/slac/g/lcls/package/ArchiverAppliance/tools/script/validateBackup.py in /scratch/Work/prodMysqlBackups.

$ ./validateBackup.sh 
Copying over backups from lcls-dev2. This will take some time
Done copying
Validating testfac
We have 29001 PVs for this appliance from the database
We have 29001 PVs for this appliance from the appliance runtime
Validating FACET
We have 30708 PVs for this appliance from the database
We have 30708 PVs for this appliance from the appliance runtime
Validating LCLS
We have 43790 PVs for this appliance from the database
We have 43790 PVs for this appliance from the appliance runtime
We have 52246 PVs for this appliance from the database
We have 52246 PVs for this appliance from the appliance runtime
We have 59740 PVs for this appliance from the database
We have 59740 PVs for this appliance from the appliance runtime
$

Note validateBackup.sh is just a shell for validateBackup.py which does the bulk of the validation. Validation consists of importing the backup into MySQL and comparing PV counts and typeinfo's for a small set of PV's with the runtime. Because backups happen once a day, you may have mismatches occasionally as people could have made changes to the runtime in the time since the backup was made.