################################################################ ################################################################ DATA production: ---------------- Whereas before the Kanga conversion with release 10.4.6f, Kanga streams were converted from Objy stream by stream in separate jobs. With release 10.4.6f all streams were produced and written out in one job for the first time. ---- Login: ssh noric03 ssh -2 -i ~/.ssh/id_rsa bbrprod@tersk klog colberg In order to perform production, the public key of the user needs to be added to the ~bbrprod/.ssh/authorized_keys. Your privat key file ~/.ssh/identity.bbrprod has to be password protected. (create public key: cd .ssh/.public/ ssh_keygen -t rsa private key: .ssh/id_rsa public key: .ssh/id_rsa.pub) ---- Steps for setting up a new production release (1) Make a new test release (2) register the new release in the oracle DB with skimRel - Step (1) Creation of a new production release, e.g. 10.4.6f: bbrprod > cd KanGA KanGA > newrel -t 10.4.6f 10.4.6f KanGA > cd 10.4.6f 10.4.6f > srtpath 10.4.6f > gmake installdirs 10.4.6f > addpkg workdir 10.4.6f > gmake workdir.setup 10.4.6f > less $BFROOT/repo/SkimTools/History,v 10.4.6f > addpkg SkimTools V00-02-06 10.4.6f > gmake SkimTools.bin 10.4.6f > cd workdir/ - Step (2) register release 10.4.6f for Kanga production (the capital K) in the oracle DB skimRel -c K10.4.6f 40104604 The number is used to determine the order of the releases (e.g. in skimData). The meaning of 40104604: 40 is the prefix (corresponds to the "K") 1046 is release 10.4.6 04 is some unique numbering (versioning) - do a "skimRel" to list all previous numbers Running skimRel is required only once for a release. ---- Steps for the production in a release ------ (1) configure the skim jobs for one (the) stream(s) with skimJob; all jobs of one stream belong to a stream name (2) Choose the appropriate federation of the objy data (3) Request a number of runs to be converted to Kanga ! (4) Submit the jobs for a (the) streams with skimSubmit (5) checking of the status of the jobs can be done with skimReq (6) check finished jobs with skimCheck which also sets their status to ok (7) Update web page - Step (1) configure the skim jobs skimJob -l /nfs/farm/babar/physics/commonSkims/log -I @/groups/skims/AllEvents/ -i KanGA/AllEvents -O /groups/skims/isPhysicsEvents/ -m colberg -s "bsub -qbfpsol -GbabarProd -P -J -o " -w RELEASE/bin/SunOS58/jobWrapper -L JobCheckLib -P /afs/slac.stanford.edu/u/ea/bbrprod/KanGA/10.4.6f/workdir --create Skim1046fKanga K10.4.6f "PARENT/bin/SunOS58/SkimTagApp PARENT/FilterTools/tagSkim.tcl" Stream01Kanga,Stream02Kanga,Stream03Kanga, Stream04Kanga,Stream06Kanga,Stream07Kanga,Stream11Kanga,Stream12Kanga, Stream13Kanga,Stream14Kanga,Stream15Kanga,Stream16Kanga,Stream17Kanga, Stream18Kanga,Stream19Kanga To figure out all options better do a "skimJob -h" This configures the submit command, log directory etc. for the streams under a common name ("Skim1046fKanga" here). "convert.job" is the script that is called to do the actual work (see below). All jobs submitted under this configuration are listed under this name when using skimReq. It is not possible with SkimTools to have more than one configuration that differ only in the queue. To check the configuration, do: skimJob -d Skim1046fKanga - Step (2) Choose objy federation according to the runs that need to be converted: if the bridge federation works its just "physboot" otherwise you need to secify e.g. "physboot6" - If step 1) and step 2) have been done already, it is sufficient to type "initKanga 10.4.6f" after the login and you will be in the same as after step 1) and step 2) - Step (3) Check disks: df -b `cat ~/KanGA/allnfs2.lis` If the disks are badly balanced, you should run the staticBalancer: First check if a staticBalancer is running: bjobs -q long -u bbrprod -w ls -lrt log/KangaBalancerLogs/ Then, submit the staticBalancer job: bsub -o /nfs/farm/babar/physics/commonSkims/log/KangaBalancerLogs/staticbalancer%J.log -q long -N -u colberg@slac.stanford.edu staticBalancer -i 10 -s 0002 Request a number of runs to be converted to Kanga ! You could use the official goodruns list: ln -s /afs/slac.stanford.edu/g/babar/www/Physics/BaBarData/GoodRuns/ GoodRuns When you have already converted parts of a dataset and you want to find out which ones are still missing, you o: ~colberg/kanga-stuff/Compare_streams.pl -a ../../10.4.6c/workdir/GoodRuns/good_all.txt -s all -m Skim1046fKanga The run list can be obtained by e.g.: skimDataRel -s AllEvents --format "RUN BLOCK EVENTS_IN INREL OUTPUT" -S ">=Skim1044" -g 9000-10000 The above assures the right format: "13965 1 140560 P10.2.3b /groups/AllEvents/0001/3900/P10.2.3bV01fb/00013965/cb001/allevents" which means "runNumber block_number n_input_events recoRelease inputCollection" Then, randomize the list: mv AllEvents.dat AllEvents_9000_10000.dat ~colberg/kanga-stuff/randomizeLists.pl -o rand_AllEvents_9000_10000.dat AllEvents_9000_10000.dat You now load those runs: skimReq -R colberg --import Skim1046fKanga rand_AllEvents_9000_10000.dat This tells the oracle DB that the list of runs in good_runs.list will be convertet to Kanga. - Step (4) Submit the jobs with skimSubmit -j 10 -s 1 Skim1046fKanga submits 10 jobs every 3 seconds for processing in the queue defined for the stream. - Step (5) To see what the status of the submitted jobs is do skimReq which lists all (ever) requestet job configurations and their job states. For more details do e.g. skimReq -l Skim1046fKanga -L 1000 - Step (6) Once the jobs are done, you can check the result with: skimCheck --fast Skim1046fKanga This checks the log files, root files etc. If everything seems fine the job status is set to "ok". The "--fast" options checks and sets the status immediatly but only for the "ok"-jobs. The failed jobs are not handled. Without "--fast" it checks everything and then asks how to set the status for every single job (what is quite time consuming). To check if the files are really there, you can do skimReq -l Skim1046fKanga -L 1000 -s ok ls -la $BFROOT/kanga/EventStore/groups/skims/isPhysicsEvents/0000/9900/P10.2.3gV08fb/00009966/cb001/Skim1044cObjy/P0000/Skim1046fKanga/P0000/Merged/ To reset the failed runs, you do the following: To get the request ids: skimReq -l Skim1046fKanga -L 1000 -s done or for a certain run: skimReq -l Skim1046fKanga -L 1000 -s done | grep 22753 Then: undoOk -r 1059825-1059828 -n failed skimClean -j Skim1046fKanga --confirm undoOk -r 1059825-1059828 -n pending These four steps can be done at once for all done runs by: ~colberg/kanga-stuff/resetDoneRuns.pl -h ~colberg/kanga-stuff/resetDoneRuns.pl Skim1046fKanga - Step (7) Status: Edit table headers on web page http://www.slac.stanford.edu/BFROOT/www/Computing/Offline/Kanga/Production/reskimming1046/status.shtml Create rows of overall table: ~/KanGA/10.4.6f/workdir > perl ~colberg/kanga-stuff/GetStatus.pl -c Skim1046fKanga --html all all ~/KanGA/10.4.6f/workdir > mv status_Skim1046fKanga_all.html $BFROOT/www/Computing/Offline/Kanga/Production/reskimming1046/ ~/KanGA/10.4.6f/workdir > mv status_Skim1046fKanga_lastrun.html $BFROOT/www/Computing/Offline/Kanga/Production/reskimming1046/ Create rows of years' tables: ~/KanGA/10.4.6f/workdir > perl ~colberg/kanga-stuff/GetStatus.pl -c Skim1046fKanga --html 2001 all ~/KanGA/10.4.6f/workdir > mv status_Skim1046fKanga_2001.html $BFROOT/www/Computing/Offline/Kanga/Production/reskimming1046/ Sizes: You want to show the file sizes of the data that is actually already available at RAL, therefore you do the next step there: [csfa] ~ > perl ~colberg/kanga-stuff/skimSize -o '-S ">=K10.4.6c"' --html Stream01Kanga [csfa] ~ > scp size_table_Stream01Kanga.html colberg@noric03.slac.stanford.edu:/afs/slac.stanford.edu/g/babar/www/Computing/Offline/Kanga/Production/rel1046 Monitoring Plots: skimDataRel -s Stream01Kanga --format "RUN BLOCK EVENTS_IN INREL OUTPUT" -S ">=Skim1044" --goodruns 0-99000 less Stream01Kanga.dat perl ~colberg/kanga-stuff/CreateMonitoringPlots.pl -l Stream01Kanga.dat Stream01 The plots are automatically copied into the directory /afs/slac.stanford.edu/u/ea/bbrprod/KanGA/DataProdPlots/www/Stream01/ where they can be viewed from http://www.slac.stanford.edu/BFROOT/www/Computing/Offline/Kanga/Production/index.shtml (only Stream01 needs to be done as a representative since all streams are converted at once) ---- older tools (in ~mueller/kanga-stuff) prepareList.pl ... make list for request from good run files randomizeLists.pl ... randomize the above list resetRuns.pl ... for one or more runs cleanup in one or all streams and reset to pending getReqId.pl ... get the request IDs of runs from a file "runnumber prod_ver" Compare.pl ... compare a good run file with what is in the DB GetStatus.pl ... get the numbers of already converted jobs scan_log.pl ... scan logfiles to get job statistics for monitoring plots skimSize ... determine the size of a data set (selected with skimData options) ZipLogFiles.pl ... compress the conversion log file of runs marked "ok" SkimTools/checkKanga ... checks if the pointers in $BFROOT/kanga are ok and if there are duplicate dirs on the nfs kanga disks ---- remarks In case of crashed jobs, rerun workdir > skimCheck --fast Kanga1010c${stream} to get (the hopefully short list) of problem jobs. The --fast option avoids that one is forced to look at the log files during the loop over all jobs and does not change the job status. Then look at the log files in the directory /nfs/farm/babar/physics/commonSkims/log (a sym link to workdir is useful). ---- FAQ ---- How to reprocess a run that was already done and marked ok (e.g. run 22753) ? 1. Get the runs request ID skimReq -l Kanga1041aLStream17 -L 5000 -s ok | grep 22753 2. set the status bit of 22753 to failed undoOk -r ReqID-ReqID -n failed 3. clean up the log and kanga file skimClean -j Kanga1041aLStream1 --confirm 4. set the status of the run to pending undoOk -r ReqID-ReqID -n pending What is the current disk occupancy of the Kanga nfs volumes ? staticBalancer -l How to balance the load of the kanga disks ? 1. cd [from a test release] 2. addpkg SkimTools V00-01-41 [get new version] 3. gmake SkimTools.bin [make bin/staticBalancer] 4. ssh shire -l bbrprod [need to run as bbrprod] 5. /bin//staticBalancer -l [make sure version >= 1.14] 6. bjobs | grep long [check existing staticBalancer] 7. Stop here if there is already a job!! 8. bsub -o /nfs/farm/babar/physics/commonSkims/log/KangaBalancerLogs/staticbalancer%J.log -q long -N -u mueller@slac.stanford.edu staticBalancer -i 10 -s 0002 will write log file with JobID (%J) to directory /nfs/farm/babar/physics/commonSkims/log/KangaBalancerLogs/ Options: (1) -c30:00 is 30 hours (2) -N: alow -u option (3) -u: send email upon completion (4) -i: iteration, how many events to balance (5) -s: the serial number, 0000, 0001, or 0002. See man page staticBalancer.1 for details. How to deal with crashed jobs, which seem to be recoverable easily? workdir > skimCheck Kanga1010c${stream} # you are asked to change status of the crashed jobs, change it to failed! delete log file, root file and subdirs for runs marked as failed: workdir > skimClean -v -j Kanga1010c${stream} --confirm get REQ ID for failed runs: workdir > skimReq -l Kanga1010c${stream} -s failed -L 100 > failedRequests.dat edit failedRequests.dat, so that it contains one REQ ID per line only change status of jobs from failed to pending in order to be able to resubmit them: workdir > undoOk -f failedRequests.dat -n pending resubmit jobs: workdir > skimSubmit -j 10 -s 5 Kanga1010c${stream}