Routine checks
- All Problems need to be investigated. The PR manual can be of help, otherwise
search
in Opr-SOS or Padova-Sos for solution of the problem.
- Always post to Opr-SOS or Padova-SOS a report of
the problem. If you know the solution, add it to your report.
This a good way of keeping track of what is going on with farms. It also
will help to new PR managers to sort out similar situations in future.
- Use
oprruns -f%,prfarm --since 1
to see what was done on ER farm since yesterday and
oprruns -f%,prfarm -PCalibration --since 1
for status of PC farms
- Monitor PR
Run Progression page which would show any delay with run processing at
SLAC, Padova or import/export of files between two places.
- At SLAC :
- In the morning (after 9:30am SLAC time) check that snapshot
is taken successful
cd /nfs/oprserv0x/u1/PCx/prod/workdir
ls -alrt *global200*
pick the latest file and check its end:
tail -40 <latest_global200*file>
if the output looks like
Thu Aug 5 08:42:57 2004-Snapshot succeeded
Thu Aug 5 08:42:57 2004-Resuming farm PC1
where the date and time correspond to this morning - than it is fine. Otherwise,
look more carefully for what could be wrong. If both PC1 and PC2 are running,
check snapshot of each farm (/nfs/oprserv01/u1/PC1/prod/workdir, /nfs/oprserv02/u1/PC2/prod/workdir)
cd /nfs/oprserv0x/u1/PCx/prod/workdir
../bin/$BFARCH/OprCmd.pl -iUser,Farm -n oprserv0x farminfo PCx
do it for each farm currently in use (PC1 and /or PC2)
Watch for
- Check the log file of the farm
cd /nfs/oprserv0x/u1/PCx/prod/workdir
ls -alrt *global-200*pick the latest file and check it is end:
tail -40 <latest_global200*file>
if the output looks suspicios - investigate further. If the date of the
last entry is old - check what is wrong.
- Watch out for emails from "Babar Program Manager" or for calls on
your pager.
-
in Padova :
it is simular to the checks at SLAC, but includes PostProcessing.
- Check that each farm is running (login as
<yourself>@bbr-user02.pd.infn.it,
for most things you don't need to use the "common" account
bbrprmgr):
workx ;
../bin/$BFARCH/OprCmd.pl -iFarm,User -noprx farminfo ERx ;
ls -alrt *global-* ;
tail -40 <latest_*global-*> ;
- Pay attention to the FatalErrors;
failure of the previous runs ; Speed of run processing; Number
of nodes used.
- watch out for emails from "Babar Program Manager". There are
currenly no synchronised calls to pager from padova farms.
- Check status of the ER feeder
workx
../bin/$BFARCH/OprCmd.pl -i Farm,User -nbbr-lock geterfeederstatus
It should show all ER farms in use and the run queue should have
recent IR2 runs which passed PC1 path and have their QA flag at
IR2 marked as anything different from "Unusable".
Also, if you are doing reprocessing, it should show latest runs to
be reprocessed.
Last modified: Wed Apr 20 17:51:25 CEST 2005