Workbook for BaBar Offline Users - Quick Tour Trouble-Shooting
The quicktour was becoming over-burdened with notes about what might
go wrong, and what to do if something did go wrong. Those notes have
been moved to here. This page will be continually updated when we find
any other problems or "gotchas" in the workbook.
Contents
If you logged into yakut using ssh yakut.slac.stanford.edu -l
<username>, (or noric or tersk, similarly) and you have
logged into yakut in the past, you may have had got a response like
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the host key has just been changed.
...
In this case it should be sufficient to note the number of
the machine you are currently logged into: yakut0x, where
x is a number, log out, then log in again with the command
ssh yakut0x.slac.stanford.edu -l <username> and the
display should work ok.
Return to Quicktour
If you get an error message like:
bin/Linux24SL3_i386_gcc323/BetaMiniApp: error while loading shared libraries:
libCore_pkgid_4.04-02.so: cannot open shared object file:
No such file or directory
it probably means that you have forgotten the "srtpath" command.
If you get an error message like:
No boot file has been set, either explicitly or using OO_FD_BOOT
it means that you have forgotten to set up the data path with the
cond18boot command.
"Could not find datafile" messages
If you get an error message like:
Could not find datafile "BetaPid/PidDRCLike.dat"
Path was ".:RELEASE:ONLINEPARENT:PARENT:/afs/slac.stanford.edu/g/babar"
Could not find datafile "BetaPid/PidDRCLike.dat"
Path was ".:RELEASE:ONLINEPARENT:PARENT:/afs/slac.stanford.edu/g/babar"
Could not find datafile "BetaPid/PidDRCLike.dat"
Path was ".:RELEASE:ONLINEPARENT:PARENT:/afs/slac.stanford.edu/g/babar"
Could not find datafile "BetaPid/PidDRCLike.dat"
check that you are actually in workdir. (Running the job from other
directories is a common error.)
Here you might occasionally (should only happen once) get an error
message about BetaMiniApp being unknown. If that is the case, you should
first make sure you are actually in workdir (a common error), and if
that's not the problem, type
gmake setup
in workdir and this will reset the workdir configurations correctly
(the symbolic links in workdir sometimes get mangled if you do a
gmake clean from within workdir rather than in the
release directory). Then you should be able to run the BetaMiniApp
executable without problems.
Return to Quicktour
When you try to put in a collection with
> mod talk KanEventInput
KanEventInput> input add /store/SP/R18/001237/200309/18.6.0b/SP_001237_013238
if you see output of the form:
The Federated database [/afs/slac/g/babar-ro/objy/databases/boot/physics
/V7/ana/conditions/BaBar.BOOT] is currently unavailable - waiting...
and it's Monday 8:00 am to 4:00 pm or Thursday 4:00pm to midnight
(SLAC time), you have tried to run your job during the time set aside
for making data skims and general database maintenance. Do a
CTRL-C to exit the job and try again later.
If it isn't during these times, there is possibly a problem with the
databases, give it a while to be fixed and look at hypernews to see if
anyone is reporting a problem there.
Return to Quicktour
If you try to start up a PAW session and get an error message such as:
X connection to shire01:11.0 broken (explicit kill or server shutdown).
then you do have a problem. PAW has been unable to open a window on
your desktop. Exit paw, check that your xwindow client is turned on and
then try the pawX11 command again.
Another source of possible difficulty is confusion caused by running
with ssh, required for enhanced security. In this case,
instead of hitting <CR>, try:
Workstation type (?=HELP) <CR>=1 : 1.my_workstation (your workstation name)
or
Workstation type (?=HELP) <CR>=1 : 1.aaa.bbb.ccc.ddd (your workstation ip address)
If the HIGZ window does appear, but the PAW session never returns to
the "PAW >" prompt, exit (ctrl-C), and try again by
typing the command:
ana30/workdir> /cern/95a/bin/pawX11
Finally, if the PAW session seems to start ok and you get a prompt,
but no HIGZ window appears, try quitting the session, and at the start
when you are asked for "workstation type", try entering
"2".
If all of this fails, you have a major problem, and should consult
with someone working on similar machines as you, or your department
system manager.
Return to Quicktour
There are two possible causes for an executable file to "go
missing":
- you haven't used the file in a week
- you accidentally (or intentionally) issued a "gmake
clean" command from within your workdir
In the first case the binary is indeed gone. The Quicktour
setup includes a command to put all potentially large temporary files
in the BaBar "scratch" area. Since files like binaries and
library files can easily be regenerated, and because people tend to
hoard files they won't actually use any more, but just want to keep
around "just in case", file space is saved by BaBar
automatically cleaning out the scratch space by deleting files that
have been there for more than a week. In the case of binaries, the
solution is to simply re-compile and re-link the files.
In the second case, your binary is almost certainly still there. What
has happened is that you issued the gmake clean command
from the workdir, rather than from the release directory where one
would normally issue the command. The result is that a symbolic link
from your workdir to the area on the scratch disk where your binary is
actually stored will have been deleted. This is easily solved - simply
issue a
gmake setup
command from your workdir (or gmake workdir.setup from
your release directory.
Return to Quicktour
When writing output with the -o option of the
bsub command, y
ou should never refer to NFS disks by their "automount"
names, namely:
/a/...
But by the full NFS path:
/nfs/...
as the former will only work if there has previously been a job
submitted to the same batch machine that mounted the disk via the NFS
path and the automounter hasn't unmounted it yet.
Return to Quicktour
Generally when a job dies (e.g. the ntuples are not completely filled,
or are not even made), an error message with an error code is written
to the output (which you will presumably have written to a log file
for checking for just such a problem).
One of the most comon exit codes is exit code 130. This usually means
that the job has exceeded the CPU time allowed for the particular queue it is
running in. The solution here is to use a queue with a larger CPU limit, or
run your job on a smaller number of events.
More information about exit codes can be found on the exit codes webpage.
Return to Quicktour
Author:
Jenny Williams
Last updated: 13 February 2006
Last significant update: 3 June 2005
|