Logging In
Info Resources
Software Infrastructure
CM2 Introduction
Event Store
Modifying Code
Writing and Editing
Framework II
Find Data
Batch Processing
Advanced Infrastructure
New Releases
Main Packages
Event Displays
Contributing Software
Advanced Topics
Make CM2 Ntuples
New Packages
New Packages 2
Persistent Classes
Site Installation
Workbook for BaBar Offline Users - Quick Tour Trouble-Shooting

The quicktour was becoming over-burdened with notes about what might go wrong, and what to do if something did go wrong. Those notes have been moved to here. This page will be continually updated when we find any other problems or "gotchas" in the workbook.


Logging into SLAC from a remote machine

If you logged into yakut using ssh -l <username>, (or noric or tersk, similarly) and you have logged into yakut in the past, you may have had got a response like
    Someone could be eavesdropping on you right now (man-in-the-middle attack)!
    It is also possible that the host key has just been changed.
In this case it should be sufficient to note the number of the machine you are currently logged into: yakut0x, where x is a number, log out, then log in again with the command ssh -l <username> and the display should work ok.

Shared libraries error

If you get an error message like:
bin/Linux24SL3_i386_gcc323/BetaMiniApp: error while loading shared libraries: cannot open shared object file: 
No such file or directory
it probably means that you have forgotten the "srtpath" command.

No BOOT file

If you get an error message like:
No boot file has been set, either explicitly or using OO_FD_BOOT
it means that you have forgotten to set up the data path with the cond18boot command.

"Could not find datafile" messages

If you get an error message like:

Could not find datafile "BetaPid/PidDRCLike.dat"
Could not find datafile "BetaPid/PidDRCLike.dat"
Could not find datafile "BetaPid/PidDRCLike.dat"
Could not find datafile "BetaPid/PidDRCLike.dat"
check that you are actually in workdir. (Running the job from other directories is a common error.)

Here you might occasionally (should only happen once) get an error message about BetaMiniApp being unknown. If that is the case, you should first make sure you are actually in workdir (a common error), and if that's not the problem, type

gmake setup
in workdir and this will reset the workdir configurations correctly (the symbolic links in workdir sometimes get mangled if you do a gmake clean from within workdir rather than in the release directory). Then you should be able to run the BetaMiniApp executable without problems.

Federated Database Unavailable, waiting...

When you try to put in a collection with
   > mod talk KanEventInput
   KanEventInput> input add /store/SP/R18/001237/200309/18.6.0b/SP_001237_013238
if you see output of the form:
The Federated database [/afs/slac/g/babar-ro/objy/databases/boot/physics
/V7/ana/conditions/BaBar.BOOT] is currently unavailable - waiting...
and it's Monday 8:00 am to 4:00 pm or Thursday 4:00pm to midnight (SLAC time), you have tried to run your job during the time set aside for making data skims and general database maintenance. Do a CTRL-C to exit the job and try again later. If it isn't during these times, there is possibly a problem with the databases, give it a while to be fixed and look at hypernews to see if anyone is reporting a problem there.

Problems Running PAW

If you try to start up a PAW session and get an error message such as:

   X connection to shire01:11.0 broken (explicit kill or server shutdown).
then you do have a problem. PAW has been unable to open a window on your desktop. Exit paw, check that your xwindow client is turned on and then try the pawX11 command again.

Another source of possible difficulty is confusion caused by running with ssh, required for enhanced security. In this case, instead of hitting <CR>, try:

   Workstation type (?=HELP) <CR>=1 :  1.my_workstation     (your workstation name)
   Workstation type (?=HELP) <CR>=1 :    (your workstation ip address) 

If the HIGZ window does appear, but the PAW session never returns to the "PAW >" prompt, exit (ctrl-C), and try again by typing the command:

   ana30/workdir> /cern/95a/bin/pawX11 

Finally, if the PAW session seems to start ok and you get a prompt, but no HIGZ window appears, try quitting the session, and at the start when you are asked for "workstation type", try entering "2".

If all of this fails, you have a major problem, and should consult with someone working on similar machines as you, or your department system manager.

Where did my binary go?

There are two possible causes for an executable file to "go missing":
  1. you haven't used the file in a week
  2. you accidentally (or intentionally) issued a "gmake clean" command from within your workdir
In the first case the binary is indeed gone. The Quicktour setup includes a command to put all potentially large temporary files in the BaBar "scratch" area. Since files like binaries and library files can easily be regenerated, and because people tend to hoard files they won't actually use any more, but just want to keep around "just in case", file space is saved by BaBar automatically cleaning out the scratch space by deleting files that have been there for more than a week. In the case of binaries, the solution is to simply re-compile and re-link the files. In the second case, your binary is almost certainly still there. What has happened is that you issued the gmake clean command from the workdir, rather than from the release directory where one would normally issue the command. The result is that a symbolic link from your workdir to the area on the scratch disk where your binary is actually stored will have been deleted. This is easily solved - simply issue a
gmake setup
command from your workdir (or gmake workdir.setup from your release directory.

Problems writing to NFS disks

When writing output with the -o option of the bsub command, y ou should never refer to NFS disks by their "automount" names, namely:

But by the full NFS path:
Why did my job die?

Generally when a job dies (e.g. the ntuples are not completely filled, or are not even made), an error message with an error code is written to the output (which you will presumably have written to a log file for checking for just such a problem).

One of the most comon exit codes is exit code 130. This usually means that the job has exceeded the CPU time allowed for the particular queue it is running in. The solution here is to use a queue with a larger CPU limit, or run your job on a smaller number of events.

More information about exit codes can be found on the exit codes webpage.

Author: Jenny Williams

Last updated: 13 February 2006
Last significant update: 3 June 2005