|
[introduction] [rsync] [obtaining rsync] [using rsync] [selecting files] [remote centres] [tape] [links]
2 Apr 2001: Except for mirroring the Kanga conditions
data ($BFROOT/kanga/CondDB/) and small or personal
exports, the recommended method for managing exports of Kanga event data is now
with the skimData database.
This method (syncslac/rsync) still works for event data,
though a full scan of the Kanga directory
trees can take many hours.
Exporting KanGA (née NOTMA)
files from SLAC should be much
simpler than exporting Objectivity files.
Each KanGA file contains a single run's data and is under about 100 Mb. The
files are logically arranged in a simple (if deep!) directory tree (though may
be physically on different disks, referenced by Unix soft links).
A general purpose file transfer
tool (ftp, scp, or what have you) can be used for
network copies.
For a few files this is probably the best course. For thousands of files
it is too cumbersome, particularly when one needs to keep track of
which files have already been exported.
For this reason, I suggest that the public domain rsync mirroring package be used
for network transfers of KanGA files.
rsync can copy directory trees over the network
(including or excluding certain files if need be), checking for files
that have already been copied.
If the connection is broken, it can easy pick up where it left off.
During the transfer, files are hidden so
that users at the remote site will not be confused by partially copied
files. Although we don't expect to change already-written KanGA files
(except to correct any screwups), rsync will check for modifications
(efficiently recognising identical files, even if the modification date
has inadvertently been changed). The transfers can be compressed, though
this probably won't help us much as the ROOT files are already compressed.
Since rsync can use ssh, we needn't worry about getting
through the SLAC firewall (a problem with ftp). Using ssh identity files
removes the requirement for a login password, so the whole update could
be performed in batch or as a cron job.
rsync has already been very successfully used for more than a year to
provide a mirror of the BaBar web and CVS files at RAL
(updated automatically every night).
rsync needs to be available on both local and remote machines.
rsync is installed at SLAC at
/afs/slac.stanford.edu/public/software/rsync/bin/rsync
(This provides versions for Solaris, OSF, Linux, (AIX), and HP; these
are accessible explicitly as /afs/slac.stanford.edu/public/software/rsync/2.4.3-tja4/*/bin/rsync .)
If AFS is available at your site, these binaries can of course be executed
directly or copied to a local disk. Alternatively the rsync binaries (with man
files) for your architecture can be downloaded from the appropriate one of:
rsync-2.4.3-tja4.Solaris26.sparc.tar.gz
rsync-2.4.3-tja4.OSF1V4.alpha.tar.gz
rsync-2.4.3-tja4.Linux22.i386.tar.gz
rsync-2.4.3-tja4.HPUX1020.tar.gz
rsync-2.4.3-tja1.AIX42.tar.gz (only version 2.4.3-tja1)
(or copied from /afs/slac.stanford.edu/public/software/rsync/dist/).
Please contact me if you need some
different build (eg. for another OS version). The standard distribution (rsync 2.4.3) already includes our fixes to rsync 2.3.2 for large
file support (on Solaris at least) - not currently required for
KanGA files, but this is a useful general tools for other transfers as well.
The SLAC version has additional patches (not
yet in the official rsync distribution) to correctly handle soft links
(--copy-unsafe-links bugs),
to show better statistics, and a fix to the
return code. See the top of the patch file for
details.
I have created a wrapper script to specify some useful defaults
for KanGA transfers. Copy
/afs/slac.stanford.edu/public/software/package/rsync/syncslac
to your machine (or execute directly from AFS if you prefer).
Make sure that rsync is in your PATH
(if this is not convenient, then modify the $rsync variable
at the top of syncslac).
You can check which versions are in your PATH with
syncslac --version
which, with the latest versions, should show
syncslac V2.5 - Copies files from SLAC using rsync
rsync version 2.4.3-tja4 protocol version 25
For the record, you also need ssh client and Perl installed
on your machine. I have discovered that Perl 5.005 is required
and syncslac does not work with Perl 5.004.
I may have a fix, so if you want to get it running in Perl 5.001-5.004,
please let me know.
Use syncslac and
also rsync with no options for a
summary of the command usage.
syncslac specifies as default (all these can be overridden): transfer via ssh
(with the faster blowfish cipher), remote
host tersk.slac.stanford.edu (this may change),
location of rsync at SLAC,
extra statistics, directory tree copy, correct handling of soft links,
and preservation of file dates. The rsync command is printed out so
you can see exactly what it's doing.
So, to copy all the KanGA data and the KanGA conditions files to a
local directory /localdisk/kanga, you could say
syncslac /afs/slac.stanford.edu/g/babar/kanga/EventStore/groups/ \
/localdisk/kanga/EventStore/groups/
syncslac /afs/slac.stanford.edu/g/babar/kanga/CondDB/ \
/localdisk/kanga/CondDB/
Note that the trailing slash (/) on the first parameter is significant
(see the rsync
man page for the reason). To perform an update of the files (or to
pick up again after a network outage), simply rerun the command.
Currently (6 Jun 2000) $BFROOT/kanga/EventStore/groups/ contains
41400 files taking 645 GB.
You may want to transfer
a subtree if this is too much data or to give you an idea of the time the
transfer will take (if you put it in the right place locally and later do the
entire tree, then rsync won't have to copy these again).
Hints
- The
-n option does a "dry-run",
just listing the files that would be copied.
- You can specify
--delete to also remove local files that
are no longer at SLAC. This should be checked carefully (perhaps initially
with the -n dry-run option) if updating an existing large local
directory tree, because if the wrong SLAC directory specification is given,
then the entire local tree will probably be removed.
- If your SLAC username is different from your local username,
you can change the default by specifying,
eg.,
user@/afs/slac.stanford.edu/g/babar/kanga/EventStore/.
- You can specify an ssh identity
file (with
syncslac -i file)
to allow automatic logon (no password).
- If you have an unreliable connection to SLAC, you can specify,
eg.,
--retries=2. This causes syncslac to retry
the rsync command (at most twice, with - by default - a
15 minute delay between each try) if it fails.
Many sites may not want to mirror the
entire $BFROOT/kanga/EventStore/groups/ directory tree. It is
easy to select some subsamples by using a subtree,
eg. $BFROOT/kanga/EventStore/groups/SP/ for Simulation
Production. However the files for the different skims are distributed
throughout the $BFROOT/kanga/EventStore/groups/skims/ tree with
the stream name identified by the filename in a directory named after the
Analysis Working Group (AWG).
There are several ways to select these files. To
exclude the AWG directories
you aren't interested in,
create an exclusion file, eg. exclude.lis, listing the
AWG directories and/or files you don't want copied, eg.
TauQED/
ClHBD/
PID/BPCElectronKanga-micro.root
excludes the Tau/QED and Charmless Hadronic B AWG
skims and the Particle ID AWG's BPCElectronKanga stream.
Then specify this file with the --exclude-from=exclude.lis option
on the syncslac (or rsync) command line.
It's slightly more complicated to include
just the AWG directories you are interested in.
Eg., in a file include.lis use
*/
**/Charmonium/*
**/Dstar/KpiKanga-micro.root
- *
to copy only all the Charmonium AWG skims
and the Dstar AWG's KpiKanga stream.
Specify this file with the --include-from=include.lis option
on the syncslac (or rsync) command line.
You probably don't want to know why all those asterisks are necessary
(they aren't in every case, but it's probably simplest to include them), but
if you do, check out
the rsync man page.
The first line should be "*/" (to include all directories) and
the last line should be "- *" (to exclude everything else).
Unlike the exclusions, these selections create the full directory structure
(ie. including deselected AWG directories), but since no data is copied, this
probably isn't such a problem. (If anyone has a way
round this restriction, I'd be interested to hear it.)
Note that these inclusions and exclusions can also be specified using
multiple --include=name and/or --exclude=name
options, but it's probably simpler to use a file as described above.
This procedure has been performed to transfer $BFROOT/kanga/EventStore to
RAL (csfsun02.rl.ac.uk) and from there to a number
of other UK sites (also Rome).
European sites may prefer to copy from RAL, rather than directly
from SLAC. Please contact me if
you want to do this but don't have access to the RAL machines.
It is probably unfeasable to copy the entire dataset from scratch across the
internet.
We plan to develop a method of transferring KanGA files by tape.
This is more complicated than network transfers: the tape contents will
need some form of catalogue and (for efficiency) may have to be
combined into larger (eg. tar) files.
Once a mirror is established at a remote site, it can probably keep up
with addition datataking/OPR via the network.
KanGA
Data Distribution
rsync and SSH
Please let me know if you
have any comments, suggestions, or questions about exporting KanGA files.
/BFROOT/www/Computing/Offline/DataDist/kanga_export.html last
modified 22nd May 2001 by Tim Adye, <T.J.Adye@rl.ac.uk>
|