Requirements for the BaBar data distribution
system
CCG
Dominique Boutigny - LAPP Annecy
Last modified : January 28, 1999
Introduction :
The BaBar computing model assumes that while the
bulk data reconstruction is done at SLAC, the analysis tasks are distributed
among the institutes of the collaboration through regional centers which
will receive a copy of the data.
The BaBar collaboration is convinced that computing
in regional centers is of prime importance for the quality of the analysis.
A real effort will be done in order to insure an efficient distribution
of the data within the collaboration.
The French regional center will be a mirror of
SLAC and will receive a complete copy of the data after some physics cuts
to reduce the amount of background. The mirror site will also play the
important role of the SLAC database backup. The regional centers in Italy
and in UK will receive a copy at the ESD level. The US and Canadian universities
will receive a copy at the AOD level, we assume that 10 such copies are
made.
The German and Russian institutes will also receive
the AODs but the copy will be made by the mirror site.
The Italy and UK regional centers are now in the
process to buy hardware to store and to process BaBar data. Once
installed, it will be very difficult for these sites to get new money to
upgrade their configuration. It is therefore important to settle now the
ESD size and to estimate with a good accuracy the amount of necessary CPU.
For the French mirror site, the situation is somewhat different, as the
computing center is already existing and is continuously upgrading.
In the following we will try to estimate
what are the ressources necessary and the constraints in order setup an
efficient data distribution system.
Size of the data to be copied :
It is assumed that only the qqbar sample will be
copied plus a fraction of the Bhabha + tau + 2-gamma sample. The
details of the sample composition will be fixed by the CCG representatives.
In the following, we distinguish 3 cases :
-
The first year with 8 weeks of run and an integrated
luminosity of 2.4 fb-1 corresponding to a total of 9.106
events to be transfered.
-
The second year with 40 weeks data taking and an
integrated luminosity of 24 fb-1 corresponding to 9.107
events to be transfered
-
The nominal years with an integrated luminosity of
(54 fb-1) spread over 30 weeks
We suppose that the collaboration will have the capacity
to produce a Monte Carlo statistics equal to at most: 6 times the real
data sample. The MC production will be spread
over the entire year, no reprocessing will be done on Monte-Carlo.
We assume the following event size for real data
and Monte-Carlo. The numbers correspond to qqbar events which dominates
the sample.
|
Real Data |
Monte Carlo |
| RAW |
90 K |
140 K (Truth Table : 50 K) |
| REC |
50 K |
100 K (Truth Table : 50 K) |
| ESD |
20 K |
26 K (ESD + 30%) |
| AOD |
2 K |
2.6 K (AOD +30%) |
| TAG |
0.1 K |
0.1 K |
| Fulll Event size |
162.1 K |
268.7 K |
| ESD Event size |
22.1 K |
28.7 K |
| AOD Event size |
2.1 K |
2.7 K |
For detector studies and reconstruction development
a small sample(1%) of REC should also be copied. We don't take them
into account here.
Reprocessing scheme :
-
First year : We
assume 3 reprocessings done after the data taking period.
-
Second year :
The data taking starts in November and runs for 10 months, we assume 1
reprocessing in January, 1 in May (to be ready for summer conferences)
and 1 in September after the data taking. (3 reprocessings in total).
-
Nominal year :
The data taking starts in November and runs for 8 months. We assume 2 reprocessings,
1 in March and 1 in July
In the Excel Spreadsheet,
one can find the following information :
-
The amount of data which should be transfered to
the mirror site, regional centers and smaller institutes like US.
-
The amount of storage necessary to keep the most
recent processing and an older one. For the Monte-Carlo, we suppose that
we need to store only the statistic corresponding to 1 times the real data
sample
-
The necessary bandwith to be able to transfer the
data. Here, one should be careful that the given numbers should be multiplied
by the number of sites (2 for the regional centers and ~10 for the universities).
The numbers correspond to the data only, the bandwidth required for the
Monte Carlo depends on which statistics is transferred.
The Excel spreadsheet
is reproduced here in gif format
  
Tools to be developped :
Sophisticated automated tools should be developped
to handle this amount of data.
The database group is currently developping the
necessary tools for exporting and importing data from and to the database.
The following tools are also needed and will
be provided by the operation group.
-
A (relational) database for the book-keeping of the
data copies and transfers.
-
A possibility for each site to configure the copy
request :
-
Selection of the type of data to copy
-
Fraction of Bhabha, taus, junk ...(fixed once for
all before processing)
-
Fraction of RAW or REC (fixed once for all befor
processing)
-
It is assumed here that theconfiguration is setup
before starting the copy process and that it cannot be in principle modified
during the process.
-
A tool to trigger the copy on a regular basis or
on the amount of data vailable to copy
-
A tool to load and install the data in the remote
database.
The data distribution should be part of the global
production system in order to limit as much as possible the data movement
on the network.
Manpower :
-
It is estimated that 0.5 FTE is necessary during
the next 6 monthes to develop and test the copying tools.
-
1 person is necessary to operate the copying system,
to collect the tapes ans to ship them in remote sites. It is estimated
that it is 1 day work per week during the first year and 2 day work per
week the next years.
|