SLAC PEP-II
BABAR
SLAC<->RAL
Babar logo
HEPIC E,S & H Databases PDG HEP preprints
Organization Detector Computing Physics Documentation
Personnel Glossary Sitemap Search Hypernews
Unwrap page!
Comp. Search
Who's who?
Meetings
FAQ Homepage
Archive
Environment
Administration
New User Info.
Web Info/Tools
Monitoring
Training
Tools & Utils
Programming
C++ Standard
SRT, AFS, CVS
QA and QC
Remedy
Histogramming
Operations
PromptReco
Simulation Production
Online SW
Dataflow
Detector Control
Evt Processing
Run Control
Calibration
Databases
Offline
Workbook
Coding Standards
Simulation
Reconstruction
Prompt Reco.
BaBar Grid
Data Distribution
Beta & BetaTools
Kanga & Root
Analysis Tools
RooFit Toolkit
Data Management
Data Quality
Event display
Event Browser
Code releases
Databases
Check this page for HTML 4.01 Transitional compliance with the
W3C Validator
(More checks...)

Requirements for the BaBar data distribution system

CCG
Dominique Boutigny - LAPP Annecy
Last modified : January 28, 1999

Introduction :

The BaBar computing model assumes that while the bulk data reconstruction is done at SLAC, the analysis tasks are distributed among the institutes of the collaboration through regional centers which will receive a copy of the data.

The BaBar collaboration is convinced that computing in regional centers is of prime importance for the quality of the analysis. A real effort will be done  in order to insure an efficient distribution of the data within the collaboration.

The French regional center will be a mirror of SLAC and will receive a complete copy of the data after some physics cuts to reduce the amount of background. The mirror site will also play the important role of the SLAC database backup. The regional centers in Italy and in UK will receive a copy at the ESD level. The US and Canadian universities will receive a copy at the AOD level, we assume that 10 such copies are made.
The German and Russian institutes will also receive the AODs but the copy will be made by the mirror site.

The Italy and UK regional centers are now in the process to buy hardware to  store and to process BaBar data. Once installed, it will be very difficult for these sites to get new money to upgrade their configuration. It is therefore important to settle now the ESD size and to estimate with a good accuracy the amount of necessary CPU. For the French mirror site, the situation is somewhat different, as the computing center is already existing and is continuously upgrading.
 
 In the following we will try to estimate what are the ressources necessary and the constraints in order setup an efficient data distribution system.

Size of the data to be copied :

It is assumed that only the qqbar sample will be copied plus a fraction of the Bhabha + tau + 2-gamma sample. The details of the sample composition will be fixed by the CCG representatives.
In the following, we distinguish 3 cases :
  • The first year with 8 weeks of run and an integrated luminosity of 2.4 fb-1 corresponding to a total of 9.106 events to be transfered.
  • The second year with 40 weeks data taking and an integrated luminosity of 24 fb-1 corresponding to 9.107 events to be transfered
  • The nominal years with an integrated luminosity of (54 fb-1)  spread over 30 weeks
We suppose that the collaboration will have the capacity to produce a Monte Carlo statistics equal to at most: 6 times the real data sample. The MC production will be spread over the entire year, no reprocessing will be done on Monte-Carlo.

We assume the following event size for real data and Monte-Carlo. The numbers correspond to qqbar events which dominates the sample.
 

Real Data Monte Carlo
RAW 90 K 140 K (Truth Table : 50 K)
REC 50 K 100 K (Truth Table : 50 K)
ESD 20 K 26 K (ESD + 30%)
AOD 2 K 2.6 K (AOD +30%)
TAG 0.1 K 0.1 K 
Fulll Event size 162.1 K 268.7 K
ESD Event size 22.1 K 28.7 K
AOD Event size 2.1 K 2.7 K
 
For detector studies and reconstruction development  a small sample(1%) of  REC should also be copied. We don't take them into account here.

Reprocessing scheme :

  • First year : We assume 3 reprocessings  done after the data taking period.
  • Second year : The data taking starts in November and runs for 10 months, we assume 1 reprocessing in January, 1 in May (to be ready for summer conferences) and 1 in September after the data taking. (3 reprocessings in total).
  • Nominal year : The data taking starts in November and runs for 8 months. We assume 2 reprocessings, 1 in March and 1 in July
In the Excel Spreadsheet, one can find the following information :
  • The amount of data which should be transfered to the mirror site, regional centers and smaller institutes like US.
  • The amount of storage necessary to keep the most recent processing and an older one. For the Monte-Carlo, we suppose that we need to store only the statistic corresponding to 1 times the real data sample
  • The necessary bandwith to be able to transfer the data. Here, one should be careful that the given numbers should be multiplied by the number of sites (2 for the regional centers and ~10 for the universities). The numbers correspond to the data only, the bandwidth required for the Monte Carlo depends on which statistics is transferred.
The Excel spreadsheet is reproduced here in gif format

 

Tools to be developped :

Sophisticated automated tools should be developped to handle this amount of data.
The database group is currently developping the necessary tools for exporting and importing data from and to the database.
The following tools are also needed and will be provided by the operation group.
  • A (relational) database for the book-keeping of the data copies and transfers.
  • A possibility for each site to configure the copy request :
    • Selection of the type of data to copy
    • Fraction of Bhabha, taus, junk ...(fixed once for all before processing)
    • Fraction of RAW or REC (fixed once for all befor processing)
    • It is assumed here that theconfiguration is setup before starting the copy process and that it cannot be in principle modified during the process.
  • A tool to trigger the copy on a regular basis or on the amount of data vailable to copy
  • A tool to load and install the data in the remote database.
The data distribution should be part of the global production system in order to limit as much as possible the data movement on the network.

Manpower :

  • It is estimated that 0.5 FTE is necessary during the next 6 monthes to develop and test the copying tools.
  • 1 person is necessary to operate the copying system, to collect the tapes ans to ship them in remote sites. It is estimated that it is 1 day work per week during the first year and 2 day work per week the next years.