This session covers a number of topics: o Data distribution/management and bookkeeping/metadata Some very high level requirements on data distribution appear in section 8 of the CMWG2 Requirements document (see that document, I won't copy them here). There were a number of more detailed requirements in the OCP document (BaBar note 548) related to data distribution: \item Data should be distributed in a way that does not require the use of particular vendor. \item Data distribution must be architecturally neutral. \item Data must be able to distributed in an unpinned way (that is, references within the data must be self-contained and be independently relocatable). \item Distribution (sending and receiving) must be able to be performed in parallel with analysis and production. \item Meta data about the data must be stable. That is, it should not be necessary to resend a lot of meta-data each time a file is transferred. \item The distribution model must be compatible with emerging grid technologies. \item It should be possible to exchange data between sites without clashes between identifiers used (database ids, file names, or whatever used). And some related to data management and access: \item The cost of recovering from unexpected problems like a power outage, disk failure, machine reboot or job failure should be considered (length of outage, people-effort) and should not exceed that of the existing system. \item System inhibit: The system should allow to inhibit access to whole data or selected parts of the data to allow administrative tasks (maintenance). \item Authorization: Data should be protected by access control, for instance a user should not be able to modify production or other user's data (unless approved and explicitly configured to allow that). It should be possible to restrict administrative functions. \item It should be possible to {\it transparently} load balance the data on data servers to improve access and recover from disk crashes. \item File size to HPSS: average target size should be about 250~MB, and certainly should not be smaller than 100~MB. There is no upper limit, though it probably should not go much above 10~GB. \item System should provide an easy access to files on tapes, including a file staging mechanism, purging and migration. \item It should be possible to keep some parts of an event on disk and some on tape. ***************************************************************** Both Ulrik and Alessandra wrote down some specific questions in discussing how data distribution for a new Kanga eventstore might work. Ulrik's questions were: On Thu, Jan 09, 2003 at 01:58:30PM +0000, Ulrik Egede wrote: > I promised to give you a few questions we should aim to answer regarding > the data distribution in the future. > > Multiple files: > =============== > In the present system we copy files in a one-to-one correspondence. So a > file at a Tier A site is always copied to an identical file at a Tier C > site. > > With the pointer collections we can extract the tag and micro of a kanga > file and export that. > > With the new system an event might consist of more parts > > - A pointer collection > - A file with the tag. > - A file with the micro. > - A file with an extended micro. > - A file with the mini. > > So the questions deal with how we now make the copy > - How will the import job know what parts to pick up? > - Can I specify that for a given set of events I want the tag, the micro > and the mini but not the extended micro? or not the mini or whatever > other combination? > - Will the import job save everything into a single file at the > receiving end or put it back into multiple files? > - How do we make sure we put together the correct parts. > - If skimData knows about all the files involved you would be able to > delete collections based only on skimData information. If it doesn't > you would need to read through the pointer collection of all events in > all runs you wish to delete. What do we do? > > Other questions might also be: > - How do we tell the import job if something is available locally > already. At the moment we assume for the pointer collections that > either everything is there already (like a SLAC to RAL copy) or that > nothing is there (like a RAL to Imperial copy). Do we need to allow > some k9ind of in between solution? > > Cheers > Ulrik. Alessandra's questions were: On Fri, Jan 10, 2003 at 05:26:21PM +0000, Alessandra Forti wrote: > here are my questions (with my worries) > > Do the event IDs have to go into the bookkeeping? > > I find a bit monstruous the idea of having a bookkeeping per event Ids > > Are the event Ids serial in a run so that a range of ids can be defined > per each run? > > How all these files, associated with an event (or run), > will be produced? together or one type at the time? > > In my opinion skimData should know about all the files, not only for the > import but also for a normal user selection. A type of file should > correspond to a configuration of jobs (I guess task in gregory language) > > Will OPR and SP and SkimTools or their future equivalent still be treated > as different production and bookkeeping systems? > > This would leave open the issue of redundancy of information and > consitency between the tables. > > cheers > alessandra