USE CASE DATA DISTRIBUTION Here are my use cases for the new Kanga distribution scheme, with particular emphasis on the bookkeeping aspects. I also include some related data management requirements for analysis sites. I do not consider SP/PR site to Tier A transfers (SP/PR use cases should cover these). One general comment: the data distribution and management tools will deal mainly with files. Assuming the data access systems (eg. XRootd) can be used, we may often be able to work in terms of the logical filename (LFN). However information on the physical file name (PFN) and location will be required for tasks such as moving data between servers (unless that is handled entirely by the access system). It may be useful for import managers to refer to collections in some selections, but the translation of collection (1) -> files (N) should be made using the SQL database (except during deep copy). In all cases, import should be controlled by the destination site. This doesn't preclude automatic ssh connections to run commands on the source site, but management is simpler if user commands and cron jobs run at the destination. 1. Tier A -> Tier A a) Allow import manager to select datasets for import. Dataset selection uses very broad criteria (eg. all 2002 data, and generic SP). Manager should also be able to specify priorities based on these criteria. b) Import should be able to run in a cron job. This job will need a list of files (LFN) required for import and should be able to sort them on priority as well as a few other qualities (eg. stream, job, or run number so files appear in the expected order). Note that the list of files for import will need to be automatically updated for each night's import, based on the criteria specified in 1a. c) Data integrity testing (eg. cksum) should be done after import (but probably in parallel with first analysis access) and later on request. d) If a subset of the data is lost or corrupted, it should be possible to request a reimport. e) It should be possible to remove old data, using similar criteria above. If required, it should be possible to remove data from source Tier A when export and check (and, perhaps, archive) is complete. f) Data should be distributed between different servers/filesystems - both on import and redistributed afterwards. One should be able to query the contents of a server/filesystem with SQL (eg. to determine what was lost when a disk went down). 2. Tier A -> large Tier C a) (1a) to (1e) also apply here, though the file selection may be on finer criteria (eg. run range, or specific signal modes). With the existing skimImport (skimSqlSelect), it has greatly simplified user training that the selection uses the same options as available in the analysis selection (skimData). b) Pointer skims referring to files that are not imported to Tier C (eg. AllEvents) should be imported with a deep copy. Do we need this test (whether deep copy is required) to be automatic or by command (site manager knows that Tier C doesn't include AllEvents, so specifies deep copy explicitly)? Users at the local site should be able to query what sort of data they have (pointer or deep copy). 3. Tier A -> small Tier C (single user) If the site has just a single BaBar analysis user (or is even her own machine or laptop), we may want to consider a reduced infrastructure. Can a remote SQL server be used? Do we combine selection and import? These are just my ideas. Please post suggestions and comments in the Bookkeeping HN.