RFC: file naming conventions in the new Kanga/ROOT eventstore Hi, We clearly need to converge on the file naming convention for the new Kanga/ROOT eventstore. There were multiple discussions about this at the Lyon workshop and the thinking evolved somewhat during the meeting. Tim proposed something to me in the days following the meeting. What I write below is basically a slightly revised version of that. First note the following: o we want to sequentially number all of the "owned" files of a collection starting from 1 o we want to have some means of recognizing which data components are in which files to allow data management to do some basic clustering of the files on filesystems o we want to retain the flexibility of clustering anything (except the header) into any number of files (up to N files for N components, plus one for the header) o we need to provide for the fact that there is a 2GB limit on the ROOT file size and even once that is removed (apparently sometime in the next months) we may still want to limit filesizes to be smaller than some particular specified size. This may mean that the data corresponding to a particular written component winds up in two files. o We want to have some clear visual separator to help users understand which part of the filename is part of the collection name and which part is the o The first file "01" is special as it is the one found by a naming convention. It must contain the event header. so the proposal is that: LFN = ..root where the "ComponentList" can be one or more of the following letters: 0) H HDR 1) U USR 2) B TAG 3) C CND 4) A AOD 5) T TRU 6) E ESD 7) S SIM 8) R RAW Thus the (logical) files for a collection: /store/R14/SP/modes/001/237/Jan2001/14.2.1b/sp001237_0035 would be something like the following: /store/SP/R14/modes/001/237/Jan2001/14.2.1b/sp001237_0035.01.root /store/SP/R14/modes/001/237/Jan2001/14.2.1b/sp001237_0035.02E.root /store/SP/R14/modes/001/237/Jan2001/14.2.1b/sp001237_0035.03E.root /store/SP/R14/modes/001/237/Jan2001/14.2.1b/sp001237_0035.04HBCAT.root with the following notes: o a "." is used to separate the eventstore prefix from the collection part of the name to make it (somewhat) easy to parse it out. (This replaces the "-" which is slightly more commonly used in collection names, and is slightly more obvious as a separator.) o the first file is generically named "01" without any indication of the components contained inside so a universal naming convention LFN = .01.root always works to find this file without knowledge of its contents. o the file number is zero-padded to two digits. o the component letters should be given in the order listed above, so "whatever.04HBCAT.root" is correct, "whatever.04TABH.root" is not. o the component numbers 0 to 8 listed above may be used as bit numbers in a "component mask" in the bookkeeping. o I have clustered the "TRU" in with the "AOD" in this example. I'll post about this separately, it looks like this is what we will do for MC (i.e. the "TRU" will now become part of the "micro"). o the USR component is a bit more complicated in that a collection may have multiple USR components (internally named USR1, USR2, etc.) Externally this will just be labelled USR, however, and do not think that the bookkeeping needs to differentiate between these. o I think that each file should internally also store the _expected_ eventstore extension (e.g. inside the file someplace should be a string like "04HBCAT" for file "whatever.04HBCAT.root" such that one can ask the file what its extension is if someone does mv whatever.04HBAT.root foo.out by accident. (Files can be renamed, but only the "whatever" part can be changed. They need to retain the same eventstore extension ".04HBCAT.root".) o If one actually has a file in hand there is also a way (and tool) to ask which components are inside. The point of the naming convention is to allow some (gross) classification of files for data management even without opening the files. Please comment on this proposal (in particular Matthias, Tim, Alessandra and Artem). I'd like to converge in the next day or so. (Sorry if this is again a core dump, but it is hard to find time to concentrate while one is at SLAC.) Pete Changes 031113 PE - Swapped order of B and C in "ComponentList" to match what was actually implemented in KanCompMap. As KanCompMap defines the order of the slots in the event data, this choice is effectively persisted so swap the order here. Added missing "C" in a few of the examples to reflect what what would normally find for an SP collection. Changes 1st October 2003: reordered release and type (R14/PR -> PR/R14) Removed duplicate part of run/mode number (0001/00013400 -> 0001/3400) Changes 10 July 2003 [TJA]: zero-pad file number, add CND component, order and number component list, add comment on USR component. ------------------------------------------------------------------------- Peter Elmer E-mail: Peter.Elmer@cern.ch Phone: +41 (22) 767-4644 Address: CERN Division PPE, Bat. 32 2C-14, CH-1211 Geneva 23, Switzerland -------------------------------------------------------------------------