SLAC PEP-II
BABAR
SLAC<->RAL
Babar logo
HEPIC E,S & H Databases PDG HEP preprints
Organization Detector Computing Physics Documentation
Personnel Glossary Sitemap Search Hypernews
Unwrap page!
Comp. Search
Who's who?
Meetings
FAQ Homepage
Archive
Environment
Administration
New User Info.
Web Info/Tools
Monitoring
Training
Tools & Utils
Programming
C++ Standard
SRT, AFS, CVS
QA and QC
Remedy
Histogramming
Operations
PromptReco
Simulation Production
Online SW
Dataflow
Detector Control
Evt Processing
Run Control
Calibration
Databases
Offline
Workbook
Coding Standards
Simulation
Reconstruction
Prompt Reco.
BaBar Grid
Data Distribution
Beta & BetaTools
Kanga & Root
Analysis Tools
RooFit Toolkit
Data Management
Data Quality
Event display
Event Browser
Code releases
Databases
Check this page for HTML 4.01 Transitional compliance with the
W3C Validator
(More checks...)

CM2 Production Status

This page documents the status of known issues with the deployment of the complete Computing Model 2 system, primarily the conversion and skimming for the summer conferences. Some old problems have been removed but can still be read here and some more here.

20th September 2004


Data Conversion

    Complete at the >99.5% Level (34 unrecoverable, 101 not good, 17 also not good and 58 that may be recovered not good out of 13149)

Data Skimming

    Started around 22nd January (done: 12938 out of 12939, 1 cannot be skimmed, 1728M events merged:Run1 297533928-Run2 958020236-Run3 316399434+156777829)
  1. Production Issue: Some runs missing from Green Circle (David Hutchcroft) (Updated 27/07/2004)

    The following runs (amounting to 280ipb) are missing from the merges they are expected to be in. The reason isn't understood yet but probably temporary network or file server problems led to them being missed during the merge.

    22214
    22308
    26777
    26789
    29376
    33592
    34488
    38150
    38369
    38374
    38399
    38408
    38466
    38485
    38501
    38558
    38947
    39093
    

    These runs were reskimmed and added to BlueSquarePrime dataset (and BlackDiamond).

SP5 Conversion

    Restarted on 3rd March (1.38B done, target is 1.4B)
  1. Production Issue: Runs found in two collections (Will Roethel) (Added 19/06/2004)

    A set of runs have been found to be in two collections. The Bookkeeping only believes they are in one of them. Needs to be understood.

SP5 Skimming

    Done.
  1. Production Issue: Skimming one merged collection (Will Roethel) (Updated 1/06/2004)

    As the SP collections are 150k events (roughly) they take many days to skim. In addition the local disks on the noma are not large enough to contain two skim outputs. Working on updates to Task Management to allow it to split jobs up. For the moment only using toris for SP skims.

    Jobs have been split up into ~4k events to avoid this problem.

  2. Production Issue: NFS Limitation (Will Roethel,Andreas Petzold) (Updated 19/06/2004)

    Have been having many problems with NFS servers causing jobs to crash. Working on adding more servers and updating tools to avoid too much load on one server.

    Serialised the sub-merges (from the split skim technology) to solve this issue.

  3. Production Issue: Some merge collections lost (Will Roethel,Remi Mommsen) (Updated 19/06/2004)

    It was discovered that for some merge collections the rollover files were not in HPSS. They are not on disk either. It looks like some merges will need to be redone from the individual skims which have been backed up to tape. The reason for this problem hasn't been discovered yet.

    It was the first five merges that had this problem. They have been redone (as new merges) and the old ones will be marked bad in the Bookkeeping (no one can use them anyway due to the missing files.

Run4 Reprocessing

    PC done (some cleanup going on), ER done (O(20) runs awaiting fix)
  1. Production Issue: Merge of node files didn't use KanRecover (Teela Pulliam) (Updated 8/06/2004)

    Intention was to use KanRecover to fix any problems with ROOT files from Elf nodes that crashed. This didn't happen (and not sure if would have actually helped) and the merge application used the ROOT automatic recovery. Some runs have lost metadata when this happened (just for the crashed node's events). May need to reprocess runs that had an Elf node crash (~10%) and then update PR so this isn't an issue in the future.

    The 10% figure was actually based on more than one Elf crashing in a run, there are actually 20% of runs with one or more Elf crashed in a run. For the moment marking these runs as "problem" runs. Need to be clasified into difference crash modes. Will also try skimming various kinds to determine if we always have this problem with lost metadata. Will also deploy 14.4.1(a) ASAP to avoid as many reasons for crashes as possible.

    All but O(20) runs have had reconstruction code fixes for their crashes. These will need the underlying checkpointing problem fixed.

SP6

    Started around 12th February (1385M of 1.2B done)

SP6/Run4 Skimming

    Run4 (110ifb done, 110ifb merged) SP6 (421M events done, 346M merged)
  1. Production Issue: SP6 skims also effected by SP5 skim issue
  2. Production Issue: Conditions database broken in Padova (Guglielmo De Nardo,Fulvio Galeazzi) (Updated 1/07/2004)

    Jobs are not running successfully since a conditions database update in Padova for the skimming. Many jobs run there before the update were not good due to the conditions being stale.

    There was an incomplete sweep into the federation. Fulvio will do a full sweep to fix this.

    The sweep fixed the problem.

  3. Development Issue: Tag information incorrect (Pete Elmer,Tulay Donszelmann,) (Updated 8/07/2004)

    It was discovered that in the data added to make the Blue Square dataset that some tags bit were set incorrectly. Circumstantially it seems that this problem only occurs in collections that were produced from subskimmed collections (this is the mechanism used to divide a large job in many smaller parts). The reason for the problem isn't understood and it still being investigated. It is suspected that much of the SP5 and SP6 merges will therefore also have this problem as they are used the subskim mechanism.

    The current understanding is outlined in this post.

    The problem is suspected to be due to a problem discovered a while ago and fixed in 14.5.2 (aka analysis-21). However, it was thought this fix was not needed for the existing production. This was prior to the switch to split-skimming. The fix will be backported to 14.4.3 and tested.

    Fix deployed in 14.4.3d (in addition it was requested for the future builds 14.4.4a and 14.4.5). Initial testing looks good. Bringing submerges onto disk to begin production remerge. Will mark the existing merges as bad and add the new merges into the dataset to make a BlueSquarePrime dataset. This would include everything that was in BlueSquare plus some other data that can't easily be excluded (for example runs taken earlier in the year that were not available when BlueSquare was created but are now and some runs in the final merge of BlueSquarePrime).

    Confirmed fix works. Remerging ongoing (3fb-1 available now). Will mark old merges as bad.

    Merging ongoing. Problem has been fixed.


Stephen J. Gowdy, Pete Elmer
Created: 28th January 2004