|
|
CM2 Production Status
This page documents the status of known issues with the deployment
of the complete Computing Model 2 system, primarily the conversion and
skimming for the summer conferences. Some old problems have been
removed but can still be read here and some more
here.
20th September 2004
Data Conversion
Complete at the >99.5% Level (34
unrecoverable, 101 not good, 17 also not good
and 58 that may
be recovered not good out of 13149)
Data Skimming
Started around 22nd January (done: 12938 out of 12939, 1 cannot be skimmed, 1728M events merged:Run1 297533928-Run2 958020236-Run3 316399434+156777829)
Production Issue: Some runs missing from Green Circle
(David Hutchcroft) (Updated 27/07/2004)
The following runs (amounting to 280ipb) are missing from the
merges they are expected to be in. The reason isn't understood yet but
probably temporary network or file server problems led to them being
missed during the merge.
22214
22308
26777
26789
29376
33592
34488
38150
38369
38374
38399
38408
38466
38485
38501
38558
38947
39093
These runs were reskimmed and added to BlueSquarePrime dataset (and
BlackDiamond).
SP5 Conversion
Restarted on 3rd March (1.38B
done, target is 1.4B)
- Production Issue: Runs found in two collections (Will Roethel) (Added 19/06/2004)
A set of runs have been found to be in two collections. The
Bookkeeping only believes they are in one of them. Needs to be
understood.
SP5 Skimming
Done.
Production Issue: Skimming one merged
collection (Will Roethel) (Updated 1/06/2004)
As the SP collections are 150k events (roughly) they take many days
to skim. In addition the local disks on the noma are not large enough
to contain two skim outputs. Working on updates to Task Management to
allow it to split jobs up. For the moment only using toris for SP
skims.
Jobs have been split up into ~4k events to avoid this problem.
Production Issue: NFS Limitation (Will
Roethel,Andreas Petzold) (Updated
19/06/2004)
Have been having many problems with NFS servers causing jobs to
crash. Working on adding more servers and updating tools to avoid too
much load on one server.
Serialised the sub-merges (from the split skim technology) to solve
this issue.
Production Issue: Some merge collections lost (Will Roethel,Remi
Mommsen) (Updated 19/06/2004)
It was discovered that for some merge collections the rollover
files were not in HPSS. They are not on disk either. It looks like
some merges will need to be redone from the individual skims which
have been backed up to tape. The reason for this problem hasn't been
discovered yet.
It was the first five merges that had this problem. They have been
redone (as new merges) and the old ones will be marked bad in the
Bookkeeping (no one can use them anyway due to the missing files.
Run4 Reprocessing
PC done (some cleanup going on), ER done (O(20) runs awaiting fix)
- Production Issue: Merge of node files didn't use KanRecover (Teela Pulliam) (Updated 8/06/2004)
Intention was to use KanRecover to fix any problems with ROOT files
from Elf nodes that crashed. This didn't happen (and not sure if would
have actually helped) and the merge application used the ROOT
automatic recovery. Some runs have lost metadata when this happened
(just for the crashed node's events). May need to reprocess runs that
had an Elf node crash (~10%) and then update PR so this isn't an issue
in the future.
The 10% figure was actually based on more than one Elf crashing in
a run, there are actually 20% of runs with one or more Elf crashed in
a run. For the moment marking these runs as "problem" runs. Need to be
clasified into difference crash modes. Will also try skimming various
kinds to determine if we always have this problem with lost
metadata. Will also deploy 14.4.1(a) ASAP to avoid as many reasons for
crashes as possible.
All but O(20) runs have had reconstruction code fixes for their
crashes. These will need the underlying checkpointing problem
fixed.
SP6
Started around 12th February (1385M of 1.2B done)
SP6/Run4 Skimming
Run4 (110ifb done,
110ifb merged)
SP6 (421M events done,
346M merged)
Production Issue: SP6 skims also effected by SP5 skim issue
Production Issue: Conditions database broken in Padova
(Guglielmo De Nardo,Fulvio
Galeazzi) (Updated 1/07/2004)
Jobs are not running successfully since a conditions database
update in Padova for the skimming. Many jobs run there before the
update were not good due to the conditions being stale.
There was an incomplete sweep into the federation. Fulvio will do a
full sweep to fix this.
The sweep fixed the problem.
Development Issue: Tag information incorrect (Pete Elmer,Tulay
Donszelmann,) (Updated 8/07/2004)
It was discovered that in the data added to make the Blue Square
dataset that some tags bit were set incorrectly. Circumstantially it
seems that this problem only occurs in collections that were produced
from subskimmed collections (this is the mechanism used to divide a
large job in many smaller parts). The reason for the problem isn't
understood and it still being investigated. It is suspected that much
of the SP5 and SP6 merges will therefore also have this problem as
they are used the subskim mechanism.
The current understanding is outlined in this post.
The problem is suspected to be due to a problem discovered a while
ago and fixed in 14.5.2 (aka analysis-21). However, it was thought
this fix was not needed for the existing production. This was prior to
the switch to split-skimming. The fix will be backported to 14.4.3 and
tested.
Fix deployed in 14.4.3d (in addition it was requested for the
future builds 14.4.4a and 14.4.5). Initial testing looks
good. Bringing submerges onto disk to begin production remerge. Will
mark the existing merges as bad and add the new merges into the
dataset to make a BlueSquarePrime dataset. This would include
everything that was in BlueSquare plus some other data that can't
easily be excluded (for example runs taken earlier in the year that
were not available when BlueSquare was created but are now and some
runs in the final merge of BlueSquarePrime).
Confirmed fix works. Remerging ongoing (3fb-1 available now). Will
mark old merges as bad.
Merging ongoing. Problem has been fixed.
Stephen J. Gowdy, Pete Elmer
Created: 28th January 2004
|
|