If you got paged
with message
Root file copy failed..
can't stage
Requesting_xtc timeout.
Timeout MonitorProcessing.


do farm info to find which node had the problem and then set the NP back to state CopyFilesToServerRoo
../bin/$BFARCH/OprCmd.pl -iDebug,User,Farm -noprserv0x farminfo XXx
 you will see one (or more) nodes that are
still in 'Started', i.e.

IP - Port - Elf ID - Status
...
10.100.2.117 - 1487 - 034 - Finished
10.100.2.123 - 1353 - 051 - Started
10.100.2.111 - 1560 - 019 - Finished

sometimes the copy will fail while the others are still copying,
so you want to wait until all the nodes are done except the ones
that failed... or you can get the elf number from the email...

check the current state of the NP, i.e.

pulliam@bbr-user:ER3:prod/workdir> ../bin/Linux24RH72_i386_gcc2953/OprCmd.pl -iFarm,User,Debug -n10.100.2.123 -p 1353 npcurstate
connected
Message [OprMessage=HASH(0x86cec78)]:
From : OprFSMFwkModule:NP
To : OprMpxServer
Content : AnswerGetCurrentState
Attributes : FatalError

the Fatal Error means it is the right one...

then set the state:

pulliam@bbr-user:ER3:prod/workdir> ../bin/Linux24RH72_i386_gcc2953/OprCmd.pl -iFarm,User,Debug -n10.100.2.123 -p 1353 npsetstate CopyFilesToServerRoo
connected
Message [OprMessage=HASH(0x86cef14)]:
From : OprFSMFwkModule:NP
To : OprMpxServer
Content : AnswerSetCurrentState
Attributes : OK forced transition to CopyFilesToServerRoo

and check...

pulliam@bbr-user:ER3:prod/workdir> ../bin/Linux24RH72_i386_gcc2953/OprCmd.pl -iFarm,User,Debug -n10.100.2.123 -p 1353 npcurstate
connected
Message [OprMessage=HASH(0x86cec78)]:
From : OprFSMFwkModule:NP
To : OprMpxServer
Content : AnswerGetCurrentState
Attributes : CopyFilesToServerRoo

and then the run should finish on it's own...




 

Subject: PC1 run 44402: can't stage 44402.
Run 44402 requested but unable to transfer! (disk space?)
this part of manual needs to be edited. For now - search hypernews for solution
 

Subject: PC1 run 44481: Requesting_xtc timeout.
Farm Manager has been waiting for xtc file for more than 1800 seconds.
this part of manual needs to be edited. For now - search hypernews for solution
 

Subject: PC1 run 44508: Timeout MonitorProcessing.
Please check!
Previous state: StartMonitorRun
this part of manual needs to be edited. For now - search hypernews for solution
 
Last modified: Thu Sep 16 17:48:09 PDT 2004 by Olga Igonkina