There are three distinct timers that may included in a framework
job. These are AppActionTime,
AppActionSumTime, and AppActionConfTime.
All of these actions are contained in the
RecoUtils
package, as is the
documentation.
AppActionTime was written in 1997 by Marc Turcotte, and provides a rudimentary way to detemine how long modules took to execute. The output however, was rather verbose and difficult to parse, as it extended over multiple lines.
APPExecutable: GenBuildEnv spent 0.00 sec(s) in Begin Job
and spent 0.00 sec(s) in all steps this far.
APPExecutable: DhpTupleEnv spent 0.02 sec(s) in Begin Job
and spent 0.02 sec(s) in all steps this far.
AppActionSumTime was designed as a replacement to AppActionTime in order to make the output a bit more readable. AppAST is now in common use throughout the collaboration.
AppAST: SumBeg | NEvt | SumEvent | AvgEvent (ms) |SumEnd | Module AppAST: ------- |-------- |---------- |--------------- |------- |------------------ AppAST: 0.01 | 0 | 0.00 | 0.00000 | 0.00 |GenBuildEnv AppAST: 0.01 | 0 | 0.00 | 0.00000 | 0.00 |DhpTupleEnv AppAST: 0.00 | 100 | 0.05 | 0.50000 | 0.00 |OepFEventSetupModule ...
Parsing this, notice that GenBuildEnv took some time in beginJob and beginRun, but was not called in the event level. OepFEventSetupModule was called at the event level 100 times, and had a total of 0.05 seconds of CPU used. For clarity, this is averaged in the AvgEvent column and reported in milliseconds.
Both AppActionTime and AppActionSumTime report the CPU time used by the Framework modules. However some BaBar internals (e.g., Database access) are limited by wall-clock time. AppActionConfTime was designed to be a general, all-purpose timer that is configurable at compile time. It has several modes of operation:
timex or time
There are a few reasons for this. 1) AppAST doesn't keep track of regular Framework overhead, 2) AppAST doesn't time ~AbsEvent, and 3) AppAST doesn't keep track of 'other' transisitons
Profiling is a method of determining in exactly which routines or portions of routines time is spent. Unlike the various actions described above, which can be added in AppUserBuild, profiling is compiled in to each module.
Compiling for a profiled executable is somewhat tedious. If you are lucky enough to have a profiled build done, then use it for your base release and check out the modules you are going to change on top of it. If you aren't lucky, then this is going to take a while.
For the lucky users: gmake lib on all the modules
you have checked out with the following options:
CXXFLAGS=-pg ROPT="-Optimize -noDebug" . Link your
program with the same options.
For you unlucky users, check out the packages you are
changing, as well as the binary you are linking. gmake
lib normally. Now, gmake Package.bin LDFLAGS=-m >
whatever.map Examine this map file and check out any
libraries that are explicitly mentioned. Ignore Roguewave
and Objectivity. For something small like L3, this is about 130
libraries. Elf will be rather larger. (You probably should wait
for a profile build.) Start to compile what you have checked out
as above. Come back in a day or so.
If you are only trying to profile a limited set of code, and are willing to live with some under-reporting, check out the packages you are interested in. Additionally, check out CLHEP, ProxyDict, and ErrLogger. Compile these along with your packages.
Run your program normally. There will be a file called
gmon.out created.
Parse your output using
gprof -C <path to binary> gmon.out > gprof.log
Don't send this to stdout, there will be about 8 megabytes of
output. Edit the gprof.log file in your favorite
editor with a nice wide screen.
And now the fun begins! The output has the following format:
8000 APPSequence::event(AbsEvent*) [9]
[__0fLAPPSequenceFeventP6IAbsEvent]
0.24 14.95 25000/25000 AppFramework::event(AbsEvent*) [8]
[__0fMAppFrameworkFeventP6IAbsEvent]
[9] 49.2 0.24 14.95 25000+8000 APPSequence::event(AbsEvent*) [9]
[__0fLAPPSequenceFeventP6IAbsEvent]
0.79 2.68 1000/1000 L3Dch::event(AbsEvent*) [10]
[__0fFL3DchFeventP6IAbsEvent]
0.36 2.48 1000/1000 L3DDchTC::event(AbsEvent*) [12]
[__0fIL3DDchTCFeventP6IAbsEvent]
0.24 1.04 1000/1000 L3EmcDigiMaker::event(AbsEvent*) [25]
[__0fOL3EmcDigiMakerFeventP6IAbsEvent]
0.03 0.85 96435/96435 AppActionConfTime::afterEvent(APPExecutable*, AbsEvent*) [32]
[__0fRAppActionConfTimeKafterEventP6NAPPExecutableP6IAbsEvent]
0.02 0.80 96435/96435 AppActionConfTime::beforeEvent(APPExecutable*, AbsEvent*) [33]
[__0fRAppActionConfTimeLbeforeEventP6NAPPExecutableP6IAbsEvent]
0.38 0.38 1000/1000 L3EClusterFinder::event(AbsEvent*) [35]
[__0fQL3EClusterFinderFeventP6IAbsEvent]
Filter out the [mangled] lines, they are not important.
8000 APPSequence::event(AbsEvent*) [9]
0.24 14.95 25000/25000 AppFramework::event(AbsEvent*) [8]
[9] 49.2 0.24 14.95 25000+8000 APPSequence::event(AbsEvent*) [9]
0.79 2.68 1000/1000 L3Dch::event(AbsEvent*) [10]
0.36 2.48 1000/1000 L3DDchTC::event(AbsEvent*) [12]
0.24 1.04 1000/1000 L3EmcDigiMaker::event(AbsEvent*) [25]
0.03 0.85 96435/96435 AppActionConfTime::afterEvent(APPExecutable*, AbsEvent*) [32]
0.02 0.80 96435/96435 AppActionConfTime::beforeEvent(APPExecutable*, AbsEvent*) [33]
0.38 0.38 1000/1000 L3EClusterFinder::event(AbsEvent*) [35]
AppSequence::event :
Note that every routine is associated with an internal gprof index. L3Dch has [10], L3DDchTC has [11], etc. These indices are allocated by total cpu time spent in that module (including daughter modules). [1] and [2] will be _start and main (usually) for this reason. Now look at one of the other modules in detail.
[10] 11.2 0.79 2.68 1000 L3Dch::event(AbsEvent*) [10]
0.00 1.32 5813/5813 L3DFinderData::L3DDoFit(int, const L3DTableNode**, int, int*, int, HepAList<L3DTrk>*, double*) [23]
0.00 1.01 1000/2001 static Ifd<L3DTrkTableT>::get(IfdProxyDict*, const IfdKey&, AbsArg&) [14]
0.06 0.05 32529/32529 L3DFinderData::L3DSimpleTCT0(double, double, double, double, double, int, int, int, int) [173]
0.07 0.00 2000/2000 L3DFinderData::makeWork(void) [227]
...
0.00 1.32 5813/5813 L3Dch::event(AbsEvent*) [10]
[23] 4.3 0.00 1.32 5813 L3DFinderData::L3DDoFit(int, const L3DTableNode**, int, int*, int, HepAList*, double*) [23]
0.11 1.18 5813/5813 L3DFinderData::L3DFit5P(int, double*, L3DHelixTrk&, double*, double*, int*, int, int*, int*, int*, int*, double*) [24]
0.00 0.03 1171/54387 HepAListBase::newAppend(int) [22]
0.00 0.00 1171/1171 L3DTrk::L3DTrk(const L3DHelixTrk&, int, const int*, const int*) [7676]
0.11 1.18 5813/5813 L3DFinderData::L3DDoFit(int, const L3DTableNode**, int, int*, int, HepAList*, double*) [23]
[24] 4.2 0.11 1.18 5813 L3DFinderData::L3DFit5P(int, double*, L3DHelixTrk&, double*, double*, int*, int, int*, int*, int*, int*, double*) [24]
0.57 0.14 102890/102890 L3DFinderData::phi(int, int, double, double, double&) [36]
0.18 0.17 10330/10330 L3DFinderData::L3DTrkFit(int, L3DHelixTrk&, double*, double*, L3DHelixTrk&, double*) [68]
L3Dch is taking 3.37 milliseconds per event, on average. 1.3 ms is spent in L3DFinderData::L3DDoFit and 1 second in Ifd::get. L3DDoFit spends much of it's time in L3DFit5P, which (after reading the source) seems logical.
Interesting, Ifd is taking a long time...
0.00 0.00 1/2001 L3DDchTC::beginJob(AbsEvent*) [609]
0.00 1.01 1000/2001 L3Dch::event(AbsEvent*) [10]
0.00 1.01 1000/2001 L3DDchTC::event(AbsEvent*) [12]
[14] 6.6 0.00 2.03 2001 static Ifd<L3DTrkTableT>::get(IfdProxyDict*, const IfdKey&, AbsArg&) [14]
0.00 2.01 2001/2001 IfdDataProxyTemplate<L3DTrkTableT>::get(IfdProxyDict*, const IfdKey&, AbsArg&) [15]
...
0.00 2.01 2001/2001 static Ifd<L3DTrkTableT>::get(IfdProxyDict*, const IfdKey&, AbsArg&) [14]
[15] 6.5 0.00 2.01 2001 IfdDataProxyTemplate<L3DTrkTableT>>::get(IfdProxyDict*, const IfdKey&, AbsArg&) [15]
0.00 2.01 1/1 BdbConfigTransientProxyBase<L3DTrkTableT, L3DTrkList>::faultHandler(IfdProxyDict*, const IfdKey&, AbsArg&) [16]
Here see the main caveat of using gprof:
The actual call to BdbConfigTransientProxyBase<L3DTrkTableT, L3DTrkList>::faultHandler is called identically once, probably underneath L3DDchTC::beginJob. The CPU time used (2 seconds) is accounted to the parent, Ifd::get. The CPU for Ifd::get is accounted to it's parents weighted by the number of calls. Roughly half get accounted to L3DDchTC::event and half to L3Dch::event. Almost none gets sent to L3DDchTC::beginJob, which actually made the call that took the CPU.
Recently, (4 February, 2000) I discovered another problem with using gprof. At the end of an Elf job built with the 8.5.0 profiled build, the following cryptic message was printed, and the gmon.out file wasn't generated.
mcount: tos overflow
The original Berkeley code for gmon.c has a hardcoded maximum number of modules it can profile, 65534. As the L3 job discussed above has ~11k modules, I can believe that Elf has more than six times that number. This is going to be a problem, at this point, I am not aware of a solution on Sun. Recent patches to gcc may have fixed this, but we are using Sun's CC compiler on SunOS
Profiling is broken with the current version we are using of cxx on OSF, this was reported and was fixed in a newer version of the compiler.
There are two main methods in BaBar to determine code performance, module level timing (AppActionConfTime) and profiling (gprof). Each has drawbacks, AppACT can't see Framework overhead, and gprof does certain accounting incorrectly. Each has benefits, AppAST is easy to use and gives a point of reference, gprof gives lots of detailed information showing exactly where the problems are.
Using these in combination will give the most information. However, you cannot use AppACT/AppAST on profiled builds. The profiler adds significant time to the actual execution speed, which it factors out later. Your best bet is to have two compiled versions, one with profiling, one without.