Code Profiling and Performance Measurement

General Timing

Code Profiling

Summary


General Timing

There are three distinct timers that may included in a framework job. These are AppActionTime, AppActionSumTime, and AppActionConfTime. All of these actions are contained in the RecoUtils package, as is the documentation.

AppActionTime

AppActionTime was written in 1997 by Marc Turcotte, and provides a rudimentary way to detemine how long modules took to execute. The output however, was rather verbose and difficult to parse, as it extended over multiple lines.

APPExecutable: GenBuildEnv          spent 0.00      sec(s) in Begin Job
        and spent 0.00      sec(s) in all steps this far.
APPExecutable: DhpTupleEnv          spent 0.02      sec(s) in Begin Job
        and spent 0.02      sec(s) in all steps this far.

AppActionSumTime

AppActionSumTime was designed as a replacement to AppActionTime in order to make the output a bit more readable. AppAST is now in common use throughout the collaboration.

AppAST:  SumBeg  |  NEvt   | SumEvent  | AvgEvent (ms)  |SumEnd  |      Module      
AppAST:  ------- |-------- |---------- |--------------- |------- |------------------
AppAST:     0.01 |       0 |      0.00 |        0.00000 |   0.00 |GenBuildEnv
AppAST:     0.01 |       0 |      0.00 |        0.00000 |   0.00 |DhpTupleEnv
AppAST:     0.00 |     100 |      0.05 |        0.50000 |   0.00 |OepFEventSetupModule
...

Parsing this, notice that GenBuildEnv took some time in beginJob and beginRun, but was not called in the event level. OepFEventSetupModule was called at the event level 100 times, and had a total of 0.05 seconds of CPU used. For clarity, this is averaged in the AvgEvent column and reported in milliseconds.

AppActionConfTime

Both AppActionTime and AppActionSumTime report the CPU time used by the Framework modules. However some BaBar internals (e.g., Database access) are limited by wall-clock time. AppActionConfTime was designed to be a general, all-purpose timer that is configurable at compile time. It has several modes of operation:

Problems and Caveats


Code Profiling

Profiling is a method of determining in exactly which routines or portions of routines time is spent. Unlike the various actions described above, which can be added in AppUserBuild, profiling is compiled in to each module.

Compiling

Compiling for a profiled executable is somewhat tedious. If you are lucky enough to have a profiled build done, then use it for your base release and check out the modules you are going to change on top of it. If you aren't lucky, then this is going to take a while.

For the lucky users: gmake lib on all the modules you have checked out with the following options: CXXFLAGS=-pg ROPT="-Optimize -noDebug" . Link your program with the same options.

For you unlucky users, check out the packages you are changing, as well as the binary you are linking. gmake lib normally. Now, gmake Package.bin LDFLAGS=-m > whatever.map Examine this map file and check out any libraries that are explicitly mentioned. Ignore Roguewave and Objectivity. For something small like L3, this is about 130 libraries. Elf will be rather larger. (You probably should wait for a profile build.) Start to compile what you have checked out as above. Come back in a day or so.

If you are only trying to profile a limited set of code, and are willing to live with some under-reporting, check out the packages you are interested in. Additionally, check out CLHEP, ProxyDict, and ErrLogger. Compile these along with your packages.

Run your program normally. There will be a file called gmon.out created.

gprof

Parse your output using

    gprof -C <path to binary> gmon.out > gprof.log

Don't send this to stdout, there will be about 8 megabytes of output. Edit the gprof.log file in your favorite editor with a nice wide screen.

File Format

And now the fun begins! The output has the following format:

                                    8000             APPSequence::event(AbsEvent*) [9]
                                                     [__0fLAPPSequenceFeventP6IAbsEvent]
                0.24       14.95   25000/25000       AppFramework::event(AbsEvent*) [8]
                                                     [__0fMAppFrameworkFeventP6IAbsEvent]
[9]     49.2    0.24       14.95   25000+8000    APPSequence::event(AbsEvent*) [9]
                                                 [__0fLAPPSequenceFeventP6IAbsEvent]
                0.79        2.68    1000/1000        L3Dch::event(AbsEvent*) [10]
                                                     [__0fFL3DchFeventP6IAbsEvent]
                0.36        2.48    1000/1000        L3DDchTC::event(AbsEvent*) [12]
                                                     [__0fIL3DDchTCFeventP6IAbsEvent]
                0.24        1.04    1000/1000        L3EmcDigiMaker::event(AbsEvent*) [25]
                                                     [__0fOL3EmcDigiMakerFeventP6IAbsEvent]
                0.03        0.85   96435/96435       AppActionConfTime::afterEvent(APPExecutable*, AbsEvent*) [32]
                                                     [__0fRAppActionConfTimeKafterEventP6NAPPExecutableP6IAbsEvent]
                0.02        0.80   96435/96435       AppActionConfTime::beforeEvent(APPExecutable*, AbsEvent*) [33]
                                                     [__0fRAppActionConfTimeLbeforeEventP6NAPPExecutableP6IAbsEvent]
                0.38        0.38    1000/1000        L3EClusterFinder::event(AbsEvent*) [35]
                                                     [__0fQL3EClusterFinderFeventP6IAbsEvent]

Filter out the [mangled] lines, they are not important.

                                    8000             APPSequence::event(AbsEvent*) [9]
                0.24       14.95   25000/25000       AppFramework::event(AbsEvent*) [8]
[9]     49.2    0.24       14.95   25000+8000    APPSequence::event(AbsEvent*) [9]
                0.79        2.68    1000/1000        L3Dch::event(AbsEvent*) [10]
                0.36        2.48    1000/1000        L3DDchTC::event(AbsEvent*) [12]
                0.24        1.04    1000/1000        L3EmcDigiMaker::event(AbsEvent*) [25]
                0.03        0.85   96435/96435       AppActionConfTime::afterEvent(APPExecutable*, AbsEvent*) [32]
                0.02        0.80   96435/96435       AppActionConfTime::beforeEvent(APPExecutable*, AbsEvent*) [33]
                0.38        0.38    1000/1000        L3EClusterFinder::event(AbsEvent*) [35]

AppSequence::event :

  1. has an internal gprof index of [9]
  2. takes 49% of total execution cpu
  3. is called by AppFramework::event and APPSequence::event
  4. is called 25k times in total
  5. calls at least 6 routines
  6. is active for a total time of 15.19 CPU seconds, 14.95 seconds of which is spent in called routines

Note that every routine is associated with an internal gprof index. L3Dch has [10], L3DDchTC has [11], etc. These indices are allocated by total cpu time spent in that module (including daughter modules). [1] and [2] will be _start and main (usually) for this reason. Now look at one of the other modules in detail.

[10]    11.2    0.79        2.68    1000         L3Dch::event(AbsEvent*) [10]
                0.00        1.32    5813/5813        L3DFinderData::L3DDoFit(int, const L3DTableNode**, int, int*, int, HepAList<L3DTrk>*, double*) [23]
                0.00        1.01    1000/2001        static Ifd<L3DTrkTableT>::get(IfdProxyDict*, const IfdKey&, AbsArg&) [14]
                0.06        0.05   32529/32529       L3DFinderData::L3DSimpleTCT0(double, double, double, double, double, int, int, int, int) [173]
                0.07        0.00    2000/2000        L3DFinderData::makeWork(void) [227]
...
                0.00        1.32    5813/5813        L3Dch::event(AbsEvent*) [10]
[23]     4.3    0.00        1.32    5813         L3DFinderData::L3DDoFit(int, const L3DTableNode**, int, int*, int, HepAList*, double*) [23]
                0.11        1.18    5813/5813        L3DFinderData::L3DFit5P(int, double*, L3DHelixTrk&, double*, double*, int*, int, int*, int*, int*, int*, double*) [24]
                0.00        0.03    1171/54387       HepAListBase::newAppend(int) [22]
                0.00        0.00    1171/1171        L3DTrk::L3DTrk(const L3DHelixTrk&, int, const int*, const int*) [7676]


                0.11        1.18    5813/5813        L3DFinderData::L3DDoFit(int, const L3DTableNode**, int, int*, int, HepAList*, double*) [23]
[24]     4.2    0.11        1.18    5813         L3DFinderData::L3DFit5P(int, double*, L3DHelixTrk&, double*, double*, int*, int, int*, int*, int*, int*, double*) [24]
                0.57        0.14  102890/102890      L3DFinderData::phi(int, int, double, double, double&) [36]
                0.18        0.17   10330/10330       L3DFinderData::L3DTrkFit(int, L3DHelixTrk&, double*, double*, L3DHelixTrk&, double*) [68]

L3Dch is taking 3.37 milliseconds per event, on average. 1.3 ms is spent in L3DFinderData::L3DDoFit and 1 second in Ifd::get. L3DDoFit spends much of it's time in L3DFit5P, which (after reading the source) seems logical.

Interesting, Ifd is taking a long time...

                0.00        0.00       1/2001        L3DDchTC::beginJob(AbsEvent*) [609]
                0.00        1.01    1000/2001        L3Dch::event(AbsEvent*) [10]
                0.00        1.01    1000/2001        L3DDchTC::event(AbsEvent*) [12]
[14]     6.6    0.00        2.03    2001         static Ifd<L3DTrkTableT>::get(IfdProxyDict*, const IfdKey&, AbsArg&) [14]
                0.00        2.01    2001/2001        IfdDataProxyTemplate<L3DTrkTableT>::get(IfdProxyDict*, const IfdKey&, AbsArg&) [15]

...

                0.00        2.01    2001/2001        static Ifd<L3DTrkTableT>::get(IfdProxyDict*, const IfdKey&, AbsArg&) [14]
[15]     6.5    0.00        2.01    2001         IfdDataProxyTemplate<L3DTrkTableT>>::get(IfdProxyDict*, const IfdKey&, AbsArg&) [15]
                0.00        2.01       1/1           BdbConfigTransientProxyBase<L3DTrkTableT, L3DTrkList>::faultHandler(IfdProxyDict*, const IfdKey&, AbsArg&) [16]

Here see the main caveat of using gprof:

Accounting to parent modules is weighted by number of calls.

The actual call to BdbConfigTransientProxyBase<L3DTrkTableT, L3DTrkList>::faultHandler is called identically once, probably underneath L3DDchTC::beginJob. The CPU time used (2 seconds) is accounted to the parent, Ifd::get. The CPU for Ifd::get is accounted to it's parents weighted by the number of calls. Roughly half get accounted to L3DDchTC::event and half to L3Dch::event. Almost none gets sent to L3DDchTC::beginJob, which actually made the call that took the CPU.

Profile Problems

  1. Recently, (4 February, 2000) I discovered another problem with using gprof. At the end of an Elf job built with the 8.5.0 profiled build, the following cryptic message was printed, and the gmon.out file wasn't generated.

    mcount: tos overflow

    The original Berkeley code for gmon.c has a hardcoded maximum number of modules it can profile, 65534. As the L3 job discussed above has ~11k modules, I can believe that Elf has more than six times that number. This is going to be a problem, at this point, I am not aware of a solution on Sun. Recent patches to gcc may have fixed this, but we are using Sun's CC compiler on SunOS

  2. Profiling is broken with the current version we are using of cxx on OSF, this was reported and was fixed in a newer version of the compiler.

Summary


There are two main methods in BaBar to determine code performance, module level timing (AppActionConfTime) and profiling (gprof). Each has drawbacks, AppACT can't see Framework overhead, and gprof does certain accounting incorrectly. Each has benefits, AppAST is easy to use and gives a point of reference, gprof gives lots of detailed information showing exactly where the problems are.

Using these in combination will give the most information. However, you cannot use AppACT/AppAST on profiled builds. The profiler adds significant time to the actual execution speed, which it factors out later. Your best bet is to have two compiled versions, one with profiling, one without.


References

  1. AppActionConfTimer
  2. GNU gprof, not exactly Sun's, but the differences are well noted.



Jim Panetta
Last modified: Tue Feb 15 12:14:33 PST 2000