NuMI-L-499

MINOS Software Planning Group Report

The Committee: Contents:

The Report

Answers to questions in the original charge:
 
  1. Should the MINOS software system be OO or Fortran based? If the Group is unable to decide at this time, what is the suggested mechanism for this decision? What is the time-scale necessary for reaching a timely decision? What are the risks associated with either course of action?
  2. The Committee reports that MINOS should use an OO-based system, and proposes a system based on C++ and ROOT as a strong candidate.

    The arguments in favor of the Fortran/ADAMO system are:

    The difficulties of a FORTRAN/ADAMO choice stem from:

    In contrast, the Committee has decided to use a ROOT-based system as their OO straw man, rather than just OO/C++. See Appendix 3 for the justification of this choice. The strengths of such a system are, in decreasing order of importance:

    It is the view of the Committee that the first two items in favour of OO are so compelling that the choice should be made now in favour of OO despite the fact that this will require significantly more effort in the short term and considerable pain as people convert to C++. We ask the Collaboration to recognize the long term, permanent advantages of this choice and to accept the short term costs.

    As we are recommending adoption now, this answers the questions about a decision mechanism and time-scale. We proceed directly to an assessment of risks and their management.

    Risk: Lack of Expertise We have very little practical experience of OO design. We identify two specific risks:

    Risk: Collaboration Resistance Without question the transition is going to be very painful for some. We see two ways to minimize this risk:

    Risk: Not getting the job done on time. This is a question of required resources that will be addressed, albeit inadequately, in answer to a later question. Clearly, though, the risk is minimized by maximizing our resources: the challenge is to scale up the effort from the current level. We make two specific recommendations:

    Risk: ROOT will not last 10 years. Assessment of this risk was one of the primary reasons why the four of us attended the ROOT workshop held at Fermilab. ROOT is maintained by a small group, and there was the perception that an essential, highly technical subsystem, CINT (C++ interpreter), is particularly vulnerable. We were encouraged to learn that, in regard to CINT:

    As for ROOT as a whole, the most compelling evidence for its long term prospects can be found in the list of experiments that have already committed to it. The most relevant (amongst quite a long list) to us are:
    CERN:        Alice
    BROOKHAVEN:  Star, Phoenix, Phobos, Brahms
    FERMILAB:    CDF, D0
    
    In the case of CDF and D0, their adoption is only partial. However, CDF does use, and would continue to support even if ROOT folded, the I/O system, which is the most essential element of the framework. Since CINT is important for defining the data dictionary used in I/O, CDF and D0 have an interest in its survival. Both CDF and D0 also use the analysis tools.

    Risk: Other external packages do not last 10 years.

  3. Is a hybrid system a sensible possibility? How would such a system be structured?
  4. The Committee is strongly against a hybrid system for the following reasons:

    However, it is recognized that there is important legacy code, specifically NEUGEN, which depends on BOS, STDHEP and LUND. The proposal is to wrap this code so that, externally, it appears just another object and is stored in a dynamic library like the rest of the OO code. To do this implies control over support packages. This is not a problem for BOS or STDHEP. However, LUND is still being actively developed. Our belief is that ATLAS is now supporting a ROOT compatible version, although this needs to be confirmed. So, within our framework, Fortran would be limited to the internals of the wrapped NEUGEN. It would be a low priority, long term goal to rewrite NEUGEN in C++.

    Scripting languages are needed for support tools where the Committee recommendation is perl for user scripts and sh for make files.

  5. If an OO system were adopted, how does one make the transition in a way that assures that MINOS has the ability to continue the required work during the transition period?
  6. The proposal would be to add functionality to the existing system to support essential activity for the next ~6 months and then to freeze the code. Through-going muons and radioactivity (both from the walls and in the steel) are known deficiencies. Although overlapped events are now supported, the generation of event files awaits neutrino fluxes, through-going muons and radioactive noise. We need feedback from the full collaboration for any anticipated additional demands with their time-scales. We would ask people to recognize, however, that such work taxes very limited resources and would have to be duplicated in the new system.

  7. Within the framework of the suggested action, what is a reasonable schedule for completing various tasks? What would be the important milestones?
  8. In order to answer this question, we have first to identify the list of tasks encompassed by the framework. The traditional role of the off-line software focusses on the Monte Carlo, Event Reconstruction and Analysis. However the choice of OO and C++ opens up the opportunity to share code with the on-line system and so blurs the division of responsibility between the two areas. To exploit this opportunity to the fullest requires close cooperation between the off-line and the on-line groups in such areas as:

        detector monitoring
        event visualization
        database interface
        on-line reconstruction (to act as a 2nd level filter?)
    
    Moreover, there is a wide range of support activities that, while they do not fit into the orthodox input -> process -> output model, could benefit by using sub-systems, such as the database interface and the event display. Support for such non-standard jobs should be an important framework design consideration. The Collaboration may take the view that certain support activities are so essential that it is unacceptable that they be "blackboxes" to all but those who developed them and may insist they be placed in the public domain. The framework could provide the environment for such public software. Of course, those software tasks that remain outside the framework are not restricted in their choice of language or support packages but must be responsible for these choices. Framework designers should work with those who wish to remain outside the framework to facilitate communication, predominantly via files or pipes, between external systems and the framework.

    Returning to the question of the tasks that the framework encompasses, a draft of the User Requirements can be found in Appendix 2. Members of the Collaboration are encouraged to review this list and offer suggestions for its improvement. The Committee has made an attempt to schedule the work from the Requirements List according to a priority scheme which would expedite the following:

    1. Work on the data model and its relationship support mechanisms would be the first priority; a study of the ALIROOT model may provide valuable insights for the underlying architecture. An initial proposal should be drafted by the June 1999 Ely Workshop. An orientation session on the draft framework would be given.
    2. The next priority would be to establish a minimal working environment in which physicist-programmers can develop reconstruction algorithms to gain experience with C++ and ROOT. The framework would implement those parts of the new architecture necessary to support reconstruction activities. Initially, this would be interfaced to the ADAMO-based data structures of MINFAST, so that re-engineering of GMINOS can be deferred. User experience with the data model will provide valuable feedback on objects, relationships and methods of navigation. Several physicist-programmers should be usefully engaged by October 1999.
    3. During the year 2000, work can proceed on the Monte Carlo elements of the framework, with particular emphasis on digitization, which is required as an adjunct to reconstruction studies.
    4. Late in the year 2000 framework support must be provided for calibration and checkout activities starting in 2001.
    A schedule with preliminary start and completion dates has been generated.

  9. What are the human resources required for the suggested tasks? How many new software-dedicated MINOS people would the proposed schedule require?
  10. At the time of writing, it is not possible to answer this question beyond the statement that the current off-line team (George, Robert, 0.3*Nick) is insufficient for the complete program. There are a number of imponderables:

    In the early days we need effort with the core design. Ideally two more people, with good understanding of C++ and OO design who could work essentially full time would be very beneficial. The core team should remain small; good architectures are seldom produced by large committees. Once we can start to harness the power of the physicist-programmer, we probably need the equivalent of another 2 FTE.

  11. What is the optimum (from MINOS point of view) relationship between the MINOS software group and the Fermilab Computing Division?
  12. For this project we are currently short, both in effort and expertise. However it would be unwise to "contract out" the core design, even if the Computing Division were willing to undertake the job. The design involves many choices, balancing pros and cons, and it is important that members of the Collaboration play an active role here. It is important to understand why the design was chosen. Without this deep understanding, it is likely mistakes would be made as the core evolves. Later, as the applications are fleshed out, effort should not be an acute problem.

    This leaves our shortfall in expertise, and here we should think seriously about the support we could tap into, both within the Computing Division and the wider HEP community, particularly the ROOT based experiments. It is likely others have insights, if not solutions, to all the major problems we have to solve. So, if we could establish a consultancy role with the Computing Division, both for direct advice and to help us understand the subtleties in designs of other experiments, this could be helpful.

  13. What resources should we request from the Fermilab Computing Division?
  14. As stated above we should seek to establish a consultancy role with the Computing Division, in which they help with specific design issues, offering advice and helping us to exploit the growing pool of expertise. However, this stops short of initiating design work themselves, so we should ask for something like 8 hours per month of access to someone knowledgeable in ROOT-based experiments.

  15. What are the most important immediate issues that have to be addressed so that we can proceed with the detector design in the most expeditious way possible?
  16. We cannot answer this question without more feedback from the Collaboration. However, we would repeat that hardware design issues in the next 6 months can only be addressed with the legacy Fortran system. Software effort is a limited resource in the next year and we would ask others to limit requests to a reasonable level.

  17. (added later) Choice of Database
  18. The short answer is that it is not yet necessary to decide the database needed in this context. It is clear that we have two quite different needs:

    An inventory database
    to record and track steel, electronics, etc., mainly to support detector fabrication.
    A calibration database
    to hold calibration and other data sets to support detector operation.

    ORACLE has already been chosen for the inventory database for very good reasons. We are assembling a very expensive detector. Mistakes made during manufacture may be impossible to rectify. The work will involve the coordination of a large group of people spread over a number of laboratories. The clearest possible communication within this group is essential and the robustness and integrity of the database vital. It was for this reason that ORACLE, with its proven track record, was chosen. The model is to have PCs equipped with "ORACLE Lite" running locally and then send out updates via the network as required.

    When it comes to the calibration database, the natural inclination to use ORACLE again must be justified. This application has a very different access pattern to that of a commercial database. So we start by listing a set of acceptance criteria. It is not a requirement that all the functionality listed below be supplied with the database; we expect to do some work, if only to write query functions. However, the amount of additional work required to develop and maintain full functionality will play a part in deciding between competitors.

    1. Robustness
    2. Distribution

    3. the master/slave model with a simple distribution system is essential; network bandwidth limitations mean that it is impossible for all groups to use a single database repository.
    4. Performance

    5. it must be possible to retrieve large blocks of calibration constants rapidly and in such a form that event calibration is optimized.
    6. Functionality
    7. Schema Evolution

    8. it must be possible to evolve the contents of the database over time.
    9. Communication with the Installation Database

    10. communication is essential with the installation database. Ideally, information should not be duplicated in the two databases. However many elements have both an "inventory aspect" and a "calibration aspect" and this, if nothing else, ties the two together.
    11. Access

    12. it must be possible to access the database from all sites at reasonable cost. It must also be possible to do limited processing isolated from the network, e.g., on a laptop (flying to/from Chicago). In the case where licenses are involved, we could add a feature to the Database Interface so that it would disconnect from the server during quiet periods minimizing the number of concurrent licenses required.
    13. Support

    14. as mentioned under risk management, support over the lifetime of the experiment is essential.
    ORACLE has a clear advantage over any other database when it comes to item 6 - there is no communications problem if the two databases are one and the same. It also simplifies the framework interface that has to access both. In other areas, and, in particular, item 3 - Performance, Oracle has to "earn its spurs" like any other database. What is required are a set of performance tests that model calibration procedure. However, if it passes this and the remaining tests, it should be the default and only be replaced if another system has a significant advantage. The ability to work disconnected from the network is an important one for which several solutions are possible:
    1. A mini-database and server.
      In the case of ORACLE the smallest server is "ORACLE lite" costing ~ $400/seat.
    2. A substitute mini database.
      For example:
      1. mSQL
      2. ASCII files
      Both of these require an additional interface and a method of converting a subset of the official database to the substitute format. However, there is a plan to have the Database Interface accept ASCII files as a way of masking out the database, (see Appendix 4) so this solution would then only require the converter.
    3. In-line constants.
      Write the constants into the same file as the data. This requires that database constants can be stored in an event file and that the Database Interface can accept data from the event file.

APPENDIX 1

Stan's original mail: The charge to the committee
  1. Should the MINOS software system be OO or Fortran based? If the Group is unable to decide at this time, what is the suggested mechanism for this decision? What is the time-scale necessary for reaching a timely decision? What are the risks associated with either course of action?
  2. Is a hybrid system a sensible possibility? How would such a system be structured?
  3. If an OO system would be adopted, how does one make the transition in a way that assures that MINOS has the ability to continue the required work during the transition period?
  4. Within the framework of the suggested action, what is a reasonable schedule for completing various tasks? What would be the important milestones?
  5. What are the human resources required for the suggested tasks? How many new software-dedicated MINOS people would the proposed schedule require?
  6. What is the optimum (from MINOS point of view) relationship between the MINOS software group and the Fermilab Computing Division?
  7. What resources should we request from the Fermilab Computing Division?
  8. What are the most important immediate issues that have to be addressed so that we can proceed with the detector design in the most expeditious way possible?
  9. (added later) Choice of Database

APPENDIX 2

User requirements list
  1. Data Model
    1. Structure (what objects describe the detector and the events and what relationships exist between them)
    2. Machinery (how are relationships supported, i.e., how are relationships navigated in either direction and how they are maintained as objects are written and read, created and destroyed or relationships reassigned)
    3. Flexibility (reading old data - schema evolution, reading partial data - e.g., can run Event Reconstruction on MC output and Reconstruction output as well as raw data)
    4. Standard DST (Collaboration format for high level analysis)
  2. Database Interface (upper layer of 2 tier client)
    1. API (the interface that other framework components use to access the database)
    2. Synchronization (the machinery that ensures calibration constants remain in sync. with the current event).
    3. Retrieved objects (the objects that the interface returns to its clients)
  3. Database Server
    1. API (the interface that the Database Interface sees)
    2. Data Model (the data model used between the Database interface and the Server)
  4. Detector Calibration
    1. Committing values (storing the data in the database)
    2. Framework support for application (common tools to assist the physicist-programmer extract calibration constants from a variety of sources)
    3. Strategies (for obtaining values from muons, light injection, and radioactive noise)
  5. Detector Checkout Framework Support (common tools to assist the physicist-programmer to establish and maintain correct detector operation)
    1. Checkout visualization
    2. Histogram manipulation
  6. Event Visualization
    1. Event views
    2. Custom views
    3. Interaction with reconstruction
  7. Monte Carlo
    1. Beam
      1. Beam simulation (the production of a flux)
      2. Beam API (the interface the flux presents to the user)
    2. Geometry GEANT interface (the geometry is part of the data model (see I), but will need to be translated for GEANT).
    3. Interaction (NEUGEN)
      1. Vertex position from cross-section, geometry and detector media.
      2. Kinematics and fragmentation
    4. Noise (radioactivity)
    5. Transport and Physics (e.g., GEANT)
    6. Hit Extraction
    7. Overlays
    8. Digitization
      1. Strip collection and fiber attenuation
      2. PMT effects including noise
      3. Optical summing
      4. Front end electronics (preamps, ADC, TDC, noise)
    9. DAQ including trigger
  8. Reconstruction
    1. Framework
    2. Demultiplexing
    3. Event calibration
    4. Pattern recognition
      1. Vertex
      2. Shower
      3. Track
      4. Noise
    5. Event Classification
      1. NC vs. CC
      2. Neutrino flavour
      3. Neutrino energy
  9. User Interface
    1. Job Control
    2. Job Output

APPENDIX 3

Choice of Root as the straw man

There is a range of OO languages, although only two, C++ and Java, have found significant application in HEP MC and reconstruction. C++ is thought by many to be overheavy with features that can confuse under-supervised programmers. Java has avoided many of these traps and is altogether a more lightweight language with the additional appeal that it is platform neutral. However, this neutrality is bought at the cost of a "compiler" that really just translates the code into byte codes that are truly compiled into machine code as needed at execution time. This can result in significantly slower execution speeds.

We would not start with just a compiler but need a framework. There are 3 possibilities:

Based on the existence of large user groups involving major experiments with our time-scales, we select ROOT as the straw man. In principle, some future, partial migration to Java might be desirable. To first order, Java looks like a subset of C++, so a transition should not be difficult. It is more restrictive and, in particular, does not have templates. At this time C++ templates are only partially supported in ROOT, and we would probably not use them in the core framework.

APPENDIX 4

Use Case: Database Interface Synchronization

The following is a simple sketch of the way the Database Interface maintains a set of database constants in synchronization with the current event. The model has been successfully used in Fortran on both Soudan and SNO but is more natural when set in an OO context.

This simple model has a number of advantages: As stated above, this model has been implemented (twice) in FORTRAN, but certain aspects are very clumsy in that language. Specifically, the relationship between DBI and CLx. In FORTRAN, the only way for DBI to call any CLx is to have a dispatch table in DBI that calls EVERY POSSIBLE CLx! Each time a new client is written, DBI has to be updated to allow it to call the new client. When linking, references to every possible client have to be resolved even though they probably will not all be called. In C++ all DBI needs to know is that there is a DBI client call-back object that it can use to talk to clients. Then, as clients request constants, they pass DBI this call-back object. This is a standard technique in OO that can be used to effectively decouple components.

APPENDIX 5

ROOT workshop impressions

This is called "impressions" rather than a report as it will be very brief and only touch on a few points of relevance to MINOS. The workshop was well attended with over 50 taking part - twice what the organizers had expected. Rene Brun gave a talk about the motivation for the ROOT framework and future development. He placed great emphasis on the robustness of the I/O and on features to improve forward/backward compatibility. There were a number of site reports including ones from CDF, D0, BaBar (ROOT not official), Alice, Star, Brahms, Blast, Phenix, Phobos, TJNF, LCD and Minos. ROOT is being used across a full spectrum of software activities, although there were some complaints, specifically about aspects of I/O and STL support. One of the high points was the talk from ALICE on ALIROOT, a way of wrapping GEANT3 in such a way that it would be possible to replace it with either GEANT4 or FLUKA. Other interesting talks, such as those on parallel processing, network based server/client models and web based I/O served to underline the scope of ROOT. There were some confrontations, the principal one over the charge that ROOT is too monolithic. The antagonists felt that, with more careful design, ROOT could evolve to be more modular. This would make it easier to adapt to limited domains like DAQ now, and more adaptable to future changes in the external software environment. There was, however, some recognition that its high degree of integration reflects a degree of pragmatism which has allowed ROOT to develop extensive functionality so quickly.