-
Should the MINOS software system be OO or Fortran based? If the
Group is unable to decide at this time, what is the suggested mechanism
for this decision? What is the time-scale necessary for reaching a timely
decision? What are the risks associated with either course of
action?
The Committee reports that MINOS should use an OO-based system, and
proposes a system based on C++ and ROOT as a strong candidate.
The arguments in favor of the Fortran/ADAMO system are:
- The framework is tried and tested.
- There is considerable Fortran expertise within the
Collaboration.
- MINOS is a relatively simple experiment that does not present
technical challenges beyond the reach of ADAMO.
The difficulties of a FORTRAN/ADAMO choice stem from:
- The uncertainty of support for FORTRAN packages like ZEBRA,
ADAMO, GEANT3, BOS, TAUOLA, LUND, STDHEP and PAW through 2010.
Support covers bug fixes and porting to new architectures,
including 64-bit machines.
- The difficulty of interfacing a FORTRAN system with future
HEP packages like GEANT4 and with emerging database,
visualization and networking products.
- The lack of enthusiasm of younger collaborators for working
exclusively with a procedural language and for having to
maintain otherwise unsupported legacy packages.
In contrast, the Committee has decided to use a ROOT-based system as
their OO straw man, rather than just OO/C++. See Appendix 3 for the
justification of this choice. The strengths of such a system are,
in decreasing order of importance:
- Direct contact with a large and growing band of experts and
software systems ranging over the full spectrum of software
activities from DAQ monitoring to the parallel analysis of results.
Beyond the wealth of expertise there are also the more tangible
rewards of software we can directly use. In particular:
- ROOT - a full function framework (I/O, containers,
2-D and 3-D graphics, GUI, ntuple/histograms, fitting,
network support, interactive code development tools)
- ALIROOT - A Monte Carlo developed by ALICE that addresses
the GEANT problem (GEANT3 unsupported, GEANT4 not yet here) by
defining an interface behind which the existing GEANT3 can be
replaced by GEANT4 or FLUKA
It is hard to overstate just how much we can benefit over the
years by being part of this mainstream activity. No one knows
what will happen in the next 10 years. The ALIROOT project is
an example of the opportunities that will open up.
In a recent video-conference we recognized that GEANT was one of
the central problems we have to address, and now a potential
solution has emerged. ALICE has constructed a shared library
for GEANT3, which they have ported to their ROOT interface. The
lead developer of ALIROOT encourages others to take the code,
offers help to those who want to use it and invites others to
suggest improvements to it. The benefit to ALICE is that outside
users will expose weakness and that the end product will be better.
CDF, which had begun parallel development on a similar GEANT3
interface to ROOT, is now working with the ALIROOT team to unify
their interfaces and standardize the shared library. These
developers have stated their intentions to port GEANT4 to the
same interface in a timely way. At the ROOT workshop we also
heard about work on an interface to JAS, the Java Analysis Studio.
It will mean that simple analysis code can be written in JAVA
with C++ at the back-end to take care of the compute intensive jobs.
- Future students will use OO regardless of the choice of the
official code. The problem is already apparent:
- Oxford has lost a possible student because they could not
promise that MINOS code will be OO.
- The SNO group has a complete Fortran system (Monte Carlo,
Fitters, Analysis tools, Display, Database). Despite this, some
collaborators have developed a superior event display using ROOT
that cannot be used within the official Fortran framework.
- OO is a superior technology. It is important to understand that
OO encompasses procedural code. Experiments that are already
developing OO systems exercise a degree of pragmatism that recognizes
that procedural code should be used where the problem demands it.
There remains a strong processor data model, while the use of OO
allows a superior engineering of interfaces between components
internally and packages externally.
- It allows the sharing of software components between the on-line
and off-line code, for example, the event display and the database
interface.
It is the view of the Committee that the first two items in favour of
OO are so compelling that the choice should be made now in favour of
OO despite the fact that this will require significantly more effort
in the short term and considerable pain as people convert to C++.
We ask the Collaboration to recognize the long term, permanent
advantages of this choice and to accept the short term costs.
As we are recommending adoption now, this answers the questions about
a decision mechanism and time-scale. We proceed directly to an
assessment of risks and their management.
Risk: Lack of Expertise
We have very little practical experience of OO design.
We identify two specific risks:
- Inability to use the technology. George has demonstrated proof
of principle with MINFAST that we can use the technology and,
just as importantly, we can benefit from the work of others
(chunks of MINFAST come from ALICE and ATLAS). We plan to repeat
this in a more significant way with ALIROOT.
- Too many choices. There is no single best way of designing an
OO system. Too much
choice may have already proved a problem on some experiments.
Although ROOT provides what we need in a software framework,
it does limit certain choices. For example,
use of C++ templates is discouraged. Such restrictions can be
seen to one's advantage, particularly early on. Studying and
adapting what others have done also reduces design choices.
Risk: Collaboration Resistance
Without question the transition is going to be very painful for some.
We see two ways to minimize this risk:
- A unequivocal mandate from the Collaboration that it
is committed to OO programming
will encourage collaborators to make the effort
and avoid
a fracture within the software community.
The most pervasive theme in response to our survey of other
experiments is that
collaboration commitment to their choice is essential.
- A program of user support and training
- We need to study the training programs of other
experiments.
- A basic "user reference" has already been set
up and has proved useful.
- We propose to have a mentoring scheme with a
regional contact for each site that uses the new
software. We ask sites to identify candidate mentors
for whom we can establish a training program in a timely
fashion.
- The establishment of a mailing list
in conjunction with Hypernews as an archive method.
Risk: Not getting the job done on time.
This is a question of required resources that will be addressed,
albeit inadequately, in answer to a later question. Clearly, though,
the risk is minimized by maximizing our resources: the challenge is
to scale up the effort from the current level. We make two specific
recommendations:
Risk: ROOT will not last 10 years.
Assessment of this risk was one of the primary reasons why the four
of us attended the ROOT workshop held at Fermilab. ROOT is
maintained by a small group, and there was the perception that an
essential, highly technical subsystem, CINT (C++ interpreter), is
particularly vulnerable. We were encouraged to learn that, in
regard to CINT:
- The expertise was covered (Rene Brun has an explicit
requirement that every essential component has a backup line
of support).
- There was an existence proof (in the form of a D0 postdoc) that
outsiders can get in and understand the system sufficiently well
that they could extend it.
- Plans are being made for a rewrite to improve its
maintainability and to extend its support base.
As for ROOT as a whole, the most compelling evidence for its long term
prospects can be found in the list of experiments that have already
committed to it. The most relevant (amongst quite a long list) to us
are:
CERN: Alice
BROOKHAVEN: Star, Phoenix, Phobos, Brahms
FERMILAB: CDF, D0
In the case of CDF and D0, their adoption is only partial. However,
CDF does use, and would continue to support even if ROOT folded,
the I/O system, which is the most essential element of the framework.
Since CINT is important for defining the data dictionary used in I/O,
CDF and D0 have an interest in its survival. Both CDF and D0 also
use the analysis tools.
Risk: Other external packages do not last 10 years.
- GEANT
This is of primary concern, although, with ALIROOT,
we have a strategy to manage this risk. The Fortran scenario
has the same risk exposure but without a concrete solution.
Adam has suggested that a major part of the problem is the GEANT
framework and that our detector is so simple that we should take
responsibility for it and just take the underlying physics package.
This alternative should be examined.
- ORACLE
ORACLE has already been chosen as the inventory database
for MINOS. It is also a candidate for the MINOS calibration
database, which is needed over the same time-scale. The only
additional risk which comes with the calibration database
application is in the interface, which is significantly different
from that of the inventory database. This risk is not deemed to be
significant: the interface is not demanding and ORACLE would not
have been so successful for 20 years and have 60% of the market
if simple interfaces were hard to implement. ORACLE expertise
exists within Fermilab. Other ROOT users have advertised their
interest in an ORACLE interface and the ROOT team is sponsoring
a generic TSQL query class to help in this area.
- ANALYSIS SUITE (Histogram, n-tuples, plots, fits)
ROOT has its own analysis suite; there is no additional risk.
The JAS-ROOT connection promises future extension to Java for
user analysis tasks.
-
Is a hybrid system a sensible possibility? How would such a system
be structured?
The Committee is strongly against a hybrid system
for the following reasons:
- It leads to a bifurcation of both the code and the community
with unavoidable duplication of effort.
- It increases risk exposure by extending the number of
external packages.
- There is a necessary compromise in the communication between
the components of the system and between the maintainers of
those systems.
However, it is recognized that there is important legacy code,
specifically NEUGEN, which depends on BOS, STDHEP and LUND.
The proposal is to wrap this code so that, externally, it appears
just another object and is stored in a dynamic library like the rest
of the OO code. To do this implies control over support packages.
This is not a problem for BOS or STDHEP. However, LUND is still
being actively developed. Our belief is that ATLAS is now supporting
a ROOT compatible version, although this needs to be confirmed.
So, within our framework, Fortran would be limited to the internals
of the wrapped NEUGEN. It would be a low priority, long term goal
to rewrite NEUGEN in C++.
Scripting languages are needed for support tools where the
Committee recommendation is perl for user scripts and sh for make files.
-
If an OO system were adopted, how does one make the transition
in a way that assures that MINOS has the ability to continue the
required work during the transition period?
The proposal would be to add functionality to the existing system
to support essential activity for the next ~6 months and then to
freeze the code.
Through-going muons and radioactivity (both from the walls and in
the steel) are known deficiencies. Although overlapped events are
now supported, the generation of event files awaits neutrino fluxes,
through-going muons and radioactive noise. We need feedback from the
full collaboration for any anticipated additional demands with their
time-scales. We would ask people to recognize, however, that such work
taxes very limited resources and would have to be duplicated in the
new system.
-
Within the framework of the suggested action, what is a reasonable
schedule for completing various tasks? What would be the important
milestones?
In order to answer this question, we have first to identify the list
of tasks encompassed by the framework. The traditional role of the
off-line software focusses on the Monte Carlo, Event Reconstruction
and Analysis. However the choice of OO and C++ opens up the
opportunity to share code with the on-line system and so blurs the
division of responsibility between the two areas. To exploit this
opportunity to the fullest requires close cooperation between the
off-line and the on-line groups in such areas as:
detector monitoring
event visualization
database interface
on-line reconstruction (to act as a 2nd level filter?)
Moreover, there is a wide range of support activities that, while they
do not fit into the orthodox input -> process -> output model,
could benefit by using sub-systems, such as the database interface and
the event display. Support for such non-standard jobs should be an
important framework design consideration. The Collaboration may take
the view that certain support activities are so essential that it is
unacceptable that they be "blackboxes" to all but those who
developed them and may insist they be placed in the public domain.
The framework could provide the environment for such public software.
Of course, those software tasks that remain outside the framework are
not restricted in their choice of language or support packages but
must be responsible for these choices. Framework designers should work
with those who wish to remain outside the framework to facilitate
communication, predominantly via files or pipes, between external
systems and the framework.
Returning to the question of the tasks that the framework encompasses,
a draft of the User Requirements can be found in Appendix 2.
Members of the Collaboration are encouraged to review this list
and offer suggestions for its improvement. The Committee has made
an attempt to schedule the work from the Requirements List according
to a priority scheme which would expedite the following:
- Work on the data model and its relationship support
mechanisms would be the first priority;
a study of the ALIROOT model may provide valuable insights
for the underlying architecture.
An initial proposal should be drafted by the June 1999
Ely Workshop.
An orientation session on the draft framework would be given.
- The next priority would be to establish a minimal working
environment in which physicist-programmers
can develop reconstruction algorithms to gain experience with
C++ and ROOT.
The framework would implement those parts of
the new architecture necessary to support reconstruction
activities. Initially, this would be interfaced to the ADAMO-based
data structures of MINFAST, so that re-engineering of GMINOS
can be deferred.
User experience with the data model will provide valuable
feedback on objects, relationships and methods of navigation.
Several physicist-programmers should be
usefully engaged by
October 1999.
- During the year 2000, work can proceed on the Monte Carlo
elements of the framework, with particular emphasis on
digitization, which is required as an adjunct to
reconstruction studies.
- Late in the year 2000 framework support must be provided
for calibration and checkout activities starting in 2001.
A schedule with preliminary start
and completion dates has been generated.
-
What are the human resources required for the suggested tasks?
How many new software-dedicated MINOS people would the proposed
schedule require?
At the time of writing, it is not possible to answer this question
beyond the statement that the current off-line team (George, Robert,
0.3*Nick) is insufficient for the complete program.
There are a number of imponderables:
- How long will it take us to become competent in OO design?
- How much effort is involved in the tasks we have set ourselves
in the first year? Until we have real OO design experience,
we cannot tell.
- How well will the core programmer, physicist-programmer model
work? If well, then we may be able to tap into a large pool of
effort even if individuals work less than the canonical 50%.
- How readily can we adapt design, and code, from other
experiments?
In the early days we need effort with the core design. Ideally two more
people, with good understanding of C++ and OO design who could work
essentially full time would be very beneficial.
The core team should remain small; good
architectures are seldom produced by large committees. Once we can
start to harness the power of the physicist-programmer, we probably
need the equivalent of another 2 FTE.
-
What is the optimum (from MINOS point of view) relationship between
the MINOS software group and the Fermilab Computing Division?
For this project we are currently short, both in effort and expertise.
However it would be unwise to "contract out" the core design,
even if the Computing Division were willing to undertake the job.
The design involves many choices, balancing pros and cons, and it is
important that members of the Collaboration play an active role here.
It is important to understand why the design was chosen. Without this
deep understanding, it is likely mistakes would be made as the core
evolves. Later, as the applications are fleshed out, effort should
not be an acute problem.
This leaves our shortfall in expertise, and here we should think
seriously about the support we could tap into, both within the
Computing Division and the wider HEP community, particularly the ROOT
based experiments. It is likely others have insights, if not solutions,
to all the major problems we have to solve. So, if we could establish
a consultancy role with the Computing Division, both for direct advice
and to help us understand the subtleties in designs of other experiments,
this could be helpful.
-
What resources should we request from the Fermilab
Computing Division?
As stated above we should seek to establish a consultancy role
with the Computing Division,
in which they help with specific design issues, offering advice and
helping us to exploit the growing pool of expertise. However, this
stops short of initiating design work themselves, so we should ask
for something like 8 hours per month of access to someone knowledgeable
in ROOT-based experiments.
-
What are the most important immediate issues that have to be
addressed so that we can proceed with the detector design in the
most expeditious way possible?
We cannot answer this question without more feedback from the
Collaboration.
However, we would repeat that hardware design issues in the next 6
months can only be addressed with the legacy Fortran system. Software
effort is a limited resource in the next year and we would ask others
to limit requests to a reasonable level.
-
(added later) Choice of Database
The short answer is that it is not yet necessary to decide the
database needed in this context.
It is clear that we have two quite different needs:
- An inventory database
- to record and track steel, electronics, etc., mainly to
support detector fabrication.
- A calibration database
- to hold calibration and other data sets to support
detector operation.
ORACLE has already been chosen for the inventory database for very good
reasons. We are assembling a very expensive detector. Mistakes made
during manufacture may be impossible to rectify. The work will involve
the coordination of a large group of people spread over a number of
laboratories. The clearest possible communication within this group
is essential and the robustness and integrity of the database vital.
It was for this reason that ORACLE, with its proven track record,
was chosen. The model is to have PCs equipped with "ORACLE Lite"
running locally and then send out updates via the network as required.
When it comes to the calibration database, the natural inclination to
use ORACLE again must be justified. This application has a very
different access pattern to that of a commercial database. So we start
by listing a set of acceptance criteria. It is not a requirement that
all the functionality listed below be supplied with the database;
we expect to do some work, if only to write query functions. However,
the amount of additional work required to develop and maintain full
functionality will play a part in deciding between competitors.
- Robustness
- the product must have a proven record for reliability
- recovery from disk crash using backup plus roll-forward
with journal files desirable.
- Distribution
the master/slave model with a simple distribution system
is essential; network bandwidth limitations mean that it is
impossible for all groups to use a single database repository.
- Performance
it must be possible to retrieve large blocks of calibration
constants rapidly and in such a form that event calibration
is optimized.
- Functionality
- selection based on relatively simple criteria is essential.
The primary key will be time or run number. Although the heaviest
use will be to serve event processing, it must also be possible
to generate reports with histograms and time charts, etc.
- it should also be possible to exclude entries added after
a supplied date and effectively recover an earlier state of the
database. This is essential for studies that need database
stability.
- it should be possible to export data in a simple format that
can be used by programs without a database interface.
- there should also be a GUI interface for use by people without
the necessary programming skills to make software queries.
- Schema Evolution
it must be possible to evolve the contents of the database
over time.
- Communication with the Installation Database
communication is essential with the installation database.
Ideally, information should not be duplicated in the two
databases. However many elements have both an "inventory
aspect" and a "calibration aspect" and
this, if nothing else, ties the two together.
- Access
it must be possible to access the database from all sites
at reasonable cost. It must also be possible to do limited
processing isolated from the network, e.g., on a laptop (flying
to/from Chicago). In the case where licenses are involved, we
could add a feature to the Database Interface so that it would
disconnect from the server during quiet periods minimizing the
number of concurrent licenses required.
- Support
as mentioned under risk management, support over the lifetime
of the experiment is essential.
ORACLE has a clear advantage over any other database when it comes to
item 6 - there is no communications problem if the two databases are
one and the same. It also simplifies the framework interface that has
to access both. In other areas, and, in particular, item 3 - Performance,
Oracle has to "earn its spurs" like any other database.
What is required are a set of performance tests that model calibration
procedure. However, if it passes this and the remaining tests, it
should be the default and only be replaced if another system has a
significant advantage. The ability to work disconnected from the network
is an important one for which several solutions are possible:
- A mini-database and server.
In the case of ORACLE the
smallest server is "ORACLE lite" costing ~ $400/seat.
- A substitute mini database.
For example:
- mSQL
- ASCII files
Both of these require an additional interface and a method of
converting a subset of the official database to the substitute
format. However, there is a plan to have the Database Interface
accept ASCII files as a way of masking out the database, (see
Appendix 4) so this solution would then only require the converter.
- In-line constants.
Write the constants into the same file
as the data. This requires that database constants can be stored
in an event file and that the Database Interface can accept data
from the event file.
There is a range of OO languages, although only two, C++ and Java,
have found significant application in HEP MC and reconstruction.
C++ is thought by many to be overheavy with features that can confuse
under-supervised programmers. Java has avoided many of these traps and is
altogether a more lightweight language with the additional appeal that it
is platform neutral. However, this neutrality is bought at the cost of a
"compiler" that really just translates the code into byte codes
that are truly compiled into machine code as needed at execution time.
This can result in significantly slower execution speeds.
We would not start with just a compiler but need a framework.
There are 3 possibilities:
Based on the existence of large user groups involving major experiments
with our time-scales, we select ROOT as the straw man. In principle, some
future, partial migration to Java might be desirable. To first order,
Java looks like a subset of C++, so a transition should not be difficult.
It is more restrictive and, in particular, does not have templates.
At this time C++ templates are only partially supported in ROOT, and
we would probably not use them in the core framework.
The following is a simple sketch of the way the Database Interface
maintains a set of database constants in synchronization with the current
event. The model has been successfully used in Fortran on both Soudan and
SNO but is more natural when set in an OO context.
- When DBI is created it asks EIO to inform it each time a new event
is read in (DBI -> EIO). From then on, as each new event is read in,
EIO informs DBI (EIO -> DBI).
- Database Interface clients CL1 and CL2 ask DBI to provide synchronized
access to specified objects in the database (CLx -> DBI). DBI uses the
current context (event date, type and detector/data type) to retrieve
the required database objects and then tells the clients where they are
to be found (DBI -> CLx)
- As each event is input, the signal from EIO to DBI initiates a check
that all database objects remain valid. A simple trick makes this very
fast. As DBI collects database objects, it forms a global time window
by ANDing all the individual object valid time windows. So it first
checks this global window and only checks individual objects if outside
this window. If objects are no longer valid, they are replaced by fresh
copies drawn from the database. Clients whose constants are changed are
informed of the change.
This simple model has a number of advantages:
- DBI Clients see a simple interface. Clients remain unaffected by
underlying database changes.
- Multiple requests for the same data result in them sharing a
single copy.
- Individual clients are relieved of the chore of checking constant
validity.
- If we follow the Soudan and SNO models, DBI will also have access
to database data in ASCII files that take priority over the database.
This has 3 advantages:
- In the short term it means that the system can be set up with
just ASCII files before a database is ready. When available, the
database can be incorporated without clients being affected by the
change.
- In the long term it means that new and modified constants can
be tested before being committed to the database.
- It can be used for the "laptop on the airplane problem"
so long as database banks can be extracted from the full database
in the right ASCII format.
As stated above, this model has been implemented (twice) in FORTRAN, but
certain aspects are very clumsy in that language. Specifically, the
relationship between DBI and CLx. In FORTRAN, the only way for DBI to call
any CLx is to have a dispatch table in DBI that calls EVERY POSSIBLE CLx!
Each time a new client is written, DBI has to be updated to allow it to
call the new client. When linking, references to every possible client
have to be resolved even though they probably will not all be called.
In C++ all DBI needs to know is that there is a DBI client call-back object
that it can use to talk to clients. Then, as clients request constants, they
pass DBI this call-back object. This is a standard technique in OO
that can be used to effectively decouple components.
APPENDIX 5
ROOT workshop impressions
This is called "impressions" rather than a report as it will
be very brief and only touch on a few points of relevance to MINOS.
The workshop was well attended with over 50 taking part - twice what the
organizers had expected. Rene Brun gave a talk about the motivation for
the ROOT framework and future development. He placed great emphasis on
the robustness of the I/O and on features to improve forward/backward
compatibility. There were a number of site reports including ones from
CDF, D0, BaBar (ROOT not official), Alice, Star, Brahms, Blast, Phenix,
Phobos, TJNF, LCD and Minos. ROOT is being used across a full spectrum
of software activities, although there were some complaints, specifically
about aspects of I/O and STL support. One of the high points was the talk
from ALICE on ALIROOT, a way of wrapping GEANT3 in such a way that it would
be possible to replace it with either GEANT4 or FLUKA. Other interesting
talks, such as those on parallel processing, network based server/client
models and web based I/O served to underline the scope of ROOT.
There were some confrontations, the principal one over the charge that
ROOT is too monolithic. The antagonists felt that, with more careful
design, ROOT could evolve to be more modular. This would make it easier
to adapt to limited domains like DAQ now, and more adaptable to future
changes in the external software environment. There was, however, some
recognition that its high degree of integration reflects a degree of
pragmatism which has allowed ROOT to develop extensive functionality
so quickly.