Computing Job Descriptions
This page describes a number of the jobs that need to be done in
the computing system.
|
This page has not been updated since 2004. The job descriptions are
still useful, but many of the names of people involved have changed.
|
Coordination
The BABAR physics program involves very large data sets and many
parallel sophisticated analyses, each of which present significant
computing challenges. The BaBar Computing Coordinator provides overall
leadership and coordination of computing activities throughout the
international BABAR Collaboration, as a full participant in the BABAR
physics program. In this capacity, they are responsible for leadership
and coordination of:
- Operation and ongoing development activities on all aspects of
BABAR computing, which includes offline software, offline operations,
online software and computing hardware, and computing tools, and
- Upgrades of BABAR Computing that address new requirements arising
from PEP II luminosity improvements, from detector upgrades, and from
experience gained via physics analysis efforts.
The BaBar Computing Coordinator ensures that BABAR Computing
satisfies the needs of the BABAR physics program within the existing
financial resource provided by the International Finance Committee,
while maximizing the capability to perform computing and analysis
activities in Tier-A centres and collaborating institutions. The
coordinator serves as a member of the BABAR Management Group, which
advises the Spokesperson. In this aspect the BaBar Computing
Coordinator;
- Coordinate BABAR computing activities with BABAR management, the
BABAR Technical Board, BABAR analysis activities, BABAR collaborators,
the Computing Steering Committee and SLAC Computing Services and
management, and
- Identify the resources, both material and human, required to
accomplish the goals of BABAR computing activities and, with BABAR and
SLAC management, ensure that necessary resources are available.
The role also requires management of a substantial staff of both
computing professionals and collaborators working within the Computing
Organisation. This includes direct and indirect supervision, and
working with people in this highly distributed organisation.
Deputy Computing Coordinator
Assists the Computing Coordinator in all aspects of his role as
designated. May also be delegated responsibilities. Provides backup
during absence of Computing Coordinator.
The offline project is defined as all that offline code centrally
maintained, run and distributed by the BaBar experiment. This includes
various aspects of simulation, reconstruction and physics analysis tools.
Except for questions of interfacing, it does not include code that we don't
develop and/or distribute, and it doesn't include code that is used by only
a small number of localized physicists. Everything not part of the offline
project belongs either in computing operations or online, but this note is
not intended to list all them.
The offline project is organized by an "offline project manager" (OPM). He
or she works with a deputy (DOPM), and through a number of "area managers".
Currently, those areas are "simulation", "reconstruction", "database",
"releases", "physics tools architecture" and "physics contacts".
The purpose of this note is to layout, in some detail, expectations for who
will be primarily doing what. We have a good team in the offline, and we
know how to work together - this is not meant to change that. Rather, this
note presents to the larger community a snapshot of how this is working now.
Each area manager has responsibilities and authority listed below. Each
area manager is delegated whatever authority the offline project manager
has within his or her specific area, except as listed. Each area manager
is responsible (with the project manager and deputy) for setting goals &
priorities within
their area, and then accomplishing them.
In addition, there are likely to be various temporary special case
projects. These will be explicitly somebodies responsibility, either an
area manager, the DOPM, or the OPM. An example would be the effort to
re-support the HP machines.
Offline project manager
Responsible for the union of the area managers' and deputy's
responsibility. Individually responsible (E.g. can't get away from it) for
overall schedule, resource allocation and manpower procurement. Responsible
for overall balance of priorities within and among the area managers
efforts. Responsible for documenting progress or its lack, and explaining
why.
As a specific responsibility, until delegated, the project manager will
oversee the development of the event display(s) and related graphics
See further below for things explicitly _not_ part of the offline project,
hence not within the authority/responsibility of the OPM
Authority: Makes and maintains own schedule, routinely reported. Most
additional authority has been delegated, and resides with the area
managers. Can replace deputy and area managers after consulting with the
computing coordinator and computing standing committee (which is defined
somewhere to be computing coordinator plus some set of people from the
computing group).
Resources: No computing professionals, as these have been redirected to the
area managers. Students and postdocs as they are attracted. .
Responsible for development and deployment of the common aspects of
reconstruction, including the reconstruction geometry & alignment
calibration model & code, the overall organization of reconstruction
executables, and the reco infrastructure packages. Responsible for the
system-specific aspects of the reconstruction code to the level desired by
the specific system managers (Note: This is not a shared responsibility.
The reco manager is responsible for the technical aspects of the reco
software within the five systems and their integration into the whole. The
system manager's offline designee is responsible for organizing the people,
setting priorities within their systems, and making decisions about which
algorithms are most suitable, etc. The offline reco manager is responsible
for setting up means of validating that the specific system code works when
integrated into the whole). Responsible for the "tracking system"
reconstruction and monitoring. Responsible for the checkout and
development of the Bear package and executable.
As an additional responsibility, responsible for the basic offline
infrastructure, including Framework, the transient event and environment
model, SRT and the basic build structure, PackageList, etc, including the
overall dependency structure of the code, but not the specific dependencies
in any package outside common & reconstruction. (This is historically a
reconstruction responsibility, and as reconstruction stresses this area, we
chose to keep it one. ) Responsible for negotiating changes to this
infrastructure with the online organization to the extent they are
sensitive to it.
Authority: Full authority over the reconstruction packages, including the
associated infrastructure packages. Authority over the reconstruction and
infrastructure contents of the development releases, subject to specific
scheduling issues currently delegated to the deputy offline project
manager. Authority over content and scheduling of changes to the basic
offline infrastructure, including Framework, the transient event and
environment model, SRT and the basic build structure, etc. Works directly
with the QC and QA groups under operations.
Resources: directs the work of Asoka Desilva (Framework and
infrastructure), Terry Hung (SRT and script support),
Wouter Hulsbergen (Tracking)
Responsible for the fast, intermediate and detailed simulations, including
development and validation. Reponsible for development and deployment of
the common aspects of simulation, including the simulation geometry & code,
the overall organization of simulation executables, and the simulation
infrastructure packages. Responsible for the system-specific aspects of the
simulation code to the level desired by the specific system managers (Note:
This is not a shared responsibility. The simulation manager is responsible
for the technical aspects of the simulation software within the five
systems and their integration into the whole. The system manager's offline
designee is responsible for organizing the people, setting priorities
within their systems, and making decisions about which algorithms are most
suitable, etc. The offline simulation manager is responsible for setting
up means of validating that the specific system code works when integrated
into the whole) Responsible for integration of the
generators, and the technical and consistency aspects of the generator
internals. Responsible for offline aspects of background mixing, L1 and L3
trigger simulation.
While being responsible for all the above most of those aspects
require little attention (beyond deployment of new L1
simulation). BaBar has been running for more than four years now and
we have learned a lot about our detector and the underlying
physics. The Simulation Manager should work towards including these
lessons in the simulation so that it can mirror the data more
closely.
Authority: Full authority over the simulation packages, including the
generators. Has authority over non-hardware-specific aspects of mixing &
trigger simulation software development.
Resources: SLAC G4 developers and others developers doing Common
Task work on Core Simulation. Helps set priority for subsystem
developers.
Release Manager
Responsible for the operation, maintenance and implementation of the
release system used to build the various forms of common releases. O,M&I
of the testing systems.
Authority: Full authority over the construction and operation of the
system. Must coordinate release build schedule, special cases, overall
capacity with the deputy offline project manager. Must coordinate
development of the general structure of the release build system with the
reconstruction manager.
Physics Contact and Physics Tools Architect are now together as
Physics Software Manager
Physics Tools Architect
Responsible for the architecture and implementation upon which specific
physics analysis tools are built. This includes Beta, the structure of the
physics tools packages, and relevant other packages & base classes.
Responsible for the integration, either proactively or retroactively, of
provided tools with the offline system support for physics tools. Not
responsible for the provision, documentation and/or tuning of any specific
tool; these are the responsibility of the respective physics tools groups
via the physics contact.
Open question: What role does this manager have in the provision of "the
PAW replacement"?
Authority: Within the general software structures previously established,
full authority to design and implement software infrastructure for physics
tools, including relevant modifications to pre-existing tools. Is expected
to make decisions of priority for tool support development, is expected to
chair decision-making meetings on tool architecture and support, and is
expected to organize code migrations when necessary. Chairs a routine phone
meeting where new and updated tools are presented and their architecture is
discussed, updates to Beta.
Resources: Second call on the time of Akbar Mohktarani.
Physics Contact
Responsible for the inclusion of physics-related software into the offline
executables. Responsible for obtaining feedback from the physics analysis
organization on event store use, generator quality, specific physics
analysis tools, etc. Responsible for obtaining and deploying documentation
of the performance of specific tools made part of the offline system,
recommendations on expected use. Responsible for event selections (cut
values and algorithm selections) that will be run so that remote
institutions can get access to the data they need. Will consult widely
within the specific physics analysis and physics tools groups, and arrange
the integration of appropriate tools into the offline system.
Authority: Final call on the inclusion of a specific tool or tool update in
the offline system, within the parameters of the development cycle. Selects
which tools will be run in the offline productions, and how they will be
configured.
Specifies how selections, etc, will be run.
Resources: None, unless provided out of the physics organization. But as
this is primarily a communication role, its not clear how much assistance
is needed.
Note: There is a built-in tension between the "Physics Contact" and the
Physics Tools Architect. In the case of a "useful, but technically
incompatible" tool, there will be forces pressing for and against
inclusion. The decision on inclusion belongs to the Physics Contact, but
it is expected that the Physics Tools Architect's opinion will be weighed,
and that both parties will work toward improvement of that particular tool
after inclusion. This emphasis is consistent with our "physics first, but
maintainability for the long term" approach.
Distributed
Database Manager
Responsible for the safety, consistency and efficiency of the data stored
in the common offline databases. Responsible for the common database code,
including utility programs and classes. Responsible for event catalogs and
collection utilities, query support. Responsible for the overall operation
of the databases, including technical oversight of the HPSS system.
Responsible for testing integrity of storage systems, including backup
provisions.
Authority: Makes technical decisions regarding the structure and
implementation of the common database code. Make operational decisions
about clustering and placement, structuring data for transport, schema
modifications and changes to the database build procedures. Determines
operational procedures for use of the databases, by both individual users
and production systems.
Resources: Directs event store
engineers (Yemi Adesanya and Daniel Wang),
conditions database engineer (Igor Gaponenko) and
aids the online database engineer (Andy Salnikov) (titles may not be
exactly right)
Open issue: This is in the distributed
organization, but there is significant online overlap. I don't intend
to let the database manager play both ends against the middle. The
online and operations managers should make really clear what they need
from the database group and when, then its the distributed's job to make sure they get it.
Operations
The operations manager is responsible for the operational activities of
BaBar Computing. This includes Simulation Production, Prompt
Reconstruction Production, Skim Production, Kanga Production, Database
Operations, Quality Control and the upkeep of BaBar Documentation. Very
close liaison with the Operations Managers of each Tier-A facility is
required to be provided by this role. The operations manager (as they are
also the Operations Manager for the SLAC Tier-A) is also responsible for
developing deployment plans for SLAC resources in cooperation with SCS.
They are the primary conduit for information between BaBar and SCS.
Online
The Online Event Processing system provides the infrastructure for
Level 3 triggering, online data quality monitoring, and parts of
online calibrations. The OEP group develops and maintains the
associated software, and provides on-call support for operational
problems involving it. The Fast Monitoring Operations manager
provides support and training for user and detector system interaction
with the data quality monitoring system.
Provides basic support for all uses of the Java language in the
online system. In particular, supports the application in BaBar of
the JAS ("Java Analysis Studio") program, and the BaBar-specific
components that connect with it. Provides assistance to the OEP
group in the maintenance of the user interfaces for data quality
monitoring (on-call support is provided by the Fast Monitoring
Operations manager).
The DAQ Operations Manager is responsible for maintaining the health
of the BABAR Data-AcQuisition system (DAQ) and instructing others in
its use. To perrform this role, the DAQ Operations Manager must
implement software upgrades to the "Production" (data-taking)
software, coordinate hardware and software changes to the system
amongst the many online groups (Dataflow, Event Processing, Level 3
Trigger, Run Control, and Detector Controls), train the shift-takers
in the operation and troubleshooting of the data-acquisition system,
and develop tools and documentation for the overall improvement of DAQ
operations.
|