2003 DoE SciDAC PI Meeting

Napa Valley, March 10-11, 2003

Rough Notes by Les Cottrell

Introduction - Laub, SciDAC Director

First PI meeting was last year in Washington. Office of Science is trying to carve out a leadership in computational science. :Lasting value of SciDAC will be the culture shift of working together as teams (of computational scientists, physicists, mathematicians) as well as the tools. Now 1.5 years into SciDAC, now need to review progress. http://www.osti.gov/scidac/updates.html and updatesbenson.htm. Heavy representation from ANL, LBNL, and ORNL. LLNL also well represented. This matched the speakers and who was involved/funded in the SciDAC projects.

SciDAC Role on DoE Office of Science - Ray Orbach

Office of Science web site has several talks which provide the direction SciDAC and Office of Science is going. Computations science is a third leg (experiment and theory) in science exploration. US will join ITER, based on belief that ITER will work, this belief is based on computational simulations, but the computational facilities are not powerful enough yet. OS philosophy is to develop the computation to become a fully fledged third leg of science. There is emerging, in Washington, an understanding of the need for UltraScale simulation. OS taking leadership in UltraScale simulation (large scale facilities), while NSF takes the lead in Grid computation. This is a very complementary set up. National security folks, DoD, NASA also have big needs for high end computation. Many programs (e.g. NASA) that were powerful in computation have allowed the programs to wither. NASA does not have the computational power top handle the requirements for analyzing gravitational collapse, colliding black holes etc. for data from the 3 satellite LISA (gravitational physics) project. OS want to reverse this trend of high end computation. They also want to drive it from the science end rather than saying here is a fast machine figure out how to use it. ASCI machine program efficiency on non ASCI specific problem is in the single digit percents The EarthGrid (26.5 TFlops) supercomputer just developed in Japan has raised the recognition that computing for general purpose science computations. In president's budget there is $15M for computational architectures. Expect to be able to build such general purpose machines for 4-5 years. Ray believes we (US) has fallen behind and needs to regain the lead.

They are working with IBM. At NERSC have a 10 TeraFLOPs general purpose machine from IBM (turned on in March 2003), 50% of this will be available to the scientific community, and about 20% is for SciDAC, and 30% for Grand Challenge - IBM has only guaranteed 12% of the cycles for general purpose computing, so this makes it difficult to see how to use for general advanced science. Supernova (nuclear science) gets 35%, radiation hydrodynamics gets 18% of peak, accelerator modeling is 25%, Wilkinson have 50%.

Just announced a contract with Cray for an X1 vector machine at ORNL.. Eager to try out various codes on that machine. Vectorization is non trivial, climate folks are working on how to vectorize codes. Want to see what the right balance of vector and parallel computing is correct and for which problems. A new machine from HP is also going into EMSL in Pacific NorthWest. Unclear whether SGI can get into the UltraScale region, may depend on DoE funding. HEP is lucky in that we have asymptotic freedom where the coupling decreases as scale decreases. This is different to condensed matter where the coupling increases as one goes to smaller distances.

Industry has big requirements for ultra-scale computing. An example is GE who could built a virtual jet engine if they had a 25TFlops computer. GM needs ultrascale computing to simulate ignition properties. Both of these can result in enormous savings in how make measurements today, and result in a quicker turn-around from idea to product, and hence greatly improve the US business competitiveness.

Ari Shoshani pointed out that data/storage management and networking is also critically important to ultrascale computing.

Getting the Performance into HPC - Jack Dongarra (UTK/ORNL)

Moore's law has pretty much predicted the performance of HPC. IBM Blue Gene Lite and ASCI Purple with 64 cabinets * 32 * 32 (131K) processors. Problems are in complexity, robustness etc. In particular there is a gap between hardware & software. IN 12 years have seen an increase in performance for LinPack 100x100 of 8231 where Moore's law gives 256. 128 x clock speed, external bus width & caching, FP, compiler technology. Potential for P4 is factor of 4 more (5.6GFlops). Memory is only increasing speed at 9% / year, while micro-processors increase at 60% / year. For y=ax+b 5.6MFlops/s requires 85000 MWords/s bandwidth from memory, but P4 has only 266MWords/s. Try to overcome with recognizing locality and using caching Need to identify the bottlenecks (e.g. in processor so can map back to application), improve compiler code generator efficiency. Want to automate since very costly to do by hand, and have to repeat for new hardware, OS releases. Future, numerical software will be adaptive, and intelligent; determinism in numerical computing will be gone; importance of FP will be undiminshed; fault tolerance within application and auditability; need to be very adaptive.

To get high utilization need: close interactions of apps with CS and Maths; dramatic improvements in adaptability of software to the execution environment; improved processor-memory bandwidth; new large scale systems and architectures. SciDAC is designed to assist in making this happen.

BER Applications

Climate modeling, debate over 0.5 - 1.0K in 285 look for effects of 1-5 Watts/m**2 forcing out of 250Watts/m**2. Early climate modeling papers in 1972. Figure of merit is to predict 5-10 years/day. As gets faster then will increases detail. Modeling indicate that starting in 1980 can identify that the non natural causes have been increasing the average temperature (i.e. outside statistical fluctuations). Computing capability is increasing dramatically, but is expensive and hard to use. Have to divert people and money resources from hiring scientists to hiring support staff and computing infrastructure.

The Community Climate System Model Software Engineering Consortium - John Drake (ORNL)

The geodesic Climate Model of the Future - Dave Randall (CSU)

Terascale science of chemistry-turbulence interactions - Larry Rahn (SNL)

Computer Science ISICs

Performance Engineering - David Bailey (LBNL)

Want to understand how application, system, hardware need to be optimized to get best performance. There can be big gains from doing this given the costs of supercomputer centers. Includes compile time optimization, self tuning software. Have built many GUI tools to look at the internal performance etc. See http://perc.nersc.gov/

Math ISICs

Terascale Optimal PDE Simulations (TOPS) - David Keyes (ODU)

ANL, LLNL, LBNL. plus some universities. Advances in algorithms have provided as much improvement as Moore's law did for FLOPs.

Terascale Simulation Tools and Technologies Center - James Glimm (SUNY SB, BNL)

Want to make tools easier to use. Working with climate, fusion, accelerator design, diesel design. Working with Kwok Ko of SLAC on understanding the effect of mesh quality on Tau3P.

Collaboratories, Networking/Middleware

Introduction & Overviews on National Collaboratory Pilot Projects - Mary Anne Scott (DoE)

Goal: Enable geographically separated scientists to work as a team.

Several projects: DoE SCience Grid (a big accomplishment is to set up a Certificate Authority for security); Fusion Grid; PPDG (has a large number of accomplishments), Internet2 Land Speed record broken at SC2002; Collab for Multiscale Chemical Sciences (have a prototype with a set of tools and technologies that they are developing); Earth System Grid (ESG);

Progress on Earth Systems Grid (ESG) Project - Don Middleton (NCAR)

Joint project ESG, ANL, LLNL, LBNL/NERSC, NCAR, ISI. Enable management, discovery, distributed access, processing and analysis of distributed climate research data; built on ESG, Globus and a major goal is to deploy. Have 72TBytes data scattered around sites. Amount of computing depends on resolution. Typically today use 250km cells. Want to get down to at least 70km (increases data by factor 10-20). Another dimension is reducing the time scale to an hour. Also want to improve quality of boundary layers, clouds, convection, ocean, .. adds another factor of 10. Earth simulator in Japan is running T1279 approx 10km. Want to increase computing by a factor 1000-10000). At the same time they will be launching a huge number of satellites. ESG challenges: data management including distribution, work flow for knowledge development. Move minimal amount of data, keep close to computational point of origin, when move it move fast with minimal human intervention (storage reource management, fast networks), keep track of what they have got (metadata & replica catalogs). They have a Hierarchical Resource Manager (from LBNL/Ari Shoshani) running across DoE/HPSS system. It works and is 100 times faster than what was done before, reduces time of researcher required to restart FTP transfers etc.

OpeNDAP (Open Source project for a Network Data Access Protocol), very valuable if do not need to move all the data.

Showed demo of requesting some data (e.g. clod data for a particular year), being provided with a list of places that have the data, select where to get data from and getting the data together with the metadata description.

Managing the collaboration is a major effort.

Overview of Middleware and Network Research Projects - Ray Bair (PNNL)

Create next generation of infrastructure to support SciDAC. Four categories.

ANL project has produced Access Grid. Indiana U middleware technology to support science portals, goal is to provide easy access to Grid resources. Portal web services (UT, Indiana, UCSD, GA) develop generalized toolkits for accessing grid. Pervasive Collaboiartive computing environment (LBNL, UW) support day-to-day collaboration needs of scientists, has a secure messaging and presence tool, a reliable transfer tool, collaborative workflow tool.

Data transfer and management: High Performance Data Grid Toolkit (ANL, USC, UWisc) want efficient, high throughput, reliable, secure and policy aware management of large scale data movement. Achieved 2.8Gbps transfer rate with GridFTP. Transferred 236 GB in 54 hours in surviving multiple failures with no human intervention. SRM (LBNL & FNAL) can robustly replicate thousands of files using a singel command. Scientif Annotation Middleware (SAM) fron PNNL & ORNL to develop lightweight, flexible middleware to support the creation of metadata and annotations.

Middleware for secure sharing, multicast and ...: Create a peer-to-peer Reliable and Secure Group communication (LBNL) develop the InterGroup protocol which is a reliable multicast protocol intended to scale to large group sizes and the Internet. Distributed Security Architectures (Mary Thompson LBNL): provide a useable policy-based access control for computer mediated resources. Uses Akenti. SciDAC Commodity Grid Kis (CoG) Kits: ANL, LBNL allow application developers to make use of Grid services. eServices Infrastructure for Collaborative Services (ANL, LBNL) to develop a unifying architecture fro the Grids (Open Grid Services Architecture (OGSA)). Dsitributed Monitoring Framework (Tiernet/LBNL) provide ability to do performance analysis and fault detection in a Grid Computing Environment (includes pyGMA and NetLogger).

SciDAC network research: Optimizing performance and enhancing functionality & distributed applications using logistical networking (Micah Beck UT & UCSB). INCITE: Rice, SLAC, LANL: understand internal Internet traffic, infer dynamic internal network characteristics by looking at the edge (MAGNET, tomo, topo, PingER). BWEST (CAIDA, GATech) pathload, pathrate, will have 2nd BW estimation workshop this summer. SciDAC security & policy for group collaborations (Steve Tueke ANL, UWisc, USC), develop a community Authorization Structure.

Panel Closing the Performance Gap

This used to be a major reason for SciDAC. What is it, who does it affect, who cares. The first panel was to address software issues. Panelists: David Bailey LBNL, Tom Dunnigan, Bill Gropp ANL, Steve Jardin PPPL.

Peak performance is skyrocketing (in past 10 years increased by 100x), but efficiency declined from 40-50% on vector computers of 90's to 5-10% on parallel supercomputers today. Challenge is in the software. Needs heroic efforts to achieve high performance, need to work with vendors etc. to come up with better tools, faster machines, new testbeds, evaluate and provide feedback to vendor, develop new languages to get easier/better performance out of parallel computers. It is a continuing problem as new computers, compilers, etc. are introduced. Need to measure not highest FP rate but highest science per unit time. Earth Simulator indicates US was asleep at the wheel.

Comments/questions: need to ensure the optimization is science/dollar. SpecMark is going through a new round of solicitations for applications so it would be good to get some standard science applications in there.

Panel Future SC Computing/Infrastructure Needs

Dave Bader PNNL, Don Batchelor ORNL, Rick Stevens ANL,

Climate modeling constant is 4-10 years/day. Death of vector was premature. Need special purpose computers for earth sciences.

Discussions

Had long set of discussions with our INCITE collaborators from Rice (Rolf Reidi and Richard Baraniuk) and LANL (Wu-chun Feng and ). Discussed where to focus in the coming year. Possibilities include: Take packet dispersion techniques and build into TCP stack to assist in bandwidth estimation. Test out packetchirp on high speed networks, compare with ABwE. Find out whether Tom Dunigan's integration of Wu's Dynamic window estimation into Web100 was included in the version we used at Sunnyvale. Evaluate Rice's TCP/LP (a low priority version of TCP) see how it works on a high performance network, and how it behaves compared to say QBSS.

Talked to KC Claffy about pathload. There will be a new release in April that will address speeds > OC3.

Gave Mary Anne Scott of DoE information on the LSR, hoping she could say something in her "Collaboratories, networking/middleware" session on Tuesday, which she kindly did.

Talked to Steve Woorsly of UCSC who works on supernovae. They need to send a GByte/hour from the simulator at NERSC to the customer site. The way they do things today is constrained by today's achievable data rates.

Talked to Don Petravick of FNAL. He now leads the network and security groups at FNAL. He wants to re-invigorate advanced network technology at FNAL. He is keen to get more involved with the IEPM/BW/PingER work. He would like to set up a meeting between SLAC and FNAL folks to share ideas and discuss futures.

Send ABwE to Mark Gardner at LANL & Don Petravick at FNAL. Set up meeting between FNAL and SLAC IEPM interested folks.