Author: Les Cottrell. Created: October 15
|Intro to Grid & Collab Security||Search for InterLab VPN||Washington Update||Grid Monitoring||Applied Research at ESnet|
|Grid CP||Experiences with Cisco IDS||ESnet Update||Detective Story||Internet2 status & Futures|
|ESnet proposal for PKI and Dir Serv.||Creating an Enhanced IDS at Argonne||High Performance Network Research Update||QBSS & PingER status||ESnet Program Review|
|DoE Science Grid Security||DNS/Win2K Issues||Achieving high throughput|
The problem is flexible, secure coordinated resource sharing among dynamic collections of individuals, institutions and resources. Want to enable communities ("virtual organizations") to share geographically distributed resources as they pursue common goals assuming teh absence of a central location, central control, omniscience, existing trust relations. (see Anatomy of the Grid Enabling Scalable Virtual Organizations).
He gave several examples of Grids including the APS, CERN/LHC and Earthquake Engineering Simulation, & Entropia (home computers to evaluate AIDS drugs). he also stated IBM's interest in Grid being at the same level as its Linux interest.
Globus will have an open source reference software base for building grid infrastructure & applications, supports the GGF. The Globus toolkit is a collection of tools to bring it together. Start with protocols and APIs and then implement.
For security they are focusing on Authentication, Authorization and ... Problem is resources are often located in distinct administrative domains with own policies and procedures. Has to be easy to use, single sign on (requires delegation), can be used to run applications.
GridFTP big new requirement is striping a la DPSS. They are working on a prototype striped GridFTP. Demonstrated early version at SC2k. Concerns over parallel streams and effect on peak utilizations. May hurt network but meets requirements of users.
This is an ESnet activity to be involved with the IETF working group to work with International grid communities to develop a certificate policy etc. See www.gridcp.es.net for documents, references (existing CPs and background i.e. RFC 2527). Open issues: certificate levels, PKI architecture ... Final release of first version of document Feb 2002 (Toronto). Need consensus of WG, then go to GGF to get approval, then publish as an informational RFC. Then need to work on separate documents for current work, and Acceptable Use Policies. PKI architecture repository model of trust. Certificate profiles decides what goes into a certificate, the goals of the WG is to establish required fields, and look at the naming issues. PMA (Policy Management Authority - people responsible for key policy and auditing) lead is Peter Gietz. They are reviewing setting up a volunteer PMA with a kernel standing WG.
Recently approved PKI directory proposal. This reported on a survey of tools etc. The proposal covered the name space (resources & PKI), resource meta directory, CSA infrastructure with a root CA and support for subordinate CAs; certificate repositories, policy documents including a Certificate Policy (CP), and a policy management authority (PMA). Long approval process, changes in Globus, changes in the Grid, and the emergence/approval of the DoE Science Grid project. Multi-level certificate is modeled after the Federal Bridge 4 levels.
PPDG has an immediate DoE Science Grid customer. Unclear where it fits in with DoE PKI. PPDG is an example of a "Virtual Organization", with known requirements including interoperability with other grids (namely trustable authentication infrastructure), and has know unknowns (e.g. assurance levels, distribution of registration, interaction with sites' security policies).
The DoE PKI is aimed at DoE employees as clientele. The environment is the Federal Bridge CA (DoE CA is the only portal to the Federal Bridge), and applications are confidentiality for secure documents with strong authentication and key recovery, digitally signed certificates and a key life cycle. It will be centrally run with a single CA but multiple remote CAs. It is well supported, highly structured, X.500 based, with high quality and auditing.
ESnet PKI cleintele is the SciDAC & DoE Science Grid, it is an open Grid environment. The applications today are authentication only with tomorrow confidentiality and authorization. It will need some central resources. The assurance level for certificates si still a matter of debate. Unclear whether needs a directory, probably later.
Mike feels the ESnet PKI and DoE PKI are complementary. Grid is still under a lot of development so PKI will need a lot of flexibility compared to the DoE PKI.
The directory needs remain the same, Globus response is to develop a complex metadirectory service. Repositories will be needed for an OID, certificate signing etc.
Project scope is aimed first at DoE Science Grid, then other SciDAC related effeorts, ESnet projects, ESnet internal PKI and ESnet site PKIs.
The project plan is to deal with immediate needs, namely PKI issuing certificates for key projects, interoperability. Will need outreach. Policy needs to be defined, namespaces need to be thrashed out.
They are working with Grid Forum, IETF, DoE PKI and Federal bridge.
Policy Management Authority (PMA) will provide project oversight, project timeline, certificate assurance levels, Grid CP effort. For the longer term must also focus on interoperability, auditing, convergence with other PKIs. Identity Assurance levels are at 4 levels: rudimentary (possession of private keys and a token), basic (site info, 2 pieces of approved ID), medium (notarized ID). high (personal appearance).
Namespace "we want a flat name space that's highly structured". There are ownership/trademark/dispute resolution. Branding. Globus MDS. Every object must have a unique name.
PKI architecture will probably have multiple CAs (Grid CP WG) with distribution list of CAs, cross-signing and hierarchy.
There is a lot of utility in a rudimentary assurance level. They intend to issue few "high" assurance certificates with all that implies about operations. Customer support will require growth management, and expectations.
They would like to get something simple going soon with a timescale of weeks.
This presentation was by video. VPNs typically implemented by IPSec and he only covered such implementations. He did not cover L2TP or PPTP. IPSec has several RFPs 2401, 2402, 2406, 2408, 2409. he described how VPNs work at a high level, and the uses of VPNs. Can use for remote host to corporate site or for site to site (all traffic between sites flows through tunnel and anyone monitoring traffic over Internet cannot see which systems are talking). VPNs do not eliminate backdoors nor do they guarantee the security of the systems at the other end of the wire, or do they authenticate users.
How do you trust another site: need comparable user populations, similar security policies, similar boundary protections. Sites should negotiate security agreements and document security environments, firewalls may still be used between sites (usually more open than Internet firewalls if sufficient trust established.
How can you trust a remote VPN client. Must have strong user authentication (2 factor), must trust the user to operate in a defined security environment (user must not have back doors, user cannot talk to Internet and VPN at the same time). Can establish some technical controls - download security policies from VPN server, policies can prohibit simultaneous to other networks while VPN is being used, can implement personal firewalls. Also Sandia says can only use for Sandia owned home machines.
Introduced the idea of a Yellow network (between secure and open). THis is the tough problem, since security for Red networks is well defined and non-negotiable, & green networks are pretty open. Sandia have documented their Yellow net (site restricted network). Other can review the requirements of another site's Yellow network and decide what they will accept from that site. A definition could be controlled user population with procedures for managing foreign national access, provide citizenship information to other sites upon request; site must have a boundary control system that is physically and logically protected.
FIPS 140-1 certification has been met by some vendors, but these devices have a problem in that they cannot support > 100Mbps.
ANL implemented major changes to its Cyber Security posture during FY'01. Installed lab FWs (Cisco PIX 535). Installed central VPN concentratoto (Cisco 3060), plus an IDS (part of NetRanger series).
Have 2 IDSM line cards (~ NetRanger) on a car for 6509s. Four NR devices deployed throughout the Lab, maintain Oracle DB with hsitorical information. Most alarms are from outside the firewall. There are 367 types of alarms, 23 of them are set to cause an action to be taken. The top 10 IDS alarms are IP fragmentation, SYN sweeps, IIS enable. Alarms are associated with NIMDA, CodeRed, LionWorm. Nb a sweep of a class B will show as 656K alarms. Need to update signatures (from Cisco) on a regular basis. This is separate from the normal 6509 line card updates. Can write one's own signatures. Can customize to ignore/filter alarms. Can use ACLs to hide problems from alarm. Signal to noise ratio is still not where they want it, cannot currently enable an "automated response". Legitimate scanning machines required special handling.
Harder to integrate into a network management environment (e.g. MRTG or NMS) than regular network equipment.
Code REd experience: they obtained a special signature manually with help from Cisco. At teh onset the signature was VERY accurate (high water mark for in-bound Code Red was ~ 10,800/day from 5000 hosts). Signature has gotten very inaccurate on "outbound" alarms, needs more event correlation. Finding infected machines at ANL was faster with Cisco IDS than with NetFlow. Two infected machines were connected via VPN. Can demand that only accept VPNs running Black Ice or one other home firewall. One false alarm occurred at the ANL-W site, interesting aspect was the IP reported as infected was a web proxy address.
Nimda experience: shows up as a collection of Cisco IDS alarms. S/N ration not as good as CodeRed but it was fairly reliable at the beginning. Like CodeRed signature has got more inaccurate as people try to check the site for Nimda.
Plans continue to coordinate & manage Lab wide IDS efforts; develop better event correlation correlations to improve S/N; develop better notification and escalation. Need much better S/N before enable active response. Will deploy new 6509 IDSM line cards. Current ones can only sustain 100Mbps traffic.
Significant issues to dat: IDSM line cards chronically go to sleep - lose contact with the IDS director software (24 times in 3 months). Cisco COTS system is not yet ready for enterprise deployment (too much manual effort to deploy consistent signatures to multiple cards). About 10 IDS signatures are completely erroneous in our environment.
Do not blodk IPs that probe the Lab, bulk of energies are spent looking for compromised machines. DOE-HQ has purcased 9 6509s with IDSM line cards (being deployed by EDS).
PIX 535 firewall has been fairly reliable. Started using with conduit statements, now looking to move to ACLs. Has GE and FE interfaces.
They have about 1.25 FTEs working on Cisco IDS. It seems to be more accurate than some of the other tools ANL uses.
IDS is just one part of Cyber security architecture. Architecture, host based, firewall, IDS. Oracle database for forensics. IDS provides notice of intrusion attempt but not success or failure of attempt. Todays system require too much manual analysis to detect long term trends, too many false alarms, rely on vendor for updates (most information is proprietary so hard to build on).
So want an enhanced IDS. Goals: Reduce false positives, improve accuracy. Use existing IDS sensors where applicable. Differentiate between probes, worms & intrusion attacks. Identify vulnerable lab host that respond to specific attacks.
Steps: Work with existing vendors to understand & enhance commercial products. Incorporate portions of existing public domain IDS (e.g. Bro). Add tcpdump capability. Add new sensors to gather detailed packet traces. add sensor based control systems to reconfigurer sensors when specific events occur. Add state based decision level algorithms to correlate events and to decide what to do. They are developing STAT/STATL based IDS components from UCSB. Will enhancew STATL to define new types of signatures based & COTS based). Enhance STAT modules to detect script based signatures. Enhance STAT to siupport higher level signatures. They have installed and are comparing existing sensors (such as STAT, snort, BRO ...)
They are working with outside organizations: participate in IETF WG IDMEF, establish joint academic and industrial partnerships, establish partnership with other Doe labs. They will purchase new network sensor equipment (rack mounted PC running Linux). Develop operational skills needed to configure and manage IDS.
Will need funding and proposals from Labs to get them involved.
Windows 2K requires dynamic DNS. W2K workstations register themselves, Domain controlers register service (SRV) records. These do not change much afterDC is brought online. If DC is shutdown, the netlogon process will unregister SRV reords, but will re-register every hour, will also attempt to re-register the "A" record for the DC.
There was an audit of DOE HQ audit focusing on DOEnet telecommunications & videoconferencing. Requested input from ESnet. Numerous HQ & site meetings. Recommend use of VPN solutions at ESnet (mainly the operations offices), will require installation of equipment at some sites.
ESnet program review held in Santa Fe NM Sept 11, 2001. Steve Wolff chaired. Very complimentary to ESCC & ESSC. Meets current community needs. Concerns about funding in particular to meet growth rate (SciDAC, Genomes to life & other programs), need more focus on long range planning. Fragmentation of networking responsibilities at DOE HQ. ESnet initiative to address Genomes to life and nonotech. Target with FY05 budget cycle, draft document this year, workshops planned starting next year, define long term requirements and focus on emerging requirements.
SciDAC PI meeting postponed to December or January. Awards made
Peer reviews have become very important, there will be a periodic reviews of FWP, needs to be reviewable, Emphasis on paper trail for peer review. Proposal reviews. OK for research, hard for ongoing.
Budget ATM contract $7M ($6.9), operations $6.5M ($6.4M), ESnet International $1.2M ($1,187M), Video $350K ($346K), testbed $1M ($1M), equipment $660K ($890K), PKI/DS implementation $1M ($0M) - (FY2001).
Flat funding scenarios. Need to prioritize sites need to submit justification to ESCC & ESSC, DOE HQ, ESnet project mgr, DOE HQ program support. Justification needs to show need, what is required, what happens if not done.
Time frame for upgrades, get paperwork in now if need something over a year ahead.
FY02/03 goals: New ESnet FY05 initiative. Cure problems with QWEST ATM core. Support increased connectivity with falt budget. Support network research program activities. SC advisory committee (Oct 2002). QoS services reasearch accomplishments documented (Oct 2001). Revise ESnet Strategic plan (June 2002). PKI implemented and in use (Sept 2002). REvise progress report (Oct 2002). Develop strong case for SciDAC ESnet.
Transition to QWEST almost done (>98.8%). ESnet program review completed, excellent marks for ESnet. Overall health of network looks good, Public peering problems continue as a concern. International access still well ahead of current demand. DCS is growing. ECS is next, focus is on H.323 & interoperability with H.320. NASA will share ESnet/QWEST contract (but not facilities). The OIG is "fascinated" by ESnet and DOEnet. A new testbed is being planned (QWEST will fund). SC01 is closing in on us once again.
Bytes accepted almost doubled in last year (Sep 01), average packet size almost doubled (650> 1130) shows exponential growth.
Backbone with OC12 SONET connections to 5 hubs. Now peering at PAIX-W with 20 peers. Peer with Australia in Seattle GigaPoP.
QWEST contract multi-year (3+2+2 years), $50M +. Overlapped for ~ 2 years with existing Sprint (expired Aug 25, 2001). Provides advanced services & technology for production network, a high performance test-bed, research collaboration. Includes very competitive ATM pricing.
There were problems with the QWEST ATM "swamp" (due to Lucent switch problems) so had to link hubs with SONET. This reduced us from a full mesh ATM to just SONET between adjacent hubs. Hope to get back to full ATM mesh soon. However, there maybe problems with upgrading ATM to OC48c backbone due to long term viability of Lucent switches. Sunnyvale traffic peaks averaged over a day are about 800Mbps. It is the heaviest used hub. Chicago to Sunnyvale is busiest trunk with about 130Mbps. Distinction between research use of network and production users, as people routinely copy large files between sites and learn how to use all the bandwidth.
Site to hub status. LBNL 120-150Mbps (OC3>OC12), LLNL/Sandia 60-70Mbps (OC3>OC12), NERSC 100-150Mbps (OC12), SLAC 120-150Mbps (OC3>OC12), JGI ~ 30Mbps (T3>OC12). ANL~100Mbps (OC12 overbuilt), FNAL 60-90Mbps (OC3), Ames 10Mbps (T3). BNL 90-120Mbps (OC3), PPPL 12-20Mbps (OC3), MIT 12-20Mbps (T3). ORNL 160Mbps (OC12). DC (OC3) looks OK. LANL 250Mbps (OC12). On average backbone is currently in pretty good shape. May have difficulty keeping up with growth. Upgrades from OC3>OC12 cost about $10K/month.
Casual peering arrangements are coming to a close. Large carriers are raising the bar on how big you have to be in order to get free peering. Can always use QWEST public network but lower performance than more direct peering.
MAE-E FDDI shut down. NY-NAP atrophying, PAIX W rapidly gorwing in use, peering with Pac Bell finally resolved. PAIX-E (VA) is new but of questionable value. Picked up some new peers (including Australia) in Seattle GPoP. Mid Atlantic Crossing (MAX) around Washington DC, ESnet comes into via QWEST PoP at DCNE. Peering traffic shows NY-NAP at ~ 30Mbps and dropping.MAE-E 25 Mbps on a T3. MAX 10-12Mbps. Chicago peaks about 100Mbps. MAE-W 35Mbps on T3 (upgrading to OC3). PAIX 30Mbps (OC3). FIX-W, MIX-W, MAE-W 250Mbps (OC12).
Academic access: Dropped T1s to CalTEch, UCLA, UTA, FSU, NYU all have much better alternates. Plans for enhamced peering with Abilene - Oc12 at SNV (exists), OC3 at CHI (exists, OC12 direct cross-connect planned, OC12 at NY just installed, will eliminate other peering points. Direct academic peerings at: Atlanta, Chi, LBNL, Dan Diego, Seattle. ESnet access to major US academic collaboratories seems to be in good shape at the moment. TRaffic to Abilene at SNV peaks at 200Mbps (OC12), CHI ~100Mbps (OC3>OC12).
DANTE - ESnet traffic solid about 12Mbps (well within the capacity of the OC3). JAnet 30-50Mbps exports more from UK than imports (OC12). CERN 70-120Mbps (on OC3). Phynet France is red lining at 35Mbps (80 Kcells/sec) (T3 ATM link limit).
Japan: KEK now terminates in Bay Area. JAERI peaking within their 1Mbps circuit, NIFS 800kbps, KEK has 10Mbps but is lightly loaded.
DCs usage continues to grow. Supports on average 3 conference 24 hours days/day. Will reverify all people who can schedule a conference. This is to enable trace back to a DoE PI. Then who is a PI will be validated by ESSC. PI will be able to delegate. One concern is use by Europeans for non DoE related activities.
DoE review: Bill Turnbull, Vicky White, Ricky Kendall, Paul Love, George Strawn, Steve Wolf. Findings: concern about growing requirements/flat budgets. Well respected by community. Network asset managements tools are noteworthy. Find operation cost-effective (lean but adequate). Internal, i.e. self directed technology exploration and research must be protected. ESnet is the obvious & natural organization to provide new grid related services because of level of trust it enjoys in the community.
SC01: Level 3 offered ESnet OC48 SNV DNV/SC POS, OC12 SNV to DNV SC (IPv6 ATM). An OC48 CHI DNV SC (POS). Juniper has loaned added interfaces. QWEST OC48 to NERSC & LBNL.
OIG has new team dedicated to networking and communications issues. ESnet was in an obvious initial target of interest. DOEnet has garnered a great deal of interest. They have been vague & inconsistent about exact status of their investigation, sometimes call an audit other times say only an initial investigation. There has been much talk about merging DOEnet and ESnet. Seem to understand ESnet reluctance to jeopardize current success to solve HQ issues. Initial report has been written. Conclusion a new architecture to manage all data and voice in DOE.
New program requirements. Most coherent at the moment are HENP. QWEST QWave is an optical wavelength service that utilizes DWDM technology, availabke in 2.5Gbps (OC48) & 10Gbps (OC192) quantities. One wavelength delivered on site as a fiber pair (one transmit, one receive). QWEST does not have mcuh support for GE access at the moment, prefer SONET.
Services convergence is an attractive buzzword for a magic bullet. Allows reduction of number of devices, aggregate line costs (and thus cost savings). Build on IP, use MPLS. Esnet will work with QWEST on a testbed. Trial using Junipers to connect some ATM services (not production) to see if it works. No added bandwidth or hardware costs to ESnet.
Next step: increase access bandwidth for selected sites: SLAC & FNAL to OC12, BNL review for OC12. LBNL to OC48. Move backbone back to 550-ATM for inter-hub offering more overall bandwidth (full mesh OC12 circuits). Deal with growth by putting in OC48 SNV to CHI and OC48 to SLAC & FNAL to their respective hubs, then to NY.
Research = investigation & testing with an eye to utilization in near future. Research areas:
Meeting in Santa Fe Sep 10-12. Review on 3 year cycle, last one 1998. Regular meeting: receive feedback on review & initial responses, idea of initiative withing DoE for networking, new requirements etc.
Six reviewers from NSF, NOAA, Cisco, I2, Ames & FNAL. Charge: are the present mechanisms for providing ESnet with goals, directions & user requirements effective. Does ESnet meet needs of DoE science wrt connectivity, capacity, tools & services, cost effectiveness. Is planning for future appropriate & effective from point of view of SC programmatic nees, special apps like SciDAC, research into emerging technologies.
Committee found present mechanisms are effective for 100% growth, but are stressed by requirement for a large step increase in capacity and a requirement for deployment of non-traditional network technology. If needs are not met, DOE network will fragment as it did prior to 1985.
ESnet meets needs of DOE science, net & asset management tools are especially noteworthy.
Planning has worked well for forecasting needs in 1 to 2 year time frame, understanding longer term requirement would benefit from stronger interactions with the program office's planning processes. Special applications will put a tremendous load on ESnet. They are unfunded mandates that ESnet must absorb. Need process so network requirements can be relayed directly to ESnet program from special applications. The work of ESnet must include certain amount of advanced technology exploration & research, needs to be included in ESnet FWP. ESSC should be extended to more formally have purview over all networking facilities (e.g. some area of distributed computing). ESSC agrees to provide input on infrastructure components of all 3 MICS programs. Mary Ann, Thomas, in addition to George agree to provide regular updates at ESSC meetings. Question of relationship of ESSC to ASCAC.
Resources are adequate to keep up with 100% growth, but new requirements do not fit. There needs to be additional emphasis on planning within each SC office to provide a long range plan which needs to be updated & presented to the ESSC at least annually.
DoE networking initiative is motivated by added net requirements generated by SciDAC etc., growing requirements from programs, extensive plans of "peer" networks such as Abilene & GEANT (Europe) (both of which appear to be more aggressive than ESnet's plans), belief that high performance networking could be an important enabler of DOE science.
ESnet Research Support Committee is a good idea, it needs to get going.
Steps towards initiative:
Questions, how does one get key people involved at the upper levels to provide support for the networking initiative? It will need an an added leadership.
12 router nodes with Cisco 12008s, QWest colocation. OC48 interior circuits interconnect, POS in all cases. Access 54 total, OC3, OC12 and some OC48. Partners: QWest: SONET & colo; Nortel: OC192 SONET ADMs; Cisco: 12008; Indiana NOC. Abilene core Chi, Snv, SEA, LA, KC, Denver, Houston, Cleveland, NY, Washington, Atlanta. September 11th 60 Hudson (early carrier Hotel, labelled Western Union on entrance) was evacuated but continued to run (unlike 25 Broadway which lost power). Circuits to DANTE went down for about a week. Peering at OC12 (POS) in Chicago, NYC (POS) & OC12 at SNV. Peerings with DANTE, NorduNet, Australia (300Mbps SEA to SYD), JAnet. Lot of motivation and work to improve connections to Australia. 197 participants in 50 states, just added Puerto Rico.
Post Abilene planning, ongoing needs beyond 2003 (Abilene-QWest MOU extended from Mar-03 to Sep-05, allows upgrade of circuits to use 10Gbps lambda, continuing partnership in exploring implications of Interent2), leverage growing DWDM / fiber provisioning with many 10Gb/s lambda. Needs are to leverage backbone / GigaPoP / campus structures, serious attention to international & federal peering, current advanced services such as IPv6.
Implications: bring on new circuits (10Gbps) Spring '02 LA to NYC, eventual replacement of almost all OC48 interior circuits, expect little topology changes/router locations (maybe changes within cities to get better colocation). Need new routers (10Gbps interfaces, dont work in 2.5Gbps backplane 12008 routers), first class support for IPv6, renewed emphasis on measurement support (netflow, Surveyor, OCxMon, E2EPI support, iperf boxes).
Houston flood from tropical storm Allison (26" rain in 24 hours). Houston router node went down Saturday morning, no news until mid-day Monday. Cisco tech then was able to access the facility for the first time and said router beyond repair. Actually untrue, water reached inch below router. Router came up OK next day. QWest SONET ATMs were shut down.
Longer term WG issues:
Raise expectations, encourage aggressive use, deliver on performance/functionality to key constituencies. Not the easy way but necessary for success. Otherwise users think network is congested and can't help them. Planners see only 20% utilization and no further investment is needed.
About 30 in person attendees. The meeting was videocast and many of the questions came in by email. At one time they said they had 15 attendees coming in remotely. They had access via a wireless network. Over 50% of the attendees used their laptops during the sessions. Next meeting in San Diego in Spring 2002. After Easter.
Uploaded slides to incoming at anonymous ftp in ftp.mcs.anl.gov They will be made available on the web.