XIWT Meeting, San Francisco 3/4/98

XIWT Meeting, San Francisco 3/4/98

Rough notes by Les Cottrell

Overview:

There were 28 attendees from places such as HP, Intel, Motorola, Intertrust, GTE, SBC, USWest, Bellcore, Nortel, Bellsouth, NEC, Cisco, 3COM, Purdue, SLAC, Stanford, NSA, and CNRI. There were 11 laptops. There were 28 attendees from places such as HP, Intel, Motorola, Intertrust, GTE, SBC, USWest, Bellcore, Nortel, Bellsouth, NEC, Cisco, 3COM, Purdue, SLAC, Stanford, NSA, and CNRI. There were 11 laptops. For me the highlights were the security discussions (in particular Protection Techniques - Gene Spafford, Purdue); the presentation on the automobile industries plans (ANX) for setting up an extranet (Manufacturing - Bryan Whittle, Bellcore/ANX); and the talk by Cisco on reliability availability and serviceability (Challenges in Network Reliability Availability and Serviceability - Scott Cherf, Cisco) and of course (J ) my own talk.

Table of Contents:

XIWT Meeting, San Francisco 3/4/98 *

Agenda: *

Overviews *

Commercial Views *

Government Views *

Technical Perspectives *

Telecommunications Network Security: Perspective *

Critical Infrastructure: A Societal View - Dave Elliott, Stanford International Center for Security *

Banking & Finance - Dan Schutzer, CitiCorp *

Manufacturing - Bryan Whittle, Bellcore/ANX *

What are the ANX business drivers. *

How is the ANX service defined. *

Contracts *

Regulators Point of View - Stagg Newman, FCC *

NSA's Perspective as a User and a Representative of other Government Users with Special Needs - Bruce Bottomley, NSA *

Protection Techniques - Gene Spafford, Purdue *

Challenges facing End-to-end Performance and SLA Measurements - John Leong, InverseNet *

Practical End-to-end Dependability Metrics, Methods and Actions - Jeff Sedayao, Intel *

Challenges in Network Reliability Availability and Serviceability - Scott Cherf, Cisco *

Agenda:

Overviews

View from the provider of Information Assurances Professional Services - John Kimmins BellCore

Critical Infrastructure: A Societal View - Sy Goodman, Stanford

Commercial Views

Panel Chair Stu Personick, BellCore

Dan Shutzer, CitCorp

Bryan Whittle - ANX/BellCore

Government Views

Stegg Newman FCC

Bruce Bottomley NSA

Technical Perspectives

Panel Chair Dan Sincoskie - BellCore

Hilaire Orman - DARPA

Gene Spafford

John Leong _ Inverse

Cindy Bickerstaff - Intel

Les Cottrell - SLAC

Scott Cherf

Telecommunications Network Security: Perspective

The characteristics of the environment are: existing and emerging carriers, mergers and acquisitions, TCP/IP emphasis (both for delivery and operations), new services & technologies, interconnection (not just data/voice networks but with signaling networks which were never designed to have a firewall), customer access (customers want to know more about the service, need to add hooks for provisioning, SNMP, which is at the heart of the critical are of managing network, is struggling with security).

The current threat environment includes a new mindset that may create denials of service. There is a loose federation of intruders. In the past problems were often accidental, more & more it is now intentional.

We do not have a good framework to define/characterize the impact of an intrusion, how badly has the device been compromised, what is the impact of such a compromise of this device on what services, and how critical/widespread are these services, how is this quantized. For example, for phone systems there are levels of impact that must be reported to the FCC.

Philosophy is not just to try and prevent crackers getting through, but to contain the impact if and when they do get through.

There is a lack of security features being offered by the suppliers. Then even if there is an offering the ISPs have to embrace and implement correctly. There is an incomplete identification of the security issues.

The state of the art today includes heavy use of one-time authentication, emerging use of encryption, security interoperability issues, inconsistent characterization of events, and unsolved security issues.

The conclusions are that:

The potential for disrupting critical components has increased, however the potential for a total network outage is low.
The intruder mindset has changed to include denial of service.
The range of targets is increasing and the impacts vary considerably.
The intrusion detection is not real-time.

Recommendations include: examine policies for sharing information, promote development of simulators & tools for better risk analysis, support enhancements of critical protocols, promote R&D for secure software development, develop and promote guidelines for service providers, enhance process to develop security requirements for products.

Critical Infrastructure: A Societal View - Dave Elliott, Stanford International Center for Security

The center became an outpost in collaboration with LLNL for the Presidential Commission on Security, organizing conferences and interfacing with Silicon Valley etc. There is a new report from the commission on Critical Foundations for Protecting America's Infrastructure, so one of the immediate issues is prioritizing the 73 or so recommendations in the report. The Presidential commission came out of an Executive Order 13010. It considers physical attack and focuses on electronic, radio-frequency etc., but mainly focuses on cyber-attacks. The main areas considered vulnerable and focussed on were telecommunications and electric power. Banking was recognized as a major area but the banking industry recognizes this and is working on it. The threat focussed on was a professionally organized wide-spread attack (even though none has actually occurred outside films) on the infrastructure by foreign governments, sub-national groups etc. The threat spectrum as perceived by industry on the perpetrator axis starts with a malicious outsider to thieves to inside malefactors to organized crime. The government view is more starting at militant sub-national groups to state-sponsored terrorism to state conflict (overt or covert).

The types of attacks include: penetration for information gathering and theft; denial of access (saturation / jamming); causing system malfunction.

How does one assess the vulnerability & threats. This is based on experience, analysis, friendly attacks (white hats / red teams), mirror imaging (i.e. thinking about what could the U.S. do to another country is a useful way of thinking about what another country could do to the U.S.), and intelligence.

The key thought in the report was the need to form quickly a government/industry partnership to address the issues. If industry does not self-police then the government will need to step in. An important step in getting trust is information exchange, i.e. industry quickly provides information on attacks with analysis which will be kept anonymous, the government will evaluate, provide a perspective and coordinate R&D. This may result in screening people working in these areas. The government will bring to the table its experience and "red team" attacks which will make recommendations. Some people are concerned that implementing all the recommendations will not be cost effective in some cases, but having a (non-implemented) recommendation to be discoverable after a penetration could lead to stock holder legal suits.

Impediments to solutions include lack of a security culture, little experience, inadequately tested & insecure COTS software, difficulty of modernizing large info systems, rapid IT evolution, cannot test real infrastructure, employment practices, globalization, commonality, growing use of public networks, deregulation, more automated control/remote access, legacy systems, cost / timeliness of customized solutions.

Banking & Finance - Dan Schutzer, CitiCorp

Banking is unclear that it would be a primary terrorist target. An attack on the U.S. financial system would be an attack on the world financial system.

The challenge for banking is that increasingly they are dependent on the public networks and end user devices that are not controlled by the banks, yet they want to provide consistent, reliable services. In case of a communication failure, it is necessary to be able to recover down to the transaction level. They also want to recover as fast as possible, and need fall-back systems such as paper transactions, or store and forward electronic systems (e.g. Point of Sale) that send when the communication path recovers. The outage times allowable are measured in terms of how long it takes to lose a customer (e.g. a customer at a supermarket checkout counter). For example, if there is an outage of the Citicorp system and Visa can't get through to Citibank to verify a payment, then Visa will approve and take the customer payment, so there is a risk to the bank that the customer may go over the customer's limit. This loss of connectivity impacting approval is more of a problem for ATM machines. Since the bank has multiple points that can verify a payment a loss of one part of the system is not catastrophic, since others can kick in. If the whole Northeast was blacked out by power outage then customers will not be lost due to going elsewhere, since there is nowhere else to go. It is more an issue of a customer having an alternative to go to, and the customer is lost. The customer idea can be expanded to be a supermarket chain who might chose a different bank since the response to credit card calls may be quicker, or more available.

Manufacturing - Bryan Whittle, Bellcore/ANX

bwhittle@notes.cc.bellcore.com ANX overseer

New communications infrastructure based on TCP/IP starting in the U.S.

What are the ANX business drivers.

The current state is connections between suppliers and manufacturers with little standardization, deploying new applications is slow, it is costly. They looked at the trading partner/business process needs today. EDI drives availability requirements (e.g. request to an assembly line can result in production of a part an hour later), a loss of 10 minutes is critical, can stop a production line. CAD drives file the file transfer delay. Interactive applications drive the packet delay (human factors are critical). CAD, EDI and email drive security, the transaction can be next year's car model.

How is the ANX service defined.

His focus is on the business aspect. They wanted no new large scale infrastructure development, price competition among multiple providers, a single TCP/IP link for each training partner, multiple service provider consolidated services, a user-centric ANX certified IP service quality, managed evolution of services and service levels (i.e. continuous upgrades).

The ANX Certified Service Provider (CSP) can use the Internet or build their own network to meet the needs. They expect that ISPs will be certified as CSPs. The ISP must sign a contract with the ANX overseer and get registered, this then results in an application which results in assessment which if passed will result in verification. This verification is repeated at regular intervals. If they fall out of compliance then if not fixed in 30 days, they go into probation, if fail then trading partner is asked to change to another certified CSP and have 90 days to do so.

There are about 100 metrics in 8 categories with allowed values and criteria, also measurement techniques are provided this then all goes into ANX certification verification requirements. The 8 metric categories include network services, interoperability, performance, reliability, business continuity, security, customer care, trouble handling. The data collection includes data supplied by the ANX CSP/ANX CEPO to the ANX/CEPO. ANX will do audits of the data. The maximum number of CSPs between any two trading partners is 2. This helps accountability and problem identification. Trading partner (TP) identifies it has a problem that is caused by the network. The TP reports it to the CSP who is responsible for resolving the problem, and may call the ANX/CEPO to escalate. For security the target is a public key infrastructure with CAs at the ANX/CSPs and an ANX overseer CA. There will be IPSEC gateways between the TP application and the ANX public key infrastructure. They are running an ANX pilot at the moment, they start production with release 1 which will run for a year before certification for an upgrade. Release 2 may include Internet telephony, videoconferencing.

Contracts

There are a lot of contracts required. The TP will have contracts with the ANX Overseer and with the ANX CSP. The ANX CSP has contracts with the TP the overseer and the ANX/CEPO. The ANX CEPO contracts with the ANX overseer and ANX CSPs. The business and legal stuff takes a lot of time to set up properly. Services are defined in the contract by guarantees, payments require CAs, termination is included to provide graceful exits, there is confidentiality required for overseer reporting (e.g. requires confidentiality in reporting of some performance information), there are liability limitations (to reduce the fright factor of joining up) such as liability per transaction. They are trying to head off litigation, they have an escalation process, e.g. problems with the overseer go to the AIAG, if that fails then they are working on an arbitration process.

They expect the service charges to be about double what they are for public Internet.

The potential impacts are it implements multi-service provider IP VPNs, establishes certification of ISPs and EPOs, benchmarks ISP & EPO service levels, accelerates the maturity of multi-vendor interoperable IPSec (and certificate authority) products.

Regulators Point of View - Stagg Newman, FCC

(see the http://www.fcc.gov/ Web site)

The telecommunications act of 1966 was a mandate for competition and deregulation. It promotes non-discriminatory access to the broadest base. There are engineering contradictions, so we must deal with congressional intent not engineering logic. There is a natural tension between the principles (pro competitive and deregulatory) and the desire for seamless interoperability, interconnectivity and widespread accessibility. How it applies also depends on who you are (common carrier, CMRS (mobile providers have different state regulation requirements), LEC, ESP).

The drivers for dependability are health & safety, commercial viability (e.g. 3min/year downtime), consumer confidence.

The Network Reliability and Interoperability Council was chartered by the FCC in '92 to respond to SS7 outages. It was an industry led effort with self-governance (initially the NRC). The NRC produced a 1000 page report to the nation. Made many recommendations that addressed and reduced network outages. There was a requirement to do a continued review and the NRIC was founded for this. In 1997 the NRIC issued the next report focussing on the areas of interconnectivity, the role of the NRIC should be oversight monitoring forums and standards development organizations, to ensure interoperability goals are being met, develop an agreed upon list of national services. They also looked at the impact of the Internet on holding times (i.e. how long a circuit is kept up for voice (~5 mins) versus for data (~25 mins)) and where the congestion points are. In general the Internet does not look like POTS (plain old telephone system). It is understood that the traditional legal and regulatory approaches become unworkable

Questions include what is the relationship of Internet Services and national telecommunications services, is the Internet & its architecture robust enough to prevent catastrophes that precipitate government action.

NSA's Perspective as a User and a Representative of other Government Users with Special Needs - Bruce Bottomley, NSA

See http://www.nsa.gov/

NSA has 3 responsibilities with regard to telecommunications: as a user of a significant portion of telecommunication network, to provide solutions to the governments community, to gather information on others. Bruce is involved in the first 2. The NSA network is worldwide, it is fixed and mobile, with tens of thousands of users, it uses commercial, military, fiber, satellite circuits. The characteristics of the network include very asymmetric use of bandwidth (much more going in than going out). It is largely dedicated, moving to IP with a ratio of 3:1 (dedicated circuits:IP), it is also moving to ATM. Because of the dedicated nature many parts are overloaded while overall the usage is only about 17%. The traffic characteristics are message/email/web, store & forward, and realtime for collection and advisory warnings (e.g. for deployed forces).

The reliability requirements include: confidentiality, availability and integrity; network management; operations management; a wide range of QOS; and bit position integrity (required for enciphering/decoding).

Internet "Outage" database similar to the equivalent in the phone industry (reports to FCC any incident with 30000 customer-hours of outage). There is a "declining goodwill between competing ISPs. While Internet loads and outages increase ISPs argue without supervision in the dark, they don't often report their outages to one another, and never to us." - Bob Metcalf on InfoWorld.

Protection Techniques - Gene Spafford, Purdue

See http://www.cs.purdue.edu/coast/ for information on COAST, of which Gene is the director. Email <spafford@cs.purdue.edu>

His belief is that biggest problems right now are not technological, existing technology could be deployed. For example it would be cheaper from a security viewpoint to move Web servers to MacOS instead of WNT. We have not addressed the non technology problems such as:

economics,
problems with standards (real and de-facto),
policy formation and application, threats of COTS/GOTS (Government Off The Shelf) mind-set (we accept what's off the shelf with a disclaimer we never read from companies that use students, off-shore programmers (e.g. Microsoft shipped Windows 95 to the Republic of China where the translation to Chinese had been done in Taiwan and the translators inserted derogatory remarks about mainland China), temporary hires, are more worried about time to market than any security issues etc., vendors advertise function, speed (such as SpecMarks), disk space/RAID,but do not advertise or even have metrics for how well the application has been tested for security),
accountability education and awareness.

To be serious about security one has to set policy and if security is important then have to have a business case for why must have insecure applications like Web browsers, why need Java, ActiveX etc.

We need:

Policy capture and representation (many people are looking at)
Models of availability (no one looking at a formal approach, mainly ad hoc approaches at the moment)
A replacement for multi-level security models (not being addressed in the academic realm)
Appropriate risk models (undergoing constant development but don't have the data)
Interaction with external security mechanisms and policy
Load adaptive security design (e.g. load sharing firewalls, some vendors have on the horizon for future products)

We have (but are not built into systems):

Finer grained access control mechanisms

Construction requires:

Composable security properties
Zero based OS configurations (only enable what you need, not an OS that comes with everything including Web browser, this is something that is understood but nobody has picked up on)
A usable, safer systems language with libraries (no stack overflows etc.) Java is a step in the right direction
Application of targetted testing (get vendor to test for certain known problem areas, but are we willing to pay for the extra testing)
Better integration of past lessons-learned (the causes of many bugs being introduced with Windows have been well known for years)
Push more functionality into the hardware (it is a safer place to do things, e.g. hacker can't easily change the hardware function after getting into the system)

Deployment:

Integrated alarm structure
Application audit support mechanisms
Active defenses per applications, per host, per LAB, per enterprise
Support for deployment and monitoring in the large
Integrated IDS (Intrusion Detection System) systems
Insider misuse detection systems
Secure evolution of systems (is the new revision safer than the previous one, are there no new holes)

Investigation

Reliable tracking of traffic
Reliable global identification of parties
Reliable auditing for prosecution
Forensic methods not requiring escrowed protection
Coordinating reporting and response mechanisms
Feedback into the development process (e.g. if bridge collapses we tell the engineers so the fault will not appear in a new bridge, yet we don't even report today's problems)

Coming Up

Serious ATM hacking
Wireless & mobile platforms
Agent-based attacks (mobile evolving viruses)
Professionalization of hacking (as opposed to most of today's truants)
Coordinated physical and logical attacks (e.g. blow up a building or cut a large power or communication line, and then attack using the chaos ensuing)
Consumer backlash

Challenges facing End-to-end Performance and SLA Measurements - John Leong, InverseNet

They have had a network centric view but it is difficult to map low level metric performance to application performance. Also user vewpoint is that everything must be a network problems. So Inverse want to emphasize the user's experience and gather data that can provide actions.

There are 2 major areas, passive and active monitoring. Passive (e.g. a sniffer) is probably closest to the user's experience, but it is uncontrolled and it gathers too much data. Active testing emulates the user or user's key attributes, it is controlled and focussed (easier to do comparisons, can eliminate a lot of noise), but it adds load to the network and servers. Inverse does both types of measurements.

Internet is complex with many autonomous administrative domains that may not talk to one another, may have very different quality of services etc. This can lead to much finger pointing, under-provisioning, hot-potato routing (handing off packets to somebody else ASAP).

Web measurement issues involve: server, network, modem, browser, page content and changes (so need a standard incompressible page to be found in every server as well as a standard server to test against), network caching and/or compression, throughput & time to download.

Possible measurement locations include the client, the server, the path, the test infrastructure or some combination of the above. Inverse are doing all of the above apart from the server.

Challenges with SLA (Service Level Agreement) specification.

Possible metrics include availability, packet loss, latency, throughput, network services performance.
What is the scope of responsibility
Blame-O-Metering
Dialup access has its own challenges, it is customer biased (i.e. may have no control over the local loop, it maybe a hotel's PBX, a bad local loop). An alternative is to limit the scope of the testing but this may not reflect the user perception.
Even with a direct connection the customer may be under- or improperly provisioned (and not an ISP responsibility).
One needs a test infrastructure at the edge.

Inverse is proposing to put active measurement devices at critical ISP network connection points.

Practical End-to-end Dependability Metrics, Methods and Actions - Jeff Sedayao, Intel

Current scope for dependability, definition of "Dependability". They will start with a narrow scope to enable results and broaden based on success. They are focussing on end-to-end performance. Raj Jains definition (The Art of Systems Performance) says make a request, it can be done correctly, done incorrectly (reliability), or nothing can be done (availability). To measure & assess they are using paired indicators for pairwise comparisons, , with distribution free analysis algorithms. The key dependability metrics are packet loss, median packet delay, HTTP errors versus HTTP rate, availability MTTR and MTBF. They want to pair effect and counter-effect indicators. These need to be incorporated into SLAs.

They use distribution free assessment methods such as medians and inter-quartile ranges. The network data rarely shows consistently Gaussian distributions.

The problem statement for pairwise comparison is how do we verify end to end dependability for SLA purposes in diverse independently owned and operated networks which are interconnected?

One solution is for each sample time interval, pairwise comparison of identical landmark site performance.

The SLA agrees on quantitative metrics and the methods to measure. The availability measures include scheduled and unscheduled events for all root causes as unavailable service. This drives improved processes for scheduled down time. Agree how to change metrics.

They are working to define standard definitions for metrics and methods with end-to-end reach. Some sites turn off ping to key nodes (XIWT) can help provide landmarks. Attack deterrence and detection is a possible cooperative activity that impacts dependability. The availability is from Intel to ISP POPs.

Challenges in Network Reliability Availability and Serviceability - Scott Cherf, Cisco

Availability is the probability that a system is operational when needed. This does not define the system, it is more than just the router. They want to define the network as being what the customer needs. The availability is defined by MTBF/(MTBF+MTTR) in this equation the M is for Mean. It depends on component availability (i.e. the availability of a single network component). The network availability measures the availability of the network as seen by the end user. No standards exist for measuring the availability of a distributed system (network). We must develop a standard based on our customer needs.

A working definition of the "network" is determined by the availability of a virtual circuit. From an end to end perspective the circuit is the network. This is not the same as saying any one component is available, nor is it the same as saying that all the network resources needed by an application are available.

Improving repair time can improve the availability. Large amounts of time are spent isolating failures before they can be repaired which increases the MTTR. This diagnostic time often exceeds the repair time (e.g. finding which router to power cycle can take longer than the power cycling). Multiple jurisdictions complicate the diagnostic problem. We need to develop end to end diagnostics to quickly and authoritatively isolate failures.

A barrier to fault tolerance (hot standby) is an effective means for increasing availability. For FT (Fault Tolerant) designs to work, failures must be unambiguous. Robust network protocols can work against FT solutions since it makes it harder to recognize the failures. We need to move towards a Fail Fast design approach if we intend to incorporate FT solutions.

Other industries have addressed problems of how to improve the availability of systems. For example Bell's work on the SS7 switch design, or process control work, or disk mirroring design where a failure on one is not noticed and the mirror kicks in, then eventually the mirror dies too (this is called a blind failure).