PingER Measurement Pathology Examples

Page created May 23, 1999 by: Les Cottrell. Last Update July 16 2002.

Tutorial | PingER Help | PingER Tools | PingER Summary Reports| PingER Detail Reports

Introduction

Using ping to measure LAN performance, we have encountered a pathology when pinging from doris (Linux) to atreides (Windows NT). The pathology manifested itself in the plot of ping RTT versus ping sequence number as an unusual regularity in the sequence number separation of the pings with RTT > 10 msec., see the highlighted link for Linux to WNT (atreides) below. The effect was reproducible.

In this page we provide an index to some interesting plots created when tracking down the pathology. In the case of Linux sending the ping requests the requesting host was doris.slac.stanford.edu which was running Redhat Linux 5.2. The Windows NT (WNT) hosts were running Windows NT 4 with Service Pack (SP) 4. Some of the PC hosts could dual boot and so may show up as running WNT or Linux. The Sun host (gryphon) was running Solaris 5.6. The names of the hosts involved are given in parentheses, are all in the .slac.stanford.edu domain, and all, including doris, are on the same subnet (PUB6). Also all hosts are on 10 Mbps shared Ethernet hubs. Unless otherwise noted the ping application was the standard version delivered with the operating system. The pings were sent once a second, had a timeout of 20 seconds, and contained 100 bytes. All the graphs show the ping sequence number along the x-axis (unless otherwise noted) and the ping round trip time (RTT) in milliseconds along the y axis.

Pinging involving PCs

Ping from Linux to WNT

Linux to WNT (procrustes)
Linux to WNT (atlas)
The following two graphs are for pings from a Linux host (doris) on one Catalyst 5000 switch port, to two different WNT hosts on a shared hub on a different switch port to the one for doris.
- Linux to WNT (eccles)
- Linux to WNT (atreides) note the pathological regularity of data. In this example the points shown with RTT > 10 msec. are separated from the adjacent points by 1141 in sequence number.
- NIKHEF ping from Linux (doris) to WNT (atreides), note the regularity appears to be more pronounced during the early morning and evening.
We temporarily swapped the hub ports for atlas and atreides and the effect appears to follow the host, rather than the port:
- Linux to WNT (atreides on atlas' hub port) though the regularity is not as marked as in the previous ping graph between Linux and atreides, it is apparent especially for the 18 points with RTT > 15 msec. and sequence number > 13000. For these 18 points, 9 are separated by 1141 in sequence number and 3 by 2342. The "wire-times" (as measured by NetXray running on atreides) for the 12 pings with sequnce number separation of 1141 or 2342 were all < 200 usec.
- Linux to WNT (atlas on atreides' hub port).
We installed the Linux host (doris) on the same 10Mbps shared hub (TT-CGB8 a Centercom model 3624TR) as the WNT host (atreides) and repeated the pings.
- Linux to WNT (atreides) on same hub though much noisier the regularity is still observable. The green dots are the pathologically regular RTTs with their sequence number identified. For these 50 pathologically regular points, 36 are separated in sequence number by 1141, 3 by 2342, and 7 by 1201. Looking at the RTTs difference for adjacent pathologically regular points, over 80% lie in the range 10-12 msec.
NIKHEF ping Linux to WNT (yemint) both hosts on same shared 10 Mbps hub.

WNT to Linux

Windows NT (hector) to Linux. The graph shows the results from 32K pings. Note that the WNT ping RTT resolution is not better than 10 msec., i.e. ping on WNT reports > 10 ms as the smallest RTT. Also of all the 36 pings reporting > 11 ms RTT only two were not a multiple of 10 (91 & 151 were the RTTs reported). Since WNT ping does not report the sequence number, the x value is the ping output line number. Included in these line numbers were 9 lines that said "Request timed out." Looking at the separation of the line numbers of the pings with > 11 ms RTT there is not an obvious pattern.
Windows NT (atreides) to Linux. The graph shows 32K pings, all pings reported as >10ms are plotted as 9 msec.

Linux to Linux

NIKHEF ping from Linux to Linux (odin)

Ping between Linux and Sun (gryphon)

Ping from Linux to self

NIKHEF ping from Linux to self

Network connections

Host	Switch port	IP address	MAC address	OS
atlas/dhcp-24-179	CGB3: 3/4	134.79.24.179	00-10-04-f5-f5-53	WNT 4/SP4
atreides	CGB3: 3/16	134.79.24.12	00-c0-4f-76-18-36	WNT 4/SP4
doris	CGB3: 3/7	134.79.24.122	00-c04f-98-6b-f7	Linux Redhat 5.2
eccles	CGB3: 3/16	134.79.24.95	00-c0-4f-a3-8d-04	Linux Redhat 5.2
gryphon	CGB3: 3/5	134.79.25.130	08-00-20-22-ed-4b	Solaris 5.6
hector	CGB3: 3/2	134.79.24.97	00-60-97-cc-50-26	WNT 4/SP 4
odin	CGB3: 3/6	134.79.24.46	00-c0-4f-a3-b8-51	Linux Redhat 5.2
procrustes	CGB3: 3/8	134.79.24.84	00-c0-4f-b9-6a-65	WNT 4/SP4
yemint	CGB3: 3/7	134.79..24.86	00-c0-4f-c2-77-cd	WNT 4/SP4

Possible resolution

On Tuesday July 15, '02 I received the following email from Stephan Bohacek [bohacek@math.usc.edu]:
Hi Les, We have been doing extensive high frequency ping measurements. We had noticed a similar effect you noted in http://www.slac.stanford.edu/comp/net/wan-mon/pathology-eg.html. However, we have since determined that this is due to the operating system stalling. For example, our windows machines stalls every 300 ms for 16 ms. Thus, a packet will be delayed anywhere from 0 to 16 ms depending how far along the stall is when the packet arrives. Since the stalling is periodic and the pings are sent periodically, the delay pattern can be quite complex (as you noted). We have fixed the problem by using real-time operating systems (RTAI a real time linux). We are writing up some details. I'll send them on when they are complete.

[ Feedback ]