Attendees: Brian Tierney, Martin Stoufer, Dan Gurney (LBL), Les Cottrell, Connie Logg, Jori Navratil (SLAC).
Funded by Thomas Ndousse and also some DARPA funding. Part of Net100. Currently validating Sally Floyd's slow start and AIMD. Uses Netlogger. Can log to a local file, or to a logging daemon, or to a WAN to connect to a remote network logging daemon. Has a C interface as well, but no perl interface at the moment. Expect that could generate a perl interface in a day.
Allows periodic tests with some variance (flat random distribution), Has fault tolerance to recover, reschedule, report failed tests Restart the iperf servers at the remote end by cron.
Netlogger events include web100 event read out at 1/sec rate. Netlogger data is stored by netarchd into a local MySQL db. The parsing and insertion are done asynchronously to avoid a client write bottleneck.
For analysis use web interfaces to start extracting to display in nlv. Also have web interfaces that can get data from database and create statistical summary reports.
Future design requirements of NTAF call for the ability to subscribe via the pyGMA model and retrieve NetLogger events and they are generated. Near time capability will allow an attenuated feedback loop.
Have found the capabilities of an SQL database has been extremely valuable for accessing and analyzing the data. They allow access to teh data going back 3 weeks. The interface is a web form. MySQL has rudimentary statistical package (means, standard deviations). Form uses cgi-bin Python script, allows selection of metrics, src & dst hosts, time windows, can access netlogger raw data (with yyyymmddhhmmss.uuuuuu time stamps), results are shown as Gnuplots, statistical tables. See http://www-didc.lbl.gov/net100/, scroll to the bottom and see results and demos for examples.
Netlogger visualization are based on time correlated & object correlated events. Can associate events so can join together. Can also color group of events. New implementation is that all languages interface to SWIG and SWIG interfaces to the C implementation. Only Python done so far, Java & Perl to come. SWIG needs an interface file and is a standalone interface wrappers for many languages (e.g. perl). Look on net for swig.org.
New netlogger features can be exposed since have SWIG interface. Now has a read API as well as a write API. Addedf reliability, a trigger API, and a binary format for saving space. Also added a Trigger API to activate monitoring from an external configuration file or activation daemon. So a consumer might sign up via a GMA producer to request various pieces of information that is obtained from an Activation services. They have a version of iperf that is Netlogger enabled. Can turn on off say the Web100, the bandwidth, the net100 reporting and then can request level of reporting. Binary uses IEEE floating point, much faster over WAN, uses reader-makes-right to minimize cpu usage at the sender.
A goal may be to use Netlogger as a way to write and read measurements from measurement applications.
If remote NetLogger receiver becomes available, it will automatically failover to an alternate location (e.g. local disk or 2nd receiver).
GMA tries to abstract how one does monitoring in a grid environment. There are producers and consumers. There is also a directory service to provide information publication and discovery. pyGMA is a prototype Python implementation of the producer/consumer publish/subscribe library built in Python. Uses SOAP: the emerging standard for web services for communication between producers and consumers. Control channel is SOAP. SOAP uses XML. Will add GSI enabled SOAP soon. Allows subscriptions by consumer. If can agree on message formats then can be language independent. Directory service is just a proof of principal by a summer student at the moment.
Through the grid forum they are trying to come up with a standard naming convention for event, coming out of DAMED.
Have found that netest can reliably hang a Linux host due to a Linux kernel bug. Want to minimize impact on network. Want to run netest continuously, takes < 2Mbits/s, more typical is a few hundreds of Mbits/s. Want to build expertise into tools to tell non expert what window, streams to use. Also want to provide disk I/O, memory bw, cpu power, OS version, etc.. This is all part of version 2, that Jiri has been testing.
Uses UDP to determine burst size, optimal window size. Bw*delay product can exceed the buffer size of an intermediate router, find this buffer size by pushing out bigger and bigger bursts. UDP might give a 400Mbits/s throughput, but TCP single stream has 90Mbits/s. Has a formula y=a-bx^(-w) and fit to a few values of low number of streams then calculate where the derivative of the curve is 27.5 degrees to give a values of optimum number of streams. Need to measure the multiple streams multiple times (in order to avoid effects of cross-traffic).
Deb Agarwal indicated that Sally Floyd 's new slow start may require changes in Quick iperf , but they should be relatively small.
LBL GridFTP version saves data in NetLogger. Brian has a web page on how to configure GridFTP for monitoring (addresses the certificate issues), will send a copy,
The codeanal provides an analysis of failures that is very useful.
Brian asks whether we continue to need to run bbcpmem, is iperf not sufficient. Still need a disk involved estimator (e.g. bbcpdisk, bbftp, GridFTP). When we get the new version (which enables easier addition of new measurements) we will add udpmon, Quick iperf, GridFTP, netest etc.
We need to meet again in October to go over Web Services. We had to defer this since Warren was not present.
We discussed the MAGGIE proposal. We need to determine whether LBL will be a funded or unfunded collaborator. Warren will send an updated proposal to Brian to assit Brian in deciding.