1 Network-aware middleware for multi-server data distribution

The enormous popularity of Peer-to-Peer (P2P) file sharing has shown 
that it is indeed possible for a large number of people (well in the 
millions) to exchange terabytes of data over a significant period of 
time. Over the course of the last few years well over a dozen popular 
protocols have been developed and several are open-source. Recent 
estimates have shown that P2P traffic is easily over half the number 
of bytes in the Internet and has only increased. As recently as 
four years ago HTTP was responsible for 75% of the bytes. The rapid 
shift in traffic is mainly due to the popularity of the content 
exchanged in P2P networks. A very large fraction of P2P content 
exchanged is questionable in nature in terms of content ownership. 
Television-quality copies of movies, DVD images, entire CDs, 
individual songs in MP3 format etc. are the primary content formats 
exchanged. Popular protocols include KaZaa/Morpheus, eDonkey/eMule, 
BitTorrent, DirectConnect etc. There is a reasonable separation of 
protocol and content; specific protocols appear to be used for 
specific content. A significant amount of work has gone into 
optimizing the delivery of the `shared' content from a smaller 
number of sources to a large number of users with varying degrees 
of connectivity. For example, the primary problem noticed with 
Gnutella, the protocol that was popular early on in the P2P world, 
was that of free-riding. There were many `peers' who downloaded 
data but did not necessarily share the data. Freeloading has been 
largely solved in the recent P2P protocols (e.g., eMule) where 
peers essentially keep track of the upload/download ratio and 
downgrade delivery to peers who do not have a fair ratio. The other 
problem in P2P world has been the introduction of `decoys' on 
behalf of the content owners who want to reduce the `free' 
downloads (`stealing'). eDonkey has enabled checksum comparison 
to reduce the risk of downloading of decoys. KaZaa has maintained 
control of content downloading by resorting to encrypting the 
transfers (including many of the headers).

A key technical advance in P2P has been the breaking the large 
media files into chunks (`parts') and allowing the P2P clients 
to download them from multiple servlets and assembling them. 
This has allowed for parallel downloads to speed up the fetching 
of the large objects. The selection of chunk sizes is driven by 
individual site considerations although there has been a rough 
consensus in certain protocols. The chunks have individual 
signatures (often MD4 hashes, computed offline) and the headers 
include resource size to ensure content integrity.

A key point in the existing work on P2P has been that it is driven 
largely by a open-source friendly community (with the signal 
exception of KaZaa) and the primary motivation has been to get 
around the difficulties of large scale delivery of fat content 
in the face of legal troubles, bandwidth constraints, freeriding 
etc. Our proposal faces none of the above constraints and is 
more closely related to the problem of content distribution. 
Traditional Content Distribution Networks (CDNs) arose in the 
context of the World Wide Web to reduce the overhead on busy 
Websites. If cnn.com receives tens of millions of hits (each of 
which can be a separate TCP connection in the absence of 
HTTP/1.1 persistent connections) for the many small images it 
has on its Website, it may be unable to handle the load. CDNs 
offloaded this work and using DNS as the request routing load 
balancing mechanism, delivered the small images on behalf of 
the busy Web sites. The various models of CDN delivery and their 
effectivness in reducing the latency perceived by the user has 
been examined (cite bala's IMW01 'On the Use and Performance of 
Content Distribution Networks' paper). Of late CDNs have been 
delivering streaming media content as well. But the motivation 
of CDNs has never been to deliver large files to many users.

The advantage of grid oriented computing is that the user base 
is known in advance and thus their connectivity. Similarly the 
file size ranges are generally known and their chunking can be 
determined easily keeping in mind the classification of users' 
connectivity. Several informed algorithms can be tried to 
find the right chunking size and their placement location. The 
set of algorithms to try based on user's connectivity class, 
the distributions of delays on the respective paths, and the 
ability to efficiently replicate the chunks in the right sites 
are all problems not examined in the P2P or the CDN worlds.