Clustering web content for efficient replication

Yan Chen, Lili Qiu, Weiyu Chen, Luan Nguyen, R. H. Katz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

17 Scopus citations


Recently, there has been an increasing deployment of content distribution networks (CDNs) that offer hosting services to Web content providers. We first compare uncooperative pulling of Web contents, used by commercial CDNs, with cooperative pushing. The latter can achieve user perceived performance comparable to the former scheme with only 4-5% of replication and update traffic. Therefore, we explore how to push content to CDN nodes efficiently. Using trace-driven simulation, we show that replicating content in units of URLs can yield 60-70% reduction in clients' latency, compared to replicating in units of Web sites. However, such a fine-grained replication is very expensive. We propose to replicate content in units of clusters, each containing objects which are likely to be requested by clients that are topologically close. We describe three clustering techniques, and use various topologies and several large Web server traces to evaluate their performance. Cluster-based replication achieves 40-60% improvement over per Web site based replication. By adjusting the number of clusters, we can smoothly trade off the management and computation cost for better client performance. We also explore incremental clusterings that adaptively add new documents to the existing content clusters. We examine both offline and online incremental clusterings. The offline clusterings yield close to the performance of the complete re-clustering at much lower overhead. The online incremental clustering and replication cut down the retrieval cost by 4.6-8 times compared to no replication and random replication, so it is especially useful for improving document availability during flash crowds.

Original languageEnglish (US)
Title of host publicationProceedings - 10th IEEE International Conference on Network Protocols, ICNP 2002
PublisherIEEE Computer Society
Number of pages10
ISBN (Print)0769518567, 0769518567, 9780769518565, 9780769518565
StatePublished - Jan 1 2002
Event10th IEEE International Conference on Network Protocols, ICNP 2002 - Paris, France
Duration: Nov 12 2002Nov 15 2002

Publication series

NameProceedings - International Conference on Network Protocols, ICNP
ISSN (Print)1092-1648


Other10th IEEE International Conference on Network Protocols, ICNP 2002

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software


Dive into the research topics of 'Clustering web content for efficient replication'. Together they form a unique fingerprint.

Cite this