TY - JOUR
T1 - Efficient and adaptive Web replication using content clustering
AU - Chen, Yan
AU - Qiu, Lili
AU - Chen, Weiyu
AU - Nguyen, Luan
AU - Katz, Randy H.
N1 - Funding Information:
Manuscript received August 18, 2002; revised April 9, 2003. The work of Y. Chen and R. H. Katz were supported in part by the California MICRO Program, Nokia, Ericsson, HRL Laboratories, and Siemens. This paper is an extended version of an earlier paper that was presented at the 10th IEEE International Conference on Network Protocols (ICNP’02), November 2002.
PY - 2003/8
Y1 - 2003/8
N2 - Recently, there has been an increasing deployment of content distribution networks (CDNs) that offer hosting services to Web content providers. In this paper, we first compare the uncooperative pulling of Web contents used by commercial CDNs with the cooperative pushing. Our results show that the latter can achieve comparable users' perceived performance with only 4%-5% of replication and update traffic compared with the former scheme. Therefore, we explore how to efficiently push content to CDN nodes. Using trace-driven simulation, we show that replicating content in units of URLs can yield 60%-70% reduction in clients' latency, compared with replicating in units of Websites. However, it is very expensive to perform such a fine-grained replication. To address this issue, we propose to replicate content in units of clusters, each containing objects which are likely to be requested by clients that are topologically close. To this end, we describe three clustering techniques and use various topologies and several large Web server traces to evaluate their performance. Our results show that the cluster-based replication achieves performance close to that of the URL-based scheme, but only at 1%-2% of computation and management cost. In addition, by adjusting the number of clusters, we can smoothly trade off management and computation cost for better client performance. To adapt to changes in users' access patterns, we also explore incremental clustering that adaptively adds new documents to the existing content clusters. We examine both offline and online incremental clustering, where the former assumes access history is available while the latter predicts access pattern based on the hyperlink structure. Our results show that the offline clustering yields performance close to that of the complete re-clustering at much lower overhead. The online incremental clustering and replication cut down the retrieval cost by 4.6 times compared with random and by 8 times compared with no replication. Therefore it is especially. useful to improve document availability during flash crowds.
AB - Recently, there has been an increasing deployment of content distribution networks (CDNs) that offer hosting services to Web content providers. In this paper, we first compare the uncooperative pulling of Web contents used by commercial CDNs with the cooperative pushing. Our results show that the latter can achieve comparable users' perceived performance with only 4%-5% of replication and update traffic compared with the former scheme. Therefore, we explore how to efficiently push content to CDN nodes. Using trace-driven simulation, we show that replicating content in units of URLs can yield 60%-70% reduction in clients' latency, compared with replicating in units of Websites. However, it is very expensive to perform such a fine-grained replication. To address this issue, we propose to replicate content in units of clusters, each containing objects which are likely to be requested by clients that are topologically close. To this end, we describe three clustering techniques and use various topologies and several large Web server traces to evaluate their performance. Our results show that the cluster-based replication achieves performance close to that of the URL-based scheme, but only at 1%-2% of computation and management cost. In addition, by adjusting the number of clusters, we can smoothly trade off management and computation cost for better client performance. To adapt to changes in users' access patterns, we also explore incremental clustering that adaptively adds new documents to the existing content clusters. We examine both offline and online incremental clustering, where the former assumes access history is available while the latter predicts access pattern based on the hyperlink structure. Our results show that the offline clustering yields performance close to that of the complete re-clustering at much lower overhead. The online incremental clustering and replication cut down the retrieval cost by 4.6 times compared with random and by 8 times compared with no replication. Therefore it is especially. useful to improve document availability during flash crowds.
KW - Content distribution network (CDN)
KW - Replication
KW - Stability
KW - Web content clustering
UR - http://www.scopus.com/inward/record.url?scp=0042025136&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0042025136&partnerID=8YFLogxK
U2 - 10.1109/JSAC.2003.814608
DO - 10.1109/JSAC.2003.814608
M3 - Article
AN - SCOPUS:0042025136
VL - 21
SP - 979
EP - 994
JO - IEEE Journal on Selected Areas in Communications
JF - IEEE Journal on Selected Areas in Communications
SN - 0733-8716
IS - 6
ER -