Exploiting Heterogeneity in Peer-to-Peer Systems Using
A thesis submitted to the University of Dublin, Trinity College
in fulllment of the requirements for the degree of
Doctor of Philosophy (Computer Science)
A peer-to-peer system can be defined as an overlay network built by a set of nodes on top of a physical network infrastructure and its operating protocols, such as the Internet. In a peer-to-peer network, each node maintains a limited number of connections with other nodes, called peers, and the graph of peer connections constitutes the overlay's topology. One of the most fundamental properties of existing large-scale peer-to-peer systems is a very high heterogeneity and dynamism of peers participating in the system. Studies show that the distributions of peer characteristics, such as peer session duration, available bandwidth, and storage space, are highly skewed and often heavy-tailed, with small fractions of peers possessing disproportionally large fractions of the total system resources. Such heterogeneity introduces both challenges and opportunities when designing peer-to-peer systems. The use of low-performance or low-stability nodes for maintaining system data or services can easily lead to a poor performance of the entire system, while the placement of critical data and services on the most reliable, high-capacity nodes may improve the overall system stability and performance.
Current state-of-the-art peer-to-peer systems exploit their heterogeneity by introducing two-level hierarchies of peers. High capability peers, so called super-peers, form an independent sub-topology within the existing peer-to-peer network and handle the core system functionality, such as indexing peer data and handling search queries, or relaying traffic on behalf of firewalled peers. Ordinary peers connect directly to super-peers and act as their clients. However, many existing systems lack an efficient, decentralised super-peer election algorithm. In many systems, super-peers are selected manually, through an out-of-band mechanism, or are elected using simple local heuristics, which are likely to generate suboptimal super-peer sets. Sophisticated super-peer election algorithms exist, but they are usually highly specific to particular systems and are not easily portable to other application areas.
This thesis presents a novel class of peer-to-peer topologies, called gradient topologies, which generalise the concept of super-peer networks. In gradient topologies, the position of each peer is determined by a continuous utility function, and the highest utility peers are clustered in a logical centre of the topology, while peers with lower utility are located at gradually increasing distance from the centre. The utility metric captures application-specific peer requirements and reflects peers' ability to contribute resources and services to the system. The gradient structure of the topology has two fundamental properties. Firstly, all peers in the system with utility above a given threshold are located close to each other in terms of overlay hops and form a connected sub-topology. Such high-utility peers can be exploited by higher level applications in a similar fashion to super-peers in traditional two-level hierarchies. Secondly, the information captured in the topology enables a search heuristic, called gradient search, that enables efficient discovery of such high utility peers.
The gradient topologies have been evaluated using a custom-built simulator and compared with state-of-the-art super-peer systems. The evaluation shows that the election techniques based on gradient topologies allow more flexible super-peer criteria specification compared with the other systems. Moreover, the super-peer sets elected using gradient topologies are closer to the theoretical optimum, compared with the other systems, and have a higher average utility and stability. The experiments also show that the maintenance cost of gradient topologies, in terms of generated messages and established connections, is similar to that in the state-of-the-art systems.
First and foremost, I am very grateful to my supervisors for their mentoring, expertise and support. I would like to thank Jim Dowling, who proposed my research topic and without whom this thesis would never have been written, René Meier, who supervised me in the middle of my studies, and Mads Haahr, who guided me during the difficult time of writing up. A special acknowledgement must be given to Raymond Cunningham, my de facto fourth supervisor, for his great help in my research. I am also grateful to my reviewers, Anthony Rowstron and Ciaran McGoldrick, for their valuable contributions to this work.
Over the course of my studies, I collaborated with a number of colleagues in the Distributed Systems Group (DSG). I would like to especially mention here Bartek Biskupski, Dominik Dahlem, Andy Edmonds, Serena Fritsch, Anthony Harrington, Marcin Karpinski, Dave McKitterick, Andronikos Nedos, As'ad Salkham, Aline Senart, Kulpreet Singh, Tim Walsh, and Stefan Weber. The DSG has created a phenomenal environment for working, learning and living. I would also like to thank my family, my parents, brother, and fiancee Aja, who continuously supported me despite the physical distance between us.
The research described in this thesis was partly funded by the Information Society Technology Programme of the European Union, Microsoft Research Cambridge, and the Irish Council for Science, Engineering and Technology (IRCSET). Their generous support is greatly appreciated.
The purpose of this chapter is to give a general introduction to the area of peer-to-peer system and to the problems discussed in this thesis. It describes briefly the history of peer-to-peer systems and discusses the main characteristics found in these systems. It is shown that existing large-scale peer-to-peer systems are highly heterogeneous. This chapter also describes the thesis goals, explains the motivation behind these goals, and outlines the thesis organisation.
Peer-to-Peer (P2P) systems belong to the fastest growing applications in the area of distributed computing. Within a short number of years, P2P systems gained an extreme popularity, attracted millions of users, entered a number of different application areas, and became one of the main contributors of global traffic in computer networks.
Although the first widely-used systems considered P2P appeared already in 1970's, with Usenet and the Network News Transfer Protocol (NNTP) being examples of such systems, most of the modern P2P applications have been invented during the last decade. Particularly, the first application that gained an extraordinary popularity was Napster, a file-sharing application developed and published in 1999. By the end of 2000, Napster had been downloaded by 50 million people around the world and became the fastest growing application on the Web . The success of Napster was quickly followed by other file-sharing P2P systems, such as Gnutella, KaZaA (also known as FastTrack), DirectConnect, eDonkey, Overnet, BitTorrent and many others. Soon, P2P systems became one of the two most dominant applications on the Internet, in terms of global traffic contribution, along with the Web. On a typical day, KaZaA has more than three million active users, sharing over 5,000 terabytes of content . According to the measurements by Tier 1 Internet providers, P2P applications generate between 15% and 20% of the total Internet traffic, and up to 60% of traffic on some Internet backbone links [57,87]. Moreover, a number of regional Internet service providers, such as university network operators or national operators, estimate that P2P applications account for approximately 30-70% of total generated traffic, and their share is still growing [12,11,111,53].
Freenet [40,39], which appeared in 1999, was another system that pioneered the development of P2P networks. Freenet was a distributed storage system that allowed for publication, replication, and retrieval of data, while protecting the anonymity of both authors and readers. Along with file sharing and file storage, P2P systems entered a number of other application areas, such as Internet telephony, video conferencing, and multimedia streaming [37,32,84,33,92,140,204]. P2P systems were also used for content distribution , for example by the Debian organisation for distributing Linux distribution image files. An extremely large popularity was achieved by Skype [15,67], an Internet telephony and conferencing application made available in 2003. In 2006, Skype was reported to have 83 million registered users around the world and was used by approximately 3-4 million simultaneous users at any one time .
Another initiative that contributed to the development P2P systems was SETI@home [8,194], a scientific project launched in 1999, whose aim was to exploit unused computing resources, such as CPU idle time, on machines connected to the Internet for running scientific computations, such as analysing radio telescope signals in search of intelligent life outside Earth. Although SETI@home is not generally considered a P2P system, since its nodes communicate with a centralised server rather than with each other, it shares a number of common characteristics with P2P systems, and belongs to a wider class of distributed systems known as Public Resource Computing (PRC) or volunteer computing. Both P2P and PRC systems rely on large and dynamic populations of autonomous nodes that contribute resources to the system.
The SETI@home project received a strong public response, and by 2002, over 3.8 million people registered in the program. This enabled an unprecedented computing power (order of 100 TeraFLOPS), and by 2002 the program performed a total of 1.7e21 floating-point operations, the largest computation on record at that time .
As the popularity of P2P grew, P2P systems also became a significant research area within the greater domain of distributed systems. A number of conferences and scientific journals have been devoted solely to P2P systems and their applications, and a large body of work has been published in the area.
A number of independent definitions for P2P system have been proposed [171,10,2,167]. Most definitions describe P2P systems as large-scale, decentralised systems maintained by a large number of autonomic nodes, called peers, which voluntarily contribute resources, such as storage space, processor cycles, bandwidth, or physical human presence, and self-organise in order to provide a useful services to a community of users. The membership in P2P systems is open and dynamic, as peers can freely join and leave during the system's operation. There is no distinction between dedicated servers and clients, as all nodes participating in the system provide services to each other as peers. Every peer can function as both a client and a server. Some definitions also assume that the responsibilities and capabilities of all peers in a P2P system are identical [162,44,176].
Due to the scale and dynamism of a P2P system, it is not feasible for each peer to possess and maintain an accurate model of the entire system. Instead, each peer usually has a knowledge of and communicates with a limited number of other peers, called its neighbours. The connections between neighbouring peers form an overlay network, on top of the physical infrastructure, such as the Internet. This overlay network is used by peers for routing messages and exchanging information. Unlike in wireless networks, where the communication range is limited, the nodes in P2P systems are directly reachable and the overlay structure can be adapted arbitrarily depending of the system needs. The graph of peer connections is called the system topology.
Contrary to the definitions that assume identical roles and capabilities of all peers in the system, measurements show that deployed P2P systems are characterised by a very high diversity of participating peers. The distributions of basic peer characteristics, such as the processing power, storage space, bandwidth, or session duration, are highly skewed and radically different from the normal or uniform distributions. In the normal distribution, practically all values are located within several standard deviations from the mean (e.g., 99.7% of all values are within three standard deviations from the mean) and values beyond this range are extremely unlikely. In P2P systems, it has been observed that the characteristics of peers (i.e., storage, bandwidth, etc.) vary by several orders of magnitude between individual peers (see Table 1.1 for a comparison). Furthermore, while a large majority of peers have relatively few resources, small subsets of peers possess significant fractions of the total system resources. Such distribution is often modelled using the Pareto distribution, also known as the Zipf's law and power law [135,4], and other heavy-tailed distributions, such as the Weibull and log-normal distributions.
One of the basic peer properties that shows a high P2P system heterogeneity is the session duration, defined as the amount of time a peer stayed (or is expected to stay) on-line in the system without disconnecting. According to a number of independent measurements, the average session duration in existing P2P systems is relatively short, and varies between 1 minute and 1 hour, depending on the system and the measurement method [181,144,27,36,167]. However, in nearly all studied systems, groups of peers were found which stayed on-line for considerably longer periods. For example, Stutzbach et al.  observe that roughly 10%-20% of peers in Gnutella and Kad have an uptime (i.e., amount of time since joining the system) longer than one day, and around 1-3% of BitTorrent peers have an uptime longer that two weeks. Similarly, Pouwelse et al.  report that 17% of BitTorrent peers stay in the system for longer than one hour after they finished downloading, 3.1% of peers stay on-line for at least 10 hours after downloading, 0.34% of peers stay for more than 100 hours after downloading, and their longest observed session is 83.5 days.
Related to the peer session duration is peer availability, a property defined as the fraction of time a peer spends on-line in the system within a longer period. Measurements indicate that peer availability ranges from almost 0% to 100% between peers, and the availability distribution is highly skewed [173,18]. The majority of peers have a poor availability, while small subsets of peers stay on-line almost all the time. Furthermore, while some peers frequently join and leave the system over time, other peers connect to the system only once and never come back. For instance, Bhagwan et al.  discover that in a two-week trace collected from the Overnet system, on each day, new hosts never seen before in the trace comprise over 20% of the peer population. A number of experiments reveal also diurnal patterns in the peer participation in P2P systems [18,167,36,67,27].
Bandwidth is another example of an unevenly distributed resource between peers. Although technically bandwidth is a property of a network connection between two machines, in practice the bottleneck bandwidth between a peer and the rest of the Internet is determined by the peer's direct link to the Internet [97,167], and hence, is a property of this peer. Saroiu et al.  show that the median upstream bottleneck bandwidth of peers in Gnutella is roughly 1Mpbs, while about 22% of peers have a bottleneck bandwidth below 0.1Mbps, 8% of peers have a bottleneck bandwidth above 10Mbps, and the highest observed bandwidth capacities reach 100Mbps. Similarly, Pouwelse et al.  show that while the average download speed of a peer in BitTorrent is 240 Kbps, a number of peers download at much higher rates, with a maximum around 4,000 Kbps. A number of other measurements show consistent results [9,173,188].
Other peer properties that have been analysed and shown to follow skewed distributions include peers' computing power and available storage space and memory (RAM) . Moreover, it has been shown that a significant fraction of peers in P2P systems are located behind firewalls or Network Address Translators (NAT) which limit their ability to communicate with other peers [102,181]. The types of peers' firewalls or NATs add one more dimension to the diversity of peers and hence to the heterogeneity of P2P systems.
Apart from hardware parameters, peers also very between each other in terms of their behaviour. An important peer characteristic is willingness to share resources. Studies show that the number of files shared by Gnutella peers vary between 0 and 10,000 [167,208]. In particular, a large fraction of users (so called free riders, up to 70% of all users) share no files . Some Gnutella users also take steps to discourage other users from downloading files from them, for example by advertising the lowest possible upload speed (64 Kbps or less) . On the other hand, groups of peers have been observed which exhibit a contrary behaviour. These peers do not download any files but stay on-line and let other peers download from them. It is estimated that 7% of peers in Gnutella together offer more files than all of the other peers combined .
The amount of traffic generated by individual peers is also extremely variable. Sen et al.  analyse network traffic in a large Tier 1 Internet provider and show that less than 10% of peer IP addresses contribute around 99% of the total traffic volume, and the top 1% of peer addresses transmit 73% of the total traffic. A single peer may transmit over 10GB of data during a one day. Similarly, Leibowitz et al.  observe that while the majority of KaZaA peers initiate less than 10 downloads during their life time, a number of peers request several hundred files. Such highly active peers, generating relatively large amounts of traffic, are often called heavy-hitters.
Table 1.1 shows a summary of measurements of
peer characteristics in deployed P2P systems.
One of the main challenges in building P2P systems is dealing with the difficult substrate on which these systems are based - the heterogeneous and dynamic population of peers discussed in section 1.2. Since the average peer session duration in a P2P system is short, large numbers of peers continuously join and leave the system. This process is often referred to as churn, and the rate of peer arrivals, which is approximately equal to the rate of peer departures in the long run, is also called churn rate [181,153].
In the presence of high churn, the entries in peers' neighbourhood tables become quickly outdated and may point to peers that no longer participate in the system . This either increases message loss rates, since messages routed using these entries are not delivered, or increases delays in communication, as message failure detection and retransmission is usually slow. In order to keep neighbourhood tables up to date, peers need to frequently exchange messages with their neighbours. However, this increases the overhead related to the overlay network maintenance [113,100]. For example, it is estimated that in the early versions of Gnutella, keep-alive messages accounted for over 50% of all generated traffic . As the churn rate increases, the system also needs to increase the level of data redundancy (e.g., the replication factor) in order to guarantee a certain level of data availability. This again increases the overlay maintenance cost, since more data need to be transferred when peers are joining and leaving the system and more replicas need to be synchronised when updates are issued . Consequently, a number of P2P systems have been shown to perform poorly in the presence of high churn .
Furthermore, P2P systems may suffer poor performance if they do not address their heterogeneity and do not adapt their structure to the properties of individual peers. The lowest performance peers, with poor processing capacity or insufficient network throughput, are likely to become bottlenecks. For example, in August 2000, the entire Gnutella network experienced deteriorated performance, with slow response time and fewer available resources, as it grew to a larger size. This was caused by peers connected by dial-up modems, which became saturated by the increased load and caused network fragmentation. Indirectly, this was caused by Gnutella's lack of ability to control its topology and to adapt it to the capabilities of individual peers .
However, the heterogeneity of P2P systems, as well as being a challenge, is also an opportunity that can be exploited. By assigning more responsibilities to stable, high capability peers, a P2P system may improve its overall reliability and performance [167,110]. The subset containing the most stable peers is less subject to churn, and hence, is more suitable for hosting data or routing traffic [181,27]. Likewise, a set of peers with the largest amount of resources is most suitable for performing resource-intensive tasks, such as handling search queries .
Consequently, nearly all widely used P2P systems today attempt to exploit the diversity of participating peers. In most systems, peers are divided into two categories. The highest capability peers, called super-peers, act as servers to the other peers. Usually, they form an independent sub-topology within the system overlay and handle the core system functionality. Ordinary peers maintain connections to selected super-peers and act as their clients . For example, KaZaA uses super-peers (called supernodes) to index data stored by clients and to handle the search protocol . A similar concept has been employed by Gnutella, where ultrapeers are used for routing search queries , and in eDonkey, where eDonkey servers are used by client peers as rendezvous points [188,75]. In Skype, supernodes are mainly used for relaying messages between peers whose Internet access is restricted by a firewall or a Network Address Translator (NAT) [15,67]. While different systems and research documents introduce their own vocabulary for describing peers, they refer in principle to the same general concept of distinguished peers that have higher capabilities and perform more tasks than ordinary peers. For consistency, this thesis uses the term super-peers for describing such distinguished peers, and the terms ordinary peers and clients for the remaining peers. In most contexts, the terms super-peer, ultrapeer, supernode, and superpeer can be treated as synonyms.
There are other approaches to exploiting heterogeneity in P2P systems, in which the system structure is adapted to the properties of participating peers, but no division between super-peers an clients is made. Such approaches include spanning tree and mesh structure optimisations in streaming systems [150,6,139,112,21,168], topology and message flow adaptation in Gnutella [110,34,27], and the introduction of virtual servers to distributed hash tables [148,88]. However, these approaches are highly specific to their application areas, and hence are not as universal as the super-peer based approaches.
The design based on super-peers has two main advantages. First, certain system tasks, such as hosting data or services, can be assigned to stable, high-performance peers, i.e., super-peers, in order to improve the overall system reliability and performance. Second, the use of super-peers allows the system to limit the number of participants in certain distributed algorithms, such as search [109,199,187,99], which do not scale well and become too expensive when the system size is large. Thus, the super-peer design can improve the scalability of a P2P system.
However, the use of super-peers introduces a number of new challenges that need to be addressed. In order to elect super-peers, the system needs to decide on how many super-peers are needed and which peers are most appropriate to take the role of super-peers. Furthermore, the system needs to maintain and continuously adjust the super-peer set in response to peer arrivals and departures, changes in the current load and changes in peer capabilities. The system also needs to distribute clients between super-peers and migrate them when super-peers leave or fail. Ideally, the load between super-peers should be balanced to ensure the system's scalability and fault-tolerance. This must all be performed in a dynamic and decentralised manner, with no central coordination or authority.
It should be noted that traditional election algorithms for distributed systems, such as the bully algorithm , and classic approaches to group communication [156,73,19], are not applicable to large-scale P2P systems due to their cost and message overhead. Most of these approaches require strong global consensus between all peers in the system  and rely on broadcasting messages between all peers in the system [60,70]. As such, they can be only applied to small systems (order of thousands of nodes ) or smaller subsets of peers within a larger system.
Few existing P2P systems attempt to elect super-peers in an efficient, decentralised way. Many P2P systems rely on manual super-peer selection, through an out-of-band mechanism, or employs simple heuristics, which are likely to generate suboptimal super-peer sets. In some systems, the addresses of super-peers are simply hardcoded into the application. Only few P2P systems attempt to elect optimal super-peers sets according to some metric, but most of these systems are specific to particular applications and their super-peer election mechanisms are not easily portable to other domains.
This thesis describes a novel approach to dealing with heterogeneity in P2P systems. This approach is based on peer utility metrics and gradient topologies. A utility metric is a function evaluated at each peer locally that reflects peer's ability to contribute resources to the system and to provide services to other peers. A utility metric is domain-specific, and captures peer requirements imposed by the higher-level application built on top of the gradient topology.
A gradient topology is a P2P topology where the highest utility peers are clustered in the logical centre of the topology, while peers with lower utility are found at gradually increasing distance from the centre. In contrast to super-peer topologies or hierarchies, where peers are divided into two or more discrete groups, gradient topologies introduce a continuous spectrum of peers, from the highest utility peers in the centre to the lowest utility peers at the periphery.
Gradient topologies have an elementary property that for any given utility threshold, all peers in the system with utility above this threshold are located close to each other, in terms of overlay hops, and form a connected sub-topology within the system overlay. Such high utility peers can be exploited by the system in a similar way as super-peers in traditional two-level hierarchies. Furthermore, high utility peers in gradient topologies can be easily and efficiently discovered by lower utility peers using an heuristic called gradient search, which routes messages from outer peers towards the centre of the topology, as in hill climbing and similar techniques based on the notion of gradient.
The main advantage of gradient topologies over the traditional super-peer topologies is that utility thresholds can be increased or decreased, adjusting the number of peers above the thresholds according to the system requirements, without the need to reconfigure any peer connections. For any selected threshold, peers above the threshold are clustered at the centre of the gradient topology and gradient search can be used by low utility peers to efficiently discover them. Moreover, without any additional mechanisms, multiple thresholds can be calculated to elect multiple concentric sets of high utility peers, similar to a hierarchy.
Gradient topologies, together with election algorithms described in this thesis, allow P2P system to select the most suitable peers for performing tasks such as hosting system data, running services, or participating in certain distributed algorithms. By selecting the highest utility peers in the network for these tasks, P2P systems can exploit the heterogeneity in peer populations for improving the overall system stability and performance.
Gradient topologies are domain-independent and can support many different classes of P2P applications, such as storage systems, name services, file-sharing applications, and semantic registries. As gradient topologies are based on the notion of peer utility rather than peer connection utility, they are not directly applicable to P2P systems that need to adapt their structures to the properties of the underlying low-level network, such as link latencies and link throughputs. Systems that belong to this category, for example multi-media streaming applications, are not addressed in this thesis.
The main contributions of this thesis are: (i) a number of utility metrics for the characterisation of peers in a P2P system, (ii) neighbour selection algorithms that generate and maintain gradient topologies with desired properties, such as a low degree of peers and a low distance between the highest and the lowest utility peers, growing logarithmically with the system size, (iii) election algorithms, based on decentralised aggregation techniques and adaptive utility thresholds, which create and manage close-to-optimal super-peer sets in the gradient topologies and minimise the number of swappings between super-peers and ordinary peers, (iv) routing heuristics that enable high-utility peer discovery in the gradient topologies.
The proposed algorithms have been validated using a custom-built P2P simulator. In a range of experiments, it is shown that gradient topologies, together with aggregation-based election techniques, generate better-quality and higher-stability super-peer sets, and have similar maintenance cost, compared with state-of-the-art super-peer systems. Moreover, it is shown that gradient topologies offer more flexible and powerful super-peer election mechanisms compared with the existing P2P systems, and thus extend the current state-of-the-art knowledge on super-peers, and more generally, heterogeneity exploitation in P2P systems.
The remainder of this thesis is organised as follows.
Chapter 2 reviews a wide range of P2P systems that introduce super-peers, with a particular emphasis on the super-peer criteria and election mechanisms, highlighting the achievements and limitations of each system.
Chapter 3 formally defines the class of gradient P2P topologies and describes their main characteristics.
Chapter 4 presents a collection of utility metrics and algorithms that generate gradient topologies and enable super-peer election and discovery in these topologies.
Chapter 5 evaluates the algorithms introduced in chapter 4 and verifies that they construct topologies defined in chapter 4. It also compares the functionality and performance of gradient topologies with a number of state-of-the-art super-peer election systems.
Chapter 6 summarises the thesis and discusses future work.
This chapter surveys the area of P2P systems that exploit the diversity in peer characteristics. In the large majority of cases, these systems are based on super-peers. For each reviewed system, the super-peer functions, election mechanisms, and topology maintenance algorithms are described, and the advantages and limitations of each proposed approach are discussed. The last section in this survey covers systems that do not use super-peers but are based on alternative principles.
Systems that exploit heterogeneity between peer connections rather than individual peers are not covered in the survey. These systems include in particular streaming and multicasting applications, which optimise dissemination trees, mesh overlays, and other structures based on the latency, bandwidth, and throughput of connections between peers [150,6,139,112,21,168]. Due to the different requirements and objectives, these systems cannot be directly compared to the gradient topologies and election strategies described in this thesis.
The concept of super-peers has been first studied by Yang and Garcia-Molina in , who divide P2P systems into three categories. In pure P2P systems, such as Freenet and initial versions of Gnutella, all peers have equal roles and responsibilities in all aspects, and the system's functionality is fully decentralised. In hybrid systems, such as Napster, some functionality is handled by a centralised component (e.g., search), but otherwise the system is decentralised (e.g., downloads are performed directly between peers). Super-peer networks, such as KaZaA and early versions of Morpheus, present a cross between pure and hybrid P2P systems.
A super-peer is a node that acts as a centralised server to a set of clients and as an equal to other super-peers. Clients communicate with their super-peers as with servers in hybrid P2P systems or in traditional client-server architectures. However, super-peers are also connected to each other as peers in pure P2P systems, forming a super-peer overlay that handles the core system functionality, such as search. A super-peer together with the set of its clients is called a cluster, and the cluster size is the number of nodes in the cluster, including the super-peer.
In a sense, the introduction of super-peers enables a trade-off between full decentralisation and partial centralisation in a P2P system. The advantage of super-peer systems over hybrid P2P systems is that they do not have any centralised components. The advantage of super-peer systems over pure P2P systems is that they can exploit the heterogeneity in peers by assigning relevant system functions to high-capability peers and that they reduce the number of participants in expensive algorithms, such as search, by running these algorithms on super-peers only.
Super-peer topologies found in the existing P2P systems and literature can be divided into four general types. In classic super-peer topologies, every client is connected to exactly one super-peer, as shown in Figure 2.1(a). These topologies, however, have the drawback that in the case of a super-peer failure, all clients of the failed super-peer become disconnected from the network and isolated.
In order to address this problem, some systems allow clients to maintain connections to multiple super-peers, as shown in 2.1(b). Similarly, for improved system reliability and fault tolerance, Yang and Garcia-Molina  introduce a -redundant super-peer topology, where each super-peer, called a virtual super-peer, is replicated on physical peers. Each of the physical peers maintains connections to all neighbours of the virtual super-peer (clients and other super-peers) and hosts a full copy of the super-peer data. In this way, a virtual super-peer can tolerate up to peer failures without service disruption. A sample 2-redundant super-peer topology is shown in Figure 2.1(c).
Finally, in some systems, any two peers are allowed to establish a connection between each other, including two clients, in which case the super-peer topology is embedded in the overall system topology, as shown in Figure 2.1(d). Such a topology is also known as a hub topology .
In order to introduce super-peers, a P2P system needs to decide on how many super-peers are desired and which peers are the best candidates for super-peers. These two problems are correlated, as the desired number of super-peers may depend on the properties of available peers. For example, fewer super-peers may be elected if high-capacity candidates are available in the system, while more super-peers may be required if all peers have low capacity.
As the system elects super-peers, it needs to decide on how to connect the super-peers with each other and how to assign clients to super-peers. The connections between super-peers, as well as the super-peer functionality, are usually application-specific. However, the algorithms for super-peer election and client distribution are often application-independent, and given an optimality criteria, can be directly compared with the election algorithms used in gradient topologies. For that reason, a special emphasis is put on the super-peer election mechanisms used in the systems reviewed in this chapter. Systems with the more elaborated super-peer election algorithms are covered in more detail.
The reviewed systems are divided into four loose categories based on the super-peer election mechanisms. The proposed organisation is not an exhaustive formal taxonomy, and some systems may fit in more than one category. The classification is introduced for convenience only.
The first category, described in section 2.2, comprises systems that do not specify any super-peer election algorithm or have a very simple super-peer election approach. This includes manual or centralised super-peer selection, and selection performed locally at each peer based on a static criteria. Many of the first P2P systems that introduced super-peers belong to this category.
The second class, covered in section 2.3, consists of systems where the population of peers is divided into disjoint groups, and super-peers are elected within each group independently. The groups are usually based on peer properties such as physical location, network proximity, or semantic content.
The third class (section 2.4) contains systems where the super-peer election method is based on a Distributed Hash Table (DHT). The DHTs [179,149,162,46] are a well-known class of P2P systems, with a well-defined functionality (i.e., that of a hash table), which constitutes a coherent subset of P2P systems. Due to the distinctive characteristics of the DHTs, systems that use DHTs for the election of super-peers are considered as a separate class in this review.
Finally, the last category (section 2.5)
contains adaptive systems that elect super-peers based on global demand,
for example defined as the number of clients, rate of client requests,
or current load on super-peers. These systems usually define some
optimality criteria and continuously strive to optimise the super-peer
set. Table 2.1 lists the systems reviewed in
each of the four categories.
This section reviews P2P systems with the simplest super-peer election and management mechanisms. This category includes systems that do not specify any super-peer election method, or rely on external, out-of-band mechanisms, or leave the super-peer election to higher-level applications; systems where the super-peer sets are hardcoded; systems that use centralised components for handling super-peer management; systems where super-peers are selected manually, either by a global system administrator or by the local user at each peer; systems where super-peers are elected based on fixed threshold, such as minimum amount of bandwidth or storage space. A number of research papers reviewed in this section assume that a certain super-peer topology is given and focus on the exploitation of such a topology. The creation and maintenance of the super-peer topology is treated in these papers as a separate research topic. Moreover, a large number of papers specify super-peer election criteria, usually as a combination of peer characteristics such as bandwidth, storage space, or processing power, but do not elaborate on any super-peer election algorithm.
The reviews begin with OceanStore, one of the first P2P systems that postulated the use of super-peers, and Brocade, a system inspired by OceanStore. Next, the section describes a number of file-sharing systems that rely on super-peers, including KaZaA and Gnutella. Finally, the review covers Skype, an Internet telephony system, and a number of general P2P frameworks and algorithms such as JXTA and the Schelling algorithm.
One of the first systems that proposed the use of super-peers, published in 2000, was OceanStore [94,152], a global-scale distributed storage system for persistent data. OceanStore can be seen as a predecessor of P2P systems, as it was designed to run on a large number of nodes distributed around the world and maintained by multiple independent providers. The nodes in OceanStore are considered unreliable and untrusted, as in P2P systems.
For reliability and performance reasons, data stored in OceanStore is replicated and spread evenly between nodes using a proactive replication algorithm. Every data object has a primary replica, which serialises updates, verifies access control credentials, and constructs a dissemination tree between secondary replicas. The primary replicas are hosted on a selected set of nodes, called primary tier or inner ring. These nodes function in OceanStore in a similar way to super-peers in many P2P systems.
As stated in , ``the primary tier consists of a small number of replicas located in high-bandwidth, high-connectivity regions of the network''. However, OceanStore does not provide any algorithm for the election of nodes in the primary tier. According to the replication protocol specification , nodes that participate in the inner ring are selected manually by system operators.
The data placement and discovery in OceanStore is controlled by a variation of the Plaxton algorithm , which later evolved into the Tapestry , one of the first distributed hash table systems. In this algorithm, every data item, as well as every peer in the system, is assigned a unique identifier. Data items are assigned to peers based on their identifiers and independently of their physical locations. Peers maintain a system topology and routing tables that enable efficient query routing and data access. Both Tapestry and the original Plaxton algorithm spread data and traffic equally between all peers in the system, and treat all peers in the system as if they possessed uniform resources.
In order to address peer heterogeneity and to improve routing performance in OceanStore, an extension to Tapestry has been proposed called Brocade . In Brocade, high-capability super-peers, called landmark nodes, are used for routing messages through wide-area networks on behalf of other peers that act as their clients. Landmark nodes maintain a Tapestry network between each other and use the Tapestry routing algorithm. Landmark nodes also maintain lists of their clients.
Brocade elects super-peers from nodes that ``have significant processing power, minimal number of IP hops to the wide-area network, and high-bandwidth outgoing links'' . Amongst the nodes that satisfy these requirements, super-peers are selected by the Internet Service Providers (ISP) in each local network. According to , ``gateways routers or machines close to them are attractive candidates'' for super-peers. An election algorithm is mentioned, but no details of such an algorithm are provided.
Brocade also includes two mechanisms that associate clients with super-peers. In the first approach, super-peers monitor the traffic in their local networks and intercept all messages destined to peers located in remote networks. These messages are subsequently tunnelled and routed over the Tapestry overlay by the super-peers. However, this approach requires that every local network in the system must have at least one super-peer and the network must be configured in such a way that the super-peers can intercept messages from all local peers. Such a requirement may be a serious obstruction in the system deployment.
The second method relies on the use of predefined names in the Domain Name System (DNS) to identify super-peers. Every super-peer, when elected, binds its address to a fixed name in the local DNS domain. A client can discover its super-peer by resolving the fixed name in its own local DNS domain. If no super-peer is found, the client can become a super-peer itself. However, this approach again requires that at least one super-peer must be created for every local DNS domain in the system, and furthermore, super-peers must be allowed to alter their DNS domains in order to register.
Amongst the first P2P systems that introduced super-peers were file-sharing applications. In these systems, each peer specifies a collection of files that it agrees to share with other peers and each peer can download potentially any file made available by other peers in the system. Most file-sharing systems also provide a search facility that allows peers to discover files shared by other peers that satisfy certain search criteria. In Napster, search was provided by a centralised server that kept track of all files hosted by peers in the system. However, most P2P file-sharing systems provide a decentralised search facility. In these systems, a peer performs search by generating a search query. The query is propagated in the overlay, and peers that store files that match the query reply back to the searching peer.
Early file-sharing systems, such as the first versions of Gnutella, commonly used flooding to propagate queries between peers. Other search techniques often used include random walks, Breadth First Search (BFS), Depth First Search (DFS), and iterative deepening [199,109,187]. However, these search strategies, sometimes called blind search techniques, have the drawback that they need to disseminate search queries potentially to all peers in the system in order to find matching results. As a consequence, when the size of the system grows and more search queries are generated, an average peer receives more search messages. At certain point, the system does not scale as the search overhead becomes prohibitive.
Gnutella addressed this problem by restricting the maximum number of times a query could be forwarded, limiting the scope of search queries. However, this introduced the problem known as the search horizon - peers were able to discover only a subset of files available in the system. Furthermore, Gnutella suffered poor performance due to the fact that the slowest peers in the network easily became overloaded, as the system did not adapt its topology to the peer capabilities.
In KaZaA, peers are divided into two classes - high-capability super-peers (called supernodes) that handle search, and ordinary peers that act as their clients. Each super-peer maintains an index of all files stored on its clients. The search protocol in KaZaA is called FastTrack. A client performs search by submitting a search query to its super-peer. The super-peers disseminates the query to other super-peers, and as it receives results from other super-peers, it forwards the results to the client.
This design has two advantages over traditional search algorithm for P2P systems. First, it limits the number of peers that participate in the search protocol, as search messages are exchanged between super-peers only. This way, search is quicker, less expensive, and more scalable. Second, it prevents low-capability peers from being overloaded, and further improves search performance, as search is handled by selected, high-capability peers.
However, it is not precisely known how KaZaA elects super-peers, since it is a proprietary application and its specification and source code are not publicly available. Based on reverse engineering, it is believed that peers in KaZaA decide to become super-peers using local knowledge about their own characteristics, such as bandwidth, processing power, unrestricted access to the Internet and availability . Furthermore, becoming a super-peer is voluntary, as each peer has the option to suppress the super-peer functionality.
As KaZaA became popular and the super-peer approach proved valid, super-peers were also introduced in Gnutella version 0.6 . Like KaZaA, Gnutella 0.6 used super-peers (called ultrapeers) for indexing files stored by clients (called leaves) and for handling search. The introduction of super-peers reduced the load on the lowest-performance peers and improves the scalability of the Gnutella network.
Gnutella 0.6 divides leaves into two categories: ultrapeer-capable and ultrapeer-incapable. The distinction is based on minimum performance requirements. The Gnutella 0.6 protocol  specifies that a peer is capable of becoming an ultrapeer if it has a non-firewalled connection to the Internet (allowing incoming TCP connections and UDP packets), minimum of 20 KB/s downstream bandwidth and 10 KB/s upstream bandwidth, an operating system that can handle large numbers of simultaneously open sockets, such as Linux, Windows 2000/NT/XP and Mac OS/X, and a sufficiently high uptime.
Every ultrapeer can accept up to 32 connections from leaves and up to 30 connections from other ultrapeers. In order to become an ultrapeer, a leaf peer must meet the ultrapeer requirements and must connect to an ultrapeer with at least 27 clients (90% of maximum 32). If an ultrapeer-capable leaf connects to an ultrapeer that has fewer than 27 clients, it connects as a leaf; otherwise, it connects an ultrapeer .
This way, a new ultrapeer is created when an existing ultrapeer has utilised 90% of its capacity. Generally, ultrapeers are created when new peers are added to the system, and ultrapeers are removed when peers leave the system. Moreover, every Gnutella user can force its peer to act as an ultrapeer or a leaf, disabling the election algorithm.
An interesting feature of the Gnutella 0.6 protocol is that it is backward compatible with earlier versions. For this reason, it is possible for ultrapeers and leaves to coexist in one Gnutella overlay with peers that do not distinguish between super-peers and clients and connect to all of them in the same manner.
eDonkey is another P2P file-sharing application that introduced super-peers, called eDonkey servers, for handling search [188,75]. eDonkey servers do not share any files and do not initiate downloads, but index files stored on their clients and enable search. Every peer in the eDonkey network is eligible to setup a server, and the decision is made manually by each eDonkey user.
Yang and Garcia-Molina  analyse the performance of super-peer based file-sharing networks and give practical guidelines for designing such networks. They investigate the relationships between the number of super-peers in the network, load on super-peers, and search performance, in order to find the formula for an optimal system configuration. Their model is specific to file-sharing systems and is not easily generalised to other application areas. Moreover, they do not address the problem of super-peer election and super-peer topology maintenance, as they only analyse static properties of super-peer networks.
Skype [15,67] is a P2P telephony system that enables voice communication between users using computers connected through a wide-area network. It uses super-peers (called supernodes) for relaying traffic between firewalled peers.
Peers are classified into two categories: firewalled peers, located behind a firewall or Network Address Translator (NAT), which usually have a private IP address and cannot receive incoming TCP connections and UDP packets, and non-firewalled peers, which have full access to the network and can accept all connections. Super-peers are elected amongst non-firewalled peers only.
Depending on the type of the calling peer and the callee, four scenarios are possible. If both the caller and the callee are non-firewalled, they can communicate directly. Similarly, if the caller is firewalled but the callee is non-firewalled, the caller can establish a direct connection with the callee.
Every firewalled peer, in order to receive phone calls, maintains a permanent connection with a super-peer. If a non-firewalled peer attempts to call a firewalled peer, it first contacts the super-peer associated with the callee, the super-peer then notifies the callee, and the callee initiates a connection to the caller. When the connection has set up, the caller and the callee can communicate directly.
In the last possible scenario, where both the caller and the callee are firewalled, the connection procedure consists of two steps. First, the caller connects to the super-peer of the callee, and synchronised by the super-peer, the caller and the callee attempt to establish direct connectivity using STUN [161,160], a NAT traversal protocol. If this step fails, peers fall back to the TURN protocol , where all messages between the caller and the callee are relayed by the super-peer .
Although the Skype protocol specification and source-code are not publicly available, it is believed that peers are promoted to super-peers in Skype if they are non-firewalled and have a high amount of bandwidth .
Juxtapose (JXTA)  is a network programming and computing platform, created by Sun Microsystems, specifically designed to be the foundation for creating P2P systems. JXTA standardises a common set of protocols that allow groups of hosts to establish P2P overlay networks. JXTA can be implemented on top of TCP/IP, HTTP, Bluetooth, HomePNA, and many other transport-layer protocols.
JXTA uses two types of super-peers, called rendezvous peers and relay peers .
Rendezvous peers are peers that enable search. They cache so-called advertisement indices, i.e., pointers to client peers that advertise particular resources, and act as directory services. Rendezvous peers also connect to each other and form a semi-structured super-peer network, where each rendezvous peer maintains a loosely-consistent list of all other rendezvous peers in the system. Rendezvous peers forward search queries between each other using a limited-range rendezvous walker algorithm .
Relays are used for bridging peers that do not have direct physical connectivity, for example due to firewalls or NAT. Like rendezvous peers, relay peers connect to each other and maintain loosely-consistent lists of all relay peers available in the system. A routing algorithm is used to establish a connection between any two peers in the system through a series of relays.
Furthermore, JXTA uses two sets of peers that act as permanent rendezvous and permanent relays, called seeding rendezvous and seeding relays. The seeding peers are used for bootstrapping peers that join the overlay.
The super-peer election algorithm in JXTA is not fully specified. According to , ``any peer can become a rendezvous peer assuming it has the right credentials''. Similarly, ``any peer may become a relay peer assuming it has the right level of credentials and capacity (bandwidth, low churn rate and direct connectivity)''. It is up to the higher-level application to decide on the minimum capacity and credentials required for rendezvous and relay peers. The JXTA white paper specifies also that ``if none of the seeding rendezvous is reachable after a tunable period (5 minutes), the edge [i.e., client] peer will try to become a rendezvous (if it has the right credential)'' . Seeding rendezvous and seeding relays must be hard-coded in the application.
Kleis et al.  propose the use of Lightweight Super-Peer Topologies (LST) for routing in P2P networks. Their approach is based on Yao-Graphs and the Highways scheme, which allow the construction of Euclidean minimum spanning tree. Such a tree structure can be used for efficient multicasting. LST relies on super-peers, however, Kleis et al.  do not specify any super-peer election mechanism. They only mention that a super-peer ``should have enough resources to serve other peers'' and ``should be reliable in the sense that it is not joining and leaving the P2P network frequently''. Furthermore, ``trust and security incentive schemes could be layered in this election process.''
Another example of a P2P system that does not specify any super-peer election algorithm is the Supernode Based Peer-to-Peer File Sharing System (SBARC) . SBARC improves the routing performance in Pastry DHT  by routing messages through high-capability super-peers. Xu and Hu describe a routing algorithm and a data caching scheme that take advantage of super-peers, however, they do not elaborate on the super-peer election mechanism. The following paragraph describes their super-peer election criteria.
The basic requirement is a high bandwidth connection. [...] The second criteria is it [i.e., super-peer] must have enough computation power because supernodes need to deal with most system workloads. [...] Third, supernodes can not join/leave system frequently, otherwise its efficiency will be greatly reduced. Also it is helpful if supernodes have large amount of storage space. 
It is not specified what amount of available bandwidth or computational power or storage space is required for a super-peers, and no algorithm is given that could calculate such minimum super-peer requirements. Furthermore, it is not explained how to estimate peer stability in order to elect super-peers that do not join or leave the system frequently.
A very similar approach is described by Zhu et al. in , where the routing performance in a DHT is improved through the introduction of super-peers. However, as in , Zhu et al.  focus on routing strategies and do not describe any super-peer election algorithm. They only specify the following criteria for super-peers: ``significant processing power, high bandwidth, high availability, and large amounts of memory and storage space''. The details of the super-peer election are not discussed ``due to space constraints.''
Scalable Unstructured P2P System (SUPS)  is another system that relies on super-peers but does not contain any super-peer election mechanism. The main goal of SUPS is to produce a low-diameter and balanced super-peer topology. A neighbour selection algorithm is shown, inspired by the theory of random graphs, that connects super-peers in such as way that the topology diameter is and the average super-peer degree is minimised, given a system with super-peers.
However,  does not address the problem of super-peer election. It only mentions that ``super-peers are selected from normal peers that have high bandwidth, high computing power, a long resistance time, and a low likelihood of failure [...] the detailed choice of super-peer remains as a separate research topic not addressed in this paper.''
Singh and Haahr  propose a neighbour selection algorithm, inspired by the Schelling's model, that constructs and maintains hub topologies. A hub topology can be defined as P2P topology, where selected peers, called hubs, are highly connected with each other, while ordinary peers are connected to both hubs and other ordinary peers. Hub topologies are considered more robust than classic super-peer topologies, since in the former, a super-peer failure causes an isolation of its clients, while in the latter, ordinary peers are connected to multiple super-peers and can handle hub failures.
The original Schelling's model is a sociological model proposed to explain the existence of segregated neighbourhoods in the United States. In this model, each agent defines its satisfaction condition, and changes its neighbourhood whenever the satisfaction condition is not met by performing an adaptation procedure.
The customised Schelling algorithm that creates hub topologies is shown in Figure 2.2. A hub is satisfied if it is connected to at least one, but not more than , other hubs, where is a system constant (lines 1-3). If the number of hub neighbours, , is above , the hub drops one hub connection (lines 5-9). If is zero, the hub connects to another hub discovered by performing a Depth First Search (lines 10-13). An ordinary peer is satisfied if it is connected to at least one hub (lines 14-17). If this condition is not met, the peer discovers and connects to a hub (lines 23-24). Furthermore, if its number of neighbours is above , where is a system parameter, it removes a random neighbour (lines 18-22).
The algorithm does not determine which peers act as hubs. According to , hubs are ``high-availability and high-capacity peers''. They can be be exploited in a similar manner as super-peers in traditional super-peer systems, for example to perform resource-intensive tasks such as ``maintaining a directory of resources in the network and processing search queries''. The main advantage of the Schelling algorithm is its simplicity and robustness. However, it does not address the super-peer election problem.
All systems described in this section employ very simple approaches to super-peer election and management, which are not likely to produce optimal, or close to optimal, system configurations. Centralised approaches introduce reliability and scalability concerns, and are in obvious contradiction with basic principles of P2P systems. Manual super-peer selection is difficult due to the large scale, dynamism, and complexity of P2P systems. A global system administrator is not likely to possess sufficient knowledge about the system to select an optimal super-peer set. A local user at each peer is even less likely to obtain such a knowledge about the system.
Most of the reviewed systems specify criteria for super-peer election. In most cases, these criteria are described as a high amount of available bandwidth, storage space, processing power, and a long session time. However, many systems do not provide any mechanism for the estimation of peer properties in order to apply these criteria. For example, it is not obvious how a peer can estimate its remaining session time. More importantly, given a criteria such as a high amount of property , many systems do not specify precisely what high is.
In many reviewed systems, the super-peer criteria is defined as static threshold, e.g., , such that all peers with property above the threshold automatically become super-peers and all the remaining peers become clients. This simple approach has the advantage that every peer can make the decision about becoming a super-peer locally, without any communication with other peers. In some cases, the threshold may be directly defined by the requirements of a higher-level application.
However, this approach has a serious drawback. A static threshold does not allow the system to control the number of super-peers in dynamic populations of peers. If peer properties change, the super-peer sets changes accordingly. This can lead to extreme cases, where no super-peers are elected, if all peers fall below the threshold, or where every peer is a super-peer, if all peers are above the threshold.
In many P2P applications, the number of super-peers is critical for the system performance. For example, in file-sharing applications, the number of super-peers must be significantly smaller than the total number of peers in order to reduce the amount of search traffic to an acceptable level. At the same time, the number of super-peers must be large enough to be able to handle the load associated with serving clients. Often, the optimum number of super-peers in the system depends on the current load, e.g., the number of user requests. A static threshold does not allow the system to adapt the size of the super-peer set to the current demand.
Setting a threshold that limits the number of super-peers to a desired level requires a global knowledge of peer characteristics. In many systems, it is assumed that peer properties follow a certain distribution. Such a distribution can be estimated using domain-specific knowledge, or measured experimentally in deployed systems.
However, there are a number of reasons why the distribution of peer characteristics may change. A natural change may occur over time, as new technologies are developed and the Internet evolves. According to the Moore's Law , the capabilities of computers, measured as the number of transistors placed on an integrated circuit, is increasing exponentially, doubling approximately every two years. The global network conditions and the general behaviour of users are another two important and unpredictable factors. Peer properties may depend on the time of the day, day of the week, and external events which may generate flash crowds. Peer properties may also change when the system is deployed in a different environment, for example in a different country, or exposed to a different group of users.
Figure 2.3 shows a sample super-peer election scenario with a static threshold. Each peer is assigned a capacity value, and the super-peer election threshold is 10. In Figure 2.3(a), four super-peers are elected, and the ratio of clients to super-peers is balanced. In Figure 2.3(b), the capacity of all peers is reduced by a half. As the threshold remains static, only one super-peer is elected in the system. Such a single super-peer is likely to become overloaded. In Figure 2.3(c), the capacity of all peers is twice increased. Nearly all peers in the system become super-peers and the system is again likely to becomes inefficient.
To conclude, the systems reviewed in this section clearly show that there are many applications scenarios where the performance and scalability of a P2P network can be significantly improved by introducing super-peers. Moreover, as the reviewed systems offer relatively simple super-peer election mechanisms, there is a general need for more sophisticated techniques that would allow peers to better control the number of super-peers in the system and to adapt the super-peer set to the changing system conditions. Such approaches to super-peer election are covered in the next three sections.
The systems described in this section rely on a common mechanism of peer grouping. In these systems, the population of peers is divided into disjoint groups, based on peer properties such as physical location or data semantics, and a super-peer, or multiple super-peers, are elected within each group independently.
The main advantage of this approach is that the general problem of super-peer election in the system is decomposed into a number of local election subproblems within groups, which are easier to solve since groups are much smaller than the total system size. In the most straight-forward case, the characteristics of all peers in a group are directly compared with the existing super-peer. Furthermore, systems that belong to this class usually impose restrictions on the super-peer topology, such as a certain maximum distance between a super-peer and a client, according to some metric, in order to satisfy application-specific requirements.
In most approaches, super-peers are created on demand, as peers join the system. The general structure of the join procedure consists of the following steps, as shown in Figure 2.4. First, the joining peer, , determines the group, , that it belong to (line 2 in Figure 2.4), and contacts the super-peer, , responsible for group (line 3). These steps may involve interactions between and other peers in the system, and are system-specific. It is assumed that every peer must know at least one other peer that already participates in the system in order to join.
If no super-peer exists in group , peer becomes a new super-peer in group (lines 4-5). If a super-peer is found, peer compares its characteristics with in order to decide which of the two peers is more suitable to serve as a super-peer for the group (line 7). If peer has higher capabilities than , it joins the group as a client of (line 8). Otherwise, it becomes a new super-peer and swaps its roles with . In the latter case, all existing clients are transferred from to and besoms a client of (lines 10-12).
The join procedure is graphically shown in Figure 2.5. An initial configuration consists of four groups, three super-peers and eight clients (a). If a new peer enters an empty group, it becomes a new super-peer in this group (b). If a peer joins a non-empty group, it connects to the existing super-peer in this group (c), compares its capabilities with the super-peer using some application-specific metric, and potentially swaps its role with the super-peer (d).
Apart from peer arrivals, the super-peer election algorithm is run whenever an existing super-peer becomes unavailable. Often, the super-peer maintains a list of all peers in the group and periodically broadcast this list to all its clients. The list is ordered based on the characteristics of individual peers, from the highest-capability peer to the lowest-capability peer. In case of a super-peer failure, an election is performed and the first peer on the list becomes a new super-peer. A number of variations of the super-peer election algorithm exist, for example where multiple super-peers are elected within each group, and where groups are dynamically split and merged with each other during the course of the system's operation. If peer characteristics are dynamic, the super-peer may occasionally swap with one of its clients.
The systems reviewed in this section are divided into three subcategories, where peers are organised into groups based on physical location, semantic description, and administration domain (in Grids).
In a number of systems, peers are organised into groups based on their physical location, for example using information about their country, ZIP code, or ISP. Many systems also introduce peer distance metrics, for example defined as the communication latency or number of IP hops on the route between two peers. In these systems, the goal is usually to assign clients to super-peers that are close to them according to the distance metric.
Many systems also use Internet coordinate systems, such as GNP  and Vivaldi , where every peer is positioned in a -dimensional virtual coordinate space. The coordinates of each peer are calculated by measuring the distance of each peer to well-known landmark hosts. If the landmark nodes are evenly distributed in the system, the Euclidean distance between peers in the virtual coordinate space approximates the distance between these peers in the physical network.
Figure 2.6 shows an example of a two-dimensional coordinate space generated using two landmark nodes, and . Peer measures its distance to and and obtains as its coordinates . Similarly, peer determines its coordinates as . The distance between and can be estimated as .
A -dimensional virtual space can be mapped onto a lower-dimensional space (one-dimensional space in particular) using a Space Filling Curve (SFC), such as the Hilbert space-filling curve and the -order curve. These curves have the locality-preserving property. Points that are close in the original -dimensional space are mapped onto points that are also close in the target space.
Crown  is a distributed resource management protocol, similar to a distributed hash table. It allows a P2P system to distribute resources, such as files, programs, and services, uniformly between peers, and it enables an efficient resource lookup protocol. Each peer in Crown is assigned two identifiers: a peer identifier, which is unique, and a group identifier, which is potentially shared with other peers. A peer identifier is defined as the SHA-1 hash of the peer's IP address, and a group identifier is defined as the SHA-1 hash of the first bits of a peer's IP address. It is assumed that peers which share a common IP address prefix are likely to be located in the same local network or autonomous system, and hence, the connections between such peers are likely to have low latency and high throughput.
Peers that belong to each group elect a super-peer. The details of the super-peer election algorithm are not described in , but it can be expected that a similar algorithm to the one described in Figure 2.4 can be applied in Crown. The criteria for super-peer election in Crown are: high peer bandwidth, high availability (uptime), large computational power, and a low load. Additionally, users are allowed to manually select which peers become super-peers.
Super-peers connect to all clients in their respective groups and to other super-peers in the system, forming a ring topology that spans all peer groups. Super-peers run a decentralised lookup protocol, which allows them, and their clients, to locate resources available in the system.
A similar approach to super-peer election is proposed in the Peer-to-peer Asymmetric file Sharing System (PASS) , a P2P file-sharing application. Peers in PASS are divided into multiple areas based on their geographical location, using information such as ZIP codes or administrative network domain names. PASS does not impose any particular division scheme, but it assumes that the latency between peers that belong to the same area is low compared with the entire system.
Each group of peers independently elects its own super-peers. The super-peers maintain a distributed directory of system data and handle search. As in other file-sharing applications, search requests are propagated between super-peers only and are not forwarded to clients in order to reduce search overhead. Furthermore, one of the super-peers in each area is elected as a Representative SuperNode (RSN), which handles inter-area communication. It is assumed that due to the location-aware division of peers into groups, operations executed locally in a group, such as super-peer elections, directory updates and lookups, are performed at low cost.
As the performance of super-peers (particularly RSNs) is critical for the system's operation, the super-peers are selected from the most stable peers in the system. The first peer that enters an area, becomes a super-peer and RSN for this area. If an existing super-peer becomes overloaded, e.g., due to a high number of clients, the super-peer selects one of its clients, promotes it to a super-peer, and splits the remaining clients between itself and the new super-peer. Furthermore, each super-peer appoints a backup peer amongst its clients; when a super-peer leaves, it is replaced by its backup peer.
A very different approach is followed in PoPCorn , which assumes an -dimensional Euclidean space generated using an Internet coordinate system such as GNP and Vivaldi described above. PoPCorn elects super-peers and distributes them evenly in the coordinate space. The distribution criteria, dispersal, is achieved by maximising the sum of inter-node distances between pairs of super-peers.
Super-peers are elected using tokens exchanged between peers using a repulsion model. The initial token placement is random. Every peer that holds a token performs a scoped gossip with its neighbours in order to notify them about tokens in their vicinity. When a peer receives a gossip message, it updates its model of nearby tokens and calculates a combined repulsion force of these tokens. If the repulsion force exceeds a threshold, , the token is passed to another peer, according to the force direction. If a token stays on a peer for a certain number of time steps, , this peer becomes a super-peer. Each peer calculates its threshold based on its capabilities. Peers that are better qualified to serve as super-peers have higher values of the threshold , and hence, are more likely to be elected super-peers.
The algorithm description in  is very brief and is missing many details. In particular, it it not entirely clear how peers select their neighbours and what topology they maintain, how they perform gossip, and how they discover peers that are closest to the virtual locations defined by the repulsion forces. No evaluation is shown in  and there is no evidence that the algorithm generates the desired system configurations. Furthermore, it is not obvious how to deal with peer failures and departures, and in particular, how to address the problem of lost tokens.
Wolf and Merz  consider a similar problem of constructing a super-peer topology where the distance between super-peers and their clients, as well as between connected super-peers, is minimised. The motivation for this problem is to minimise the total communication cost in the system. Wolf and Merz state that the problem is NP-difficult, but show a heuristic based on evolutionary algorithms enhanced with local search that generates close-to-optimal super-peer topologies. However, the proposed heuristic algorithm requires a global view of the entire network and as such cannot be directly deployed in a P2P system.
A number of P2P systems elect super-peers based on peer semantics, for example using tags, text descriptions, or XML descriptions, or ontology concepts associated with peers. A number of standards have been proposed for the semantic description of peers, which include the Resource Description Framework (RDF), Web Ontology Language (OWL), and RDF Schema. Many systems also introduce formal metrics that measure the semantic similarity between peers. Furthermore, virtual coordinate spaces can be generated based on peer semantic properties, as in location-based systems.
Tang et al.  describe an approach where peers are mapped onto a coordinate space using the Vector Space Model (VSM). In this model, every peer is associated with a set of text documents, and a fixed set of terms is used to calculate peers' coordinates. The 'th coordinate of peer is defined as , where is the frequency of the 'th term in 's documents, and is the total frequency of the 'th term in all documents. The distance between peers and , assigned coordinates and is given as .
Figure 2.7 shows a sample two-dimensional coordinate space based on terms 'aaa' and 'bbb'. Peer is described by terms 'aaa', 'bbb', and 'ccc'. Given the global frequencies of 'aaa' as 0.3 and 'bbb' as 0.3, both coordinates of peer are equal to . Similarly, peer , associated with terms 'aaa', 'bbb', 'ddd', and 'bbb', is assigned coordinates and , and peer , with keywords 'aaa', 'ccc', and 'bbb' has coordinates and .
VSM has the disadvantage that it does not recognise term synonyms and generates coordinate spaces with large dimensions. These problems are addressed in the Latent Semantic Indexing (LSI) algorithm , which projects high-dimension vectors generated by VSM onto low-dimension semantic subspaces, selecting the most relevant terms for each peer.
Another approach is proposed by Smidth and Parashar , where every peer is described by a sequence of keywords and positioned in a -dimensional coordinate space. The 'th coordinate of a peer is defined by the 'th keyword of this peer. Each keyword is treated as a -digit base- number. Keywords that are longer than digits are truncated by removing their last letters. Keywords shorter than letters are padded with a special zero character.
For example, if keywords were constructed over the plain Latin alphabet with 26 letters, each keyword would be a base-26 number, with 'a' representing 1, 'b' representing 2, and so on. Figure 2.8 shows a sample two-dimensional coordinate space ( ) with three-letter keywords ( ) over a 26-character alphabet ( ). Peer , associated with keywords 'abc' and 'ddd' is assigned coordinates and . Similarly, peer with keywords 'kk' and 'sss' has coordinates and , and peer with keywords 'abcde' and 'pqr' has coordinates and .
A number of other peer distance metrics based on peer semantics are described in .
Edutella  is a file-sharing P2P network where every shared file and every peer is described using RDF schemas and RDF metadata. As many other file-sharing networks, Edutella uses super-peers for handling search in order to reduce the overall search cost and overhead [134,133]. Super-peers are arranged in a hypercube topology, maintained using the HyperCuP protocol , which enables efficient routing of search queries, and is also robust to multiple peer failures. In a hypercube with peers, the path between any two peers has at most overlay hops, and the minimum number of peers that need to be removed to partition the network is , while each peer maintains neighbours. A hypercube overlay also allows efficient broadcast and multicast.
In Edutella, every peer is described using a combination of structuring concepts, defined through a set of global ontologies, known to all peers in the system [169,108]. These structuring concepts determine the coordinates of each peer in the hypercube. An ordinary peer connects to the semantically closest super-peer. A super-peer joins the hypercube overlay using the HyperCuP protocol. This way, peers that have similar semantic characteristics are located close to each in the system topology.
According to , ``super-peers provide the necessary bandwidth and processing power to enable efficient and reliable query processing and answering''. However, no algorithm is described that would allow the selection of such high-performance peers. In all algorithms described in [169,134,133,108], it is assumes that every peer joins the system either as a super-peer or an ordinary peer and it is not explained how a peer decides on its role.
The Repository of Objects with Semantic Access (ROSA)  is an e-learning system, whose purpose is to support teachers in the preparation of didactic materials. ROSA-P2P  is a P2P network that provides a distributed storage and search facility for ROSA.
As in Edutella, ROSA-P2P organises peers into groups based on their semantic characteristics. Each group consists of peers that have similar subjects and localisation. Furthermore, each group elects a super-peer, which is responsible for storing data and handling search. The super-peer election criteria are based on peer characteristics such as stability, bandwidth, processing power, available memory and storage capacity. In order to act as a super-peer, a peer needs to satisfy certain performance requirements. Additionally, peers that belong academic institutions automatically become super-peers.
Each peer can declare one of three preferences: to refuse being a super-peer, to accept becoming a temporary super-peer, or to accept being a permanent super-peer. In the first case, the peer never becomes a super-peer. In the second case, the peer becomes a super-peer only if there is no other peer in its group that agrees to be a permanent super-peer. In the third case, the peer is elected a super-peer if it is the highest-capability peer in its group.
When a peer joins the ROSA-P2P network, it first determines the group it belongs to and contacts the super-peer in this group, as in the general scheme shown in Figure 2.4. The list of available groups and corresponding super-peers is obtained from a centralised ROSA portal. The further steps depend on the preferences of the current super-peer and the joining peer. If no super-peer is found in the group, the joining peer is requested to become a super-peer. If it rejects the request, it is either assigned to a different group of is not allowed to join the system. If a temporary super-peer is found in the group, and the joining peer agrees to become a permanent super-peer, the existing super-peer and the joining peer swap their roles. In all other cases, the joining peer becomes a client of the existing super-peer in the group.
Each super-peer periodically compares the characteristics of its clients and selects the highest-capability peer in its group that agrees to serve as a super-peer. If the selected peer has better characteristics than the existing super-peer, the roles of the two peers are swapped. Moreover, each super-peer defines a limit on the maximum number of clients it can simultaneously support. When the number of peers in a group approaches the limit, the super-peer assigns a new super-peer from its clients and splits the group.
Even though ROSA-P2P has been thought as a decentralised P2P system, it is partly centralised, as it relies on a centralised ROSA portal, which maintains the information about the current structure of groups in the system. The centralised portal allows all peers in the system to achieve a synchronised view on the peer groups.
Qiao et al. [147,146] propose a P2P system where peers are arranged using a globally-known taxonomy of concepts. Such a taxonomy is a tree, or a hierarchy, where vertices represent concepts, and edges represent relationships between concepts. Every peer is characterised by a number of concepts, but the taxonomy tree is maintained by super-peers only. Each super-peer is responsible for one or multiple subtrees of the taxonomy tree, and maintains an index of all peers that are characterised by the concepts in these subtrees. A super-peer together with associated clients form a semantic cluster.
Super-peers are created on demand. Each concept is associated with a load, defined for example as the frequency of queries related to the concept, or the number of clients or data items associated with the concept. Moreover, every super-peer has a maximum capacity that defines the maximum load it can handle. When the load in a cluster exceeds the capacity of the super-peer, the super-peer selects one of its clients as a new super-peer and splits the cluster.
There are two algorithms for splitting clusters. In the Semantic First Clustering Algorithm (SFCA), a cluster always contains a single subtree of the global hierarchy. When a cluster is divided, the super-peer selects a subtree with load approximately equal to half of the current cluster's load. The resultant clusters always have a parent-child relationship.
The Load Balance First Clustering Algorithm (LBFCA) allows more flexible cluster splitting in order to balance the load between clusters more evenly. A cluster is divided into two subclusters according to the following three rules: (i) if the cluster consists of a single semantic tree, the tree is split into one or multiple subtrees and these subtrees are divided between subclusters; (ii) if the cluster consists of multiple subtrees, these trees are divided between subclusters; (iii) if the cluster consists of a single concept, the load associated with this concept is shared between subclusters.
Figure 2.9 shows a sample hierarchy consisting of concepts A, B, ...K. Each number represents the load associated with a concept. The maximum super-peer load is 20. Initially, the entire hierarchy belongs to one cluster X. When the load associated with every concept is doubled, the cluster is split into clusters X and Y using rule (i) in LBFCA. Next, the load in cluster Y is doubled again, and cluster Y is split into Y and Z using rule (ii). In the last scenario, the load in cluster Y is doubled once again, and cluster Y is divided into Y and Y' using rule (iii), so that the single concept A is shared by both clusters Y and Y'.
Although SFCA and LBFCA offer a flexible and adaptive division of the semantic space between super-peers, they raise a number of questions. Both algorithms do not specify how clusters are merged, for example when the load in the system decreases. Furthermore, they require a static and globally known hierarchy of concepts. It is not obvious how such a hierarchy is created and what authority maintains it. One approach would be to adapt a centralised server, as in ROSA-P2P, but this would conflict with the system's decentralisation.
Finally, it can be questioned if a hierarchy is a generally a suitable structure for P2P systems. If a super-peer hierarchy is used for routing or searching, as described in , the top-level super-peers may easily become overloaded. Moreover, the failure of the top-level super-peers, potentially due to a malicious attach, may disable the entire P2P system.
A large-scale Grid can be viewed as a network interconnecting small-scale, proprietary Grids, where each of such small-scale Grids is a network composed of hosts located within one administrative domain, called a Virtual Organisation (VO) [116,175]. Grid systems and P2P systems, although considered distinct areas, share a number of common characteristics. In particular, both Grid and P2P systems consist of large numbers of nodes physically distributed between different geographical locations, which cooperate with each other in order to provide system-level services.
The main differences between Grids and P2P systems are their scale, dynamism, and the degree to which these systems are controlled. In the currently largest deployed Grids, the numbers of participating nodes are on the order of thousands , while in the most popular P2P systems, such as Skype and KaZaA, the numbers of simultaneous users have already exceeded millions [35,102]. Furthermore, Grid nodes are more stable, as they are usually run by large enterprises and public institutions and are often hosted on dedicated servers. In P2P networks, nodes can freely join and leave the system. Grid systems are also more tightly controlled by system administrators and are more secure than P2P systems.
Mastroianni et al. [116,117] propose the adoption of super-peers in large-scale Grids in order to improve the efficiency of resource discovery and membership management. In their approach, super-peers maintain metadata about their clients and run a search protocol in a similar manner as in P2P systems. The authors believe that super-peers can be selected manually in Grids, unlike in P2P systems, due to the fact that large-scale Grids are composed of smaller, locally-managed Virtual Organisations. In each VO, local administrators can select a set of the most ``powerful'' super-peers using local knowledge available within the organisation.
A more self-managing approach to super-peer election is proposed in the Grid Activity Registration, Deployment and Provisioning Framework (GLARE) . Here, the Grid is divided into multiple peer groups, as in , and each group elects a super-peer. Super-peers are used for relaying search queries and caching search results. However, unlike in , peer groups are not defined by VOs, but are rather created using the Globus Toolkit 4 (GT4) built-in hierarchical aggregation and indexing mechanism.
One member of each peer group is selected as an election coordinator using the GT4. In order to initiate an election, the coordinator notifies all peers in the group, and peers reply back to the coordinator with their static characteristics, such as processor speed, amount of memory, uptime, and site name. The coordinator ranks the peers and selects the highest-capability peer as a super-peer. In case the group consists of a large number of peers, the coordinator may decide to split the group by appointing multiple super-peers and splitting the group members equally between the super-peers. If a super-peer fails, the first peer that discovers the failure selects the highest rank peer as a new coordinator, who initiates a new election. If the majority of peers in the group confirm that the current super-peer is not available, the highest rank peer becomes a new super-peer.
One of the main advantages of dividing a P2P system into peer groups is that many system-level problems can be decomposed into simpler to solve group-level subproblems. In particular, the general problem of super-peer election in the system can be decomposed into simpler subproblems of super-peer election in each group. Additionally, if super-peers are elected locally in each peer group, they are usually located close to their clients in terms of physical distance, communication latency, bandwidth, semantic similarity, or other metrics, depending on the mechanism used to construct peer groups. Such a proximity-aware organisation of super-peers and clients in the overlay topology is often required by higher-level applications.
However, the use of peer groups introduces the problem of group creation and maintenance in a decentralised P2P system. Every peer needs to know the group it belongs to.
In the simplest case, a peer determines its group based on its static properties, such as IP address, ZIP code, ISP, administrative domain, or position in some global semantic ontology. This has the advantage that every peer can independently decide on its group, without negotiating with other peers. Moreover, in many systems, every peer group elects exactly one super-peer. This simplifies greatly the super-peer election, as it can be performed using a single coordinator in each group, which collects complete information about every peer in the group and selects a super-peer.
However, such simple approaches have a number of drawbacks. First of all, they do not allow the system to actively control and adapt the number of super-peers to changing system conditions. The number of super-peers is determined by the current number of groups and cannot be tuned at runtime. Furthermore, in many systems, the maximum number of groups is fixed and is determined by the number of possible values of peer properties, such as IP address prefixes, ISPs, ZIP codes, etc. If the total number of peers in these systems is low, groups have relatively few peers, and the overall ratio of clients to super-peers is low. Contrarily, if the number of peers is large, groups consist of large numbers of peers, and the ratio of clients to super-peers is high. For a certain system size, super-peers become overloaded. Inherently, such a system does not scale.
Figure 2.10 shows a sample system with four peer group. The capacity of each peer is marked by a number. In figure 2.10(a), the number of super-peers and clients is balanced. In figure 2.10(b), the overall number of peers is reduced, resulting in a very low client to super-peer ratio. In figure 2.10(c), the number of peers is increased, resulting in a high client to super-peer ratio and a high load on super-peers.
If the distribution of peers between groups is non-uniform, some groups may become highly populated by peers, while others may contain relatively few peers. For example, in systems where peers are assigned to groups based on their physical locations, it can be expected that groups corresponding to high-density, developed areas, such as cities, may contain more peers than other groups. In such cases, the load between super-peers is imbalanced. Moreover, if the distribution of peer capabilities between groups is also skewed, some groups may contain significantly more high-capability peers than other groups. If the goal of the system is to elect the globally highest-capability super-peers, such a skewed distribution may lead to an suboptimum super-peer election. An example of such a scenario is shown in Figure 2.10(a). The highest capability values are 18, 16, 15, and 14. However, the elected super-peers have lower capabilities, i.e., 18, 15, 11, and 9. Peers with capabilities 14 and 16 are not elected super-peers because they are clustered in one group with peer 18.
These shortcomings can be addressed in two ways. First, multiple super-peers can be elected in each group. However, this requires additional mechanisms for the synchronisation and coordination of multiple super-peers elected in one group, and dividing clients between super-peers in a group. Ultimately, as the system size grows to a large number of peers, the problem of multiple super-peer election in a group becomes equivalent to the general problem of super-peer election in a P2P system.
The second approach is to adjust groups in the system dynamically, for example by splitting and merging them. However, this requires dedicated mechanisms for the management of dynamic groups in the system. Every peer needs to be able to determine the group it belongs to, and super-peers need to achieve an agreement between each other on the structure of peer groups. As this is non-trivial in a decentralised system, some systems rely on centralised servers that manage information about peer groups. Furthermore, merging and splitting groups requires an additional algorithm that decides when and which groups to merge and split. In the extreme case, where groups are fully flexible and can be arbitrarily changed, the problem of group construction is equivalent to the general super-peer election problem.
This section describes the class of P2P systems known as Distributed Hash Tables (DHT). A DHT is a P2P system that distributes values, such as objects, chunks of data, and user requests, between peers in the system. Each value is associated with a key, and the system provides an efficient and deterministic mapping from keys to peers. DHTs support three operations: an insert operation that associates a given key with a given value and stores the key-value pair on a peer in the system; a lookup operation that retrieves the value associated with a given key; and a delete operation that removes from the system a given key together with its associated value.
Every peer in a DHT is assigned a unique identifier and is responsible for the maintenance of a part of the key space. Usually, a peer is responsible for the keys that are numerically closest to its own identifier.
The three operations, insert, lookup, and delete, require multi-hop routing between peers in order to access keys and values stored by the system. Typically, DHTs support routing from a peer to any other peer in the system in overlay hops, while every peer maintains neighbours, where is the number of peers in the system.
Chord [179,180] is one of the the earliest, most often cited, and most popular to date DHT systems. In Chord, every peer calculates its -bit identifier by applying the SHA-1 hash function to its IP address. Similarly, every key is assigned an -bit identifier generated by hashing this key. In some variations of Chord, the key's identifier is identical to the key itself and is obtained by hashing the value associated with this key. Due to the properties of the SHA-1 functions, both peer identifiers and key identifiers are scattered evenly in the identifier space, and the probability of generating duplicated identifiers for different peers or keys is extremely low.
The identifiers are ordered in an identifier circle modulo . The successor of identifier is defined as the first peer whose identifier is equal to or follows in the identifier circle. Similarly, the predecessor of is the first peer whose identifier precedes in the identifier circle.
The keys, and the values associated with them, are assigned to peers in the system using consistent hashing. The basic rule is that key is assigned to the predecessor of . This scheme has the property that keys are distributed evenly between peers, and the addition or removal of a peer in the system does not change the mapping of keys to peers significantly. In particular, in a system with peers and keys, with high probability, every peer is responsible for at most keys, and when a peer joins or leaves the system, keys are re-assigned between peers, and only between the joining or leaving peer and its neighbours .
Figure 2.11(a) shows an example Chord system with a 4-bit identifier space ( ) and nine peers. Since all operations are performed modulo , identifier 15 is followed by 0. Each identifier is assigned to the closest preceding peer, and hence, peer 0 is responsible for identifiers 0 and 1, peer 2 is responsible for identifier 2, peer 3 is responsible for identifiers 3 and 4, and so on.
Peers in Chord connect to their successors and predecessors by which they organise into a ring topology. In some variants of Chord, peers maintain lists of successors and predecessors, i.e., lists of peers that immediately precede and follow them in the identifier space, in order to improve the system's fault-tolerance. The ring topology enables a simple routing algorithm, where peers forward messages to their successors until each message reaches its destination. However, in this simple routing strategy, a message is forwarded on average between peers before it is delivered.
In order to improve the routing performance, peers maintain additional neighbours that act as ``shortcuts'' in the ring topology. These neighbours are selected using a finger table, which consists of identifiers. For a peer identified by , the finger table contains
where all arithmetic is modulo . In generality, the 'th entry in the finger table of peer is , where . The entries in the finger table do not change over time, since peer identifiers are constant. Figure 2.11(b) shows a sample finger table in the 4-bit Chord system. For peer 0, the table contains identifiers 1, 2, 4, and 8.
For each identifier in the finger table, each peer maintains a connection to the peer that is responsible for , i.e., to the predecessor of . Such a neighbour, associated with the 'th entry in the finger table, is called the 'th finger neighbour, or simply the 'th finger. Unlike entries in the finger table, finger neighbours depend on the system configuration and may change over time.
The maximum number of finger neighbours of a peer is . However, if the size of the identifier space (i.e., ) is significantly larger than the total number of peers in the system, it is likely that a peer is responsible for the first entries in its own finger table. In this case, the number of peer's finger neighbours is lower than . It can be shown that in a system with peers, the average number of neighbours per peer in Chord is .
Figure 2.11(c) shows finger neighbours of peer 0 in the sample Chord system. The neighbours, which are peers 2, 3 and 7, corresponds to identifiers 2, 4 and 8 in the finger table of peer 0, as shown in picture (b). Peer 0 has only three finger neighbours, as no neighbour is associated with the first identifier in the finger table, i.e., 1.
The structure of finger connections enables a very efficient routing algorithm, where a message can be routed between any two peers in the system in overlay hops. A peer routes a message addressed to by forwarding it to the neighbour that most immediately precedes in the identifier space. This way, each message is first routed over long (in terms of identifier distance) finger connections, and is gradually forwarded over shorter links, until it finds its destination.
Figure 2.11(d) shows a sample message routed from peer 0 to peer 10. Dashed lines represent peers' finger neighbours, and solid lines indicate the selected routing path. The message is first forwarded to peer 7, using the longest finger connection of peer 0, and is next forwarded to peers 9 and 10 using shorter-distance connections.
The algorithm that maintains the Chord topology, known as the stabilisation algorithm, as well as the details of the routing algorithm, are described in .
A number of other DHT systems have been proposed, including Pastry , CAN , Tapestry , Kademlia , Kelips , P-Grid , and Symphony . DHTs have been shown in both theoretical and experimental evaluations to exhibit good performance and stability in the presence of high peer churn, and scalability to millions of participating peers. In particular, DHTs usually provide routing between any peers in the system in hops, given the system size , and have an expected cost of insert, delete, and lookup operations of message transmissions. Viceroy  is the first system that achieved routing in hops with neighbours per peer. Kaashoek and Karger  show that the lower bound for routing in a DHT is hops per lookup request with 2 neighbours per node, and hops per lookup request with neighbours per node. These bounds are met in the Koorde system . A good comparison of several DHT systems can be found in  and .
Kad, based on Kademlia , is considered the largest DHT system that has been implemented and deployed in the Internet. As a part of eMule, a popular file-sharing application, Kad has been reported to support over one million simultaneous users . Bamboo , also known as OpenDHT, an open-source implementation of Pastry , has been shown to handle higher peer churn rates than other state-of-the-art DHT systems . It has been deployed on PlanetLab , a wide-area testbed for P2P systems.
An approach to super-peer election based on a DHT, called Scalable Supernode Selection (SOLE), is described in . The approach relies on the DHT functionality, but does not require any particular DHT scheme and can be customised to any DHT system. It allows a P2P system to elect and maintain super-peers, where is a global system constant.
The main idea of SOLE is relatively simple. The system defines super-peer identifiers, and every peer that is responsible for at least one super-peer identifier, according to the mapping provided by the DHT, becomes a super-peer.
The definition of the super-peer identifiers depends on the type of the DHT. If Chord is used as the underlying DHT, the super-peer identifiers can be defined as
where is the size of the identifier space. The 'th super-peer identifier is , for . This way, the super-peer identifiers divide the DHT identifier circle into approximately equal arcs, each containing identifiers ( or , to be precise). Each peer belongs to exactly one arc, and using simple arithmetic, peer can calculate the super-peer identifier that directly follows it in the DHT identifier circle. Furthermore, the peer can perform a DHT lookup in order to discover the super-peer responsible for . If is assigned to the peer itself, the peer becomes a super-peer. Otherwise, the peer becomes a client of the super-peer responsible for .
Figure 2.12(a) shows an example Chord system with four super-peers ( ) elected using SOLE. The super-peer identifiers are 0, 4, 8 and 12. Peer 0 is elected super-peer as it is responsible for super-peer identifier 0. Similarly, peers 3, 6 and 11 are elected super-peers as they are responsible for identifiers 4, 8 and 12, respectively. Figure 2.12(b) shows the same system with extra peers added. As the number of peers is increased, the super-peers are distributed more evenly in the identifier space. Figure 2.12(c) shows the same system with a fraction of peers removed. In this scenario, the number of super-peers is lower than , as multiple super-peer identifiers are assigned to the a single peer. In particular, both identifiers 4 and 8 are assigned to peer 3.
Lo et al.  describe variants of SOLE that use other DHT systems than Chord, such as Pastry and CAN. The general structure of the SOLE algorithm is shown in Figure 2.13. A super-peer periodically checks if it is responsible for its super-peer identifier (line 2). If the identifier is assigned to a different peer (line 3), the peer appoints as a new super-peer (line 4), transfers all clients to (line 5), and becomes a client of (line 6). A client performs a DHT lookup only when it is not assigned to any super-peer, for example when it is joining the system or when the client's previous super-peer has become unavailable (lines 9). The client determines peer responsible for the closest super-peer identifier (line 10), and depending on the result, it either becomes a super-peer (line 12), or connects to as a client (line 14).
Since the super-peer identifiers divide the key space into equal parts, and due to the properties of Chord, the distance from a peer to its super-peer is overlay hops. Hence, the lookup operation needed to discover a super-peer requires message transmissions, given peers in the system.
The number of super-peers elected by SOLE does not exceed , as there are super-peer identifiers and each of them can be assigned to at most one peer. Furthermore, if every arc in the identifier circle contains at least one peer, the number of elected super-peers is equal to . Given that peer identifiers are uniformly distributed in the identifier space, the probability that no peers belong to a particular arc is . The probability that every arch contains at least one peer can be then estimated as greater or equal to . Hence, if , with high probability, the algorithm elects super-peers.
The main advantages of SOLE are its simplicity and wide-applicability. The use of DHTs assures full system decentralisation, high resilience, and good scalability. The algorithm enables the restriction of the number of super-peers in the system, balances the load between super-peers, and provides an efficient super-peer discovery mechanism.
However, the algorithm has a number of disadvantages. First of all, it elects super-peers based on DHT identifiers rather than their performance or capacity. If peer identifiers are generated randomly or using hash functions, as in many DHTs, the election of super-peer can be seen as a random process, where all peers, including the lowest-performance peers, have equal probability of becoming super-peers. Lo et al.  mention criteria for super-peer election, such as high CPU speed, high network connectivity, and high amount of other resources, but it is not clear how the additional super-peer criteria can be combined with the DHT-based super-peer election and discovery algorithms.
Another disadvantage of SOLE is that it elects super-peers, where is a predefined value, known to every peer, which cannot be easily changed at runtime. The algorithm does not allow super-peer addition in response to an increased system size or load. Furthermore, SOLE requires that every peer in the system participates in the DHT, which may cause a high load on the lowest performance peers.
In hierarchical DHT systems , peers are organised into groups, and each group maintains its own autonomous overlay network. Each group also elects one or more super-peers, and the super-peers maintain a top-level DHT overlay that connects all peer groups. The communication between peers within groups is handled by the intra-group overlays. The communication between peers located in different groups is relayed by super-peers over the DHT. In order to communicate with peers in other groups, a peer contacts the super-peer in its group, the super-peer routes the message over the DHT to the super-peer in the destination group, and the message is forwarded to the destination peer. Figure 2.14 shows a sample hierarchical DHT system that consists of four groups connected by Chord.
The groups may be heterogeneous. Each group, including the top-level group, may establish a different P2P overlay. In particular, peers in a group may deploy an internal DHT overlay, as in two right-hand-side groups shown in Figure 2.14. The advantage of this approach is that intra-group communication is fully decentralised between all peers in the group, and hence the group can scale to a large number of peers. Alternatively, all peers in a group can directly connect to their super-peer, forming a local star topology, where all intra-group communication, as well as inter-group communication, is handled by the group's super-peer. This configuration is shown in Figure 2.14 in the bottom-left group. The advantage of this approach is that the load on clients is very low, but it has a drawback that the super-peer can easily become a performance bottleneck limiting the size of the group. One more approach is to elect multiple super-peers within each group, as shown in in Figure 2.14 in the top-left group.
Depending on the application-specific requirements, peers in a group may or may not be topologically close to each other. It is assumed that every peer belongs to exactly one group. A peer is uniquely identified by a pair , where is the identifier of peer's group and is the peer's identifier used for the communication within the group. Garcés-Erice et al.  do not specify how peer identifiers and group identifiers are created and how peers are assigned to groups. They only mention that the group identifier of a peer may correspond to the peer's local ISP or university campus. A more self-managing approach would be to generate group identifiers based on peer IP addresses, or using a virtual coordinate system, as described in section 2.3.
Each group elects its super-peers from the ``most powerful'' peers available in the group. The super-peer election criteria are based on peer properties such as high uptime and good connectivity (primarily), and high CPU power and network connection bandwidth (secondarily). It is assumed that higher uptime peers are more likely to stay on-line in the future. By selecting the most stable and bandwidth-rich peers as super-peers, the system improves the reliability and throughput of the inter-group overlay. Super-peers also cache values frequently accessed in the top-level DHT by the super-peers' clients. If peer groups are organised based on proximity, such a caching strategy can improve data access performance.
When peers join the system, they follow a similar procedure to that described in section 2.3 (as shown in Figure 2.4). In order to join, a peer is required to know at least one other peer that already participates in the system. Peer obtains the address of a super-peer in its group by querying peer , and becomes a client of . If no super-peer in group exists, joins as a super-peer. In a configuration with super-peers per group, the first peers that join a group become group super-peers.
Garcés-Erice et al.  introduce an interesting approach to handling super-peer failures. Peers run the algorithm shown in Figure 2.15. Each super-peer maintains a list that contains all peers that belong to the super-peer's group (line 4). The super-peer periodically probes its clients in order to keep the list of peers up-to-date. The list is sorted by peer characteristics, from the best candidates for super-peers to the worst candidates. Furthermore, each super-peer creates a list that contains neighbours of this super-peer in the top-level overlay (line 5). These two lists, and , are periodically broadcast by the super-peer to all peers in the group (line 6). Each client also obtains a copy of the two lists when joining its group.
In case of a super-peer failure or departure from the system, each client checks the first entry on its list (lines 9-10). The first peer on the list, , becomes a new super-peer for the group (line 12) and joins the top-level super-peer overlay by contacting super-peers on its list (line 13). All other peers connect to and become its clients (line 15). If a super-peer fails and the lists and are not available at a peer , peer executes a join procedure as if it was joining the system for the first time.
The main open issue in hierarchical DHTs, as they are described in , is the mechanism for the group construction and maintenance. Furthermore, although Garcés-Erice et al. assume that multiple super-peers can be elected in a single peer group, they do not specify how clients are distributed between such multiple super-peers and how lists and are generated and synchronised between multiple super-peers in a peer group. In many aspects, hierarchical DHTs are similar to other group-based systems described in section 2.3, and most points made in the discussion in section 2.3.4 are also valid for hierarchical DHTs. In particular, hierarchical DHTs with static structures of groups may not scale well with the system size, do not guarantee load balancing between groups, and do not assure optimal super-peer election in terms of super-peer capabilities.
Hierarchical DHTs can be generalised to an arbitrary number of levels. For example, in a three-level hierarchy, each group of peers elects a super-peer, and groups of super-peers elect higher-level super-peers (super-super-peers), which connect through a top-level overlay. However, in this design, not only must peers be assigned to groups, but also super-peers must be assigned to second-level groups (super-peer groups) in order to elect third-level super-peers (super-super-peers). More generally, a mapping must be defined between peers and groups at each level of the hierarchy, excluding the top level. The more groups and levels in the hierarchy, the more knowledge is required at peers to construct and maintain the hierarchy. The problem of group definition and peer assignment to groups is not discussed in .
A similar system to hierarchical DHTs, called Hybrid Overlay Network (HONet), is proposed by Tian et al. in . A HONet system consists of non-overlapping peer groups, called clusters. Every group elects a super-peer, called root node, and the super-peers connect with each other forming a top-level overlay that binds all the groups. Peers in each group also maintain a local overlay, as in hierarchical DHTs. Both the intra-group overlays and the top-level overlay are DHTs with independent identifier spaces. Every peer is uniquely identified by a cluster identifier (CID), which is the local cluster root's identifier in the top-level DHT, and a member identifier (MID), which is the peer's identifier in the cluster-level DHT.
Clusters are created using an Internet coordinate system, such as GNP  and Vivaldi  described at the beginning of section 2.3.1. When a peer joins the system, it determines its coordinates in the virtual space, and calculates its identifier in the top-level DHT, , by projecting the -dimensional coordinates onto the one-dimensional space of the top-level DHT using a locality-preserving Space-Filling Curve, such as the -order or Hilbert curve described in section 2.3.1. Next, it performs a lookup in the top-level DHT and discovers the closest peer to , denoted . If the distance in the DHT identifier space between and is below a threshold , assumes that it belongs to the same cluster as and becomes a client of . Due to the properties of the identifier space and the SFC mapping, and are located in close physical proximity. If the distance between and is higher than , joins the system as a super-peer and creates its own cluster. In this case, it sets as its CID.
Apart from DHT connections, every peer maintains random shortcut links to peers located in different clusters. The number of such links depends on the peer's capacity, and these links are created using a customised random walk algorithm. The system provides two routing algorithms: hierarchical routing, where messages are transferred over both the top-level DHT and cluster-level DHTs, as in the hierarchical DHTs, and fast routing, where messages are transferred directly between clusters using shortcut links, if they are available, bypassing the super-peer overlay.
According to , ``each cluster can choose the most stable member to serve as cluster root, resulting in improved global and local stability''. However, it is not obvious how a cluster can change its super-peer from a less stable peer to a more stable peer, since the algorithm described in  does not take into account any other peer characteristics than coordinates. Similarly, it is not obvious how the system addresses super-peer failures and departures. If a super-peer is changed in a cluster, the distance between the new super-peer and some of the cluster members may increase above . In such a case, the cluster may need to elect more than one super-peer in order to cover all peers. Ultimately, the system needs to find a trade-off between electing the highest-capability super-peers in the system and electing super-peers that are evenly distributed in the DHT identifier space.
This leads to another issue associated with HONet. Almost all DHTs rely on a uniform distribution of peer identifiers in the DHT space. This is required to assure efficient routing and balanced load in the DHT. However, in HONet, peer identifiers are calculated based on peer coordinates, which depend on the physical peer location. If a large number of peers belong to one area, the distribution of DHT identifiers becomes skewed and the DHT may exhibit suboptimum performance.
In Super-Peer Chord (SPChord) , super-peers maintain a Chord overlay with an -bit identifier space, and the remaining peers connect to them as clients. Every peer generates its own -bit unique identifier, including clients, and a client connects to the super-peer that is responsible for this client's identifier in the DHT overlay. Figure 2.16(a) shown a sample SPChord topology with four super-peers and eleven clients distributed between them.
Super-peers are selected from the most stable peers. Peer uptime is used as a predictor for peer stability, and it is assumed that higher-uptime peers are more likely to stay on-line than lower-uptime peers.
Peer identifiers are generated in a two-step process. First, each peer determines its coordinates in a virtual coordinate system, such as GNP  and Vivaldi  described in section 2.3.1. Second, each peer maps its virtual coordinates onto the -bit Chord identifier space using space-filling curves that preserve locality, as described in section 2.3.1. This way, super-peers that are close in the physical network are located close to each other in the DHT identifier space, and are associated with clients that are in physical network proximity.
Figure 2.17 shows the algorithm executed at each peer in SPChord. The join procedure is described in lines 1-8. The first peer that joins the system becomes a super-peer (lines 2-3). All other joining peers perform a DHT lookup in order to determine the super-peer responsible for their identifier (line 5) and become clients of this super-peer (line 6).
SPChord uses the same approach for handling super-peer failures as the hierarchical DHT described in the previous section. Every super-peer maintains a list of its clients, , and a list of its neighbours in the DHT, . The lists are periodically broadcast to all clients of the super-peer (lines 36-38), and a joining peer obtains the lists and from the super-peer it connects to (line 7). If a super-peer becomes unavailable, the highest-uptime client, , on list takes over the role of the super-peer (lines 10-13). All other clients connect either to or the predecessor of , depending on their identifiers, so that each client is associated with the super-peer that is responsible for 's identifier in the DHT (lines 15-19).
The most novel element of SPChord is the algorithm that creates super-peers based on load. If a super-peer is overloaded (line 23), it selects its highest-uptime client, , as a new super-peer (lines 24-26), introduces to the DHT overlay, and splits its remaining clients between itself and , based on the clients' DHT identifiers (lines 26-30). According to , a super-peer is overloaded when the number of its clients exceeds , where is a system parameter, equal for all super-peers. The algorithm can be easily extended to allow for individual limits in the number of clients per each super-peer.
A similar algorithm can be used to reduce the number of super-peers when the load is low. When a super-peer is underloaded (e.g., has fewer clients than ), the super-peer attempts to merge its cluster with the cluster of its predecessor, (lines 32-39). However, before can delegate its clients to , it must check if can handle the extra load without splitting its own cluster (line 34). This extra check is necessary to avoid a cyclic behaviour of super-peers, where continuously merges its cluster with and continuously splits its cluster and selects as a super-peer. If can accommodate 's clients without exceeding its maximum load limit, migrates all clients to and becomes a client of (lines 35-36).
In the absence of peer failures and departures, the peer join procedure and the cluster splitting mechanism guarantee that the highest-uptime peer in each cluster is elected super-peer. However, these algorithms do not guarantee that the uptime of super-peers is globally maximised. For that reason, Liu et al.  propose a topology adjustment algorithm, periodically performed by every super-peer in the system, as shown in Figure 2.18. In this algorithm, each super-peer selects its highest uptime client, (line 2), and a random super-peer, , discovered by performing a DHT lookup on a randomly generated peer identifier (lines 3-4). If 's uptime is higher than that of (line 5), the roles of and are reversed. Client becomes a super-peer (line 6), the DHT identifiers and neighbour connections of and are swapped (lines 7-8), and is demoted to a client of (line 9).
The topology adjustment algorithm probabilistically improves the uptime of super-peers, but it also introduces a significant maintenance overhead. Each time a super-peer is swapped with a higher-uptime client, the super-peer needs to notify all its clients and all its DHT neighbours about its identity change. Such frequent changes in the topology may affect the DHT's performance and its stabilisation cost. Since the evaluation of SPChord described in  is based on relatively simple simulation experiments, it is not obvious if the proposed algorithm is feasible in a large-scale P2P environment.
As in HONets, peer identifiers in SPChord are calculated based on peer location so that peers that are located close to each other in the physical network also have close DHT identifiers. This way, the system reduces the communication cost for super-peers and their clients. However, many DHT systems, including Chord, require a uniform distribution of peer identifiers in order to provide logarithmic routing and even load distribution between peers. If the distribution of peer locations in the virtual coordinate space is non-uniform, the DHT overlay is likely to suffer reduced performance.
SPChord regulates the number of super-peers according to the system size. When new peers join the system, it creates super-peers, and when peers leave the system, it removes super-peers. Furthermore, it elects super-peers from the highest-uptime clients. However, even in a stable peer population with no arrivals, departures and failures, SPChord does not balance clients evenly between super-peers, hence does not reduce the number of super-peers in the system to minimum, and hence does not guarantee that super-peers have globally-highest uptime.
This is because clients in SPChord are assigned to super-peers based on their DHT identifiers, which imposes additional constraints on the configurations that the system can generate. For example, if two high-uptime peers have very close DHT identifiers, the system can elect both peers as super-peers, but in this case the preceding peer is very likely to have very few or no clients due to the short distance to the other super-peer. Conversely, if the system elects only one of these two peers as a super-peer, the uptime of super-peers in the system is not maximised. Hence, the system can only choose between electing super-peers with maximum uptime, and electing super-peers that are evenly distributed in the DHT space, which is required to balance clients evenly between super-peers and to minimise the number of super-peers in the network. The topology adjustment algorithm can improve the uptime of super-peers, but it does not assign clients uniformly to super-peers.
This is further illustrated in Figure 2.19, which shows a sample SPChord system. The numbers written on peers, within the circles, represent peers' uptimes. The numbers written next to peers, outside the circles, are peers' DHT identifiers. The maximum number of clients per super-peer is 4. Given that there are 14 peers in the system, 3 super-peers are sufficient to handle the remaining 11 peers as clients. However, SPChord elects 4 super-peers, and due to the constraints imposed by the DHT, cannot reduce this number to 3. Super-peer 0 (with uptime 2) cannot merge its cluster with super-peer 3 (uptime 6) or super-peer 10 (uptime 9), as this would require creating a cluster with more than 4 clients. Similarly, super-peer 8 (uptime 5) cannot merge its cluster with the neighbouring super-peers.
Due to the failures of super-peers and merging of clusters, some clients in SPChord may have higher uptimes than their super-peers. In the presented example, the highest-uptime peers in the system are 10, 1, 3 and 12, with uptimes values of 9, 7, 6 and 6, respectively. However, the uptimes of elected super-peers are 9, 6, 5 and 2. The topology adjustment algorithm can gradually swap super-peers with clients and eventually achieve a configuration where peers 10, 1, 3 and 12 are elected super-peers. However, the adjustment algorithm is not able to balance the clients evenly between the super-peers, and hence is not able to reduce the total number of super-peers to 3.
Structured Superpeers, a system introduced by Mizrak et al. , is in many aspects similar to SPChord. In Structured Superpeers, every peer generates a unique identifier and participates in a global Chord overlay called outer ring. The system also elects super-peers, which maintain an additional, fully-connected overlay, called inner ring. Every super-peer has a full information about all other super-peers in the inner ring. The outer ring is divided into arcs, and each arc is assigned to one super-peer in the inner ring. Figure 2.20 shows an example of such a topology, where the outer ring is split between five super-peers.
The system guarantees routing between any two peers in overlay hops. In order to send a message from peer to peer , peer forwards the message to its super-peer . Due to the full-connectivity of the inner ring, super-peer locates super-peer responsible for the arc enclosing and forwards the message to . Super-peer delivers the message to .
The outer ring is not used for routing messages. It's purpose is to improve the network's robustness, as it prevents peer isolation and overlay partitioning in case of a super-peer failure. It is also used by peers to monitor each other and detect potential failures. Moreover, for increased fault-tolerance, super-peers replicate their state on their neighbours in the inner ring.
Super-peers are elected using a volunteer service. Mizrak et al.  do not specify how the volunteer service is implemented, but only require that every peer joining the system registers in the service, and that the service, when queried, returns the best candidates for super-peers available amongst clients. It is assumed that every super-peer knows its maximum capacity and is able to measure its load. When the load on a super-peer approaches the maximum capacity, two scenarios are possible. If the load on the super-peer's neighbours is below a threshold, the super-peer shares its load with the neighbours using a load-balancing algorithm. Otherwise, the super-peer splits its outer ring arc and creates a new super-peer, selected using the volunteer service. Analogously, when the load on a super-peer is lower than a certain threshold, a super-peer may dismiss one of the existing super-peers and return it to the volunteer service.
The description of Structured Superpeers, published by Mizrak et al. , is brief and lacks detail. The main element missing from the description is the algorithm used by super-peers to synchronise their views on the outer ring division. Every super-peer needs to maintain full knowledge of other super-peers in the system and their corresponding outer ring arcs in order to provide constant time routing. Given that super-peers are created and removed on demand, and arcs are dynamically split and merged, it is not known how the information about changes in the arc division is propagated between super-peers. A straight-forward solution would require either a centralised coordinator or an expensive super-peer synchronisation protocol. Furthermore, it is not clear if clients are migrated between super-peers when they are sharing load, and it is not obvious how to implement the volunteer service.
Systems reviewed in this section use DHT overlays to elect super-peers. In these systems, peers use the DHT overlay to discover other peers that are located close to them in the DHT identifier space, and super-peers are elected locally amongst peers that are close in the DHT space.
The advantage of DHT-based approaches over group-based approaches is that in the DHT-based systems, peer clusters (i.e., super-peers together with their clients) can be easily split and merged at runtime. The DHT manages the information about current cluster division in an efficient and decentralised manner. The system can dynamically adapt the number of super-peers to the current overlay size or load, while at the same time, every client joining the system is able to discover its super-peer and all super-peers agree with each other about the current division of clients between super-peers.
However, the DHT-based approaches have also disadvantages. Due to the constraints imposed by the DHT, the system cannot at the same time elect the highest capability super-peers and distribute clients evenly between the super-peers. Furthermore, if the clients are not evenly balanced between the super-peers, the total number of super-peers in the system cannot be minimised.
For example, if the best candidates for super-peers have close DHT identifiers, which corresponds to the situation where high-capability peers are located in one group in a group-based approaches, the system is forced to choose between electing the highest-capability super-peers, and electing lower-capability super-peers that evenly divide the DHT space. Such a trade-off is particularly likely if the DHT identifiers are not purely random, but are rather generated based on peer properties, such as peer location.
Finally, in some systems described in this section, such as SOLE and Structured Superpeers, every peer is required to participate in a global DHT overlay. However, running a DHT protocol may introduce a significant overhead on the lowest-performance peers. This is important, since in many P2P systems, super-peers are introduced in order to reduce the load on the lowest-capability peers, allowing a ordinary peer to have only one connection to a super-peer and letting the super-peers handle more expensive protocols. Systems where the DHT overlays are maintained by super-peers only, such as SPChord, are more consistent with this idea.
This section contains reviews of three systems that elect and optimise super-peer sets according to some well defined criteria. In SG-1 , the optimality criteria is based on peer capacities. In SG-2 , the optimality criteria is based on peer capacities and distances. In DLM , the optimality criteria is derived from a file-sharing systems workload model.
The goal of the SG-1 algorithm, proposed by Montresor in , is to generate and maintain general-purpose super-peer topologies with the following characteristics
(i) every client is associated with exactly one super-peer,
(ii) super-peers are connected through a pseudo-random overlay network,
(iii) the number of super-peers is minimised.
The last condition, (iii), is based on the notion of peer capacity. SG-1 assumes that peers are heterogeneous and associates each peer with a parameter , called capacity, which determines the maximum number of clients that can handle if elected super-peer. SG-1 also assumes that every peer is able to calculate its capacity upon joining the system and that peer capacity does not change over time. Condition (iii) states that SG-1 generates a topology with minimum number of super-peers such that the total super-peers capacity is higher than or equal to the total number of clients. A P2P topology that satisfies conditions (i-iii) is called an SG-1 target topology.
As many other P2P systems, SG-1 assumes that all peers are mutually reachable through some lower-level network, such as the Internet, and that any peer can potentially connect to any other peer. The SG-1 algorithm is based on periodic gossipping. Every peer periodically exchanges with selected neighbours its information about its current capacity, numbers of clients, and neighbours. In response to an information exchange, a super-peer may transfer its clients to a higher-capacity super-peer, a super-peer may become a client of another super-peer, and a client may be promoted to a super-peer. The general goal of the algorithm is to migrate clients from lower-capacity super-peers to higher-capacity super-peers and to eliminate superfluous super-peers, i.e., those that have no clients.
In SG-1, each peer maintains four neighbourhood sets: , , , and . The set contains a random sample of all peers in the system. It is used by peers to exchange information with each other in order to maintain the other three neighbour sets. It also assures full overlay connectivity. The set contains a random sample of super-peers in the system. It is used to establish connectivity between super-peers and is required to generate the target topology. The set contains super-peers that have fewer clients than their capacity value. This set is used in to obtain candidates for client transfers. Finally, the set manages the relationship between clients and super-peers. For a client, this set contains at most one entry, which represents the current super-peer of this peer. For a super-peer, this set consists of clients currently associated with this super-peer.
Figure 2.21 shows a sample topology generated by the SG-1 protocol. Peer capacity values are indicated by numbers. For clarity, only a subset of peer connections are shown. Super-peers are linked with each other through their neighbour sets, client are linked with their super-peers through set, and random peers are linked with each other through sets. Additionally, one super-peer is underloaded (with capacity equal to 7), and a number of peers are connected with this super-peer through their sets.
In order to create and maintain the four neighbourhood sets, each peer periodically runs four neighbour selection algorithms. The set is maintained using Newscast [80,82], a gossip-based neighbour selection algorithm that generates approximately random P2P topologies.
The general structure of Newscast is shown in Figure 2.22(a). Each peer maintains a limited-size partial view, , which contains its neighbours' descriptors. The maximum size of a partial view, denoted as , is a system constant. A neighbour descriptor consists of a neighbour address and a timestamp.
It is assumed that a TIMEOUT event is generated every time units at peer , which triggers the execution of the algorithm. As the event is raised (in line 4 in Figure 2.22(a)), peer selects a random neighbour from its partial view (line 5), adds its own address and the current time to the partial view (line 6), and sends a request to the selected neighbour (line 7). The request contains 's partial view . When receives a request from a neighbour (line 9), it adds its own address and the current time to its partial view (line 10), sends a response to (line 11), and updates its partial view by applying the operation on and (line 12). The response again contains the partial view of . Finally, when peer receives a response from a neighbour (line 14), it merges its partial view with the received view using the operation.
Operation consists of the following steps. First, all entries in and are combined into one collection, which contains at maximum peer descriptors. Second, all duplicated entries are removed from the collection. If multiple descriptors are associated with the same address, only one descriptor with the most recent timestamp is preserved. Finally, the most recent descriptors are selected from the collection and all other descriptors are discarded.
The Newscast algorithm is also modelled using two threads, as shown in Figure 2.22(b). An active thread initiates a gossip exchange with a random neighbour every time units, while a passive thread continuously listens for incoming exchange requests and responds to them. The operation is identical in both variants of the algorithm. The algorithms are also equivalent in terms of generated topologies and message costs.
Both the and sets are maintained by running two instances of a modified version of Newscast. The modification to the original Newscast algorithm is twofold. First, the neighbour chosen for gossip exchange (in line 4 in both Figure 2.22(a) and Figure 2.22(b)) is selected randomly from the set, and not from the or set. This way, all peers in the system participate in the dissemination of super-peer and underloaded super-peer information, as the set contains a random sample of all peers in the system. Second, the peer adds itself to the set of descriptors (in line 5 in Figure 2.22(a) and in lines 5 and 13 in Figure 2.22(b)) only if it satisfies a condition. For the set, this condition is that the peer must be a super-peer, and for the set, the peer must be a super-peer with fewer clients than its capacity.
The algorithm that maintains the sets is the most sophisticated component of SG-1. It is run periodically, as the other neighbour selection algorithms in SG-1, and its pseudocode is shown in Figure 2.23(a). Each super-peer periodically invokes the procedure (in line 1 in Figure 2.23(a)), while each client executes the procedure (line 23).
A super-peer iterates over all entries in its set and attempts to find a candidate for a client exchange (lines 2-10). Such a candidate must satisfy a number of conditions. First, it must be a super-peer with free client slots. Second, it must either have a higher capacity than that of , or it must have an equal capacity as but a higher number of clients than (lines 3-5).
If a candidate is found, peer transfers as many clients to as is able to handle (line 13). If is left with no clients, and has at least one free client entry (line 14), peer becomes a client of (line 15). This way, clients are migrated to a higher-capacity super-peers and the total number of super-peers in the system is reduced.
In the last step, if still has some clients after an exchange with (line 16), peer selects the highest-capacity client of , denoted as (line 17), and if 's capacity is higher than the capacity of (line 18), peer swaps its role with (line 19). For this purpose, becomes a super-peer, transfers all its clients to , and becomes client of . This step again assures that higher-capacity peers are promoted to super-peers.
The procedure is very simple. Every peer joins the system as a super-peer, and whenever a client loses its super-peer, it becomes a super-peer again (lines 24-26).
Contrary to the evaluation described in , the original version of the SG-1 algorithm does not converge to the target topology (this fact was acknowledged by Alberto Montresor, the author of SG-1, in a private conversation held with the author of this thesis in March 2008). The following scenario can be given as a counterexample. The highest capacity peer in the system, , becomes a super-peer and accepts the second highest capacity peer in the system, , as its client. Subsequently, other peers connect to and the capacity of becomes fully utilised. In the absence of failures and peer departures, the configuration becomes stable, as does not belong to the sets of other peers and does not participate in client exchanges. Peer is a super-peer while peer is a client. However, in the target topology, it is very likely that both and should be elected super-peers, since they are the two highest-capacity peers in the system.
This problem can be addressed by extending the set maintenance algorithm, as shown in Figure 2.23(b). It should be noted, however, that the proposed extension does not belong to the original SG-1 algorithm and is only a suggestion of this thesis' author. In the extended SG-1, each super-peer selects candidates for client transfers (lines 2-10) and performs the procedure (lines 18-27) in exactly the same way as in the original version of SG-1. However, two extra steps are added. In lines 11-16, each super-peer iterates over its entries in the set and searches for a super-peer such that its capacity is lower that the capacity of the highest-capacity client of peer . This can be done without incurring any extra communication cost. If a suitable super-peer is found, the roles of and are swapped. Client becomes a new super-peer, all clients of are transferred to , and becomes a client of . This way, a higher-capacity client replaces a lower-capacity super-peer.
The second extension to the original algorithm, in lines 30-39, allows peers to join the system as clients rather than super-peers. Every client that is not associated with a super-peer, either when it is joining the system or when it loses a previous super-peer, attempts to find a new super-peer in its set. If a super-peer with free capacity is found, which has a higher capacity than that of , peer becomes a client of . Otherwise, peer becomes a super-peer.
It can be shown that a topology managed by the extended SG-1 algorithm, in the absence of peer arrivals, departures, and failures, always converges to the target topology. If churn is present, the algorithm approximates the target topology. SG-1 is also capable of dealing with catastrophic failures, where a large percentage of super-peers (even 100%) are suddenly removed from the network.
SG-1 allows every super-peer to specify its capacity, i.e., the maximum number of clients it can handle. However, in many applications, the load associated with handling clients may vary between different peers and may also change with time. SG-1 does not model this. More importantly, the load on a super-peer may depend not only on the number of clients directly connected to this super-peer, but also on the general activity of other peers in the system. For example, in systems where super-peers handle search, the load on a super-peer is generated by serving its own clients as well as processing queries received from other peers. SG-1 has the drawback that it does not allow the system to explicitly control the number of super-peers and to adapt the super-peers set to the current total demand in the system.
SG-2, proposed by Jesi et al. in , is an algorithm inspired by SG-1 that generates proximity-aware super-peer topologies. In such topologies, super-peers are elected from the highest-capacity peers, as in SG-1, but the system imposes additional constraints on the maximum distance between peers using a distance metric. Each peer is assigned a capacity value , which represents the maximum number of clients it can handle if elected super-peer. Peer capacity is static and it is assumed that every peer knows its own capacity value, as in SG-1. Additionally, for any pair of peers the system defines a latency distance . SG-2 generates topologies where:
(i) every client is associated with exactly one super-peer,
(ii) the number of clients of any super-peer does not exceed ,
(iii) the latency between a client and its super-peer does not exceed ,
(iv) two super-peers are connected if the latency between them is below ,
(v) the number of super-peers in the system is minimised.
Parameters and are configurable system constants. The distance metric is defined as the average round-trip time (RTT) between two peers, and is calculated using the Vivaldi virtual coordinate system , described in section 2.3.1. It is assumed that every peer is able to determine its distance to any other peer in the system. A topology described by conditions (i-v) is called a SG-2 target topology.
The target topology can be described using geometrical concepts. Each peer in the system is represented as a point in the virtual coordinate space. The influence zone of a peer is an -dimensional sphere of radius centred at that peer. The goal of SG-2 is to cover the virtual space with a minimum number of super-peers in such a way that every peer is either a super-peer or belongs to the influence zone of a super-peer.
Figure 2.24 shows a sample topology generated by SG-2 in a two-dimensional Euclidean space. Five super-peers are elected in order to cover all peers in the space, and their influence zones are marked with circles. Each client is connected to the highest-capacity super-peer in its influence zone.
In SG-2, peers communicate with each other using a local broadcast service, which efficiently disseminates messages to all peers within the influence zone of the sender. The local broadcast service is provided by Spherecast, a gossip algorithm based on Newscast  and Lightweight Probabilistic Broadcast .
In Spherecast, every peer runs an instance of Newscast that manages its set of neighbours. The fan-out set of a peer is defined as the subset of the peer's neighbours that belong to the peer's influence zone. A peer broadcasts a message by sending it to all peers in its fan-out set. When a peer receives a message from a neighbour, it either forwards this message to all neighbours in its fan-out set or drops it. The decision is made probabilistically, based on the number of times the message has been encountered by the peer in the past. A message is dropped with probability , where is the number of previous message occurrences, and is a constant threshold parameter.
Every client that is not associated with a super-peer periodically broadcasts a ``request for super-peer'' message, denoted CL-BCAST, in its influence zone. Super-peers reply to these request messages, and clients connect to them. If a client discovers multiple super-peers in its influence zone, it connects to the highest capacity super-peer. Furthermore, a client may probabilistically decide to become a super-peer, depending on its capacity and the frequency of requests received from other clients.
At the same time, every super-peer periodically broadcasts a ``super-peer advertisement'' message, denoted SP-BCAST, to all neighbours within range in order to announce its presence. Super-peers whose influence zones overlap compete with each other, as clients are migrated from lower-capacity super-peers to higher-capacity super-peers. A super-peer that loses all its clients is demoted to a client.
The three parallel processes, which create super-peers based on demand, transfer clients to high-capacity super-peers, and remove idle super-peers, together generate a topology that approximates the system target topology.
Figure 2.25 shows pseudocode for the SG-2 algorithm. Given that there are two types of peers, super-peers and clients, and two types of messages, CL-BCAST and SP-BCAST, four scenarios of message exchange are possible.
When a super-peer receives an SP-CAST message from a super-peer (line 1 in Figure 2.25), whose capacity is lower than , it requests a client migration (lines 2-14). All clients of that belong the influence zone of are transferred to , provided has enough free capacity (lines 3-7). If is left with no clients, it is demoted to a client (lines 8-9), and if it belongs to the influence zone of , it becomes a client of (lines 10-12).
When a super-peer receives a CL-CAST message from a client , it accepts as its client, given it has enough free capacity (lines 15-16).
When a client receives an SP-CAST message from a super-peer , if is not currently associated with any super-peer, it connects to as a client. If is associated with a super-peer with a lower capacity than the capacity of , it disconnects from the current super-peer and becomes a client of , provided has enough free capacity. Super-peer can refuse the connection from if it does not have enough free capacity.
Finally, when a client receives a CL-CAST message from another client , it switches its role to a super-peer with probability , where is the number of times has encountered this message in the past, and is a threshold variable maintained by . Intuitively, as the frequency of requests grows, the probability of becoming a super-peer increases, and as the frequency of requests decreases, the probability of becoming super-peer decreases. Parameter is initialised at each peer as , where is the maximum peer capacity in the system, and is periodically updated at each peer according to the following formula
where is the current time, is the last time when became a super-peer, and is a system parameter. The update formula is used to decrease the probability of role switching between clients and super-peers over time, in order to stabilise the topology and reduce the maintenance overhead.
According to , the problem of finding the target topology, even in a static system, is NP-difficult. In a dynamic environment, with joining and leaving peers and with communication failures, the problem is even more difficult. However, as shown in , SG-2 generates topologies where a large majority of clients manage to connect to super-peers in their latency range and the number of super-peers is close to optimum.
SG-2 uses Spherecast for handling local communication between peers located in physical proximity. However, it is not clear if the Spherecast algorithm scales. When the density of peers in the system grows, more peers belong to each influence zone, and an average peer receives more messages. Similarly, when the range of influence zones is increased, a peer receives on average more messages. If a low-performance peer is located in an area with a high density of other peers, it may easily become overloaded.
On the other hand, when the range of influence zones is decreased, fewer neighbours of a peer belong to the peer's fan-out set. If the system size is large and the range of influence zones is small, the probability that fan-out sets contain any peers may become very low, and as a consequence, Spherecast may stop to work correctly. Thus, the range of peers' influence zones appears to be a critical factor affecting the system's performance. Setting this range appropriately may be non-trivial when deploying an SG-2 system.
Another approach to the construction of optimal super-peer topologies is proposed by Xiao et al. in . They ask the following three fundamental questions.
The workload model assumes that every peer stores a collection of files, shared with other peers in the system, and generates a search query with an average frequency . Super-peers index files stored by their clients and handle search. The model does not depend on any particular search algorithm, but assumes that every search query is propagated to at least super-peers before the results are returned to the originating peer. A client connects on average to super-peers, and a super-peer connects on average to other super-peers. The average durations of client to super-peer and super-peer to super-peer connections are and , respectively.
Ordinary peers are subject to very little message traffic, as they communicate with their super-peers only when updating indices of shared files or issuing search queries or receiving search results. Super-peers are subject to much higher load, as they maintain connections with multiple clients and super-peers, and relay queries received from both their own clients and other super-peers.
Xiao et al.  introduce two types of workload. The workload on a super-peer, , is defined as the average message cost incurred by a super-peer when performing a search operation. The workload on the overall network, , is defined as the total message cost of an average search operation in the P2P network. Furthermore, Xiao et al.  derive the following upper and lower bounds of the two workloads
where is the number of peers in the system, and is the ratio of clients to super-peers. They also introduce a weighted workload, , as
where and are weights such that . The optimal super-peer ratio, , is defined as a ratio between super-peers and clients that minimises the weighted workload . It can be estimated using formulas (2.2) and (2.3). For the most efficient search algorithm, which corresponds to the lower bound on the workload, the optimal super-peer ratio is
For the least efficient search algorithm, the optimal super-peer ratio is given by
Since , and are defined by the P2P protocol, and and can be measured experimentally, the formula allows the calculation of an optimal super-peer ratio .
Using their workload model, Xiao et al.  calculate that the optimum number of clients per super-peer in a typical P2P file-sharing application is between 30 and 65. They also notice that this number corresponds to the typical super-peer ratios found in popular file-sharing systems such as KaZaA.
The Dynamic Layer Management (DLM) algorithm elects super-peers from peers with longer lifetimes and higher capacities and maintains a given ratio of clients to super-peers in a P2P system. Each peer is assigned a static capacity value, , which reflects the peer's ability to process and relay search queries and search responses, and an age , which is defined as the length of time since the peer joined the system. According to , the capacity can be given as a weighted sum of low-level peer properties, such as available bandwidth, CPU speed, and storage space. The age of a peer is used as an estimator of the peer's lifetime. It is expected that peers with higher uptimes are more likely to stay on-line in the future.
In order to elect super-peers and maintain the desired super-peer ratio, each peer performs algorithm shown in Figure 2.26. The algorithm consists of three main blocks. First, each peer collects information about its neighbours, estimates the current super-peer ratio and determines how much it diverges from the optimal ratio (lines 1-8 in Figure 2.26). Second, each peer compares its capacity and age with the corresponding characteristics of its neighbours in order to determine if it is an appropriate candidate for a super-peer (lines 9-16). Finally, each peer decides whether it should become a super-peer or a client (lines 17-25).
Each peer defines its related set, , as a subset of peers in the system that it uses for the estimation of system properties. For a super-peer, the related set is defined as the current set of clients (line 2). For a client, the related set is defined as super-peers that the client knows about (line 5). Xiao et al.  suggest that the related set for a client may be defined as the set of super-peers that has connected to within the last time units.
Using , each peer approximates the average number of client connections per super-peer, . A super-peer simply assumes that is equal to its own number of clients (i.e., ), and a client calculates as an average number of clients per super-peer in (line 6). Given the optimal clients to super-peers ratio, , and the fact that each client connects to super-peers, the optimal number of client connections per super-peer in the system is .
Each peer estimates the divergence of the current system topology from the optimal topology as (lines 3 and 7). A positive value of indicates that super-peers currently have too many clients, and hence, new super-peers should be elected, while a negative value of indicates that too many super-peers exist in the system and some of them should be demoted.
Next, each peer compares its capacity and age with the capacity and age of peers in the related set (lines 10-17). It calculates as the fraction of peers in that have a higher capacity than (lines 11-13) and as the fraction of peers in that have a higher age than (lines 14-16). In the comparison, the capacity of each peer in is weighted by (line 11), and the age of each peer in is weighted by (line 14).
The aim of the scale parameters and is to regulate the probability of a super-peer promotion and demotion. However, it is not explained in  how and are calculated. Xiao et al. only mention that
and are adjusted according to the value of . For a superpeer, if it finds that the system needs more superpeers, it will decrease the possibility of its demotion by decreasing the two scale parameters. Otherwise, it will increase the possibility of its demotion by increasing the scale parameters, while for a leaf peer, if it finds that more superpeers are needed, it will decrease the scale parameters in hoping to increase the promotion possibility; otherwise, it will increase the scale parameters to decrease the promotion possibility. 
In the last step, peer compares its values of and with threshold values of and (lines 18-26). A super-peer with both and higher than and , respectively, becomes a client (lines 19-21). In order to switch its role, it drops all connections to clients and preserves only connections with selected super-peers. It is not specified in  what happens to the disconnected clients and how they find new super-peers to connect to.
A client with the values of both and above and , accordingly, is promoted to a super-peer (lines 23-25). It preserves its current super-peer connections and starts to accept incoming connections from clients. However, again, it is not entirely clear how and are calculated. Xiao et al. state that
The values of threshold variables, and , are also adjusted according to the value of . When more superpeers are needed, superpeers will increase the values of the threshold variables to reduce the demotion tendencies and leaf-peers will reduce the values of the threshold variables to increase the promotion tendencies. For the case there there are too many superpeers, inverse measures will be taken accordingly. 
Furthermore, it is not obvious why the algorithm needs both the thresholds variables and and the scale parameters and . Their roles appear to be redundant, as they both are used to regulate the probability of super-peer promotion and demotion, and they are both adjusted based on .
Two extensions to the DLM algorithm, labelled DLM-2 and DLM-3, are proposed in . In DLM-2, all super-peers periodically gossip with each other and exchange information about their clients. This information is used by super-peers to approximate the average number of client connections per super-peer more accurately than in the original version of DLM. An experimental evaluation in  shows that DLM-2 achieves better performance than DLM-1 and DLM-3.
In DLM-3, the election algorithm is run only on super-peers in order to reduce load on clients. The decision about a super-peer demotion is made as in the DLM-1 algorithm. Super-peer promotion is managed by existing super-peers. When a super-peer decides a new super-peer is needed, it selects a client with maximum value of , where and are constant weighing parameters and , and promotes to a super-peer.
DLM and the super-peer election algorithm used in the gradient topology, described later in this thesis, share a number of similarities. Both algorithms introduce metrics that capture peer capabilities and quantify the suitability of individual peers to become super-peers. In both approaches, the optimum number of super-peers in the system is calculated based on estimated system properties. Furthermore, in both approaches, super-peers are elected using adjustable thresholds.
However, unlike the gradient topology, DLM uses relatively simple heuristics for the estimation of global system properties. For example, the average number of clients per super-peer is estimated in DLM as the current number of clients of one super-peer. At the same time, dedicated algorithms exist, such as aggregation algorithms described later in this thesis, that have been specifically designed to efficiently approximate global system properties. These algorithms have been shown to achieve good scalability and performance, with average approximation error decreasing exponentially with time.
The theoretical model for P2P systems proposed in  is specific to file-sharing applications and cannot be easily applied to other areas. Furthermore, it requires the knowledge of the average duration of peer connections, which may depend on the system deployment environment, and hence, can only be obtained at runtime. Moreover, the description of DLM in  is missing details, which makes the algorithm very difficult to analyse and implement. In particular,  does not explain how the threshold parameters, and , and scale parameters, and , are calculated.
This short section describes P2P systems that adapt their structure to available peer resources, but do not elect super-peers. The purpose of this section is to complete the state-o-the-art systems review in this chapter by presenting alternative approaches to dealing with P2P system heterogeneity. Some of the concepts introduced in the systems covered in this section are similar to the gradient topology.
Astrolabe  is a large-scale information management system that continuously monitors the dynamically changing state of a collection of distributed resources, reporting summaries of this information to its users. Astrolabe computes these summaries using on-the-fly aggregation controlled by SQL queries.
Astrolabe has a hierarchical structure that can be viewed as a tree. Each node in this tree represents a zone. A zone is ``recursively defined to be either a host or a set of non-overlapping [i.e. not having any hosts in common] zones'' . The leaves in the tree represent individual hosts. For each zone, Astrolabe computes aggregates of information from all hosts that belong to this zone (i.e. host being descendants of this zone). These aggregates are computed by nodes continuously gossipping with each other within their administrative zones. Additionally, each zone has representative agents that gossip with representatives agents of other zones on behalf of these zones. This way, the information is summarised and gradually propagated from tree leaves to the root node, which receives the system-wide aggregates of all hosts.
Astrolabe is similar to the gradient topology in that it has a hierarchical structure. The representative nodes in each zone are elected using the same gossipping mechanism that produces data aggregations. Thus, Astrolabe can exploit the most stable and best performing nodes for representing zones at higher hierarchy levels. However, Astrolabe does not provide any specific mechanisms that allows nodes to actively manage and optimise the zone structure. Zones in Astrolabe are implicitly defined by node names. Each node that enters the system autonomously chooses its own name and joins the corresponding zones, creating new zones if necessary. For example, a node identified /USA/Cornell/pc3 belongs to the root zone /, the /USA zone, and the /USA/Cornell subzone withing /USA. Thus, the zone tree grows spontaneously.
An interesting approach to address P2P system heterogeneity is proposed in Gia [34,110]. Gia extends the original Gnutella protocol in order increase its scalability. It is based on four main principles. First, it introduces a dynamic topology adjustment mechanism that increases the degree of high-capacity nodes. Node capacity is calculated based on node properties such as ``processing power, disk latency, and access bandwidth'' . Each node occasionally compares its capacity with that of its neighbours, and computes its level of satisfaction. The satisfaction level is low if the total capacity of node's neighbours, normalised by their degree, is lower than the node's own capacity. In such a case, the node runs a topology adjustment algorithm, where it discovers and connects a new neighbour.
The second main principle in Gia is one-hop replication. All nodes in Gia maintain pointers to the content stored by their immediate neighbours. This mechanism, together with the topology adjustment algorithm, allows high-capacity nodes to act as hubs, which can resolve queries on behalf of other nodes. Third, Gia replaces the flooding-based search in Gnutella with a biased random walk, which directs queries to high-capacity nodes and allows the utilisation of hubs. Finally, Gia uses an active flow control algorithm that balances the load in the overlay by probabilistically routing queries towards nodes with more available capacity. By avoiding hot-spots (i.e., overloaded peers), Gia can significantly reduce query latency and drop rate.
The similarity between Gia and the gradient topology is that higher-capacity nodes are promoted in the system structure such that they handle more system traffic and load. However, Gia is specifically designed for keyword query processing and its design principles cannot be easily ported beyond the file-sharing domain.
A powerful mechanism to exploit high-capacity nodes in a P2P system is proposed in Chord . In this system, a single physical host can run multiple virtual nodes, i.e., instances of the P2P protocol, in order to utilise its available capacity. By creating, migrating, and removing virtual nodes, the system can balance the load between physical hosts and effectively use available resources. A number of virtual node allocation algorithms, based on Distributed Hash Tables, are proposed in [148,88].
The main advantages of virtual nodes are their great simplicity and very high applicability. Virtual nodes can be used in a straight-forward way in many different P2P applications. However, virtual nodes also have a number of disadvantages. First, they usually increase the system overhead. For example, in a DHT overlay of size , a node typically maintains neighbours. When a host runs virtual nodes, it needs to maintain DHT neighbours, and hence, must generate more background traffic to detect neighbour failures and to keep its routing tables up-to-date. Generally, virtual nodes increase the number of nodes in a P2P protocol, while super-peers can be used to reduce the number of participants in a P2P protocol. If the protocol does not scale well, generating virtual nodes may become nonviable. Similarly, if node stability is concerned, virtual nodes are of limited use, since they can only increase the number of stable nodes in the system but do not exclude the non-stable nodes from the P2P protocol.
This chapter surveys the area of heterogeneous P2P systems. These systems are based on super-peers in the large majority of cases, but examples are also given for systems that are based on other design principles. The covered domains include storage systems (OceanStore, Brocade), file-sharing systems (Gnutella, KaZaA, eDonkey), telephony and video-conferencing systems (Skype), e-learning systems (Edutella, ROSA), Grid systems (GLARE), and distributed hash tables (HONets, SPChord, Structured Superpeers, and others). A large part of the reviewed systems can be classified as general P2P frameworks and algorithms.
The functions assigned to super-peers, as well as the overlays run by the super-peers, are generally application-specific. File-sharing, e-learning, and Grid systems use super-peers for indexing files (and other resources) and handling search protocols. Grid systems also use super-peers for managing membership information. Skype uses super-peers for relaying traffic between firewalled peers. OceanStore introduces super-peers for coordinating updates on replicated objects. Brocade, and many other systems based on DHTs, use super-peers for routing messages.
Most reviewed systems attempt to elect super-peers that have certain desired characteristics. These characteristics are often described as high stability, high processing capability, large available storage space, and a high-quality network connection. Peer stability is usually estimated using peer's current uptime. The processing capability is often defined as a function of peer's CPU clock speed and amount of RAM. The quality of a peer's network connection is typically measured using the upstream or downstream bandwidth of the peer's Internet link. Furthermore, some systems, such as Gnutella, require that super-peers have non-firewalled Internet connections and run a certain version of the operating system.
The reviewed systems vary in the requirements for the desired number of super-peers in the network. In PoPCorn and SOLE, the goal is to elect a fixed number of super-peers, given as a system parameter. In more adaptive systems, such as SG-1, SG-2, DLM, HONets and SPChord, the number of super-peers is regulated according to the current system size and load. In particular, DLM maintains a fixed ratio of super-peers to clients, and both SG-1 and SG-2 elect super-peer sets that have sufficient capacity to handle the remaining peers as clients. In some systems, such as Crown, PASS, eDonkey and Grids, the numbers of super-peers is not strictly controlled and depends on external factors such as peer IP addresses (Crown), peer locations (PASS), and local users or administrators (eDonkey, Grids). Furthermore, some systems introduce additional constraints on super-peers and clients, such as a maximum distance in a virtual coordinate system or semantic space.
The super-peer election methods in the reviewed systems can be divided into four general categories. The first category, comprising the simplest approaches, includes systems where super-peers are hardcoded, configured manually, or elected based on fixed thresholds. Centralised approaches are not scalable and introduce security and reliability risks. Manual or static selection is not likely to produce optimal super-peer sets due to the complexity and dynamism in most P2P systems. Fixed thresholds can only be applied to systems where the distribution of system-wide peer characteristics does not change significantly in time and is known to the system designer or administrator.
In systems that belong to the second category, peers are divided into groups based on properties such as physical location, position in a virtual space, semantic content, or membership in a Virtual Organisation. This approach has the advantage that the super-peer election problem can be decomposed into local election subproblems which are solved independently in each group. It also allows the system to bind clients with super-peers that are close to them according to a system metric. However, this approach introduces the problem of group management. Simple schemes, based on peer properties such as IP address or ZIP code, do not allow peers to actively control the number of super-peers in the system. Furthermore, election in groups does not guarantee that all peers with globally-highest capability become super-peers, and does not guarantee that clients are evenly distributed between super-peers.
The third category consists of systems that elect super-peers using a DHT overlay. In these systems, super-peer clusters can be dynamically split and merged, and the number of super-peers can be regulated based on the current network size and load. However, due to the constraints imposed by the DHT, these systems cannot guarantee that the elected super-peer sets are optimal in terms of size and super-peer capabilities.
The last category contains systems that continuously optimise super-peer sets according to a formally defined criterion. SG-1 generates a topology with minimum number of super-peers such that the total super-peer capacity is equal to the total number of clients. SG-2 extends SG-1 and introduces additional constraints on the maximum distance between super-peers and clients. Finally, DLM derives an optimal ratio of super-peers to clients from a file-sharing workload model, and provides a mechanism for maintaining such an optimal ratio in a P2P system, electing high-uptime and high-capacity super-peers.
In summary, the systems reviewed in this chapter clearly show that there is a general need for introducing super-peers in P2P applications in order to improve their overall performance and scalability. However, the majority of existing systems offer simple and limited mechanisms for the super-peer election, and only a handful of systems attempt to optimise super-peer sets according to well-define criteria. These few sophisticated systems are specific to particular application scenarios.
The remaining chapters in this thesis show that gradient topologies, in combination with aggregation-based election techniques, extend the current state-of-the-art knowledge on super-peers, and allow more flexible and adaptive super-peer election in large-scale heterogeneous systems. In particular, gradient topologies can generate super-peer sets equivalent to that in SG-1, DLM, and many other reviewed systems, and can adapt super-peer sets according to the requirements in the system.
This chapter formally defines gradient topologies (GT) and describes their main properties. It then introduces a subset of gradient topologies, called tree-based gradient topologies (TGT), which have a number of attractive, formally proven properties, such as a diameter growing logarithmically with the overlay size. Due to these properties, the thesis mainly focuses on the TGTs, although many features are common to both GTs and TGTs. The last section presents a brief design of two large-scale applications, a P2P storage systems and a P2P name service, that take advantages of the TGTs. The purpose of this last section is to show, at a high-level of detail, how the TGTs can be used in practical application scenarios. The algorithms that generate TGTs and elect super-peers are presented in the next chapter.
Gradient Topologies (GT) are a class of P2P overlay topologies, where peers are arranged according to their utility such that the highest utility peers are clustered in the logical centre of the topology (also called core) while lower utility peers are located at gradually increasing distance. The higher the utility of a peer, the closer this peer is, in terms of overlay hops, to the maximum utility peers in the system.
Peer utility metric is defined by the application that runs on top of the gradient P2P topology. It is assumed that the higher-level application requires the selection of the highest-utility peers in the network for its application-specific purposes. For example, in a content distribution network, the utility may be defined as a function of a peer's maximum upstream bandwidth. In a P2P storage system, the utility may combine a peer's available storage space and bandwidth, while in a grid computing system, the utility may be a function of a peer's processing speed and expected availability.
The gradient topology is independent from the higher-level application in the sense that all algorithms used for the construction of the topology, message routing, and election of super-peers, do not make any assumptions about peer utility. They only require that for every peer in the system, a utility value, , is defined. Thus, the utility metric encapsulates application-specific peer requirements.
Formally, gradient topologies can be defined as P2P topologies where for any two peers, and , if than , where is a peer distance metric defined as the shortest path length between two peers and , and is the highest utility peer in the system. Figure 3.1(a) shows a conceptual diagram of a gradient topology, and 3.1(b) shows a sample visualisation of a gradient topology generated in a P2P simulator . Darker nodes and darker edges indicate higher-utility peers and connections between high-utility peers.
Gradient topologies have two main properties. First, all peers in a gradient topology with utility above a given utility threshold form a connected sub-overlay, which is itself a gradient topology and is concentric with the total system topology. Such high-utility peers can be exploited in a similar manner as super-peers in traditional P2P systems.
Second, the information captured in the gradient topology enables efficient routing of messages from low-utility peers to high-utility peers. This is achieved by forwarding messages at each peer to the highest-utility neighbour, as in hill-climbing and similar search heuristics. This strategy, called gradient search, guarantees that messages are eventually delivered to the highest-utility peers in the system.
Assuming that a higher-level application uses the highest-utility peers in the system for running certain services, gradient search allows lower-utility peers to discover these high-utility peers in order to access the services hosted by them. Gradient search does not require any global knowledge at peers, as it requires only that each peer estimates the utility of its immediate neighbours. Gradient search is also deterministic, and it does not require peers to duplicate message, as in flooding and parallel random walks.
Given that peers are characterised by a common utility metric , the super-peer election problem in a gradient topology can be solved by calculating a super-peer utility threshold. All peers with utility above the selected threshold become super-peers, and the remaining peers become clients. The super-peer set elected this way is optimal in the sense that the utility of peers in the set is maximised. Each super-peer has a higher utility than any client.
The number of elected super-peers is directly controlled by the super-peer utility threshold. By decreasing the threshold, the system can increase the number of super-peers, and by increasing the threshold, the system can decrease the number of super-peers. Due to the structure of the gradient topology, no peer connections need to be reconfigured as super-peers are added or removed, since super-peers always are clustered at the centre of the topology and can be discovered by clients using gradient search.
A number of different criteria can be applied when calculating super-peer thresholds. In the simplest case, the threshold can be explicitly given by a higher-level application. However, as discussed in section 2.2.14, setting a threshold that limits the number of super-peers to a desired level requires a global knowledge of peer utility. Furthermore, if the super-peer election threshold is fixed, the system is not able to adapt the number of super-peers to the existing demand and is likely to generate a suboptimal super-peer set if the characteristics of peers in the system significantly change over time.
A top-K threshold is defined as a utility value, , such that exactly peers in the system have utility equal or above and all remaining peers have utility below . Given the cumulative peer utility distribution in the system, , where is the number of peers with utility above ,
the top-K threshold must satisfy the following equation
Assuming that peers have a knowledge of the utility distribution , a top-K threshold allows a precise restriction of the number of super-peers in a dynamic system. It has the property that, regardless of the system size (as long as ) and utility of participating peers, it elects exactly super-peers, and the utility of these super-peers is maximised.
Similarly, a proportional threshold is defined as a utility value, , such that a fixed fraction of peers in the system have utility greater than or equal to . In a system with peers, a proportional threshold is described by the following equation
A proportional threshold allows peers to adapt the number of super-peers to the total system size. As the system grows and shrinks in size, the proportional threshold increases and decreases, adjusting the number of super-peers in the system so that the ratio of super-peers to ordinary peers remains constant.
In many applications, the desired number of super-peers depends not only on the system size but also on the capabilities of available peers. For example, systems such as SG-1, SG-2 and DLM introduce the notion of peer capacity and generate super-peer sets that have sufficient capacity to handle all remaining peers as clients.
Assuming that a capacity value is defined for each peer , a fixed-capacity threshold can be introduced as a utility value, , such that peers with utility above have a total capacity of . A fixed-capacity threshold can be calculated using a cumulative peer capacity distribution, , where is the total capacity of peers with utility above ,
The threshold must satisfy the following formula, similar to the top-K threshold definition
In order to elect super-peers that have a total capacity equal to the number of clients in the system, as in SG-1 and SG-2, a clients threshold is defined as a utility value, , such that
In the above equation, is the number of elected super-peers, is the number of clients, and is the total super-peer capacity. If peer utility is defined as peer capacity, i.e., for every peer , the super-peer set generated using a clients threshold is equivalent to that in the SG-1 target topology. If , super-peers are selected from the highest-utility peers in the system, and the number of super-peers is determined by the system size and super-peer capacity.
A more general approach to elect super-peers is based on the concepts of peer capacity and peer load. Depending on the higher-level application, load can represent connected clients, stored data, network transfers, handled requests, running jobs, or other application-specific concepts. The goal of the super-peer election is to generate a super-peer set that has a sufficient total capacity to handle the load generated in the system.
More formally, the capacity of a peer is defined as the maximum load peer can handle at a time, and represents load that peer is currently handling. Any peer can generate load, and the total amount of system load fluctuates over time. A load-based threshold is defined as a utility value, , such that peers with utility above have a total capacity equal to the total system load, i.e.,
The utilisation of peer is defined as the ratio of a peer's current load to the peer's capacity, i.e., . If the super-peer threshold is calculated using formula 3.7, all super-peers must achieve a full utilisation (i.e., equal to one) in order to handle the total system load. This requirement can be relaxed by allowing super-peers to have a higher total capacity than the system load. In order to maintain an average super-peer utilisation of , where , the super-peer election threshold, , must satisfy the following formula
This way, super-peers maintain a margin of spare capacity and can accommodate extra load in case of a rapid load level increase. Moreover, when the average super-peer utilisation is reduced, more super-peers in the system have free capacity, and the distribution of load between super-peers that have free capacity becomes easier.
Finally, multiple criteria for super-peer sets can be combined into one threshold. For example, if the super-peer set must satisfy two conditions, to have a minimum size of , and a minimum capacity of , a composite threshold, , is defined as , where and are derived from formulas 3.2 and 3.5, respectively. Similarly, in order to elect a super-peer set that has a size of or capacity of , the super-peer election threshold is set to .
Many measurements on existing P2P systems show that peer characteristics, such as session times, availability, and bandwidth, are closely approximated by Pareto distributions [173,144,181], and many theoretical models of P2P systems assume that peer properties follow Pareto distributions, also called power-law [135,4,14,7]. Typically, the Pareto shape parameter, , in P2P systems is approximately equal to two . Given a system with Pareto distributed peer utility, the following theorem describes the utility of super-peers.
Theorem 3..1 In a system where peer utility follows a Pareto distribution with exponent , and the highest-utility peers are elected super-peers and constitute a fraction of all peers in the system, the ratio of mean peer utility to mean super-peer utility is .
Proof. Given that peer utility follows a Pareto distribution, the probability that a peer has a utility value above is
where is the minimum peer utility and is a constant system parameter, . In order to maintain a ratio of super-peers to the total number of peers, the super-peer utility threshold, , must satisfy
Hence, . The utility of super-peers also follows a Pareto distribution, but the minimum super-peer utility is . The probability that a super-peer has a utility value above is then given by
The mean of the peer utility is . The mean of the super-peer utility is
and the ratio between mean peer utility, and mean super-peer utility, , is then .
Given that peer utility follows a Pareto distribution, the probability that a peer has a utility value above is
where is the minimum peer utility and is a constant system parameter, . In order to maintain a ratio of super-peers to the total number of peers, the super-peer utility threshold, , must satisfy
Hence, . The utility of super-peers also follows a Pareto distribution, but the minimum super-peer utility is . The probability that a super-peer has a utility value above is then given by
The mean of the peer utility is . The mean of the super-peer utility is
and the ratio between mean peer utility, and mean super-peer utility, , is then .
This section introduces tree-based gradient topologies and describes their main properties, including average peer degree, diameter, path lengths, and average distance to a super-peer.
Given a utility metric , peers can be ordered according to their utility, from the highest-utility peer to the lowest-utility peer , where is the total number of peers. The position of a peer in such a ranking, denoted , is called rank. A tree-based gradient topology (TGT) is defined as a gradient topology such that each peer (excluding ) is connected with a peer ranked , where is a constant system parameter called branching factor, . Peer ranked is called 's parent.
Figure 3.2 shows a sample TGT with 27 peers and branching factor . For clarity, peers have only these connections that are required by the TGT definition, i.e., to their parents. In most realistic use cases, peers need to maintain additional links with each other in order to increase the system's fault-tolerance and reduce the probability of the topology partitioning.
In a tree-based gradient topology, assuming no peer connections other than between a peer and its parent, peers ranked less than (with the exception of ) are connected with other peers: one parent and children. Peers ranked above have no children and are connected with their parents only. The average peer degree is , since each peer (excluding ) has one outgoing connection to a parent, and the total number of incoming parent connections is equal to the total number of outgoing parent connections. Given that is usually large in P2P systems, the average peer degree is close to two.
One of the main properties of TGT is that the shortest path between peer and has at most edges, where is the highest utility peer in the system. This fact can be shown by a straight-forward induction.
In a tree-based gradient topology, the shortest path between a peer, , and the highest utility peer, , where , has at most edges.
Proof. Let denote the shortest path length between peer and , i.e., . The proof is by induction on the peer rank. The base case is for peer , which is directly connected to , and hence .
Inductive step. Assume for all such that . It needs to be shown that . Peer is connected with its parent, ranked , and hence . Using the induction hypothesis, .
Let denote the shortest path length between peer and , i.e., . The proof is by induction on the peer rank. The base case is for peer , which is directly connected to , and hence .
Inductive step. Assume for all such that . It needs to be shown that . Peer is connected with its parent, ranked , and hence . Using the induction hypothesis, .
Theorem 3.2 immediately implies that the diameter of a TGT with peers is at most , that is .
A single fact that a P2P topology has a short diameter does not necessarily indicate that such a topology is useful. For example, the diameter of purely random topologies grows logarithmically with the number of peers, but determining the shortest paths between peers in such topologies is expensive due to the lack of knowledge about the topology structure. Searching algorithms used in random P2P topologies, such as flooding, random walks, Breadth First Search (BFS), Depth First Search (DFS), and iterative deepening [109,187,199,64], require sending messages to large numbers of peers, potentially all peers in the system.
Gradient topologies, in contrary to random and unstructured P2P topologies [31,109,42,78], contain information about peer utility and enable efficient routing from low-utility peers to high-utility peers using gradient search. It can be shown that in a TGT, a message from peer is routed by gradient search to peer through at most intermediate peers. Thus, the worst-case cost of gradient search is message transmissions. This fact can be shown by a simple induction on the peer rank, almost identically to Proof 3.2.
Super-peers are elected in a TGT using utility thresholds, described in section 3.1.1, exactly in the same way as in GTs. According to Theorem 3.2, the highest utility peers in a TGT form a TGT sub-overlay with a diameter of . Moreover, the topology structure imposes bounds on the maximum distance between a super-peer and a client.
Theorem 3..3 In a tree-based gradient topology, where the highest utility peers are super-peers, the shortest path between an ordinary peer, , and a super-peer has at most edges.
Proof. Let denote the shortest path length from peer to any super-peer. The proof is by induction on the peer rank. Obviously, for all peers where , . Furthermore, all peers such that are directly connected to super-peers through their parent links, and hence for .
Inductive step. Assume for all such that . It needs to be shown that . Peer is connected with its parent, ranked , and hence . Using the induction hypothesis, .
Let denote the shortest path length from peer to any super-peer. The proof is by induction on the peer rank. Obviously, for all peers where , . Furthermore, all peers such that are directly connected to super-peers through their parent links, and hence for .
According to Theorem 3.3, every peer in a TGT is located within at most overlay hops from a super-peer, where is the overlay size and is the number of super-peers in the overlay. Given a super-peer ratio , the maximum distance from a client to the closest super-peer is at most overlay hops. Hence, if the branching factor, , is set to the reciprocal of the super-peer ratio, , then every peer in the system is directly connected to a super-peer.
Theorem 3..4 In a tree-based gradient topology where , the maximum distance between a client and its closest super-peer is one.
Proof. Let be the total number of peers in the system and be the number of super-peers. The super-peer ration, , is equal to . From Theorem 3.3, the maximum distance between a client and its closest super-peer is . This distance is equal to one if , which is equivalent to , which is equivalent to . This can be achieved by setting equal to .
Let be the total number of peers in the system and be the number of super-peers. The super-peer ration, , is equal to . From Theorem 3.3, the maximum distance between a client and its closest super-peer is . This distance is equal to one if , which is equivalent to , which is equivalent to . This can be achieved by setting equal to .
An analogous reasoning to Proof 3.3 can be used to determine bounds on the performance of gradient search. It can be shown that in a TGT with a super-peer ratio of , gradient search routes a message from any peer in the system to a super-peer through at most intermediate peers, and the maximum cost of a super-peer discovery in a TGT is message transmissions. This cost is equal to one if .
This section describes the design of two sample applications, a P2P storage system and a P2P name service, that take advantage of the properties of gradient topologies.
The storage system is designed to permanently store user-provided data. It supports the following operations: (i) create an empty file, (ii) delete a file, (iii) read from a file, and (iv) write to a file. All operations (i-iv) take a file name as a parameter.
For performance and reliability reasons, the data stored by the system is hosted by super-peers only. Furthermore, the data is partitioned between super-peers using a DHT, such as Chord described in section 2.4.1. Each file name is mapped onto the DHT identifier space using a hash function, such as SHA-1, and assigned to a super-peer.
In order to create or delete a file, a peer generates a request and routes it using gradient search to the closest super-peer, . The contacted super-peer, , forwards the request using the DHT protocol to the super-peer, , that is responsible for the given file. Super-peer then creates or deletes the file, as requested by .
Similarly, in order to read or write a file, peer performs a gradient search to discover a super-peer, , which uses the DHT overlay to contact , the super-peer responsible for the given file. Super-peer directly contacts and transfers the contents of the file. The design can be further extended to allow for file encryption and access verification, for example using public-key cryptography.
In a gradient topology with peers and super-peers, super-peer discovery requires at most message transmissions, as shown in Theorem 3.3. Furthermore, the DHT guarantees that the cost of a DHT lookup is at most message transmissions. Thus, as shown in Figure 3.3, the total cost of locating a file is message transmissions, as in a traditional DHT. This cost can be further reduced if clients cache super-peer addresses and reuse them when performing sequences of operations.
Super-peers are elected using a load-based utility threshold. Peer capacity is defined as the amount of storage space available at a peer, and the load at a peer is defined as the amount of data stored by the peer. The system maintains a set of super-peers that has a sufficient storage space to accommodate all data uploaded to the system. In order to distribute the data between the super-peers, one of the well-known approaches to load-balancing in a DHT is applied [88,148,76].
Furthermore, for improved data availability, each file can be replicated on super-peers with numerically closest DHT identifiers to the hash of the file's name, as described in [179,162,52]. In this case, the total super-peer capacity must be times higher than the amount of data uploaded to the system.
The choice of the peer utility function depends on the system requirements. In order to minimise the number of super-peers in the system, peer utility is defined as peer capacity. In order to maximise the throughput of read and write operations, peer utility is defined as a function of peer's downstream and upstream bandwidth capacity. According to theorem 3.1, if peer bandwidth follows a Pareto distribution with exponent , and is the ratio of super-peers to the total number of peers, the average super-peer bandwidth is times higher than the average peer bandwidth. Thus, the expected file transfer rate, assuming no contention, is improved by a factor of compared with a traditional DHT where no super-peers are elected and all peers host data.
Another approach is to define peer utility as the expected peer session duration, which can be estimated using the history of previous peer sessions and current peer uptime . If peer sessions follow a Pareto distribution with exponent , than according to theorem 3.1, the average leave rate of super-peers is lower by a factor of compared with the average leave rate for all peers in the system. This has two advantages. First, the probability of data loss is lower compared with a traditional DHT, since peers hosting data are less likely to fail. Second, file replica maintenance cost is reduced, since peers join and leave the DHT less frequently (assuming the super-peer election threshold does not change) and hence, file replicas need to be transferred or re-created less frequently.
The P2P registry service stores a collection of domain-specific records, and allows peers to add new records, update existing records, delete records, and search for records that satisfy certain criteria. Each record can be updated or deleted only by its owner, which is the peer that created the record, but can be read by all peers in the system.
For fault-tolerance and performance reasons, the registry service is replicated between a limited number of high-utility super-peers, determined by an adaptive election threshold. Unlike the storage system described in the previous section, where the data is partitioned between super-peers, each super-peer in the P2P registry service maintains a full replica of the entire registry, i.e., has a copy of all records stored in the system. It is assumed that the average size of a record in the registry is relatively small (order of kilobytes), and hence, a single super-peer is likely to have enough storage space to host a full registry replica.
It is assumed that search operations are significantly more frequent than update operations, and hence, the registry is optimised for handling search. Due to the applied replication scheme, every super-peer can independently handle any search query without communicating with other super-peers. This is important, since complex search, for example based on attributes, keywords, or range queries, is known to be expensive in distributed systems [99,141,199,109].
In order to perform a search on the registry, a peer generates a query and routes it to the closest super-peer using gradient search, as shown in 3.4. The super-peer processes the query and returns the search results directly to the originating peer. It can be shown that in a system with peers and super-peers, a query passes through at most peers before it is delivered to a super-peer. Optionally, if the super-peer is heavily-loaded, it may forward the query to another super-peer which has enough capacity to handle it. Clients may also cache super-peer addresses and contact the super-peers directly in order to reduce the routing overhead.
In order to create, delete, or update a record in the registry, a peer generates an update request and routes it to the closest super-peer using gradient search. The update is then gradually disseminated to all super-peers using a probabilistic gossip protocol.
Every record in the registry is associated with a timestamp of the most recent update operation on this record. The timestamps are issued by the records' owners. Super-peers periodically gossip with each other and synchronise their registry replicas, as in . Each super-peer periodically initiates a replica synchronisation with a randomly chosen super-peer neighbour, and exchanges with this neighbour all updates that it has received since the last time the two super-peers gossipped with each other.
Super-peers do not need to maintain a membership list of all replicas in the system. Due to the properties of the gradient topology, all super-peers are located within a connected component, and hence, every super-peer eventually receives every update. Conflicts between concurrent updates are resolved based on the update timestamps. Every record can be updated only by its owner, and it is assumed that the owner is responsible for assigning consistent timestamps for its own update operations.
Super-peers are elected using a load-based utility threshold. Each peer defines its capacity as the maximum number of queries it can handle at one time. The load at a peer is defined as the number of queries the peer is currently processing. The super-peer election threshold is calculated in such a way that the super-peers have sufficient capacity to handle all queries issued in the system. When the load in the system grows, new replicas are automatically created.
In order to reduce the probability of a super-peer departure or failure, peer utility is defined as the expected peer session duration. Hence, super-peers are elected amongst the most stable peers in the system. Furthermore, every peer, once elected a super-peer, never switches back to the role of a client, and maintains a registry replica until it permanently leaves the system.
This chapter describes a set of algorithms that allow a P2P system to generate and maintain a gradient topology, elect super-peers, and route messages from clients to super-peers. The chapter is organised as follows. The first section gives a high-level overview of all algorithms. The second section describes a number of peer utility metrics and shows how they can be computed by peers. The third section presents neighbour selection algorithms that generate GTs and TGTs. The fourth section describes aggregation algorithms that approximate global system properties. The fifth section covers super-peer election strategies for gradient topologies. The sixth section describes peer ranking algorithms. The seventh section addresses gradient search. Finally, the last section shows a bootstrapping mechanism for peers joining the system. The algorithms are evaluated in the next chapter.
Figure 4.1 shows a general overview of the algorithms introduced in this chapter, with dependencies between them indicated by arrows.
In a TGT, the neighbours of each peer are divided into three subsets: random, successors and tree, and each of these subsets is managed by a different neighbour selection algorithm. Random sets are used by an aggregation algorithm that estimates global system properties, such as the system size and peer utility distribution, and allows peers to calculate adaptive utility thresholds, which in turn are the basis for super-peer election. Moreover, the estimation of global system properties, together with successor sets, allows peers to estimate their ranks and generate tree sets. The topology structure created by the tree sets is exploited by gradient search, which routes messages from clients to super-peers. In addition, every peer has a black-box component that calculates peer's current utility. Since almost all algorithms rely on this component, for clarity, it is not drawn in Figure 4.1.
The construction of a GT is similar. The main difference is that GTs do not require tree neighbour sets, and therefore, do not run the peer ranking algorithm.
The utility of peer is a number that reflects the appropriateness of peer to act as a super-peer. The higher the utility, the more suitable a peer is to become a super-peer.
Peer utility is application-specific, and can be defined by the higher-level application in an arbitrary way. However, it can be expected that many P2P applications aim to elect super-peers that have the best possible hardware parameters, such as CPU clock speed, amount of RAM, and storage space. In such applications, peer utility can be defined as a function, such as a weighted sum or product, of these hardware parameter. This approach has the advantage that the computation of peer utility is simple and straight-forward. Each peer can obtain from the operating system, or measure directly, the values of its relevant parameters and independently compute its utility.
More sophisticated utility metrics may involve feedback from neighbouring peers. In particular, in untrusted environments, a decentralised approach to trust or reputation management [86,138,61] may be adopted in order to prevent malicious peers from providing fake utility information. However, trust-based approaches to utility computation are beyond the focus of this thesis.
Network characteristics, such as bandwidth, latency, and firewall status, are more challenging to estimate due to the decentralised and complex nature of wide-area networks. Moreover, many network properties, including bandwidth and latency, are properties of pairs of peers, i.e., connections between two peers, rather than individual peers.
Nevertheless, a peer can estimate the average latency and bandwidth of all its connections over time and use the average value as a general indication of its network connectivity and overall utility for the system. Furthermore, it has been shown that the bottleneck bandwidth of a connection between a peer and another machine on the Internet is often determined by the upstream bandwidth of the peer's direct link to the Internet [97,167]. Thus, available bandwidth can be treated as a property of single peers.
For many applications, peer stability is amongst the most important peer characteristics, since in typical P2P systems the session times vary by orders of magnitude between peers, and only a relatively small fraction of peers stay on-line for a long time, as discussed in section 1.2.
One way of measuring the peer stability is to estimate the expected peer session duration. Stutzbach et al  show that the durations of consecutive peer sessions are highly correlated and the durations of previous peer sessions are good estimates for the duration of the current peer session. A number of sophisticated models, based on patterns in the history of peer sessions, have been proposed for predicting the behaviour of peers [106,121].
In some applications, peer availability, defined as the fraction of time a peer is on-line, may be more adequate for expressing peer utility than the expected peer session length. As shown in a number of analyses, peer availability can be estimated based on the history of previous peer sessions [121,172,51,18].
However, the information about previous peer sessions may not always be available, for example when peers are joining the system for the first time. In these cases, the remaining peer session duration, as well as the expected peer availability, can be estimated based on the peer's current uptime. Stutzbach et al  show that the peer's uptime is on average a good indicator of the remaining peer session time, although it exhibits high variance.
It can be shown that in systems where peer session times follow the power-law (i.e., Pareto distribution), the expected remaining session time of a peer is proportional to the current peer uptime. Similar properties can be derived for other session time distributions, such as the Weibull and log-normal distributions, which are often used in P2P system modelling.
Formally, if the peer session times in a system follow a Pareto distribution, the probability that a peer session duration, , is greater than some value is given by , where is the minimum session duration and is a system constant such that . The expected peer session duration is .
For peers with uptime of , where , session durations also follow a Pareto distribution, but with the minimum value of , i.e., the probability that a peer's session is greater than is equal to . Hence, the expected session duration for peers with an uptime of is . From this, the expected remaining session time is derived as .
Multiple peer properties, such as hardware parameters, network characteristics, peer stability and availability, can be combined into one utility function. Given a set of measurable peer properties, , , ..., , the simplest utility definition for a peer is the product
where all properties have an equal impact on the overall peer utility. In order to assign individual weights for each peer parameter , peer utility is defined as a weighted sum
Naturally, any function of peer properties, depending on application-specific knowledge, can be used to define the peer utility metric.
Many P2P systems, such as SG-1, SG-2 and DLM, introduce the notion of peer capacity and load. Peer capacity, , usually refers to the total amount of resources, such as storage space, processing power, and bandwidth, available at peer , while peer load, , usually represents the mount of resources that are currently being used.
There are two general approaches to defining peer utility based on capacity and load. One approach is to specify peer utility as a function of the peer's available capacity, i.e., . This, however, has a significant drawback that peer utility changes over time, as it depends on the peer's current load. In particular, a peer decreases its utility when it is elected a super-peer and starts to receive load, and a peer increases its utility when it falls below the super-peer election threshold and stops receiving load. Such cyclic peer behaviour may destabilise the overlay and even prevent peers from generating a gradient topology. Moreover, depending on the application, frequent switches between ordinary peers and super-peers may introduce a significant overhead.
A better approach is to define the peer utility as a function of the total peer capacity, , and use the information about system load for the calculation of the super-peer election threshold. This way, peer utility, and hence the system topology, remain stable, while the super-peer set grows and shrinks as the total system load increases and decreases.
If an external process consumes peer resources (i.e., storage space, processing power, bandwidth, etc.), the utility of peers perceived by the P2P application may change over time. Such utility changes are unpredictable from the P2P application's point of view, and hence, they force peers to occasionally re-compute their utility. In the simplest case, a peer may calculate its utility every time is requested by an algorithm running at . However, this may introduce a significant overhead, especially if the evaluation of requires measurements to be performed at peer .
In order to reduce the overhead, a peer may cache the most recently calculated utility value, and re-evaluate its utility only if the cached value is older than a predefined parameter . Figure 4.2 shows the pseudo-code of such a lazy utility evaluation strategy. Another approach is to calculate peer utility periodically.
Since peers are not able to measure or predict the utility of their neighbours, each peer needs to maintain a cache that contains the most recent utility value, , for each neighbour . Every entry in the cache is associated with a timestamp created by when the utility of is calculated. Neighbouring peers exchange and merge their caches every time their neighbour selection algorithms exchange messages, preserving the most recent entries in the caches. Clocks do not need to be synchronised between peers since all utility values for a peer are timestamped by .
Highly variable peer utility is generally not desired, as it may impede peers' ability to create and maintain a gradient topology, and may cause frequent switches between super-peers and ordinary peers. In order to reduce utility fluctuations, a peer may calculate a moving average of its latest utility samples. Assuming a peer periodically measures its utility and obtains utility samples
the utility of peer at time can be defined as
where is a system parameter. This strategy, called simple moving average (SMA), requires peers to maintain a FIFO list with most recent utility samples.
An alternative approach, called exponential moving average (EMA), does not require state to be maintained at peers and can be combined with both periodic and on-demand utility sampling. When a peer obtains a new utility sample, , it updates its current utility according to the following formula
where is a constant weighing factor, . In an infinite system run, the utility calculated using EMA converges to a sum of all samples with exponentially decreasing weights
Assuming peer utility changes over time, two important utility properties are monotonicity and predictability. When peer utility grows or decreases monotonically, peers can cross the super-peer election threshold only once, assuming the threshold is constant. Moreover, if the utility changes are predictable, a peer is able to accurately estimate its own utility and the utility of its neighbours at any given time.
For example, if peer defines its utility as the expected session duration, , and estimates it based on the history of its previous sessions, is constant during one session. When is elected a super-peer, it is not demoted to a client unless the super-peer election threshold increases above .
If the utility of is defined as 's current uptime, , the utility increases monotonically with time. Again, when is elected a super-peer, it is not demoted unless the election threshold changes significantly. Furthermore, the utility function is fully predictable. Any peer , at any time , can compute the utility of , given has a knowledge of 's birth time, i.e., the time when peer entered the system, since
Clocks do not need to be synchronised between peers, and can estimate the birth time of using its own clock. At time when receives the current uptime from , it assumes that .
If peer utility is defined as the expected remaining session time, i.e., , the utility of a peer decreases monotonically over time. Peer calculates the utility of at time as
Peer utility in this case may be negative, since is only the expected session length and may differ from the actual session length. Furthermore, as peer utility decreases monotonically over time, every super-peer gradually drifts away from the gradient topology centre and may eventually be demoted.
Many algorithms described in this thesis assume that the utility value of each peer is unique, i.e., for any peers . This property may not hold for some utility definition, particularly if peer utility is based on hardware parameters such as CPU clock speed, amount of RAM, etc. If the utility function is significantly coarse-grained, the ranking of peers based on utility and the construction of a gradient topology may become impossible.
In order to address this problem, each peer can add a relatively small random number to its utility value to break the symmetry with other peers. Thus, the utility of is defined as
where is 's utility measured using one of methods described in sections 4.2.1 up to 4.2.6, and is pseudo-random variable initialised when joins the system.
Table 4.1 summarises the main properties
of the utility metrics described in this section.
In a P2P system, a peer has a knowledge of and communicates with a limited number of peers, i.e., its neighbours, denoted . The addition and removal of neighbours at peer is controlled by a neighbour selection algorithm. This section describes a number of neighbour selection algorithms that generate gradient topologies.
There are two general approaches to modelling neighbourhood in P2P systems. In one approach, the neighbourhood relation is asymmetric, i.e., if for two peers and , it is not required that . Thus, the system topology is a directed graph. This model is relatively straight-forward to implement, as it only requires a peer to store contact addresses of each of its neighbour. However, it also has the drawback that peers may store stale addresses of neighbours that have already left the system. This is especially likely in the presence of heavy churn in the system. Such dangling references are disseminated between peers unless an additional mechanism is imposed that eliminates them from the system, such as timestamps used in the peer sampling service  and Newscast . Moreover, in the asymmetric neighbourhood model, a peer has no knowledge about other peers that decide to add it to their neighbourhood sets, and hence, cannot control its in-degree in the topology structure.
In the second approach, the neighbourhood relation between peers is symmetric. If , peers and are connected, and it is required that . It should be noticed, however, that the symmetric model does not require any particular communication protocols. It only assumes that the neighbouring peers are able to contact each other. For example, in systems where messages are small and peers frequently change their neighbours (e.g., in gossip-based protocols), neighbouring peers may simply store the addresses of each other and may communicate over UDP. In applications where large amounts of data are exchanged between neighbouring peers and communication needs to be reliable, neighbouring peers may establish TCP connections. A peer can also create and close TCP connections using some algorithm, for example based on the Least Recently Used (LRU) strategy.
The main advantage of the symmetric connection model is that peers can notify each other when changing their neighbourhoods or leaving the system, which helps to keep the neighbourhood sets up to date. Furthermore, outdated neighbour entries are not propagated between peers in the system, as each peer verifies references received from other peers by contacting each new neighbour directly. Symmetric peer connections also reduce the risk of peer overloading. When a peer receives too many incoming connection requests, it may simply reject them. In the case of neighbours crashing, or leaving without notice, broken connections can be detected either by the operating system (e.g., when using the keep alive protocol for TCP connections) or through periodic polling of neighbours at the application level. The main drawback of the symmetric connection model is that every time two peers set up or close a connection, they need to exchange at least two messages.
In the remaining part of this thesis, it is assumed that peers maintain symmetric neighbour relations. Furthermore, it is assumed that any two peers are directly and mutually reachable. Firewall bypassing, using techniques such as STUN [161,160], TURN , and ICE , is not addressed in this thesis.
The neighbours at each peer are assigned into neighbour subsets. Each of these subsets is managed by a different neighbour selection algorithm, which adds and removes entries in it. A single neighbour may belong to multiple subsets at peer .
The neighbour selection algorithms are periodic. As shown in figure 4.3, peer periodically invokes an procedure on each neighbour subset, which performs a cycle (also called time step) of a neighbour selection algorithm. If is empty, peer invokes a bootstrap procedure described in section 4.8.
Periodic neighbour selection algorithms generally perform better than reactive algorithms in heavy churn conditions, as they have a fixed invocation frequency and usually a bound communication cost. It has been observed that in systems with reactive neighbour exchange, peers generate bursts of messages in response to local failures, which congest local connections and results in a chain-reaction from other peers that send more messages, potentially leading to a major overlay congestion and failure .
The neighbourhood at each peer has a maximum size of . For each neighbour , peer maintains an estimation of 's utility, as described in section 4.2. Additionally, for each neighbour , peer maintains an estimate of 's rank. Peer ranks are estimated using an algorithm described in section 4.6, and are exchanged and timestamped by neighbouring peers in a similar way as utility values.
Moreover, for each neighbour , peer maintains a variable, , which counts the number of 's neighbour subsets that belongs to. These counter variables serve two purposes. First, a neighbour can be removed from by a neighbour selection algorithm only if , i.e., when does not belong to any neighbour subset. By adhering to this rule, neighbour selection algorithms running at peer do not interfere with each other when adding and removing neighbours. Second, a neighbour can be removed from only if , that is when the connection between and is not used by the neighbour . This way, peers achieve an agreement on which connections can be closed, and do not remove connections initiated by neighbours. Connection thrashing is harmful, as it increases the system overhead and may prevent peers from generating a desired topology.
When a neighbour selection algorithm at peer decides to add peer to a neighbour subset , it follows the procedure shown in Figure 4.4. In this procedure, peer first checks if belongs to the neighbourhood set (line 2). If , peer attempts to sends a connect request to and waits for a response (lines 4-5). If none of the two peers reached the maximum number of neighbours, the peers agree to establish a connection and add their addresses to each other's neighbourhood sets (lines 7-8). Finally, peer increments the counter and adds to (lines 13-14).
In order to remove peer from a neighbour subset , peer executes the procedure shown in Figure 4.4. First, peer decrements by one (line 17). If is equal to zero, peer concludes that does not belong to any of its neighbour subset and hence is a candidate for removal (line 18). To that end, peer sends a disconnect request to peer (lines 19-20). If , peer agrees to close the connection (line 21), and both peers remove each other from their neighbourhood sets (lines 22-23). Otherwise, peer preserves neighbour in .
Thus, two peers need to exchange two messages with each other in order to connect or disconnect.
When peer is leaving the system, it performs the procedure shown in Figure 4.4. In this procedure, it sends a disconnect message and unilaterally closes the connection to each neighbour in , without waiting for a reply (lines 27-28). A neighbour , when receiving a disconnect message from , removes from from its neighbourhood set (line 29) and from all its neighbour subsets (lines 30-32). The purpose of the leave procedure is to help peers keep their neighbourhood sets up-to-date.
However, in an open P2P system, it cannot be assumed that all peers perform a leave procedure. A fraction of peers may silently crash, become unreachable, or leave without notifying their neighbours. Furthermore, due to message loss, the neighbourhood sets maintained at neighbouring peers may become inconsistent. In particular, it may happen that , but for two peers and .
In order to remove stale entries from the neighbourhood set and resolve discrepancies with neighbours, each peer periodically runs a neighbour verification algorithm shown in Figure 4.4. In this algorithm, peer attempts to contact each neighbour (lines 35-36). If peer is unreachable (line 37), peer removes from (line 38) and all its neighbour subsets (lines 39-41). If is responsive, but (line 43), the discrepancy between peers is resolved based on (line 44). If , none of the peers are interested in maintaining the connection, and hence it is closed (lines 45). If , peer adds to in order to remove the inconsistency between the two peers (lines 46-47). Finally, in the case where (line 49), the peers remove the connection between them (lines 50-51).
Furthermore, a neighbour is removed from if a communication failure occurs when contacting this neighbour, for example, if no acknowledgement has been received within a certain period, or a certain number of message re-transmissions have failed, depending on the low-level communication protocol.
A random neighbour subset contains a close-to-uniformly-random sample of all peers in the system. The main purpose of this set is to supply the aggregation algorithm with candidates for gossipping. As they are relatively easy to maintain and bootstrap, they can also be used by peers that have just joined the system for routing messages to higher utility neighbours, and as a simple means of system exploration. Furthermore, random sets practically prevent overlay partitioning, as the probability of a random graph partitioning is extremely low.
The neighbour selection algorithm that maintains a random set at peer is shown in Figure 4.5. The algorithm aims to keep approximately entries in . If the set is empty, it is initialised with a random neighbour from (lines 2-5). If the size of is equal to or higher than (line 6), peer removes a random peer, , from (line 7-8), and in order to maintain the symmetry, peer removes from its random set (line 9). Finally, if the size of is below (line 11), peer selects a random neighbour, , from (line 12) and obtains 's random set, (line 13). Peer then selects a random entry, , from , such that (lines 14-15), and both peers and add each other to their random sets (lines 16-17).
By removing excessive neighbours (i.e., when ) and adding new neighbours when , the algorithm strives to maintain neighbours in the random set. Moreover, as all entries inserted and removed from the set are selected at random, the algorithm generates pseudo-random samples of peers in the system.
The algorithm initiates at most one neighbour exchange with another peer per time step. Hence, on average, a peer participates in less than two neighbour exchanges per time step. In every exchange, a peer connects to at most one new neighbour. Hence, on average, a peer connects to less than two other peers per time step, and disconnects less than two neighbours per time step. Each of these operations, connecting, disconnecting, and exchanging neighbours, requires sending one message by the peer. Thus, the average cost of the neighbour selection algorithm can be bounded by 6 messages per time step.
A number of other gossip-based algorithms for the generation of pseudo-random topologies are known, including Cyclone , Newscast , and other approaches described in . The main difference between these algorithms and the algorithm shown in Figure 4.5 is the connection model. The algorithm described in this thesis is based on a symmetric neighbourhood model, while all the other mentioned algorithms assume an asymmetric model.
RanSub  is another system that generates uniformly random neighbourhood sets. However, it is not based on randomised gossipping, but instead constructs a global overlay tree. For this reason, RanSub is less robust to churn and random failures compared to gossip-based approaches.
The algorithm presented here belongs to the rand, pull, rand class from the the taxonomy proposed in . However, it does not converge to a star topology, as expected in , due to the symmetric neighbourhood model, which allows peers to eliminate excessive connections from other peers.
In order to generate a topology that is more sophisticated than a random graph, peers must be able to select their neighbours according to individual criteria. In the preference-based neighbour selection algorithm, each peer defines a preference function, which allows to order any set of peers, , from the most preferred neighbour, , to the least preferred neighbour, . The ordering is achieved by comparing pairs of peers using a preference operator . If for two peers and , then prefers over as its neighbour.
Figure 4.6 shows the pseudo-code of the preference-based neighbour selection algorithm. The algorithm is divided into two parts, which correspond to the left-hand side and right-hand side of Figure 4.6. In the first part, peer updates its set based on entries available in . This part of the algorithm does not require any communication with other peers, as it operates on peers that are already connected to and belong to . These steps are necessary, as the neighbourhood may have changed between the algorithm cycles due to churn, actions taken by other peers, and actions taken by other neighbour selection algorithms running at peer . Furthermore, the preference function of peer may have changed since the last algorithm cycle, for example due to the peer utility change.
If , nothing can be improved in and the set update is finished (lines 3-5). Otherwise, peer selects the most preferred neighbour, , in that does not belong to (line 6). If the size of is below , peer adds to (lines 7-8). If the size of has already reached , peer selects the least preferred neighbour, , in (line 10) and compares it with (line 11). If , and are swapped in (lines 12-13). If , cannot be improved with entries in and the algorithm continues in line 19.
In the second part of the algorithm (lines 19-31), peer attempts to obtain a new neighbour through gossipping with one of its current neighbours, as in the T-Man topology generator framework . First, it chooses a gossip partner, , that belongs to (line 19). In the simplest case, is selected randomly from . Alternatively, may be selected from using a round-robin strategy. In principle, the choice of the gossip partner, , enables a trade-off between the exploration of random connections in and greedy exploitation of the knowledge about neighbours in . In some systems, the selection of results in the fastest topology convergence to an optimal configuration, but randomised approaches are more robust in the general case.
When has been chosen, peer sends a gossip request to and obtains the set of 's neighbours, (line 20). From this set, peer selects the most preferred neighbour, , that does not belong to (lines 22-23). If such a peer exists, and the size of is below , peer adds to (lines 24-25). If has reached the maximum size, peer selects the least desired neighbour, , in (lines 26-27) and compares it with . If , peer removes from and replaces it with .
In the presence of churn and communication failures, if either a request or response message is lost, a neighbour exchange fails. Thus, churn and message loss reduce the convergence speed of the system topology to the desired configuration, but do not introduce any side-effects such as a bias towards a particular non-desired topology.
The performance of the algorithm can be further improved by introducing ``age bias'' . With this technique, a peer does not initiate gossip exchange with low-uptime neighbours, because such neighbours have not had enough time to optimise their neighbourhood sets according to the preference function, and therefore are not likely to provide good neighbours for .
As in the random set, on average, a peer gossips with two other peers per time step, connects at most two new neighbours, and disconnects at most two neighbours. The number of neighbours changed in the set per time step depends on the stability of the topology. A peer generally swap its neighbours more frequently upon joining the system, and less frequently when it has already found its optimal position in the topology. Neighbours are also more often swapped when the utility of peers changes dynamically. In all cases, the average number of messages sent by a peer per time step is between two and six.
A simple gradient topology can be constructed using utility successor sets. In these sets, peers aim to connect with neighbours that have higher but similar utility. Formally, a peer prefers over for its successor set ( ) if and only if
for and .
The advantage of the above preference function is that it does not require any knowledge of peer ranks and can be easily evaluated by every peer in the system. Gradient topologies generated using this preference function have been studied in [164,163]. In principle, they do not guarantee the gradient search cost and provide inferior support for message routing compared with TGTs.
Similar to successor sets, utility predecessor sets are generated by a preference function where if and only if
for and .
Predecessor sets, when used together with successor sets, improve the topology convergence to a configuration where every peer is connected to the most similar peers in terms of utility [164,163]. Again, such gradient topologies are outperformed by TGTs.
In order to generate a tree-based gradient topology, as defined in section 3.1, peers maintain tree sets. In these sets, each peer aims to connect with neighbours that have ranks as close as possible to , where is the topology branching factor and is 's current estimation of its own rank.
Tree sets are generated by a preference function where for peers , and if and only if
In the simplest case, the size of the set is defined as , which produces a TGT where each peer has one parent. In order to improve the fault tolerance of the overlay, is increased, so that each peer has more than one parent. However, this significantly increases the number of connections at high-utility peers, since in a TGT with peers, where for each peer , a peer with utility above has on average neighbours.
The aggregation algorithm, described in this section, allows peers to compute and periodically update approximations of global system properties, such as the minimum, maximum, and mean of peer utility, system size, and others. The approximations of these properties are used in turn for the calculation of peer ranks and super-peer election thresholds. The aggregation algorithm requires that a peer is able to obtain a close-to-random sample of all peers in the system. Such samples can be generated based on random neighbour subsets, described in 4.3.4, as well as the Newscast  and Cyclon  protocols.
The aggregation algorithm is based on periodic gossipping. Each peer maintains estimates (also called aggregates) , , ... of global system properties. The true values of these properties are denoted , , ... . Additionally, each peer stores a set that contains the currently executing aggregation instances and auxiliary data. The aggregation algorithm can be seen as a meta-algorithm, since the selection of the system properties , , ... can be tailored to application-specific requirements.
Each peer runs an active and a passive thread. The active thread periodically initiates a gossip exchange and the passive thread responds to all gossip requests received from neighbours. Thus, on average, a peer sends and receives two aggregation messages per time step, i.e., protocol cycle. When initiating a gossip exchange, peer randomly selects a neighbour from its random neighbourhood set, and sends to . Peer responds immediately by sending to . Upon receiving their sets, both peers merge them using a operation described later. The general structure of the algorithm, shown in figure 4.7, is based on Jelasity's push-pull epidemic aggregation [81,125].
The aggregation algorithm can be intuitively explained using the concept of aggregation instances. An aggregation instance is a computation that generates new approximations of the system properties , , ... at all peers in the overlay. Aggregation instances may overlap in time and each instance is associated with a unique identifier . Potentially any peer can start a new aggregation instance by generating a new and creating a new entry in . As the new entry is propagated throughout the system, other peers join the instance by creating corresponding entries with the same . Thus, each entry stored by a peer corresponds to one aggregation instance that this peer is participating in. Every instance has a finite time-to-live, and when an instance ends, all entries corresponding to this instance are removed and all peers obtain new approximations of , , ... .
Formally, each entry, , in of peer is a tuple consisting of values,
where is the unique aggregation instance identifier, is the time-to-live for the instance, and , , ... are variables that contain the current estimations of , , ... .
A peer starts a new aggregation instance by creating a local tuple
where is chosen randomly, is a system constant, and , , ... are initial estimations of system properties , , ... by , as explained later.
As the initial tuple is disseminated by gossipping, peers join the new aggregation instance. It can be shown that in push-pull epidemic protocols, the dissemination speed is super-exponential, and with a very high probability, every peer in the system joins an aggregation instance in just a few time steps [80,81].
The tuple merge procedure, , consists of the following steps. First, for each individual tuple received by from , peer checks if its local set contains a tuple that is identified by . If such a tuple is not found, and , peer creates a new tuple
and adds it to . Values , , ... depend on the approximated system properties , , ... and are covered later.
By creating a local tuple, peer joins a new aggregation instance and introduces its own input to the approximation of , , ... . However, if , peer should not join the aggregation, as there is not enough time before the end of the aggregation instance to disseminate the information about and to calculate accurate aggregates. This usually happens if has just joined the P2P system and received an aggregation message that belongs to an already running aggregation instance. In this case, the update operation is aborted by .
In the next step, for each tuple received from , peer replaces its own tuple, , with
where again, functions , , ... are specific to the approximated system properties. By merging its local tuples with the tuples received from , peer contributes to the calculation of the aggregation result.
Finally, for each tuple in such that , peer removes from and updates its current estimates of the system properties by setting .
The definitions of , , and , for , depend on the approximated system properties , , ... . For example, in order to approximate the average peer utility in the system, , and are defined as for every peer , is defined as , and is simply . It can be shown that the values generated by the aggregation algorithm at each peer approximate the the true utility average with the average error and variance decreasing exponentially over time [82,125,81].
Figure 4.8 shows graphically a simple scenario where 6 peers estimate the average of their utility. The initial configuration is shown in picture (a). The peer in the top left corner initiates a new aggregation instance. As peers exchange tuples with each other, they gradually join the aggregation instance and update their estimations of the average (b-e). The variance between peers' estimates decreases over time. After 5 time steps, peers terminate the aggregation instance (f).
The estimation of the total number of peers in the system is based on the calculation of the average. The peer that starts the aggregation instance creates a tuple containing . The remaining peers join the aggregation instance by adding tuples with . Tuples are merged as in the calculation of the average. The values obtained at the end of the aggregation approximate the reciprocal of the system size, and hence, . Moreover, given the number of peers in the system, , and the average peer utility, , the sum of all peers' utility values is calculated as .
Figure 4.9 shows a sample aggregation execution where 6 peers estimate the system size . Initially, one peer holds a value of one and the remaining peers are initialised with zeros (a). Over a few random gossip exchanges, peers average out their values and obtain close approximations of (b-e). After six time steps, four peers estimate the system size as , and two peers estimate the system size as (f).
Despite an extreme skew in the initial value distribution (1 for the peer that starts the instance and 0 for all other peers), the algorithm efficiently approximates the system size. This can be intuitively explained in the following way. At the beginning of an instance, very few peers hold non-zero values. When a non-zero holding peer performs a random gossip, with a very high probability, it passes half of its value to a zero-holding peer. Hence, the value held by the peer decreases at an exponential rate. At the same time, the number of peers participating in the instance grows exponentially, as in a push-pull epidemic. When only few non-participating peers are left, they join the instance within just few time steps.
It can be shown that the gossip exchange operations do not change the sums of values held by peers. Thus, the sum of all values is always equal to one, as in the initial configuration. Moreover, each time two peers perform an exchange, they can only reduce (or leave unchanged) the global variance. Intuitively, extreme values (either high or low compared to the mean ) have higher probability of being averaged out in a random exchange, since the expected peer value is equal to the mean . Over time, the global variance converges to zero and the values held by peers converge to the mean . A theoretical proof of this behaviour, as well as an experimental evaluation, are described in full detail in [82,125,81].
In order to estimate the minimum and maximum peer utility in the system, the and functions are defined as for each peer , is defined as and , respectively, and is defined as . This way, the true system minimum (or maximum, respectively) is broadcast between peers in the system using a push-pull epidemic and the convergence speed is exponential .
The aggregation algorithm can also be used to estimate the fraction of peers in the system that satisfy a certain condition . To that end, and are set to one for peers that satisfy the condition and zero for peers that do not satisfy . Merge and update are defined as in the average utility calculation.
Given the fraction
of peers that satisfy
the system size
, the number of peers that satisfy
can be calculated as
. Similar approaches allow the approximation
of the average utility of peers that satisfy a given condition, and
the sum of utility values for peers that satisfy a condition. Table
4.2 lists the definitions of
for the approximation of a number of different system properties.
In order to construct a tree-based gradient topology and elect super-peers using thresholds as described in section 3.1.1, peers need to have a knowledge of the current number of peers in the system, , the total load in the system, , and most importantly, the distribution of of peer utility, , and capacity, .
It is straight-forward to approximate the system size and load using the aggregation algorithm described in the previous section. However, and are functions rather than scalars and cannot be directly approximated through aggregation. Instead, they are interpolated using histograms, which divide the peer utility spectrum into fixed intervals and approximate and in a finite number of points.
A cumulative peer utility histogram, , is defined as an array consisting of elements, called bins, where the 'th element, , is equal to the number of peers in the system with utility above
where is a parameter called bin width, defined as , and and are two variables that determine the range of the histogram. The number of bins, , is also called histogram resolution, . A utility histogram, , is a discrete version of the utility distribution function, , with points positioned between and , since for each such that .
Similarly, a cumulative peer capacity histogram, , is defined as a an array consisting of elements, where the 'th element, , is equal to the total capacity of peers with utility above
with the definitions of , , and as above for utility histograms. A capacity histogram is a discrete version of the capacity distribution function, as for such that .
Utility and capacity histograms can be generated using aggregation. According to the classification in Table 4.2, each bin in a utility histogram is conditional count, and each bin in a capacity histogram is a conditional sum, with the condition in both histograms defined as .
A utility histogram produced by the aggregation algorithm approximates the utility distribution function, , in a limited number of points. In order to obtain the value of for any utility , a peer interpolates its utility histogram, , and generates an approximation of , denoted .
When using linear interpolation, is calculated as follows. For , the interpolant is outside of the histogram range, and is defined as . Similarly, for , is set to . For , value is approximated based on two histogram bins, and , where . The interpolant must lie between and , as shown in Figure 4.10, where the values of and are known and are equal to and . Using linear interpolation,
Naturally, any other interpolation methods can be used by peers to approximate , such as polynomial and spline interpolation.
Interpolation techniques can also be used by peers to estimate utility thresholds, , such that for some variable . If the inverse function for exists, . A peer calculates , which approximates , based on a utility histogram .
For , is set to , and similarly, for , is set to . For , there must be histogram bins and such that . The two histogram bins are correspond to the values of for and . Using a linear interpolation,
Similar approach can be applied by peers to approximate the values of the capacity distribution function, , based on capacity histograms, .
This section describes a full version of the aggregation algorithm that approximates all system properties needed for the super-peer election. Each tuple in at peer consists of 10 values
where is the unique instance identifier, is the time-to-live value, is called the weight of the tuple and is used to estimate the system size, is the current estimation of the system load, and are the current estimations of the minimum and maximum peer utility, and are two -bin arrays used in the estimation of and , and and are the histogram range parameters used in this aggregation instance.
A peer joining the system obtains the current estimations of the system properties from one of its initial neighbours. At each time step, peer starts a new aggregation instance with probability by creating a tuple
where and are the current estimations at peer of the minimum and maximum peer utility in the system, is an initial utility histogram created by such that
and is an initial capacity histogram where
for . The bin width is defined as , and is a constant system parameter determining the histogram resolution, . Probability is calculated as , where is the current estimation of at peer , and is a system constant that regulates the frequency of peers' starting aggregation instances. In a stable state, with a steady number of peers in the system, a new aggregation instance is created on average with frequency .
A peer joins an aggregation instance by creating a tuple
where , and are obtained from the tuple received from the neighbour that invited to join this aggregation instance.
When a tuple is merged with tuple , it is replaced with
where , , and all arithmetics on histograms and are performed pair-wise on each bin.
Finally, when an aggregation instance ends, a peer updates its estimates of the system properties by setting , , , , and for each such that
In a stable population of peers, with no message loss, the aggregation algorithm has the following invariant. For each aggregation instance , the weights of all tuples in the system associated with sum up to . Moreover, the average of , minimum of , and maximum of fields in all tuples associated with an aggregation instance are equal to the average load, minimum utility, and maximum utility, respectively, amongst all peers that participate in the instance. It can be shown that the aggregates converge over time to the true values of system properties [82,125,81].
In the presence of churn, the results produced by the aggregation algorithm may diverge from the true system properties, since the system is changing while the aggregation is running. However, the approximation error depends on the type of the aggregated system property. When estimating an average, such as the average peer utility, the expected value introduced by a joining peer, as well as the expected value removed from the system by a leaving peer, is equal to the overall system average. Hence, the expected value produced by aggregation is equal to the true average, and there is no bias towards higher or lower values. Churn increases the variance of the aggregated results, but does not change their expected value.
This property does not hold when estimating peer counts and sums, both conditional and unconditional, and histograms. This can be shown by the following analysis. A single run of an aggregation instance can be divided into two halves. In the first half, when the instance time-to-live is above , peers can join the instance as well as leave the instance by departing from the P2P system. In the second half, peers can only leave and are not allowed to join.
Leaving peers always have non-negative weights, which they remove from the system. At the same time, joining peers initialise their weights to zero and hence do not increase the instance weight. These two facts cause the phenomenon of the aggregation weight loss. In the presence of churn, the longer an aggregation instance runs, the lower the weight of this instance.
Since the system size is approximated by the reciprocal of the tuple weight, peers tend to overestimate due to the aggregation weight loss. Similarly, sums and histograms generated by the aggregation algorithm have a bias towards higher values.
In order to improve the accuracy of aggregation results in systems with churn, peers departing from the system, if they do not crash, perform a leave procedure, where they send all currently stored tuples with time-to-live values above to randomly chosen neighbours. The receiving neighbours add the weight of each received tuple to their corresponding tuples, joining aggregation instances when necessary. This way, peers prevent the aggregation weight loss and help to preserve the weight invariant.
Peers do not perform the leave procedure for tuples with time-to-live values below for two reasons. First, extra weight passed between peers shortly before the end of an aggregation instance may distort the aggregation results. Second, in the second half of an aggregation instance, no peers are allowed to join, and the total instance weight must be reduced when peers are leaving the system in order to obtain correct weight at each peer at the end of the aggregation instance.
In the presence of communication failures, similar to churn, aggregation may produce inaccurate estimations of system properties. When an aggregation request fails, a peer simply does not perform an gossip exchange with a selected neighbour and does not update its tuples. Hence, request loss decreases the convergence speed of aggregates towards the true system properties, but does not produce any bias in the generated results.
When a response message is lost, the requestee updates its tuples, as if an exchange took place, while the requester dismisses the exchange assuming a communication failure. Thus, the algorithm's invariants, such as mass conservation, are violated and the results produced by aggregation become inaccurate. Moreover, a systematic bias may appear, similar to the weight loss effect in the presence of churn.
However, in many P2P systems, it can be expected that response messages are significantly less likely to fail compared with request messages. A request message is lost when a peer attempts to send it to a neighbour that no longer exists in the system. In dynamic systems, peers are always likely to have such outdated entries in their neighbourhood sets due to churn, communication latency, and failures. At the same time, response messages are only sent to peers that have just issued an aggregation request, and hence are likely to be on-line. Furthermore, with some probability, request messages are blocked by firewalls and NATs installed between peers. However, a response message is less likely to be blocked by a firewall, since it is only sent when two peers have already transmitted a request, and hence there is no firewall between the peers, or the firewall permits a communication between them.
Since push-pull aggregation algorithms can handle request message loss, it is argued that they are more applicable to P2P systems than push-based algorithms, such as those proposed by Kempe et al.  and Sacha et al. . Moreover, it is believed that push-pull aggregation algorithms to outperform push-based approaches since push-pull epidemics spread faster than push-only epidemics .
In the aggregation algorithm described in section 4.4.5, an aggregation instance is started with an average frequency of . Since a peer performs on average two gossip exchanges per time step and the time-to-live of a tuple is decreased by at each gossip exchange, an aggregation instance lasts approximately time steps and a peer participates on average in less than aggregation instances. Hence, a peer sends and receives on average two messages per time step and the message size is proportional to .
When the frequency of aggregation is increased, the aggregates maintained by peers are more up-to-date, but the cost of the algorithm increases proportionally to . Similarly, the cost grows when TTL is increased. A third approach is increase the algorithm fan-out, i.e., the number of neighbours contacted per time step. In this version of the algorithm, the active thread at peer performs one gossip exchange if participates in no aggregation instances (i.e., ), and performs gossip exchanges with randomly selected neighbours if runs at least one aggregation instance (i.e., ), where is a system constant. The passive thread remains unchanged.
An aggregation instance is started with an average frequency of , as before, but it lasts only time steps, since peers perform on average gossip exchanges when they participate in aggregation instances. Thus, on average, a peer participates in approximately instances. The average number of messages generated by a peer per time step, assuming , is then
since a peer initiates gossip exchanges with probability and one exchange with probability , and for each exchange peers generate two messages. Thus, neither the average number of generated messages nor the average message size grow with increasing . When , a peer exchanges approximately messages per time step.
As is increased, aggregation instances become shorter ( time steps), and hence, fewer peers join and leave the system when an instance is executing. This way, the algorithm reduces the impact of churn on the aggregation results. In particular, it reduces the instance weight loss. Furthermore, as the instance is shorter, aggregates are calculated more quickly and are hence more up-to-date. The only cost associated with increasing fan-out is that peers need to send bursts of messages per time step, which may cause a network congestion and a higher message loss rate.
The super-peer election algorithm, executed locally by each peer in the system, classifies each peer as either a super-peer or an ordinary peer. The algorithm has the property that it elects super-peers with globally highest utility, and it maintains a highest-utility super-peer set as the system evolves. The algorithm also limits the number of switches between ordinary peers and super-peers in order to reduce the associated overhead.
Given the approximations of the system size, , system load, , peer utility distribution , peer capacity distribution , and other aggregates generated by the algorithm described in section 4.4.5, every peer can calculate super-peer election thresholds, as defined in section 3.1.1. In the simplest case, every peer periodically calculates the super-peer threshold, , and compares it with its own utility , as shown in Figure 4.11(a). All peers with utility above the threshold become super-peers, while peers below the threshold become clients.
However, in a dynamic system, the super-peer election threshold constantly changes over time due to peer arrivals and departures, system load fluctuations, and statistical error produced by the aggregation algorithm. Moreover, the utility of individual peers may change, as discussed in section 4.2. As a consequence, peers may frequently cross the super-peer threshold and switch their roles between being super-peers and clients, increasing the system overhead, for example due to the required data migration and synchronisation between super-peers.
In order to prevent peers from frequently switching their roles between super-peers and clients, the system uses two separate thresholds for the super-peer election, an upper threshold, , and a lower threshold, , where . An ordinary peer becomes a super-peer when its utility exceeds , while a peer stops to be super-peer when its utility falls below , as shown in figure 4.11(b). This way, peers with utility above are always super-peers, peers with utility below are always clients, while peers between the two thresholds can be either super-peers or clients, depending on their previous utility values and previous election thresholds. Thus, the system exhibits the property of hysteresis. The minimum utility change required for a peer to switch its status, assuming constant election thresholds, is . A gradient topology with super-peers elected using two utility thresholds is shown in figure 4.12.
The upper and lower thresholds determine the minimum and maximum number of super-peers in the system. Thus, they restrict the number of super-peers in the system to an interval. For example, in order to maintain between and super-peers, where , thresholds and are calculated based on the utility distribution, , so that and . Likewise, in order to adapt the number of super-peers to the system load, , while maintaining the average super-peer utilisation between and , where , and must satisfy and . Similar formulas can be derived for other threshold defined in section 3.1.1.
As the gap between the upper and lower threshold increases, the restriction on the number of super-peers in the system is relaxed, but the switches between super-peers and clients become less probable. In order to eliminate super-peer demotions completely, peers may use a variant of the super-peer election algorithm, shown in figure 4.13, super-peers are elected for their lifetimes, i.e, until they leave the system.
In order to control the number of super-peers in this algorithm, two changes are required in the threshold calculation and histogram generation. First, the histograms and are calculated in such a way that they contain only peers that can be promoted to super-peers, i.e., clients. For that purpose, in the aggregation algorithm, when peers start or join an aggregation instance, they define their initial utility histogram as
and for the capacity histogram as
This way, contains the number of clients with utility above , and contains the total capacity of clients with utility above . Through interpolation, histogram can be used to estimate the client utility distribution function, , where
Similarly, can be used to estimate the client capacity distribution, , where
In order to maintain super-peers using a top-K threshold, a peer checks at each time step how many new super-peers need to be created in the current state of the system. Assuming the peer can estimate the current number of super-peers, , using aggregation, the number of required new super-peers is . If , the threshold is set to infinity, as no super-peers need to be created. If , the super-peer election threshold, , is derived from equation . This way, in a system with dynamic peer population, fluctuating peer utility, load, and election thresholds, super-peers are never demoted to clients, but the system creates no more than super-peers at any given time.
Similarly, in order to maintain a super-peer set with a total capacity of , for example equal to the total system load, , peers estimate the current super-peer capacity in the system, , using aggregation, and calculate a super-peer election threshold according to formula .
The calculation of the clients threshold, as defined by formula 3.6 in section 3.1.1, requires the knowledge of both and . This section describes an approach that allows the estimation of the clients threshold using only.
Each peer estimates the current system size, , and the current super-peer ratio, . The current number of clients, and hence the required super-peer capacity to handle these clients, is given by . Peers calculate the super-peer election threshold as a utility value, , such that .
Although super-peer sets elected this way do not necessarily have the exact capacity required to handle the clients in the system, they converge to the optimum size when the election is repeated. When the algorithm elects too many super-peers, the number of clients decreases, and fewer super-peers are elected in the following iteration. Similarly, when too few super-peers are elected, the number of clients increases, and the algorithm elects more super-peers in the following run.
This section describes three algorithms that allow peers to estimate their ranks. The first method is based on utility histograms, the second method is based on utility successor sets, and the last method is a combination of the first two. Peer ranks are required for the construction of tree-based gradient topologies.
The simplest way to calculate peer ranks is to use utility histograms, assuming they are generated by aggregation. The rank of a peer is equal to , where is a utility distribution function approximated using histogram at peer , and is 's estimation of its own utility.
The main drawback of this rank estimation method is that it generates a significant error for the highest-utility peers. This is because the highest utility peers belong to the last bins in the histogram (i.e., close to ), which contain very few samples, and the interpolation of these few samples is likely to produce a significant approximation error.
Another approach to the peer rank estimation is to use the information about neighbours in successor sets. In order to calculate , a peer first creates a subset, , of its successor set, , which contains peers with utility higher than
In a stable state of the system, where the successor sets are fully optimised according to the preference function defined in section 126.96.36.199, should contain peers with ranks equal to
where is equal to . The sum of the ranks of all peers in is then equal to
This information can be used by peer to estimate its rank . If is empty, peer assumes that it has the highest utility in the system and sets its rank to zero. Otherwise, peer calculates the sum of the ranks of its neighbours in and sets its rank to
where the rank estimation formula 4.37 is derived directly from 4.36.
In principle, any neighbour in can be used for the estimation of , given formula 4.35. However, by summing and averaging the ranks of all neighbours in , peer uses all available information for the estimation of its rank, and reduces the estimation error, since the individual errors introduced by each variable may cancel out.
In order to initialise its rank when joining the system, peer selects then the highest utility neighbour, , in its initial neighbour set, , such that , and the lowest utility neighbour, , such that . The rank of peer is calculated as the average between and
The successor-based rank estimation method requires that peers maintain globally optimal successor sets. If a peer does not connect to a higher-utility neighbour that would belong to its successor set in an optimal topology configuration, the peer is likely to underestimate its rank, i.e., calculate its rank such that . Moreover, as peers calculate their ranks based on the rank estimates of their higher-utility neighbours, rank estimation errors are propagated, and gradually magnified, from the highest-utility peers to the lowest-utility peers.
Thus, in large and dynamic systems, due to churn and communication failures, peers tend to underestimate their ranks, and the rank estimation error increases together with peer rank. Due to that, the successor-based rank estimation method is only applicable to small P2P systems.
In order to combine the advantages of both the histogram-based and successor-based rank estimation methods, a mixed approach is proposed, where the highest-utility peers use successor sets to estimate their ranks, and all remaining peers use histograms for the rank estimation.
The pseudo-code of the mixed method at peer is shown in Figure 4.14. If peer has no successor set (line 2), it approximates its rank using a utility histogram (line 3), as described in section 4.6.1. Furthermore, if the estimated rank of is lower than a threshold value (line 4), peer creates successor and predecessor sets and joins the highest utility peers (line 5).
If peer estimates its rank when it already has a successor set (line 8), it applies the successor-based method described in section 4.6.2 (line 8). If the estimated rank is equal to or higher than (line 9), peer generates another rank estimation, , using its current histogram (line 10), and removes its successor and predecessor sets if (line 11-12).
The extra check between and in line 11 of the algorithm is necessary to prevent peer from cyclically adding and removing its successor and predecessor sets. Such a situation could have happened, if the rank estimation obtained using the histogram-based method was below , while the rank estimated using the successor set at peer was above . In the algorithm shown in figure 4.14, a peer removes the successor and predecessor sets only if both estimation methods produce ranks below . Moreover, in case of an inconsistency between the ranks obtained using the two methods, the algorithm assumes that the successor-based rank estimation method is more accurate.
Further improvements to the algorithm may be introduced in order to allow a gradual transition between rank estimates obtained using the two methods. For example, the rank of peer can be calculated as , where is a system parameter, and and are two rank estimates produces by the two methods.
Gradient search is a multi-hop message routing algorithm, that can deliver a message from potentially any peer in the system to a high utility peer, i.e., a peer with utility above a given utility threshold. In gradient search, a peer greedily forwards each message that it currently holds to its highest-utility neighbour, i.e., to a neighbour whose utility is equal to
Thus, messages are forwarded along the utility gradient, as in hill climbing and similar techniques.
If gradient search is performed over a TGT, peers are most likely to forward messages to their parents. As shown in theorem 3.3, the worst-case cost for routing a message from a peer to a super-peer in a TGT is overlay hops, where is the total number of super-peers in the overlay. Message paths can be even shorter than hops, if peers use their random sets for routing, which potentially contain neighbours with higher utility than that of parents.
By exploiting the information contained in the topology, gradient search achieves a significantly better performance than general-purpose search techniques for unstructured P2P networks, such as flooding or random walking . Gradient search also reduces message loss rate by preferentially forwarding messages to high utility peers, assuming peer utility is based on a notion of peer stability .
Local maxima should not occur in an idealised gradient topology, however, every P2P system is under constant churn and a gradient topology may undergo local perturbations from the idealised structure.
There are a number of different approaches to prevent message looping in the presence of a local maxima in a gradient topology. One approach is to assign a unique identifier to every generated message, and to maintain at each peer a cache of recently forwarded messages. For each message, the cache contains the neighbours to which the message was forwarded. When a peer receives a message that it has already routed, it forwards the message to a neighbour that is not yet present in the cache for this message. If no such neighbour can be chosen, the message is forwarded to a random neighbour in order to escape the local deviation in the topology structure.
Another approach is to append a list of visited peers to each search message, and to impose a constraint that forbids forwarding messages to previously visited peers. Again, if a peer is not able to forward a message to a non-visited neighbour, it routes the message randomly. Additionally, a time-to-live value is added to each message, which is decremented by one at each overlay hop. Messages with negative time-to-live values are discarded by peers in order to prevent infinite message circulation in the overlay.
Bootstrap is a process in which a peer obtains an initial configuration in order to join the system. In P2P systems, this primarily involves obtaining addresses of initial neighbours. Once a peer connects to at least one neighbour, it can receive from this neighbour the addresses of other peers in the system as well as other initialisation data, such as the current values of aggregates.
However, initial neighbour discovery is challenging in wide-area networks, such as the Internet, since a broadcast facility is not widely available and peers cannot simply broadcast a join request to all addresses in the network. In particular, the IP multicast protocol has not been commonly adopted by Internet service providers due to design and deployment difficulties . Most existing P2P systems rely on centralised bootstrap servers that maintain lists of peer addresses.
This section describes a bootstrap procedure that consists of two stages. In the first stage, a peer attempts to obtain initial neighbour addresses from a local cache saved during the previous session, for example on a local disk. This can be very effective; Stutzbach et al  analyse statistical properties of peer session times in a number of deployed P2P systems and show that if a peer caches the addresses of several high-uptime neighbours, there is a high probability that some of these high-uptime neighbours will be on-line during the peer's subsequent session. Furthermore, such a bootstrap strategy is fully decentralised, as it does not require any fixed infrastructure, and it scales with the system size.
However, if all addresses in the cache are unavailable or the cache is empty, for example if the peer is joining the system for the first time, the peer needs to have an alternative bootstrap mechanism. In the second stage, peers obtain initial neighbour addresses from a bootstrap node. The IP addresses of the bootstrap nodes are either hard-coded in the application, or preferably, are obtained by resolving well known domain names. This latter approach allows greater flexibility, as bootstrap nodes can be added or removed over the course of the system's lifetime. Moreover, the domain name may resolve to a number of bootstrap node addresses, for example selected using a round-robin strategy, in order to balance the load between bootstrap nodes.
Each bootstrap node is independent and maintains its own cache containing peer addresses. The cache size and the update strategy are critical in a P2P system, as the bootstrap process may have a strong impact on the system topology, particularly in the case of high churn rates. If the cache is too small, subsequently joining peers have similar initial neighbours, and as a consequence, the system topology may become highly clustered or even disconnected if the peers in the cache become overloaded. On the other hand, a large cache is more difficult to keep up to date and may contain addresses of peers that have already left the system.
A simple cache update strategy is to add the addresses of currently bootstrapped peers to the cache and to remove them in the FIFO order. However, this strategy has the drawback that it generates topologies where joining peers are highly connected with each other. In such topologies, joining peers are less likely to discover their optimal neighbours, since most peers they communicate with are also joining the system and have not yet optimised their configurations.
A better approach is to continuously ``crawl'' the P2P network and ``harvest'' available peer addresses. In this case, the bootstrap node periodically selects a random peer from the cache, obtains the peer's current neighbours, preferably from the peer's random neighbour subset, adds these neighbours to the cache, and removes the oldest entries in the cache. This has the advantage that the addresses stored in the cache are close to a random sample from all peers in the system.
Figure 4.15 shows the bootstrap process of Peer A. In step (1), Peer A obtains a list of initial neighbour addresses from its local cache. If any of the initial neighbours are alive, the peer goes to step (4). Otherwise, in step (3), the peer resolves a well-known domain name and obtains an IP address of a bootstrap node. From the bootstrap node, in step (3), Peer A receives a list of addresses of initial neighbours. Finally, in step (4), Peer A contacts one of the initial neighbours (Peer B) and obtains additional information about the system, such as current values of the aggregates.
This chapter describes the design of the gradient topology and its main component. The chapter shows a number of utility functions and neighbour selection algorithms that generate TGTs, aggregation and ranking algorithms that estimate global system properties, super-peer election strategies, gradient search variations, and an approach to bootstrap peers joining the system. The next chapter evaluates these algorithms, shows their performance characteristics and trade-offs, and compares them to a number of state-of-the-art systems.
The description of the gradient topology evaluation consists of six sections. The first section contains a functional comparison between gradient topologies and a number of state-of-the-art super-peer election systems. It shows that gradient topologies offer more flexible and powerful election mechanisms than the other considered systems. The second section describes a custom-built P2P simulator, which is used later to run performance experiments. The third section compares the performance of gradient topologies and selected state-of-the-art systems, and shows that the election algorithms used in the gradient topologies generate higher-quality super-peer sets, according to a number of different metrics, at a similar cost, compared to the other systems. The fourth section describes a detailed evaluation and analysis of gradient topologies and their specific features that are not supported in the other systems. The purpose of this section is to verify the theoretical properties of gradient topologies introduced earlier in chapter 3, and to validate the design of the algorithms introduced in chapter 4. The fifth section validates the custom-built simulator. Finally, the last section demonstrates the practical viability of gradient topologies by applying the experimental results to a sample application scenario.
The functional comparison is intended to examine the ability of systems to generate, control and adapt super-peer sets in the overlay according to the higher-level application requirements. For this reason, the comparison is concerned with the state-of-the-art systems that are classified as adaptive and DHT-based in chapter 2, sections 2.5 and 2.4. Group-based super-peer election systems, described in section 2.3, are not considered in the comparison, since most of these systems depend on application-specific peer properties, such as physical location, position in a virtual space, and semantic properties, and introduce additional application-specific constraints in the super-peer election which cannot be directly compared with general-purpose super-peer election algorithms addressed in this thesis. The simplest approaches to super-peer election, described in section 2.2, are also excluded from the comparison.
The systems are compared based on eight features. Six of these features correspond to super-peer election criteria identified in the super-peer systems reviewed in chapter 2. These criteria can be described as a fixed number of super-peers, fixed ratio of super-peers to clients, fixed super-peer capacity, super-peer capacity based on the client set, super-peer capacity based on load, and latency-based super-peer constraints. Additionally, two features are added, which are believed to be highly relevant to many P2P applications. These are the support for peers that dynamically change their properties, and election of multiple super-peer sets.
SOLE , covered in section 2.4.3, allows the election of fixed numbers of super-peers, determined by global super-peer identifiers. It supports the election of multiple super-peer sets, as each application running on top of SOLE can define its own set of super-peer identifiers. However, SOLE does not introduce any notion of peer capability and elects super-peers solely based on their positions in the DHT.
Hierarchical DHTs , described in section 2.4.4, also elect fixed numbers of super-peers, determined by peer groups. Each group in the system elects its highest-capability peer as the group's super-peer. In cases where peer capabilities change over time, the system can swap super-peers with higher-capability clients, but does not provide any mechanisms to reduce the associated overhead. Hierarchical DHTs can be generalised to multi-level hierarchies with super-peers elected at each level in the system structure, but the design of such an approach is not described in .
HONet , introduced in section 2.4.5, elects super-peers based on peer coordinates in a virtual space. A super-peer is created each time a peer is located beyond a fixed distance, , from the existing super-peers. Thus, the maximum number of super-peers elected in a HONet system is fixed and is determined by the size of the virtual space and the super-peer coverage, . Hypothetically, each application built on top of a HONet system could elect its own super-peer set with a custom value for , using the virtual peer space and the DHT overlay, but this approach is not considered in .
In SPChord , described in section 2.4.6, a new super-peer is created when the size of a cluster exceeds a fixed threshold, , and a super-peer is removed when the size of a cluster falls below . Thus, SPChord maintains a ratio of clients to super-peers between and . Peers are positioned in a virtual coordinate system, and each client is associated with the closest super-peer, while super-peers are elected based on their uptimes. The super-peer election criteria can be potentially changed, but the system does not provide any mechanisms to stabilise the topology and reduce the number of swappings between super-peers and clients if peer properties fluctuate over time. Parameter can be theoretically defined for each peer individually, reflecting peer capacity, in which case SPChord would generate super-peer sets with total capacity equal to or higher than the number of clients in the system. Moreover, it may also be possible to split and merge super-peer clusters depending on the load on super-peers, but all these approaches are not addressed in .
Structured Superpeers , covered in section 2.4.7, is a system similar in many respects to SPChord. It splits and merges clusters, defined as arcs in an outer DHT ring, based on the number of clients and load associated with each super-peer. However, due to the lack of detailed system description, it is difficult to determine exactly how the super-peers are elected and what are the properties of the election mechanism.
DLM , described in section 2.5.3, is an election algorithm that maintain a fixed ratio of super-peers to clients in a P2P system. Super-peers are elected based on their capacity and uptime, but similar to hierarchical SPChord and Structured Superpeers, other super-peer election criteria could potentially be applied in DLM. The DLM algorithm can also handle peers with dynamically changing capacity, but  does not address the problem of super-peer stability and switches between super-peers and clients.
SG-1 , addressed in section 2.5.1, generates a P2P topology where the total capacity of super-peers is approximately equal to the number of clients in the system, and the capacity of super-peers is maximised. The algorithm assumes that peer capacity is constant, and continuously attempts to transfer clients from lower-capacity super-peers to higher-capacity super-peers, reducing any redundant super-peers. In order to elect two super-peer sets, each peer would have to participate in two superpeer and two underloaded overlays, and would have to maintain two clients sets. Hence, the system overhead would almost double.
SG-2 , discussed in section 2.5.2, extends the SG-1 algorithm and adds latency constraints to the generated P2P topology. Super-peers in SG-2 cover the system space and periodically broadcast advertisement messages in their neighbourhood, while clients are transferred from lower-capacity super-peers to higher-capacity super-peers. Hypothetically, independent sets of super-peers can be created in SG-2 if each set of super-peers generates and broadcasts it own type of super-peer messages, but such an approach is not followed in . Moreover, since super-peer election is localised in SG-2, the algorithm should work with peers dynamically changing their capacity.
Gradient topologies, together with the aggregation and election algorithms
described in this thesis, allow the election of super-peer sets with
fixed sizes, proportional sizes to the number of peers, fixed capacity,
capacity based on the number of clients, and capacity based on the
system load. Double super-peer election thresholds, and other election
techniques described in section 4.5, allow
peers to minimise the number of switches between super-peers and clients
as the utility of peers changes. Gradient topologies also support
the election of multiple super-peer sets, since peers can calculate
any number of super-peer election thresholds, and the topology structure
guarantees that each super-peer set is connected and can be discovered
using gradient search.
Table 5.1 summarises the results of the comparison. It can be concluded that gradient topologies, together with aggregation-based election techniques, offer more flexible and powerful super-peer election mechanisms compared with the existing P2P systems, and thus extend the current state-of-the-art knowledge on super-peers, and more generally, heterogeneity exploitation in P2P systems.
Performance evaluation is especially important when designing a novel P2P topology, such as the gradient topology, since P2P systems usually exhibit complex, dynamic behaviour that is difficult to predict a priori. Theoretical system analysis is difficult, and often infeasible in practice, due to the system complexity. At the same time, a full implementation and deployment of a P2P system on a realistic scale requires extremely large amounts of resources, such as machines and users, that are prohibitive in most circumstances. Consequently, the approach followed in this thesis is simulation experiments.
However, designing P2P simulations is also challenging. The designer has to decide upon numerous system assumptions and parameters, where the appropriate choices or parameter values are non-trivial to determine. Furthermore, dependencies between different elements of a complex system are often non-linear, and a relatively small change of one parameter may result in a dramatic change in the system behaviour. Moreover, due to the large size and complexity, full-scale P2P systems are not amenable to visualisation techniques, as a display millions of peers, connections, and messages is not human-readable. P2P simulations must continuously collect and aggregate statistical information about the system in order to detect topology partitions, identify bottlenecks, measure global system properties, etc. Such frequent and extensive measurements are often computationally expensive, which adds further challenges to analysing P2P systems.
A number of P2P simulators have been used for evaluating P2P systems, as summarised in [131,130]. The choice of the P2P simulator is an important part of the evaluation strategy, since many existing simulators differ in the system models and assumptions, and introduce different constraints on the performed experiments. Typically, P2P systems are evaluated in Discrete Event Simulators (DES), where the operation of a system is represented as a chronological sequence of instant events that atomically switch the system state . The state is usually stored on a single machine and all the processing is often performed in one thread. However, simulators vary in the extent to which they model the low-level network underlying the P2P system.
At one extreme are simulators such as NS-2 [25,13], which model the entire TCP/IP stack at every network node. Such approaches are expensive, and simulators based on NS-2 can usually support at most a few thousand simultaneous nodes [74,131,130]. Moreover, it is argued that rigid network modelling at the packet level is not necessary when evaluating P2P applications, since most interesting features of P2P systems can be observed at the application protocol level [120,107].
At the other extreme are P2P simulators where the network layer is extremely simplified and messages are passed between peers through direct method calls, with no delay. This approach is followed in PeerSim, a P2P simulator which has been used for evaluating algorithms such as SG-1 , SG-2 , Newscast , and push-pull aggregation . Apart from the networking model, PeerSim also simplifies the control flow. Instead of ordering and processing discrete events using a priority queue, as do most other P2P simulators, PeerSim uses a simple cycle model, where every peer in the system executes a step procedure at each cycle of a global simulation loop. Due to these simplifications, it is believed that PeerSim can simulate up to one million simultaneous nodes [131,124].
A number of other P2P simulators exist and have a different balance between performance and accurate low-level network modelling. Many simulators, such as p-sim , PlanetSim , and the simulator proposed by He et al. , model network communication latency using Internet topology models and generators, such as the Georgia Tech Internetwork Topology Model (GT-ITM) [203,29,202], Boston University Representative Internet Topology Generator (BRITE) , and Global Network Positioning (GNP) . Furthermore, simulators such as Narses  and the General Peer-to-Peer Simulator (GPS)  model network bandwidth constraints and simulate data flows based on connection latencies and throughputs. Table 5.2 shows a summary of a few well-known sequential P2P simulators.
Simulators that execute on multiple machines (e.g., a computer cluster) can be considered a separate category. This category contains in particular Parallel Discrete Event Simulators (PDES) [58,122], which are based on the same system model as DES, but process system events in parallel threads in order to shorten the overall execution time. In principle, PDES produce the same (or close to) simulation results as sequential DES.
Another approach is taken in simulators such as ModelNet , which provide interfaces to the evaluated application at the operating system level, simulating an Internet-like environment. ModelNet allows thus a very accurate system performance evaluation, but also requires a significant implementation cost, which is often comparable to a complete system development.
Finally, WiDS  is a toolkit that allows a protocol simulation in both the DES or PDES mode and a real-network deployment. Once a distributed protocol is developed using WiDS interfaces, it can be simulated within a single address space on a single machine, simulated on a cluster of machines, or deployed and run in the target environment.
Gradient topologies and the algorithms described in this thesis have been specifically designed to achieve a high scalability and resilience in the presence of churn and communication failures. Therefore, the evaluation of gradient topologies must be performed in a setup that involves a large-scale, dynamic and heterogeneous peer population and unreliable communication.
At the same time, it can be argued that a detailed model of the low-level network is not necessary to perform experiments on gradient topologies. First of all, gradient topologies and super-peer election algorithms discussed in this thesis are based on peer characteristics, such as peer utility, and do not exploit the properties of peer connections. Low-level network concerns play a minor role in the construction of gradient topologies. Furthermore, since nearly all algorithms described in this thesis are periodic (with the period length of a few seconds) and generate relatively small messages (approximately a kilobyte or less per message), they are not likely to congest network connections. Higher-level applications running over gradient topologies, which may use network resources more extensively, are not evaluated in this thesis. Thus, it can be expected the system behaviour can be adequately analysed without considering peer connections' bandwidth, cross traffic and in-network queueing. A simplified network model also enables running experiments on larger-scale, and hence more realistic, P2P systems.
Moreover, the evaluated algorithms are not sensitive to message latency, as they generally require that messages are delivered and responses are returned within each algorithm cycle. Since the cycle lengths are on the order to seconds, while a typical message round-trip time (RTT) on the Internet is approximately 100-200 milliseconds, message latency can be assumed insignificant in the evaluated protocols.
Table 5.3 summarises a number of measurements on the average RTT between machines connected to the Internet. The results vary due to differences in the applied measurement methodologies. Some values in the table are approximated based on graphs, and hence may contain minor estimation errors. In the cases where multiple results are available in one publication, the average RTT is put into the table. The average RTT over all measurements is 169 milliseconds.
Most measurements indicate that the distributions of message RTT are skewed, and relatively small fractions of messages are significantly delayed, even by many seconds. However, peers can deal with such extremely delayed (or lost) messages by estimating the average RTT to their neighbours. When a message is not responded to within a certain timeout, such as three times the RTT, it is assumed to have been lost. This way, peers can decide whether their message have been delivered or lost in a relatively short time (below a second), which is significantly shorter than the simulation time step.
An alternative approach to evaluating the performance of a P2P system is to implement it and deploy it on PlanetLab [38,178], a wide-area test bed for distributed systems. PlanetLab consists of a collection of machines (914 hosts at 473 sites at the moment of writing), physically distributed across the Globe, which are available for running scientific experiments.
However, there are at least three relevant reasons for which gradient topologies and the compared systems have not been evaluated on PlanetLab. First, gradient topologies have been specifically designed for heterogeneous P2P environments, while in PlanetLab, most nodes have similar performance characteristics, since the PlanetLab Consortium specifies a minimum hardware specification for the participating nodes. At the time of writing, the minimum requirement for a node was 3.2 GHz CPU clock speed, 4 GB RAM, and 320 GB disk space. Moreover, most nodes in PlanetLab are hosted by large research institutions and are connected with each other through fast and reliable Internet links, such as university and corporate networks. In order to perform experiments on gradient topologies, node capabilities in PlanetLab would have to be artificially limited, as in a traditional P2P simulator, in order to model a heterogeneous peer population.
Second, most machines in PlanetLab are relatively stable and have high uptimes, since they are used by the PlanetLab community for running long-term experiments. In order to perform experiments with realistic churn conditions, with peers joining and leaving the system according to some distribution, peer life cycle would have to be explicitly modelled, as in a P2P simulator.
Finally, PlanetLab consists of a few hundred machines only, while many existing P2P systems have already reached the scale of millions of nodes [35,102,181]. In order to evaluate a P2P system on a realistic scale, each physical node in PlanetLab would have to emulate hundreds of virtual peers.
At the time when the experiments described in this thesis were implemented, none of the available P2P simulators were suitable for running experiments on gradient topologies, and as a consequence, a custom P2P simulator was developed. The simulator is based on a cycle model, similar to PeerSim, as most of the evaluated algorithms are periodic. The main structure in the simulator is a global loop which controls the flow of time. At each cycle of this loop, every peer in the system executes one cycle of its neighbour selection, aggregation, super-peer election and routing algorithms. The order of peers at each time step is chosen randomly.
The simulator models an open, heterogeneous and dynamic population of peers, with continuous peer arrivals and departures. The churn rate (i.e., the average fraction of peers leaving the system per time unit) in the experiments is carefully chosen to reflect the conditions that could be expected in a real P2P system. Table 5.4 lists a number of independent measurements on peer session times in various P2P systems. Notation is used to indicate that percent of peers in a system have a session time below . Some values in the table are approximated based on graphs in the original publications. Overall, most studies show that median peer session durations in existing P2P systems are between a few minutes and and a few hours. In order to be consistent with these real-world measurements, the mean peer session time, , in most experiments described in this thesis is assumed to be 30 minutes, which corresponds to a churn rate of peers per minute. Assuming a time step of 5 seconds in the simulation, this is equivalent to mean session time of 360 time steps and a churn rate of peers per time step.
While almost all published reports agree that peer session distributions are highly-skewed, there is no general consensus whether these distributions are heavy-tailed and which mathematical model best fits the empirically observed peer session times. Sen at al. , Kutzner et al. , Bustamante , and Pouwelse et al.  report power-law (i.e., Pareto) or close to power-law peer session distributions, and Gummadi et al.  conclude that session distributions are heavy-tailed (see Table 5.4 for comparison). However, Chu et al.  suggest a log-quadratic peer session time distribution, while Stutzbach and Rejaie  propose log-normal and Weibull distributions. Saroiu et al. , Sen et al. , Chu et al. , and Bhagwan et al.  also observe diurnal patterns in peer session times.
Moreover, Stutzbach and Rejaie estimate that the best power-law fit for the peer session times in a number of BitTorrent overlays has a shape parameter between 2.1 and 2.7 (hence the distribution is not heavy-tailed), and the best Weibull fit has a shape parameter between 0.34 and 0.59. Similarly, Nurmi et al.  find that peer session times are best fitted with Weibull distributions with a shape parameter between 0.5 and 0.6.
In this thesis, three session models are used. In the Pareto model, a peer is assigned a session longer than with probability
The shape parameter, , is set to 2 (border case between heavy-tailed and non-heavy-tailed distributions) and the minimum session duration, , is calculated in such a way that the mean session is equal to
In the Weibull session model, peer has a session longer than with probability
where is a shape parameter, , and is a scale parameter, such that the mean session duration is equal to
and is the Gamma function. In the third session model, used for a theoretical system analysis, peer session times are infinite, and peers can join the system but never leave or fail.
In all session models, joining peers are bootstrapped by a centralised server. The bootstrap server obtains peer addresses through ``crawling'' the P2P overlay and maintaining a FIFO buffer with 1,000 entries, as described in section 4.8. Additionally, the bootstrap server is used for starting aggregation instances.
The network layer in the simulator is extremely simple. For the reasons explained in section 5.2.2, connection bandwidth and latency are not modelled, and it is assumed that messages are transferred between peers instantly. However, the simulator supports three message loss models.
In the simplest model, each transmitted message has a fixed loss probability . Since gradient topologies do not exploit peer network proximity, and most messages in the evaluated protocols traverse wide-area networks, the message failure probability can be adequately modelled by the overall packet loss rate on the Internet.
Table 5.5 summarises a number of measurements on the message loss rates between hosts on the Internet. In cases where multiple experiments are reported in one publication, the average result is calculated. Heckmann et al.  is the most relevant measurement in the context of P2P systems, since it has been performed in e-Donkey, a P2P file-sharing application. The results vary between 0.12% and 13.5%, depending on the measurement methodology, and the average loss rate over all measurements is 5.19%. In order to be consistent with these measurements, message loss probability in the experiments is set to .
In the second model, the probability of a message loss is proportional to the reciprocal of the recipients session time. A message sent from peer to peer fails with probability
where is the total session duration of (known to the simulator only), and is a system constant scaled in such a way that the average message loss between all peers in the system is equal to . This model is based on observations that peer neighbourhood tables often contain stale entries, due to churn, and large numbers of messages are lost because peers attempt to forward messages using neighbours that no longer exist in the system [30,113]. Hence, messages are less likely to fail if they are forwarded to more stable neighbours.
The third message loss model is specific to request-response protocols, such as aggregation and neighbour selection. The probability of a request loss is calculated using the fixed-loss of the proportional-loss model, but responses are never lost. The rationale behind this approach is the following. First, it can be assumed that the recipient of a response message, i.e., the requester, is on-line, since the average request-response exchange time is only 100-200 milliseconds, and the probability of the requester leaves during the exchange is negligible. Second, it is estimated that large fractions of peers in P2P systems are connected to the Internet through firewalls or NATs, which significantly contribute to the overall message loss [102,181]. Typical firewalls allow peers to generate outgoing traffic, but reject traffic initiated by peers outside of the firewalled zone. In the request-response protocols, if a peer has already received a request and generates a response message, it can be assumed that the peer is either not firewalled at all or has a firewall that allows communication with the requester.
It is assumed that all peers are mutually reachable and any pair of peers can potentially establish a connection. In all algorithms, except for Newscast, the neighbourhood model is symmetric, as discussed in section 4.3.1. Each peer is assigned a capacity value, , and the maximum number of connections peer can establish at a time is limited to . Capacity values follow a Pareto distribution with the shape parameter (again, border case between heavy-tailed and non-heavy-tailed distributions) and a mean of . This way, an average peer can have up to 160 neighbours. However, as shown later, peers rarely approach this limit and in most cases have approximately 50 neighbours. The highest utility peers in the simulated systems maintain approximately a few hundred neighbours. For a comparison, in e-Donkey, a P2P file-sharing system, peers have on average about 30-50 connections, and servers (i.e., super-peers) have on average 700 connections . In Gnutella, an ultrapeer can accept up to 32 connections from leaves and up to 30 connections from other ultrapeers .
The neighbour verification algorithm and the leave procedure, described in section 4.3.3, are not executed by peers. The reason is twofold. First, it is very hard for the experiment designer to determine the fraction of peers that perform the leave procedure when departing from the system, without a detailed knowledge of the system deployment environment, application scenario and user community. In particular, no study on peer crashes (as opposed to intentional departures) in P2P systems is known to the authors of this thesis. Second, a simplified model of peer connections significantly improves the performance and scalability of experiments, allowing larger networks to be simulated.
For these reasons, the simulator assumes that peers always have consistent views of their connections. However, outdated neighbour entries, and the resulting message loss, are simulated in the proportional message loss model described above in section 5.2.6.
In order to establish a connection, peers exchange two messages. If the exchange fails, the connection is unsuccessful. Similarly, peers exchange two messages when they disconnect. Connection closing always succeeds, as it can be assumed that the neighbour verification algorithm eventually resolves any discrepancies between neighbouring peers.
The simulator has been implemented in Java using RePast , a multi-agent simulation toolkit, and Colt, a high performance scientific and technical computing library. For performance and scalability reasons, the simulator uses regular arrays instead of Java collections, primitive data types (such as int, double, etc.) instead of object types (such as Integer, Double, etc.) for storing numbers, and the Colt's pseudo-random number generator instead of the default generator in Java. A number of further optimisations have been performed using a performance profiler, the Java Runtime Analysis Toolkit (JRat), which allow the simulator to support up to 100,000 gradient topology nodes using 2GB of memory. The simulator can also generate system visualisations, which are especially useful for preliminary system analysis and debugging at early stages of development.
Table 5.6 summarises the main parameters
used in the experiments. The parameters at the top of the table, i.e.,
churn rate, session time distribution, message loss rate, and maximum
number of connections per peer, define the simulated systems' environment,
and are initialised based on measurements on existing P2P systems,
as explained in sections 5.2.5, 5.2.6,
and 5.2.7. The remaining parameters
are configurable algorithm properties, and their impact on the system
performance is discussed further in this chapter. In particular, section
5.4.6 covers the aggregation settings (
), section 5.4.1 discusses
the gradient topology branching factor and neighbourhood set sizes,
and section 5.4.7 evaluates the impact
of the branching factor and message time-to-live on routing performance.
The parameters for the state-of-the-art systems (Newscast, SG-1, and
Chord) are set based on the original code and publications describing
This section describes a performance comparison between aggregation-based super-peer election algorithms, used in gradient topologies, and selected state-of-the-art systems, SG-1, SPChord, Hierarchical DHT (abbreviated to H-DHT), and SOLE, conducted using the P2P simulator described in the previous section.
For SG-1, the original code has been obtained from the PeerSim's website and ported to the P2P simulator used in this study. Moreover, an extended version of SG-1, labelled SG-1-fix, based on the algorithm proposed in section 188.8.131.52, has been implemented and compared with the other systems.
SOLE, H-DHT, and SPChord have been implemented on top of a custom-built Chord implementation. In the H-DHT, peers are randomly assigned to static groups, and each group elects one super-peer only. In SPChord, peer identifiers are chosen randomly and do not correspond to peer locations in a virtual coordinate space. Two version of SPChord are considered, with and without the adjustment algorithm, where the latter is labelled SPChord*.
The systems are compared with a tree-based gradient topology (labelled TGT), generated using tree sets, described in section 184.108.40.206, and Newscast sets, covered in section 220.127.116.11. Ranks are estimated using the mixed method (section 4.6.3), and super-peers are elected using top-K and clients thresholds (section 3.1.1), depending on the experiment.
Additionally, a variant of TGTs is considered, labelled TGT*, where the random neighbourhood sets, described in section 4.3.4, are used instead of Newscast sets. TGTs are also compared with simpler gradient topologies, described in [163,164], generated using the utility successor, predecessor, and Newscast neighbourhood sets. These topologies, labelled GT, do not have the tree sets, and hence do not guarantee logarithmic routing performance.
The evaluated systems are compared with respect to the quality of the super-peer sets they generate and the overall cost of the super-peer topology maintenance. The comparison is divided into three sections. In the first section, the goal of each system is to maximise the super-peer capacity, and the systems are compared based on the numbers of elected super-peers, average super-peer capacity, and fraction of optimal super-peers (defined by peer capacity values). In the second section, super-peers are elected based on their stability, and the systems are compared with respect to the average super-peer session duration, super-peer leave rate, and the number of switches between super-peers and clients. Finally, the last section evaluates the super-peer set election and maintenance cost, measured as the average number of neighbours maintained by peers and the number of messages generated by peers. The exact definitions of the comparison metrics are formally introduced in the sections below.
The systems are compared by running a series of simulation experiments. In each experiment, an initial network consisting of a single peer is gradually expanded by adding 0.5% peers at each time step until the system size grows to peers. At subsequent time steps, the system is still under continuous churn, as peers leave according to their session times and are replaced by new peers, but the rates of peer arrivals and departures are equal and the system size remains constant. The system is run for additional time steps, during which measurements are performed and various statistical data are collected. The obtained samples are averaged at the end of each experiment.
The experiment is repeated for each of the evaluated systems, and the results are summarised using graphs. The error bars on the graphs indicate the standard deviation in the measurements. In each experiment, all system parameters are initialised with values shown in Table 5.6, unless the experiment description explicitly states that a different setup is used.
In the first set of experiments, described in this section, super-peers are elected based on their capacity, and the goal of the election algorithm is to maximise the super-peer capacity.
In SOLE, H-DHT, and TGT with a top-K threshold, the desired number of super-peers is given as a system parameter . Given the total system size of peers, the optimum super-peer ratio is , which roughly corresponds to typical super-peer ratios observed in existing P2P systems, such as 30,000 super-peers for 3 million peers in KaZaA according to Liang et al. , approximately 1 super-peer per 30-65 clients in KaZaA according to Xiao et al. , and 1 super-peer per 32 clients in Gnutella version 0.6 .
In SG-1, SPChord, and TGT with a clients threshold, super-peers are elected in such a way that the super-peer set has a capacity equal to or higher than the total number of clients in the system. The capacity values are assigned in such a way, that the optimum super-peer ratio is approximately equal to . More precisely, given that peer capacity values follow a Pareto distribution with shape parameter of and mean , from Theorem 3.1, the average capacity of the 1000 highest-capacity peers in a 100,000 peer system is 100,000. The exact optimum size for the super-peer set changes with time and depends on the current peer capacity values.
Given the optimum number of super-peers, , in a simulated system at time , and the current number of super-peers at time , denoted , the average relative error in the number of super-peers over all time steps is defined as
where is the measurement duration. Similarly, the average relative error in super-peer capacity is defined as
where is the total super-peer capacity at time , and is the optimum super-peer capacity at time .
Figure 5.1(a) shows the value of for each of the compared systems. Clearly, SPChord performs worse than the other systems (note the logarithmic scale), both in conditions with churn and without churn. This is caused by the DHT constraints imposed on super-peers in SPChord, which prevent the system from reducing the size of the super-peer set to the required minimum. The topology adjustment algorithm significantly reduces the number of redundant super-peers in SPChord, reducing by a half, but is not sufficient to outperform the other systems.
Similarly, SG-1 elects too many super-peers in the presence of churn, since every joining peer, and every client that becomes disconnected from a leaving super-peer, becomes a super-peer. This problem is greatly reduced in the extended version of the protocol (SG-1-fix), where peers always attempt to discover super-peers using the underloaded sets and connect as clients before deciding to become super-peers. In the absence of churn, both SG-1 and SG-1-fix manage to elect a nearly optimum number of super-peers. All the remaining systems perform relatively well, and the super-peer election error is below a few percent in experiments with churn.
Figure 5.1(b) shows the value of for SG-1, SG-1-fix, SPChord, and TGT with a clients threshold. Again, SPChord and SG-1 generate super-peer sets with a significantly higher capacity than the system optimum. However, in both systems, the relative error in super-peer capacity ( ) is about 5-6 times lower that the error in the number of super-peers ( ), since the low-capacity super-peers that are elected in both systems significantly increase , but have a relatively low impact on . In the gradient topology, the average super-peer capacity error is below 2%.
At each time step , an optimal super-peer set, , is defined based on the current peer capacity values in the system. In SOLE, H-DHT, and TGT with a top-K threshold, this set contains the highest-capacity peers. In SG-1, SPChord, and TGT with a clients threshold, the optimum set contains the minimum number of highest capacity-peer that have a total capacity equal to or higher than the number of clients.
Given the current super-peer set, , at time , the average fraction of optimal super-peers is defined as
Similarly, the average fraction of suboptimal super-peers in the elected sets is defined as
Figures 5.2(a) and 5.2(b) show the values of and in each of the compared systems. As expected, the gradient topology performs very well, since in threshold-based elections, by definition, super-peer sets contain the highest-capacity peers only.
In SG-1, even in the absence churn, optimum super-peer sets are not achieved. This is caused by the lack of swapping between super-peers and clients in the SG-1 protocol when super-peers are fully utilised, as discussed in section 18.104.22.168. If churn is present, SG-1 elects a large number of suboptimal super-peers, since all clients that are not associated with super-peers are automatically promoted to super-peers, decreasing the value of and increasing . The results are significantly improved in the extended version of the SG-1 protocol, which maintains close-to-optimum super-peer sets in experiments with churn and fully-optimised sets in the absence of churn.
SOLE generates mainly suboptimal super-peer sets, as it elects super-peers based on peer DHT identifiers and does not take into account peer capacity. Similarly, in SPChord, the generated super-peer sets are largely suboptimal, since the election is based not only on peer capacity values but also on peer positions in the DHT overlay, as previously discussed. The topology adjustment algorithm in SPChord clearly improves the quality of super-peers.
In H-DHT, the fraction of optimum super-peers does not change significantly between the experiments and is close to 60% in both experiments with churn and without churn. This stems from the fact that each super-peer in H-DHT is elected independently in its peer group, and peers located in different groups are not compared. This simplifies the election process, but does not produce optimum super-peer sets at the global system level.
Figure 5.3(a) shows the average super-peer capacity in the compared systems. The standard error in the results is high, due to the skewed distribution of peer capacity. In particular, Pareto distributions with a shape of have an unbounded variance. In order to reduce the standard error and increase the experiment repeatability, the maximum capacity in all experiments is limited to 500, i.e., 50 times the mean.
The results shown in Figure 5.3(a) are consistent with the super-peer optimality shown in Figure 5.2(a). Systems with higher fractions of optimum super-peers have also a higher average super-peer capacity.
In order to get more insight into the election algorithms, this section analyses the super-peer session lengths in the evaluated systems. A super-peer session starts when a peer is promoted to a super-peer, and ends either when a super-peer leaves the system or is demoted to a client. Thus, the average super-peer leave rate is defined as
where is the number of super-peers that leave the system at time , and is the total number of super-peers at time . Similarly, the average super-peer demote rate is defined as
where is the number of super-peers demoted to clients at time step .
Figure 5.3(b) shows the average durations of super-peer sessions in the compared systems. Similar to Figure 5.3(a), the standard error is high due to the skewed peer session distribution. In SG-1, super-peer sessions are extremely short, since all peers join the system as super-peers, and are in most cases demoted to clients within just few time steps. In SG-1-fix, super-peer session are longer, since peers can join the system as clients, but the average super-peer session is still short compared with the other systems. This results from two facts. First, joining peers, as well as clients that are disconnected from super-peers, are often unable to discover new super-peers, since the underloaded neighbour sets often contain outdated peer entries, i.e., peers that do not have any free capacity, are not super-peer anymore, or have already left the system. Moreover, super-peer sessions are frequently terminated when super-peers are swapped and replaced with higher-capacity clients.
The SG-1 paper  suggests that super-peer stability can be improved by allowing peers to change their status in an initial period, when joining the system or searching for a super-peer, and assuming that a super-peer is created only if the peer's status does not change for a fixed number of algorithm rounds, . This approach is evaluated and compared with the other systems in Figure 5.3(b). It is labelled SG-1-delay for the original SG-1 algorithm and SG-1-fix-delay for the extended SG-1 protocol, and the parameter is set to 10. As expected, the average super-peer session duration is significantly increased in the new measurement method.
Overall, SOLE, H-DHT, SPChord and SG-1 with delayed super-peer election have comparable super-peer session durations, but are all outperformed by gradient topologies. This is further analysed in Figure 5.4, which shows the average super-peer leave rate and demote rate in the evaluated systems. The super-peer leave rate is nearly identical in all systems, and is equal to the overall system churn rate, since super-peers are elected exclusively based on their capacity in the considered experiments. However, the frequency of swappings between super-peers and clients (i.e., super-peer demote rate) varies considerably between the systems. SG-1 and SG-1-fix are not shown in Figure 5.4 for clarity reasons, since the super-peer demote rates in these systems are more than an order of magnitude higher than the corresponding rates in the other systems. The gradient topologies show a significant advantage over the other systems as they have the lowest rate of switches between super-peers and clients. The figure also shows the cost of the topology adjustment algorithm in SPChord. The average rate of super-peer swapping in SPChord is twice higher compared to SPChord*. As a consequence, super-peers in SPChord have twice shorter sessions compared to SPChord*.
This section describes a series of experiments that evaluate the ability of the compared systems to maximise the stability of super-peers. To that end, each system is configured in such a way that super-peers are elected based on their session characteristics rather than their capacity, as in the previous section.
For each system, three super-peer election criteria are considered: uptime, expected session duration, and expected remaining session duration. The calculation of a peer's uptime is straight-forward. For the estimation of the peer's session, three models are used. In the first model, a peer simply obtains the exact session duration from the simulator, which corresponds to a situation where a peer can predict its session length with a 100% certainty. In the second model, labelled ``Error1'' on the graphs, a peer calculates its expected session duration as
where is the true session length of , known by the simulator only, and is a parameter randomly initialised by at startup which follows a uniform distribution in the range from to . Thus, is assigned randomly between 0 and . In the third model, denoted ``Error2'', peer calculates its expected session duration using formula
where is a variable initialised by which follows the same distribution as peer sessions in the system, i.e., Pareto or Weibull, depending on the experiment. The remaining peer session is calculated as the estimated session duration, , minus the current peer uptime.
In order to obtain accurate measurement results, due to the high variance of peer session times (in Pareto distributions with a shape parameter of 2, the variance and standard deviation are infinite), each experiment is run for 10,000 time steps. However, due to the increased computational cost, the number of peers in each experiment is reduced to 10,000. As previously, in SOLE, H-DHT, and TGT with a top-K threshold, the desired super-peer ratio is (i.e., the optimum number of super-peers is ), and in SG-1, SPChord, and TGT with a clients threshold, the super-peer set is created in such a way that its total capacity is equal to or higher than the total number of clients in the system. Moreover, in the gradient topology, the election strategy described in section 4.5.3 is followed, where super-peers are never demoted to clients.
Figure 5.5 shows the relative error in the number of elected super-peers (i.e., , as defined above) in SOLE, H-DHT, and TGT, and the relative error in the super-peer capacity ( ) in SG-1, SPChord, and TGT. The experiments show that the choice of the super-peer election criteria has a minor impact on the election error for most algorithms, and the results are generally consistent with the previous measurements in section 22.214.171.124. In H-DHT, TGT, and SG-1, a better performance is observed with the uptime-based and session-based super-peer election criteria, compared with the capacity-based election, since with the former criteria the super-peer sets are more stable and hence easier to manage.
Figure 5.6 shows the average super-peer session duration in the compared systems. Peer sessions are modelled using a Pareto distribution, but similar results are obtained with Weibull distributions. In all considered scenarios, TGT and H-DHT clearly outperform the other systems, achieving several times longer average super-peer sessions. In SOLE, SPChord, and SG-1 (all versions), the choice of super-peer election criteria has little impact on the super-peer stability, since in these systems the super-peer sessions durations are mainly determined by the swappings between super-peers and clients. In all examined scenarios, the topology adjustment algorithm in SPChord only reduces the average super-peer session duration. Moreover, in SOLE, and to a large extent in SPChord, super-peers are elected based on their position in the DHT overlay rather than their estimated stability, resulting in short super-peer sessions.
In the gradient topology, top-K threshold appears to produce longer super-peer sessions than the clients threshold. This results from the fact that the latter threshold elects more super-peers. When peer utility is defined as capacity, the optimum super-peer ratio is approximately , both in systems with top-K and clients thresholds. As the utility function is changed from capacity to uptime, session, or remaining session, the clients threshold requires a selection of more super-peers in order to achieve a sufficient capacity to handle all clients in the system.
As expected, in all systems, the stability of super-peers is lowest when the election is based on peer capacity. Uptime-based super-peer election comes second, in terms of super-peer stability, and the best results are obtained with session and remaining session based election, with little difference between the last two methods.
The session and remaining session based election methods outperform the uptime-based and capacity-based methods also in the experiments with peer session estimation error. Furthermore, super-peer sessions are longer in experiments with the first error model (Error1) compared with the second model (Error2). This can be explained in the following way. In the first model, if a peer has a short session , then according to Formula 5.12, it must also have a low session estimation , and hence is not likely to become a super-peer. In the second model, all peers in the system, including those with the shortest sessions, have a non-zero chance of becoming super-peers, since is unbounded in Figure 5.13.
Figure 5.7 shows the super-peer leave rates and demote rates in the same set of experiments. As previously, in the capacity-based super-peer election, the super-peer leave rate is approximately equal for all systems. However, in experiments with the other super-peer election criteria (i.e., uptime, session, and remaining session), the super-peer leave rate is significantly lower, due to the selection of more stable super-peers.
In SOLE, as previously, the super-peer leave and demote rates are exactly the same in all experiments, since the election mechanism is based on peer DHT identifiers. Similarly, in SPChord, only a small reduction in the super-peer leave and demote rates are observed in the performed experiments, as the system elects large numbers of suboptimal super-peers (i.e., low stability in this case).
In contrast, in TGT, H-DHT, and SG-1, the super-peer leave rate is greatly reduced in experiments where super-peers are elected based on their expected session or remaining session, compared with the capacity and uptime experiments. However, SG-1 suffers from a relatively high frequency of super-peer demotions, which overall causes that the average super-peer sessions in SG-1 are short. In particular, in SG-1 where super-peers are elected based on their remaining sessions, super-peer leave rate is almost zero, since nearly all super-peers are swapped with clients before they leave the system (i.e., when their remaining session times are low). TGT and H-DHT generally outperform all other systems, as they manage to minimise both the super-peer leave rates and demote rates.
The following section compares the cost of super-peer election in the evaluated systems. This cost is measured as the peer degrees (i.e., the number of neighbours maintained by peers) and the number of messages generated by peers per time step.
The system topology, , is an undirected graph , where is the set of peers in the system, and is the set of edges determined by the peer neighbourhood sets. In most algorithms described in this thesis, the neighbourhood model is symmetric, however, some algorithms, such as Newscast, are based on an asymmetric neighbourhood model. In order to introduce a common notation and evaluation metrics for all systems, if either or .
Moreover, for each type of neighbour subsets, , a sub-topology is defined, such that if or , where and are neighbourhood subsets maintained by peers and . The degree of a peer in a sub-topology is defined as the number of edges that connect to peer in . The out-degree of a peer in a sub-topology is simply defined as the size of .
Figure 5.8(a) shows the average peer degree in the system overlay and generated sub-overlays in each of the evaluated systems. SG-1 has a clearly highest (over four times) average peer degree compared with the other systems, which is a consequence of the fact that every peer in SG-1 participates in three Newscast instances, i.e., connected, superpeers, and underloaded. Conversely, in H-DHT and SPChord, the average peer degree is many times lower compared with the other systems, since a large majority of peers in these two systems, i.e., all clients, maintain only one connection to a super-peer. This reduces the average number of connections per peer to about two in H-DHT and four in SPChord (almost identical result is obtained for SPChord*). In SOLE, where all peers participate in a global DHT overlay, the overall number of connections is significantly higher (about 40), and is mainly determined by the peer finger tables.
In TGT, an average peer participates in one Newscast overlay, and maintains two connections in the tree set (one outgoing and one incoming). In a Newscast overlay, a peer has on average 20 outgoing connections (30 in the original SG-1 code), and hence the average degree in Newscast is approximately 40 (60 for SG-1). In the presence of churn, the average degree gradually decreases, due to the outdated entries in the Newscast neighbourhood sets.
If the Newscast sets in TGT are replaced with the random sets described in section 4.3.4, the average peer degree is reduced from 37 to 12. This configuration is labelled TGT* in Figure 5.8(a). The random sets are generally more robust to topology partitioning, and hence require fewer neighbours. Furthermore, if a gradient topology is generated using the successor and predecessor sets instead of the tree sets (which is labelled GT in Figure 5.8(a)), the average peer degree grows by about 10.
Figure 5.8(b) shows the average peer out-degree. Since the average peer out-degree must be equal to the average peer in-degree in the system, the out-degree distribution is almost identical (scaled by 0.5) to the general peer degree distribution.
Figure 5.9 shows the relationship between peer degree and peer rank. The plot represents a histogram with logarithmically growing bins. Each point in the plot is generated by calculating the average peer degree, , over all peers that belong to the 'th histogram bin, i.e., peers ranked from to . The histogram is generated periodically during the experiment, and the values in each histogram bin are averaged at the end of the simulation run.
The average peer degree does not exceed approximately 50 in most systems, with SG-1 being the only exception. In SG-1, each client maintains up to 60 connections to super-peers through the superpeers and underloaded sets, and since the super-peer ratio is approximately , the average number of connections per super-peer is approximately 6,000. As the actual super-peer ratio is higher than , especially in the original SG-1 algorithm, where a large number of sub-optimal super-peers is elected, the average super-peer degree is close to 2000 is SG-1 and 5000 in SG-1-fix.
In SOLE, the average peer degree does not change with peer rank, since all peers, clients and super-peers, participate in a global DHT. This is different in H-DHT and SPChord, where only the super-peers maintain the DHT, and low-utility peers have a degree close to one, as they are connected to super-peers only. Furthermore, the average degree of low-utility peers in SPChord* is higher compared to SPChord and H-DHT, since SPChord* is more likely to elect low-utility (and hence sub-optimal) super-peers.
In TGT, the average degree of low-utility peers is 36, as these peers participate in one Newscast overlay and have only one tree connection. The average degree increases to 46 for peers ranked 10,000 and less, since each of these peers has 10 incoming tree connections. Moreover, the 100 highest-utility peers have on average 56 connections, since they maintain additional utility successor and predecessor sets for the rank estimation. In GT, all peers have approximately 45 neighbours, as they all participate in a Newscast overlay and have utility successor and predecessor sets.
One of the most commonly used metrics for measuring the cost of a distributed algorithm, or a system overhead, is the number of messages generated by a node per time step. This approach is also taken in this thesis. Figure 5.10(a) shows the average number of messages generated by each algorithm running at peers in the evaluated systems per time step. Similar to in-degrees and out-degrees, the average number of messages received by peers is equal to the average number of messages sent by peers. Given that each time step in the simulation represents 5 seconds of the real time, Figure 5.10(a) must be scaled by a factor of 0.2 to obtain the numbers of messages generated by peers per second.
The algorithms considered in the comparison, and the corresponding message types, are neighbour selection (specific to each system), connection handling (i.e., connect and disconnect messages in the symmetric neighbourhood model), and super-peer election, which is again specific to each system, and corresponds to aggregation in the gradient topology, clients set maintenance in SG-1, and super-peer discovery using a DHT and client transfer in SOLE, H-DHT, and SPChord.
Similar to peer degree analysed in the previous section, the average number of messages generated by a peer is lowest in H-DHT and SPChord, since the clients in these systems, which constitute a large fraction of all peers, communicate with their super-peer only and generate very few messages. The cost of the DHT maintenance is H-DHT and SPChord is nearly negligible, since the participation in the DHT overlays is restricted to super-peer only. Interestingly, the overall message costs in SPChord and SPChord* are almost equal. SPChord generates extra messages for the topology adjustment, but elects fewer super-peers and thus reduces the DHT maintenance cost.
SOLE incurs a much higher message cost compared to SPChord and H-DHT, since all peers in SOLE run the DHT overlay. In SG-1, the overall message cost is dominated by the three instances of Newscast run by all peers in the system. Each of these instances requires the exchange of 2 messages per peer per time step.
In TGT, peers generate on average two messages for the maintenance of Newscast sets, two messages for the tree sets, and 3-4 messages for aggregation. In total, 7-8 messages per time step, which is comparable to SG-1 and SOLE, but significantly higher than the message cost in H-DHT and SPChord. If the Newscast sets are replaced by the random sets in a TGT, which is labelled TGT* in Figure 5.10(a), the overall message cost is significantly increased, since the symmetric connection model used in the random sets requires that peers exchange a pair of messages each time they set up or close a connection. As the connections in the random set are constantly shuffled, the maintenance cost of random sets is significantly higher compared with Newscast (by approximately 2 messages per time step), and for that reason, in the remaining experiments in this thesis, peers use Newscast sets only in the gradient topology.
A GT overlay is more expensive to create and maintain than a TGT, since each peer in a GT participates in three neighbourhood sets, i.e., Newscast, successor, and predecessor, and generates on average two messages per time step for each set. In TGT, only the 100 top-ranked peers construct the successor and predecessor sets, and the cost of their maintenance is negligible compared to the total number of messages generated in the system.
Figure 5.10(b) shows the relationship between peer rank and the number of generated messages in each of the compared systems. The plots represent rank-based histograms, generated in a similar way as in Figure 5.9. A more detailed analysis of messages generated in each of the evaluated systems is shown in Figure 5.11.
In SG-1, each peer incurs a constant cost of 6 messages per time step, required for the maintenance of three Newscast overlays. Additionally, super-peer gossip and transfer clients between each other. Super-peers with ranks close to 1,000 are most likely to pass their clients to higher-utility super-peers, and hence have the highest connection and client transfer cost.
In SG-1-fix, in addition to the message exchanges required by the original SG-1 protocol, super-peers periodically contact neighbours in their superpeer set and potentially swaps them with higher-utility clients. Moreover, clients that join the system or become disconnected from their super-peers, attempt to discover new super-peers using the underloaded overlay. Super-peers gossip more often with each other in SG-1-fix compared with SG-1, since the underloaded sets in SG-1 contain a high proportion of low-utility and temporary super-peers, and hence, permanent super-peers, ranked below 1,000, are less likely to discover gossip partners.
In SOLE, all peers participate in a global DHT and exchange two messages per time step for the maintenance of their successor, predecessor, and finger sets. The connection maintenance cost is approximately one message per time step.
In H-DHT, in contrast to SOLE, only super-peers participate in the DHT and run the successor, predecessor, and finger sets. Additionally, super-peers broadcast client lists to all members in their peer groups at every 10 time steps. Given the super-peer ratio of approximately , the average cost of the client list broadcast is approximately 10 messages per super-peer per time step. Due to continuous connections and disconnections of clients, super-peers generate approximately 7 messages per time step for the management of their connections with neighbours.
In SPChord, similar to H-DHT, super-peers maintain the DHT structures, and periodically broadcast client lists to all group members. Additionally, super-peers periodically contact their neighbours in the DHT and occasionally merge or split their clusters or swap positions with higher-uptime clients. Due to the frequent topology modifications, the overall connection handling cost is relatively high.
In TGT, every peer participates in a Newscast overlay, which requires an exchange of two messages per time step, and runs the aggregation algorithm, which generates approximately 4 messages per time step. Each peer ranked below 10,000 generates 11 messages per time step for the maintenance of the tree sets, as it receives on average 10 neighbour exchange requests per time step from its children and initiates on average one message exchange with the parent node. Additionally, the 100 top-rank peers maintain utility successor and predecessors sets, for which they exchange 4 messages per time step. Furthermore, peers with ranks close to 100 suffer a high connection handling cost, since they frequently join and leave the successor and predecessor sub-overlays.
In GT, all peers participate in the successor, predecessor and Newscast overlays, and run the aggregation algorithm, and hence, generate approximately 10 messages per time step.
Overall, the average message cost at super-peers varies between 7 and 30 messages per peer per time step, and is comparable in all systems. Moreover, in all systems, super-peers generate more messages per time step than clients, with the exception of SOLE and GT, where the message cost is equal for all peers. H-DHT and SPChord achieve the lowest message cost per peer, since clients in these systems maintain only single connections to their super-peers and rarely generate messages, reducing the average message cost to approximately one message per peer in H-DHT and two messages per peer in SPChord.
The evaluated systems have been compared in a series of experiments based on three general criteria: super-peer set quality, stability, and cost. In the first set of experiments, described in section 5.3.3, the systems are compared with respect to the average size, capacity, and optimality of elected super-peer sets. The experiments show that only the TGTs (both with top-K and clients thresholds) and systems with simple super-peer criteria (i.e., SOLE and H-DHT, which elect fixed numbers of super-peers) manage to control the number of super-peers in the overlay, while SG-1 and SPChord elect significantly too many super-peers. Moreover, TGTs outperforms all other systems in terms of super-peer optimality, and generate super-peer sets with the highest average capacity.
The second set of experiments, in section 5.3.4, evaluates the systems' ability to maximise the average super-peer session duration. In these experiments, SOLE and SPChord achieve poor results, since they elect low-stability super-peers and frequently swap super-peers with clients. In SG-1 and SG-1-fix, super-peers rarely depart from the system, but the average super-peer session is short due to frequent switches between super-peers and clients. TGT and H-DHT greatly outperform all other systems, as they manage to minimise both the super-peer leave rates and demote rates.
In the last set of experiments, in section 5.3.5, the systems are compared with respect to the maintenance cost, measured as the number of established connections and generated messages. In summary, TGTs are cheaper than SG-1, comparable to SOLE (and hence to Chord), and more expensive than H-DHT and SPChord, in terms of peer connections. Moreover, TGTs are comparable to SG-1 and SOLE in terms of generated messages, and more expensive than H-DHT and SPChord. The cost at super-peers is comparable in all systems, and can be estimated as being between 13 and 27 messages per time step and approximately 50 connections, with the exception of SOLE, where all peers, including super-peers, generate only 7 messages per time step.
Overall, it can be concluded that TGTs elect better-quality and higher-stability super-peer sets, and have similar maintenance cost, compared with the state-of-the-art systems.
This section provides an in-depth analysis of the gradient topology and the algorithms introduced in chapter 4, based on a series of experiments conducted in the P2P simulator. The main purpose of this section is to verify that the proposed algorithms produce topologies and super-peer sets that have the properties theoretically derived in chapter 3.
The section is organised as follows. The first experiments evaluate the neighbour selection algorithms through an analysis of the generated topologies, where the studied properties include peer degrees, average path lengths, distances to high-utility peers, and parent ranks. The second set of experiments evaluates the accuracy of the aggregation and rank estimation algorithms in a variety of system configurations. The third set of experiments focuses on the performance of gradient search, both in GTs and TGTs, and measures the average number of message hops and failure rates in gradient search. Finally, the last set of experiments examine the behaviour of the super-peer election algorithms in systems where the utility of peers and the total system load are dynamically changed at each time step.
One of the most basic topology properties is the average peer degree. In the gradient topology, the degree of a peer is defined as the size of , and the out-degree of peer is defined as the number of neighbours such that . Analogously, peer degrees and out-degrees are defined in sub-overlays generated by individual neighbour selection algorithms running at peers.
Figure 5.12(a) shows the average peer out-degree in a TGT and its sub-topologies as a function of peer rank. As in the previous experiments, each point plotted in the figure represents one bin of a rank-based histogram. As expected, the figure confirms that every peer has one neighbour in the tree set (i.e., parent), and 20 neighbours in the Newscast set. Additionally, each peer ranked below 100 has 5 neighbours in its successor set and 3 neighbour in the predecessor set.
Figure 5.12(b) shows the average peer degree in the same experiment. For most peers, the degree is simply twice as high as the out-degree. However, in the tree sub-topology, peers ranked above approximately 10,000 have no incoming connections (both their degree and out-degree are equal to one), while peers ranked below 10,000 have only one outgoing connection to a parent and 10 incoming connections from lower-rank peers.
Figure 5.13(a) shows the average peer degree in four different TGTs with branching factors of 2, 4, 8 and 16. In all systems, the degree of high-ranked (i.e., low-utility) peers is approximately 35-40, mainly due to Newscast sets, while for the high-utility peers, the degrees is increased by incoming tree connections, as expected in a TGT.
Figure 5.13(b) shows the average peer degree distribution in a GT, and two TGTs with Newscast and random sets (the latter is labelled TGT*). Each plotted point represents the total number of peers in the system that have a given degree. The obtained degree distributions resemble normal distributions, where majority of peers have respectively around 12, 32, and 42 neighbours in TGT*, TGT, and GT. The variance in the degree distribution in TGT* is much smaller compared with GT and TGT, since the random set management algorithm actively removes neighbours when the set size grows above 10 and adds neighbours when the set size is below 10.
Similar peer degree distributions are obtained in TGTs with different utility metrics and branching factors, as shown in Figure 5.14. In all configurations, the fraction of peers that have more than 80 neighbours is marginal. Hence, it can be concluded that the neighbour selection algorithm balances the load between peers and the probability that a peer becomes overloaded by incoming connections is very low.
In an idealised TGT, as defined in chapter 3, a peer is connected with a parent peer ranked . However, in a dynamic system, it is extremely unlikely that all peers strictly adhere to this rule, and minor deviations are likely to occur in the system topology. In order to measure the extent to which the system topology diverges from the ideal TGT structure, the parent error is defined for each peer as
where is the current parent peer of . For peers that do not have any parent, is defined as one. Figure 5.15 shows the average parent error as a function of peer rank in topologies generated using four different utility metrics. The divergence between and is partly caused by the peer's inability to discover and stay connected with the desired neighbours in a dynamically changing system, and partly by the inaccuracy of the peer's estimation of its own and its neighbours' ranks. In all performed experiments, is below 20%. As shown later, this level of does not have a significant impact on the topology properties.
In order to get more insight into the structure of the generated topologies,
a number of experiments are performed that measure the average path
lengths between high utility peers. The following definitions and
metrics are used. The system topology,
, is defined as an undirected
graph with vertices
determined by the peer neighbour
sets, as previously.
is defined as the shortest path length
in the system topology
, and analogously,
is defined as the shortest path length between
in a sub-topology
. The average path length in the
, is the average value of
over all possible pairs of peers
The average path length can be calculated using the Dijkstra shortest path algorithm at an cost, where is the average peer degree in . However, in the experiments described in thesis, with and , this would require performing over basic operations.
This cost can be reduced by selecting a random subset from and approximating with
Such approximation requires running the Dijkstra algorithm for peers, and hence, incurs the computational cost of operations. In practice, generates accurate results.
In the unlikely case where two peers and are not connected in the system topology, the distance is not defined and the pair is omitted in the calculation of . The number of such pairs is extremely low in the reported experiments and such pairs only occur when a peer becomes isolated and needs to be re-bootstrapped. With the exception of isolated peers, topology partitions were never observed in any of the experiments described in this thesis.
In order to investigate the correlation between peer utility and the path lengths in the topology, is defined as a subset of peers in the system, , that contains highest utility peers. Formally,
The average path length between the highest utility peers is then given by . Similarly, is defined as the average path length between peers in in a sub-topology .
Figure 5.16(a) shows the average path lengths between peers in sets in a Newscast overlay and in four TGT sub-overlays generated by tree sets with branching factors of 2, 4, 8 and 16. As expected, the function for Newscast is almost flat with , indicating a low correlation between peer utility and Newscast topologies (which are generated randomly). In the tree-based overlays, the average path length in grows linearly with . Therefore, the experiment confirms that the generated topologies have a gradient structure, where the highest utility peers are strongly clustered, and lower utility peers are located at gradually increasing distances from them.
Figure 5.16(b) shows the average path length in five tree-based sub-overlays created in TGTs with different utility functions. Apart from the previously used utility metrics, based on peer capacity, uptime, expected session, and remaining session, a fifth system is considered (labelled ``DynCapacity'') where the capacity of peers changes over time. In this system, each peer is assigned a constant maximum capacity, , and its current capacity value, , is calculated at each time step according to formula
where is randomly chosen between 0 and . The parameter models the interference of external, unpredictable applications, which consume resources at peer . In Figure 5.16(b), is set to 0.1, so that the capacity of a peer changes by up 10% at each time step. In all experiments, the average path length between peers grows linearly with , indicating a high robustness of the neighbour selection algorithm to the utility dynamism.
Another approach to analyse the structure of the generated topologies is to measure the distance from each peer to the highest utility peer in the system, , denoted at time step . Figure 5.17(a) shows the average value of as a function of peer rank in Newscast and tree sub-topologies in TGTs with branching factors of 2, 4, and 10. As previously, the graph generated as a histogram, where each point represents one histogram bin, where
and is the set of peers ranked between and at time
In the Newscast overlay, all peers have an average distance to approximately equal to 4. This is expected, as the Newscast topology is random and is constantly shuffled by peers. In the other overlays, a clear tree structure is visible, with peers gradually increasing their distance from , the root of the tree. In particular, for , peers ranked from 1 to 9 are directly connected to , peers ranked between 10 and 99 are two overlay hops away from , peers ranked between 100 and 999 are three hops away, peers ranked between 1,000 and 9,999 are four hops away, and so on.
Moreover, the distance to in the tree-based overlays grows above 4 for low-utility peers, and is higher than the distance to in the Newscast overlay. This leads to an interesting observation that Newscast connections can be efficiently used for routing by low utility peers, since random neighbours in Newscast sets at these peers are statistically closer to than the parents in the tree sets.
Figure 5.17(b) shows the average distance to over all neighbour subsets in five different TGTs with . The utility function does not have a strong impact on the system topology, and a clear tree structure is visible even when peer utility (defined as capacity) randomly fluctuates at each time step, which shows that the neighbour selection algorithm is resilient to varied system configurations. Again, the distance to does not grow above 4, since peers ranked between 10,000 and 100,000 use their Newscast links to find the shortest paths to .
The following section evaluates peer rank estimation algorithms. The relative error in the rank estimation at peer is defined as
where is the peer's estimate of its own rank. Figure 5.18 show the average rank estimate and the average rank estimation error, obtained using the three rank estimation methods described in section 4.6, i.e., based on utility successors, utility histograms, and mixed. Both graphs are generated as histograms, based on the peers' true ranks, in the same fashion as in the previous sections.
The successor-based method is relatively accurate for high-utility peers, but provides poor rank estimates for low utility peers. Due to the error propagation between peers, as discussed in section 4.6.2, increases together with the peer's rank, and as a consequence, peers ranked 100,000 estimate their rank as approximately 4,000.
The histogram-based method, conversely, generates relatively good estimations of peer ranks for low-utility peers, but is significantly inaccurate for high-utility peers. This is caused by the fact that the last histogram bins (those with the highest-utility peers) contain fewer peer samples, and an approximation based on them is statistically less accurate. Furthermore, some of the highest-utility peers may fall outside of the histogram range.
The mixed method combines the advantages of both approaches, achieving the best efficiency and overall estimation error below 20%.
Figure 5.19(a) shows the relative rank estimation error for the mixed method in a TGT with churn and without churn. In both systems, an increase in the estimation error is observed for the lowest-utility peers, which is caused by the linear interpolation of peer utility histograms. Even with perfectly accurate aggregates, as in the case of no churn, the interpolation produces a distortion. Furthermore, a certain estimation error is produced by the successor-based method (used by peers ranked from 0 to 100) due to churn.
Figure 5.19(b) shows the relative rank estimation error for the mixed method in experiments with different peer utility functions. The error pattern varies between the systems, due to the different peer utility distributions, but it does not exceed 20% in any experiment. As already shown in the previous sections, peers are able to generate gradient topologies with desired properties despite the reported rank estimation error.
The experiments described in this section evaluate the accuracy of the aggregation algorithm. The following notation and metrics are used. Variables , , and denote the current estimations at peer of the current system size, , average peer utility, , utility histogram, , and capacity histogram , respectively, at time step . The average relative error in the system size approximation, calculated over all peers and all time steps, is defined as
where is the experiment duration. Similarly, the average error in utility histogram estimation, , is defined as
where is a histogram distance function defined as
Analogously, is defined as the average error in the average utility estimation and is defined as the average error in the capacity histogram estimation.
Figure 5.20 shows , , and in a number of experiments with varied system configurations. In the experiments with no churn and no message loss, the aggregation algorithm produces almost perfectly accurate system property approximations, with the approximation error below 0.001%. A similarly low approximation error is observed in the configuration with no churn, where request messages are lost with probability , but response messages are always delivered (labelled ``ReqLoss''). This result is consistent with the expectations outlined in section 4.4.7.
In the system where both request and response messages can fail, but still in the absence of churn (``ReqRespLoss''), the approximation error reaches approximately 15%. In a system with no message loss, but a positive churn rate (``Churn''), the observed error is lower, approximately 10% for the histograms and 3% for . The error does not exceed 0.3% in all experiments, since the approximation of a system average does not introduce any systematic bias, such as an aggregation weight loss, and the errors incurred by peers in individual gossip exchanges cancel out.
The experiment labelled ``LeaveProc'' evaluates the performance of aggregation when peers perform the leave procedure described in section 4.4.6. The procedure significantly improves the accuracy of estimation, as it prevents aggregation weight loss. However, it does not significantly affect the error in the histogram estimation, since the population of peer changes during the execution of aggregation, unlike , which is constant, and when an aggregation instance ends, the produced results are diverge from the current system state. As it is hard to estimate how many peers in a realistic P2P system perform a leave procedure when disconnecting from the network, in all experiments reported in this thesis, it is conservatively assumed that no peers execute the procedure when leaving.
An additional experiment, labelled ``RandomSet'', compares the performance of aggregation in topologies generated using random sets instead of Newscast sets, which are used in all other experiments. The approximation errors produced in the two systems are similar, indicating that both topologies are suitable for running aggregation algorithms.
The last experiment included in Figure 5.20, labelled ``ChurnReqLoss'', shows the aggregation error in a TGT with the settings that are used by default in all other sections, i.e., with churn, request loss, aggregation over Newscast topologies, and no leave procedure.
There are three parameters that control the cost and accuracy of aggregation, which are the frequency of instance initiation, , an instance time-to-live, , and the aggregation fan-out, . Additionally, the histogram resolution, , impacts on the accuracy of utility distribution approximation.
When is decreased, peers perform aggregation more frequently, and have more up-to-date estimations of the system properties. However, in the experiments described in this thesis, the system size and the probability distributions of peer utility and capacity are constant, and hence, running aggregation more often does not affect the results. At the same time, when is decreased, the average message size increases, as an average peer participates in a higher number of aggregation instances.
Similarly, when the parameter is increased, aggregation instances last longer, peers store more local tuples, and aggregation messages become larger. Moreover, as shown in Figure 5.21(a), when aggregation instances run longer, they suffer higher weight loss, and as a consequence, generate less accurate results. Conversely, if is low, the aggregation instances are too short to produce high quality results, since a certain number of algorithm steps is required to distribute the weight and average out the tuples stored by peers. The optimum performance is achieved for , and this value for is used in the experiments described in this thesis. The frequency parameter, , is also set to 60 so that nodes run on average one aggregation instance at a time.
Further, the performance of aggregation can be improved by increasing the fan-out factor, . As discussed in section 4.4.8, a higher fan-out setting requires that peers exchange more messages when they participate in aggregation instances, but it also shortens the duration of the instances. As a consequence, high fan-out does not increase the average number of messages sent by a peer per time step, as long as , according to formula 4.29. However, when instances are shorter, the aggregation results are more accurate, since fewer peers join and leave the system when an aggregation instance is running, and the system is less likely to change during the instance execution. This can be confirmed experimentally, as shown in Figure 5.21(b). Based on these results, is set to 4 in the other experiments described in this thesis.
Finally, the accuracy of utility distribution approximation can be improved by increasing the histogram resolution, . Clearly, the message size grows linearly with the number of histogram bins. The actual accuracy improvement depends on the shape of the distribution function and the histogram interpolation method. In this thesis, linear interpolation is used and histograms have 200 bins. This way, aggregation messages have approximately 1.6kB, and would fit well into UDP packets, assuming this protocol was used for the aggregation implementation.
This section describes a series of experiments that evaluate the performance of gradient search. In each experiment, super-peers are elected using a top-K threshold, and messages are routed using gradient search from random peers in the topology to super-peers. The experiments measure the average number of edges (also called overlay hops) a message traverses before it is delivered to a super-peer, and the average message loss rate.
Figure 5.22(a) shows the average number of message hops in gradient search as a function of the system size ( ) and the number of super-peers ( ) in a TGT with a branching factor of 10. Furthermore, Figure 5.22(b) shows the average number of message hops as a function of the branching factor ( ) and the number of super-peers ( ) in a TGT with 100,000 peers. Both figures demonstrate the average number of message hops grows proportionally to and decreases proportionally to . Thus, the experiment empirically confirms Theorem 3.3, which states that every peer in TGT is connected to a super-peer through at most overlay edges.
Figure 5.23(a) shows the average number of message hops when routed using gradient search in five TGTs with different utility metrics, where and . As expected (see Figures 5.16(b) and 5.17(b) for comparison), the choice of utility metric does not affect significantly the structure of the topology and the performance of routing. The only noticeable difference in gradient search performance is found in the experiment with dynamic and unpredictable peer utility (``DynCapacity''), which puts more stress on the neighbour selection algorithm.
In the simplest message failure model, where each message transmission has a fixed failure probability, the average message loss rate is simply proportional to the number of message hops. However, in the proportional model, where the failure probability for a message transmission from peer to is proportional to , as explained in section 5.5, the choice of the peer utility metric has a strong impact on the overall message loss rate in a gradient topology, as shown in Figure 5.23(b).
In the capacity-based TGT, the utility metric and the system topology are independent from peer stability, and hence message loss rate grows linearly with the message path length, as in the simple message model. In the uptime-based TGT, message loss rate is greatly reduced compared with the capacity-based TGT, since messages are gradually forwarded to peers with increasingly higher uptimes, and hence more stable and less likely to lose messages. In the TGT with session-based utility metrics, the average message loss rate is even lower, and is nearly constant with . In these systems, a message forwarded over two or more overlay hops reaches high-stability peers for which the probability of message loss is extremely low.
In the following set of experiments, TGTs are compared with GTs based on the performance of gradient search. Figures 5.24(a) and 5.24(b) show the average number of message hops, and the message loss rate, respectively, in two topologies generated using a capacity-based utility metric and 100,000 peers. With the exception of the configuration where the super-peer ratio is equal to , where super-peer can be trivially discovered using random sets, TGTs clearly outperform GTs, both in terms of message hops and loss. GTs exhibit particularly poor performance when the super-peer ratio is below .
Figure 5.25 shows in more detail the message loss in GT. In gradient search, a message is lost either when is exceeds its time-to-live (TTL) or when a failure occurs when it is forwarded. As shown in Figure 5.25, GT with a low number of super-peers ( ) suffers a very high message loss rate (more than 80%), since peers are unable to discover the super-peers and discard messages as they exceed their TTL. In TGTs, due to the topology structure, gradient search has a guaranteed cost of overlay hops, and if there is only one super-peer in the system, messages are delivered to this super-peer and TTL is exceeded in a marginal number of cases.
The following experiments evaluate the stability of super-peers, and the super-peer election error, in a system where peers dynamically change their utility. The capacity of a peer is calculated at each time step as , as in section 5.4.3, where is the maximum peer capacity, is randomly chosen between 0 and , and is parameter that models the influence of external applications on the peer's capacity. In order to reduce the number of switches between super-peers and clients, super-peers are elected using two top-K utility thresholds, following the approach in outlined in section 4.5.2.
Each experiment is set up with two parameters: , labelled ``Epsilon'' on the graphs, which determines the amplitude of peer capacity change, and , denoted ``Delta'' on the graphs, which determines the distance between the super-peer election thresholds. The upper and lower thresholds, and , are calculated in such a way that the number of super-peers is between and , i.e., and , where is a peer utility distribution. The system size is and .
Figure 5.26(a) shows the average rate of super-peer switches with clients as a function of . The experiment demonstrates that the rate of switches sharply decreases as is increased. However, it does not converge to zero, but rather to a constant positive value, since some super-peers always leave the system due to churn, and are continuously replaced by ordinary peers.
Hence, super-peers are swapped with clients for two reasons. First, as super-peers leave the system, ordinary peers are promoted in order to maintain super-peers in the system. Second, since the utility of individual peers and the super-peer election thresholds constantly fluctuate, some super-peers are occasionally demoted to clients, and clients are occasionally promoted to super-peers.
Figure 5.26(b) shows the average rates of super-peer demotions and departures in a system with . The rate of super-peer departures does not depend on and is determined by the overall system churn. However, the experiment shows that the number of super-peer demotions, caused by peer utility and threshold fluctuations, can be reduced to a negligible level by using an appropriate .
Figure 5.27(a) shows the impact of on the super-peer election error, , as defined in section 126.96.36.199. As expected, the error grows together with , since a larger gap between and relaxes the constraints on the number of super-peers in the system. Similarly, the fraction of globally optimum super-peers, , defined in section 188.8.131.52, decreases as is increased, as shown in Figure 5.27(b). Hence, enables a trade-off between restricting the constraints on the super-peers set and reducing the frequency of switches between super-peers and clients.
The final set of experiments evaluates the load-based approach to super-peer election described in section 3.1.1. In these experiments, at each time step, each peer generates a request, or more generally, a unit of load, with probability . From the central limit theorem (law of big numbers), the total load in the system, , produced at a time step follows a normal distribution with a mean of and a variance of .
The requests are routed to super-peers and distributed between them. For the purpose of these experiments, each super-peer receives a fraction of load proportional to its capacity. Load-balancing strategies are not in evaluated in this study. Peer capacity values follow a Pareto distribution with a mean of 1 and shape parameter of 2, and it is assumed that a super-peer cannot handle more requests than it capacity value, .
Super-peers are elected using a load-based threshold, defined by formula 3.8 in section 3.1.1. In the performed experiments, three values for the super-peer utilisation parameters, , are considered: 1 (full utilisation), 0.9, and 0.75.
Figure 5.28(a) shows the relationship between the request probability and the total number of super-peers in the system. It can be seen that the system adapts the super-peer set to the increasing load. The number of super-peers initially grows slowly, as high capacity super-peers are available, but the growth rate gradually increases as becomes higher, and eventually all peers in the system become super-peers. Moreover, the growth in the number of super-peer is faster for lower values of , as expected.
Figure 5.28(b) shows the total system load and super-peer capacity in the same set of experiments. The load in the system increases proportionally to the request probability, as predicted, and the total super-peer capacity scales linearly with the system load. For , the total super-peer capacity exceeds the total system load. Thus, the system achieves its objective and adjusts the super-peer set according to the existing demand. For , the super-peer capacity is marginally below , due to the aggregation error (weight loss) and histogram interpolation error.
This section describes a series of experiments that examine whether the algorithms described in chapter 4 generate topologies and super-peer sets that have the properties derived analytically in chapter 3.
The first set of experiments, presented in sections 5.4.1 to 5.4.4, evaluate the neighbour selection algorithms through an analysis of the average path lengths and distances between peers in the generated topologies. The results clearly show that the created topologies have a tree-based structure. In particular, the distance from a peer, , to the highest utility peer in the system, , grows logarithmically with 's rank, where the logarithm base is equal to the topology branching factor, , as expected in Theorem 3.2. Moreover, the neighbour selection algorithms manage to construct and maintain such tree-based topologies in a variety of system configurations, with different branching factors and peer utility metrics. The analysis also shows that peers have low degrees, and the fractions of peers that become overloaded by excessive neighbour connections are negligible.
A further set of experiments in section 5.4.7 evaluate gradient search, and show that the generated topologies can be efficiently exploited for routing. As predicted in Theorem 3.3, gradient search delivers a message from a peer to a super-peer in overlay hops, where is the system size, is the number of super-peers, and is the branching factor, which is verified in a number of experiments with varied , , , and peer utility metrics. Moreover, gradient search significantly reduces message loss rate in topologies where the peer utility metric is based on peer stability.
Another set of experiments, described in section 5.4.6, evaluates the aggregation algorithm and shows that it approximates global system properties with an average relative error below approximately 15%. The algorithm generally conforms to the expectations outlined in sections 4.4.6, 4.4.7, and 4.4.8, and its performance can be tuned using parameters such as instance frequency, , instance duration, , and gossip fan-out, .
The rank estimation algorithms are evaluated in section 5.4.5. As expected in section 4.6.3, the average error produced by the mixed method, based on aggregation and utility successor sets, is the lowest, and is below 20%. Moreover, such a level of rank estimation error does not prevent peers from generating tree-based gradient topologies.
The thesis also introduces three super-peers election techniques for gradient topologies, i.e., based on single-thresholds, double-thresholds, and with no super-peer demotion. The first of these techniques is evaluated in section 5.3.3, which shows that a single-threshold election allows a precise restriction on the number of super-peers and generates close-to-optimum super-peer sets. The second election method, with no super-peer demotions, is evaluated in section 5.3.4, which confirms that this election technique entirely eliminates switches between super-peers and clients and maximises the average super-peer session length. The two-threshold election method is evaluated in section 5.4.8, which shows that the use of two thresholds enables a trade-off between imposing constraints on the number and utility of super-peers and the reducing the frequency of switches between super-peers and clients.
This section validates the custom-built simulator, which has been used to generate the results presented in this thesis. The validation is performed by running two identical sets of experiments in the custom simulator and PeerSim. In the latter, the event-driven mode is enabled and node communication is asynchronous. In order to determine message latencies, PeerSim is fed with a trace containing nearly 3 million wide-area network latency measurements generated by the King tool . Due to the computational cost, the system size is reduced to 10,000 nodes, but all other simulation parameters are set in the same way as in the previous sections, as summarised in Table 5.6.
Figures 5.29(a), 5.29(b) and 5.30(a) show the average path lengths, distances to the highest utility peers, and peer degrees, respectively, in TGT topologies generated in PeerSim and the custom simulator. Remarkably, the results generated by the two simulators overlap almost ideally in most experiments. This empirically confirms the hypothesis stated in section 5.2.2 that message latencies observed on the Internet do not have a significant impact on the neighbour selection algorithms described in this thesis. In both simulators, the algorithms generate nearly identical topologies.
Figure 5.30(b) shows the aggregation error for a number of system properties in PeerSim and the custom simulator. As expected, message latencies on the order of 100 milliseconds (typically experienced on the Internet) only marginally reduce the accuracy of aggregation, where the gossip period is 5 seconds. As a consequence, the super-peer election algorithm generates almost the same results in PeerSim and the custom simulator.
Finally, Figures 5.31(a) and 5.31(b) show the average number of overlay hops and failure rates for messages routed using gradient search in PeerSim and the custom simulator. The number of super-peers is denoted by . Again, both simulators produce similar results in most experiments. The only exception is the experiment for , where PeerSim significantly differs from the custom simulator. Higher message latency in PeerSim causes relatively more frequent and severe deviations in the TGT structure, which in turn impact on the routing performance. Overall, the experiments described in this section demonstrate that the custom-built simulator generates consistent results with PeerSim.
This section demonstrates the practical viability of gradient topologies by applying the experimental results, derived in the previous sections, to a sample application scenario. The selected application scenario is based on the storage system described in section 3.3. In order to evaluate the impact of a gradient topology, two variations of this system are considered: a traditional DHT overlay, where all nodes participate in the data storage, and a system based on a gradient topology, where only the super-peers store the data and run the DHT protocol. For consistency with the previous experiments, it is assumed that peer properties follow a Pareto distribution, mean peer session duration is equal to time steps, mean peer downstream bandwidth capacity is equal to , and the ratio of super-peers to the system size is equal to . The storage system is run by peers and is used to permanently store bytes of data.
In a traditional DHT, in order to access a data item, a user first discovers the node that hosts the requested item and then communicates directly with this node. Thus, the discovery operation requires message transmissions, and the average download rate from the hosting node, assuming a low contention ratio, is approximately equal to .
The average peer departure rate in the DHT is equal to . If the data stored in the DHT is not replicated, the average rate at which the data is lost is proportional to the average peer leave rate, and is equal to . In a DHT with -replication, i.e., where each data item is replicated at independent nodes, the probability of a data item loss can be estimated in the following way. Let be the minimum amount of time needed to transfer a data item between two nodes, . A data item is lost if all nodes that host it leave the system during time units. Hence, the probability for a data item loss during time can be estimated as .
Finally, -replication requires a certain maintenance cost. Each time a node leaves, its data needs to be restored on another node using the remaining replicas. Since nodes store in total data and leave the system at an average rate of , the average background traffic needed to maintain -replication is equal to bytes per time unit.
In a storage system with a tree-based gradient topology, the cost of data discovery is comparable to a traditional DHT. As shown in section 3.3, and confirmed in the experiments in section 5.4.7, a DHT running on top of a TGT requires ) message transmissions to route a messages to a selected data item.
However, by selecting the highest utility peers for storing data, a gradient topology can improve data access performance and reliability. In a gradient topology, where utility is defined as peer bandwidth capacity, the average super-peer bandwidth capacity is equal to , as predicted by Theorem 3.1 and verified empirically in Figure 5.3(a). Thus, the average data item download rate in a gradient topology, assuming a low system utilisation, is 10 times faster compared to a traditional DHT.
In should be noted that a tenfold improvement in super-peer bandwidth capacity is not possible with the other evaluated super-peer systems. Due to the reasons discussed in the previous sections, these algorithms elect suboptimal super-peer sets, and the ratio between the mean super-peer bandwidth capacity and mean peer capacity is only 5 for SPChord and between 8 and 9 for H-DHT and SG-1 (see Figure 5.3(a)). SOLE does not improve super-peer capacity over due to its simple election criteria.
Super-peers can also be used to reduce the data loss rate and maintenance cost (in case of replication) if peer utility is based on peer stability. Depending on the heuristic used to estimate peer session duration, super-peer sessions in a gradient topology are from 3 times (in case of an uptime-based utility function) to 10 times (accurate knowledge of peer session durations) higher than the average peer session , as shown in Figure 5.6. Thus, in the gradient topology, the average super-peer leave rate is between and , and is 3 to 10 times lower compared to the overall peer leave rate in a traditional DHT overlay. If no data replication scheme is applied, the average data loss rate in the gradient topology is then up to 10 times lower (i.e., ) compared to the traditional DHT. Moreover, in a gradient topology with data -replication at super-peers, the probability of a data item loss rate during interval can be estimated as , and is up to times lower compared to the traditional DHT.
At the same time, gradient topology reduces the replication overhead. A super-peer stores on average data. During a short interval , approximately super-peers leave the system, and hence, data must be transferred between the remaining replicas. This implies that the average maintenance traffic is bytes per time unit, and hence is 10 times lower compared to a traditional DHT.
Finally, it should be noted that the other evaluated super-peer systems achieve significantly lower super-peer stability due to suboptimum super-peer election and swappings between super-peers and clients. In particular, even with a perfect knowledge of peer session times, H-DHT extends super-peer session only 7 times compared to the mean , SG-1 (with all optimisations enabled) generates super-peer session with lengths merely equal to (i.e., there is no improvement in data stability), and both SOLE and SPChord significantly reduce super-peer session lengths compared to the mean .
This section evaluates the performance benefits of using a gradient topology in a DHT-based storage system. By increasing the utility of data storing peers, the gradient topology improves both the data access performance and reliability. In particular, while keeping the same latency and message cost for data discovery, it improves the data download rate and data loss probability by an order of magnitude compared to a traditional P2P storage system. Moreover, due to the increased data stability, the gradient topology also reduces the replica maintenance cost. Comparable performance improvements cannot be achieved using other known super-peer election techniques.
This chapter summarises the main contributions of the thesis, and outlines directions for future work.
This thesis introduces a novel approach to dealing with heterogeneity in P2P systems using gradient topologies. A gradient topology has a fundamental property that for any given utility threshold, all peers with utility above this threshold are located close to each other, in terms of overlay hops, and form a connected sub-overlay. Such high-utility peers can be then exploited by higher-level applications in a similar fashion as super-peers in traditional P2P systems. Furthermore, the information captured in the topology enables a search heuristic, called gradient search, that enables efficient discovery of high-utility peers.
The thesis introduces a subclass of gradient topologies, called tree-based gradient topologies, which have a logarithmic diameter and allow routing messages from an ordinary peer to a super-peer in overlay hops, where is the system size, is the number of super-peers, and is a constant system parameter called branching factor. TGTs are simple and easy to generate. A node in a TGT maintains only one link to a parent node and a few randomly links to other nodes.
TGTs have been designed to support a wide class of large-scale P2P applications, such as storage systems, name services, file-sharing applications, and semantic registries, where the system performance and reliability can be improved by assigning relevant system tasks, such as hosting system data, running services, or participating in certain distributed algorithms, to the most stable and best performing peers. The thesis describes two such proof-of-concept applications.
TGTs can be generated by a periodic neighbour selection algorithm executed at each peer. The thesis describes the design of such an algorithm and evaluates it using a custom-built P2P simulator. The evaluation confirms that the algorithm constructs topologies that have the desired tree-based structure.
The thesis describes a number of super-peer election thresholds, which impose different constraints on the super-peers sets. In particular, the thesis introduces a top-K threshold that elects a fixed number of super-peers, a proportional threshold that maintains a fixed ratio of super-peers to clients, a capacity threshold that restricts the total super-peer capacity, and other thresholds that allow more sophisticated super-peer management. All thresholds are calculated using a decentralised aggregation algorithm which approximates global system properties, such as the system size, total load, and peer utility distribution. The thesis also describes super-peers election techniques, based on utility thresholds, that restrict the number of super-peers in the overlay and reduce the frequency of switches between super-peers and clients.
In a range of experiments, it is shown that gradient topologies, together with aggregation-based election techniques, generate better-quality and higher-stability super-peer sets, at a similar maintenance cost, compared to state-of-the-art super-peer systems. Moreover, it is shown that gradient topologies offer more flexible and more powerful super-peer election mechanisms compared with the existing P2P systems, and thus extend the current state-of-the-art knowledge on heterogeneous P2P systems.
Even though the thesis describes gradient topology as a single and unified architecture, its individual components can be treated as separate contributions, and can be used as independent building blocks in other P2P systems. In particular, the aggregation algorithm can be used as a generic tool for estimating global properties in decentralised systems. Through a distribution function approximation, the algorithm can address classic problems such node ranking  and slicing [79,56,66,127].
Similarly, the election strategies described in this thesis can be applied separately from gradient topologies. For example, a system with a random P2P topology can elect super-peers by aggregating peer utility information and calculating utility thresholds using the described algorithms. The information about such elected super-peers can be then disseminated to ordinary peers using a gossip-based broadcast algorithm [54,80].
Nevertheless, gradient topologies complement well with these components, since they guarantee that super-peers, elected using utility thresholds, are well connected with each other and can be discovered using gradient search. While the election thresholds can be changed, peers in the gradient topology do not need to be migrated or re-connected. Moreover, in a natural way, gradient topologies can support multiple super-peer sets by assigning multiple utility thresholds.
This section briefly outlines a sample of promising research topics that can be identified based on the work described in this thesis.
One of the most important aspects for future work on gradient topologies is security. Nearly all techniques and algorithms developed in this thesis assume a fully collaborative P2P environment. This section outlines the main challenges, and points out potential approaches, when designing a security model for gradient topologies.
The main security threat in a gradient topology is posed by malicious peers which may take harmful actions against other peers. A secure P2P system should be able to function correctly (without a significant performance degradation) despite an existence of such malicious peers. In the context of this thesis, this means that peers should be able to construct a gradient topology, elect appropriate super-peers, and enable access to these super-peers.
The most critical information in a gradient topology is that about peer utility. By providing fake utility status, a malicious peer may change its position in the topology and become a super-peer. A secure gradient topology must then provide a mechanism for peers to verify the utility of their neighbours. For example, peers may give each other feedback in order to assess their neighbours' utility. Such techniques, based on the notion of trust in a decentralised system, are described in [86,193,28]. Another potential approach is to compute, in a decentralised fashion, the reputation for each node, and discard peers that have a low reputation (i.e., provide fake utility information) [61,47,138,72].
Another vulnerable component in the gradient topology, and perhaps the most challenging to secure, is the aggregation algorithm. By disseminating fake information, a malicious node can influence the global aggregation outcome, and this way, manipulate the super-peer election and node ranking (and hence topology construction). A malicious peer can also intentionally increase the periodicity of its aggregation algorithm in order to have a greater impact on the aggregation result. Moreover, a peer can initiate a large number of aggregation instances in order to increase the system overhead and potentially cause a denial of service.
The non-malicious nodes can protect themselves from these attacks in a number of different ways. First, peers may try to identify and discard fake responses from their neighbours based on the knowledge from the current and previous aggregation instances. As each instance gradually converges over time, peers can refine their knowledge on the expected instance outcome and detect outliers (i.e., forged responses). Second, peers may cache the responses received from their neighbours and compare them with the final aggregation outcome. Misbehaving peer can then be identified and isolated. Again, reputation management techniques [61,47,138,72] can be used to black-list malicious nodes. Similarly, peers can correlate the information on gossip exchanges in order to detect nodes that increase their periodicity or maliciously initiate large numbers of instances. This may require using cryptographic mechanisms to trace node identity. Finally, peers can impose restrictions on which nodes can initiate aggregation instances, for example based on their reputation scores.
A common threat in a P2P system is a refusal to route messages by certain peers. Malicious peers may also want to route messages in an illegal way and tamper with forwarded messages. In the gradient topology, this may prevent access from clients to super-peers. A simple way to alleviate this problem is to randomise the routing algorithm and allow message retransmissions in case of failures. Assuming that malicious peers constitute only a small fraction of all peers in the system, this approach allows each peer, with a high probability, to successfully route messages. Techniques to randomise gradient search are described below in section 6.2.3.
Malicious nodes in a gradient topology may also refuse to participate in the neighbour selection algorithm, or to provide illegal neighbour candidates to gossipping peers. Since the gossipping algorithm is highly randomised and each peer verifies its new neighbours through direct connections (symmetric neighbourhood model), the neighbour selection algorithm may be already able to tolerate a certain fraction of malicious peers.
A special situation occurs when a malicious peer becomes a super-peer. However, since the super-peers functionality is entirely application-specific, it is difficult to asses in the general case the impact of a malicious super-peers on the system performance. For the same reason, incentives for peers to serve as super-peers can only be considered in a particular application scenario.
Another interesting research issue is whether a single gradient topology can support multiple applications with different utility requirements. Without loss of generality, two applications can be considered, and , where each application introduces its own utility function, and , respectively. In such a scenario, the goal of a gradient topology is to elect super-peers with high values of for use by application , and super-peers with high values of for use by application .
A naive approach to this problem is to generate two independent gradient overlays, using the two utility functions and the algorithms described in this thesis. However, this would double the system overhead. A better approach is to combine the two utility functions into one general utility function and to generate one gradient overlay shared by both applications. A convenient way of defining such a common utility function is
This has the advantage that both peers with high value of and peers with high value of have high utility , and hence are located in the core and can be discovered using gradient search. The only change required in the routing algorithm is that a search message, once delivered to a high utility peer in the core, may have to be forwarded to a different peer in the core, since either has a high value or . This last step, however, with a high probability can be achieved in one hop, since peers in the core are well-connected.
The proposed approach is illustrated in Figure 6.1. A sample gradient topology supports two applications, and . Ordinary peers , , and perform gradient search to discover super-peers for application . Peers and locate an `` -type'' super-peer in the core and their request is forwarded to a `` -type'' super-peer. Peer discovers a `` -type'' super-peer directly.
The super-peer election thresholds for the two applications, and , can be estimated using the aggregation algorithm, where the histograms for both and are generated through the same aggregation instance in order to reduce the number of generated messages. However, a potential problem may appear if the two utility functions, and , have significantly different value ranges, since the composed utility may be dominated by one of the utility functions. For example, if has values within range and has values in range , then is essentially equal to , and searching for peers with high becomes inefficient.
One way to mitigate this problem is to define the two utility functions in such a way that both have the same value ranges, e.g., . However, this requires system-wide knowledge about peers. Simple transformations or projections onto a fixed interval, for example using a sigmoid function, do not fix the problem, since if one function has higher values than the other function, the same relation holds when the transformation has been applied. A better approach is to scale one of the two utility functions using the current values of the super-peer election thresholds, for example in the following way
This has the advantage that the core of the gradient topology, determined by the threshold , contains peers with above and peers with above , since if for a peer then either or .
The gradient search algorithm has the drawback that it always forwards messages along the same paths, unless the topology changes, which may lead to an imbalance in the routed traffic between peers. This is especially probable in the presence of ``heavy hitters'', i.e., peers generating large amounts of traffic, as commonly seen in P2P systems . Moreover, faulty or non-cooperating peers, as well as broken peer connections, may prevent peers from reaching their super-peers.
These problems can be addressed by allowing multi-path routing. In a TGT where peers maintain multiple parent neighbours (i.e., the size of the tree set is greater than one), a message can be forwarded to a parent neighbour chosen randomly or using a round-robin strategy.
A more general approach, suitable for any gradient topology, would be to route messages probabilistically. In Boltzmann routing, peer selects neighbour for the next-hop destination for a routed message with a probability calculated using the Boltzmann exploration formula 
where is a parameter called the temperature that determines the ``greediness'' of the algorithm. Setting close to zero causes the algorithm to be more greedy and deterministic, as in gradient search, while if grows to infinity, all neighbours are selected with equal probability as in random walking. Thus, the temperature enables a trade-off between exploitative (and deterministic) routing of messages towards high-utility peers, and random exploration that spreads the traffic more equally between peers.
The thesis describes a relatively simple approach to the construction of utility histograms and utility distribution approximation, where all histograms bins have equal width and evenly divide the interval between the minimum and maximum peer utility. The accuracy of the peer utility distribution approximation may be potentially improved, if the bins in the histogram, and hence the interpolation points, are dynamically adjusted by peers at system runtime. In particular, is may be desirable to increase the density of bins close to the super-peer election thresholds. Potentially, the choice of the histogram bins can be dictated by the interpolation method, especially if a more sophisticated approach, for example based on cubic splines, is used instead of the linear interpolation. A proactive adjustment of histogram bins may also eliminate the need for utility successor sets in the rank estimation algorithm.
Gradient topologies have been designed in this thesis with one main intention: to allow a P2P system to exploit its highest utility peers. The next research question then is whether gradient topologies can be built based on two constraints: peer utility and location. In a locality-aware gradient topology, a peer selects its neighbours based on not only their utility, but also their distance, defined according to some metric. Potentially, node locations and distances can be determined using a virtual coordinate system, such as GNP  and Vivaldi . The goal of the neighbour selection algorithm is to construct a topology that preserves its general gradient structure (in particular, supports gradient search and maintains super-peer connectivity), but also preferentially connects nodes that are close to each other. This way, the topology can improve the performance of gradient search and allows peers to discover super-peers in their proximity.
The thesis describes the design of two proof-of-concept applications for gradient topologies, a storage system and a name service. These two applications are described only briefly, and a significant amount of work is required to fully specify them, validate, and implement. The proposed design can be then treated as a starting point for a wider research topic.
As shown through a theoretical analysis and simulation experiments, the properties of gradient topologies are promising and suggest that gradient topologies can be successfully applied to many different large-scale and heterogeneous P2P systems. Supporting such systems is ultimately the farthest-reaching goal of this thesis.