Social network graphs, like the ones captured by Facebook and Twitter exhibit small-world characteristics [2][3]. In 2011, Facebook documented that among all Facebook users at the time of their research (721 million users with 69 billion friendship links), there is an average average shortest path distance of 4.74 between users. This simply means that on average, any two people in the world are separated by just five other people. It’s a small world indeed ! Formally, a small-world network is defined to be a network where the typical distance L between two randomly chosen nodes grows proportionally to the logarithm of the number of nodes N in the network [4].
Consider the following scenario. You have a social network profile and you want someone to introduce you to the person in that profile. But luckily you are given the entire friendship graph captured by this social network. If there are mutual friends, then you just ask one of them to help you out. If not, you need some sequence of friend introductions to finally meet that person. What is the minimum number of intermediate friend introductions that you need to meet that person you are interested in ? This is equivalent to finding the shortest path in the social network graph between you and that person of interest. The solution is to run Breadth First Search on that social network graph with your profile as the starting vertex. The other interesting question is, if we have extra information about our graph exhibiting small-world properties, can we make the exhaustive Breadth-First Search (BFS) faster? The ideas expressed on this topic appeared in Beamer et al [1], where the authors optimized BFS for the number of edges traversed.
Breadth-First Search:
BFS uses the idea of a frontier that separates the visited nodes from unvisited nodes. The frontier holds the nodes of the recently visited level and is used to find the next set of nodes to be visited. On every step of BFS, the current frontier is used to identify the next frontier from the set of unvisited nodes.
Figure 1. A simple graph
Looking at the example in the figure, the current frontier consists of the nodes 1, 2, 3, 4 and 5. The edges from these nodes are examined to find a node that has not been visited. In the above case node 2’s edges are used to mark H and add it to the next frontier. But note that even though H has been marked by 2, nodes 3, 4 and 5 still inspect H to see whether it is visited or not.
Pseudocode for Naive BFS [5] :
Input: A graph G = (V,E) containing V vertices and E edges and source vertex s
Output: parent: Array[Int], where parent[v] gives the the parent of v in the graph or -1 is if a parent does not exist
One of the observations of conventional BFS (henceforth referred to as top-down BFS) is that it always performs in the worst case complexity, i.e., O(|V| + |E|) where V and E are the number of vertices and number of edges respectively. For example, if a node v has p parents, then we just need to explore one edge from any p parents to v to check for connectivity. But top-down BFS checks all incoming edges to v.
The redundancy of these additional edge lookups is more pronounced when top-down BFS is run on graphs exhibiting small-world properties. As a consequence of the definition of small-world networks, the number of nodes increases exponentially with the effective diameter of the network, which result in large networks with very low diameters. The low diameter of these graphs forces them to have a larger number of nodes at a particular level and leads to top-down BFS visiting a larger number of nodes in every step, making the frontier very large. Traversing the edges of the nodes in a frontier is the major computation that is performed, and top-down BFS unfortunately ends up visiting all the outgoing edges from the frontier. Moreover, it has also been shown in [1] that most of the edge lookups from the frontier nodes end up in visited nodes (marked by some other parent), which gives further evidence that iterating through all edges from the frontier can be avoided.
The idea behind bottom-up BFS [1] is to avoid visiting all the edges of the nodes in the frontier, which is a pretty useful thing to do for the reasons mentioned above. To accomplish this, bottom-up BFS traverses the edges of the unvisited nodes to find a parent in the current frontier. If an unvisited node has at least one of its parents in the current frontier, then that node is added to the next frontier. To efficiently find if a node’s parent is present in the frontier, the frontier data structure is changed to a bitmap.
Figure 2. Bottom up BFS
In the above example, {H, I, J, K } are the unvisited nodes. However only nodes { J, H } have a neighbor in the current frontier and as a result the next frontier now becomes {H , J}. In the next iteration the set of unvisited nodes will be {I, K} and each of them have a parent in the current frontier which is {H, J}. So {I, K} will be visited and the search will complete in the next iteration since there will be no more nodes to be added to the next frontier, since all nodes will be visited.
Pseudocode for Bottom-up BFS:
Input: A graph G = (V,E) containing V vertices and E edges and source vertex s
Output: parent: Array[Int], where parent[v] gives the the parent of v in the graph or -1 is if a parent does not exist
The major advantage to this approach is that the search for an unvisited node’s parent will terminate once any one parent is found in the current frontier. Contrast this with top-down BFS, which needs to visit all the neighbors of a node in the frontier during every step.
Top-down, Bottom-up, or both?
When the frontier is large, you gain by performing bottom-ups BFS as it only examines some edges of the unvisited nodes. But when the frontier is small, it may not be advantageous to perform bottom-up BFS, as apart from having to go over, it incurs the additional overhead of identifying the unvisited nodes. Small-world networks usually start off with small frontiers in the initial step and have an exponential increase in the frontier size in the middle stages of the search procedure. These tradeoffs lead us to another approach for small-world networks where we combine combine both top-down and bottom-up BFS—hybrid BFS [1]. In hybrid BFS, the size of the frontier is used to define a heuristic, which is used to switch between the two approaches, top-down and bottom-up. A thorough analysis of this heuristic is presented in [1].
How about parallelizing these approaches ?
When trying to parallelize the two approaches, observe that bottom-up BFS is easier to parallelize than top-down BFS. For bottom-up BFS, you can introduce parallelism in the stage where you populate the next frontier. Each of the unvisited nodes can be examined in parallel, and since every node just updates itself in the next data structure, it does not require the use of locks.
On inspecting the top-down BFS pseudo-code for sources of parallelism, observe that the nodes in the current frontier can be explored in parallel. The parallel top-down pseudo-code is:
In terms of correctness, the above pseudo-code looks good, but there is a benign race condition introduced by updating parents and next. This may result in a node being added more than once, making it inefficient. But it does not affect the correctness of the algorithm. Cleaner code would have a synchronized block to ensure only one thread updates the frontier.
The hybrid approach combining the parallel versions of top-down and bottom-up BFS provides one of the fastest single node implementation of Parallel BFS [1].
References:
-
Beamer, Scott, Krste Asanović, and David Patterson. “Direction-optimizing breadth-first search.” Scientific Programming 21.3 (2013): 137-148.
-
Ugander, Johan, et al. “The anatomy of the facebook social graph.” arXiv preprint arXiv:1111.4503 (2011).
-
Li, Jun, Shuchao Ma, and Shuang Hong. “Recommendation on social network based on graph model.” Control Conference (CCC), 2012 31st Chinese. IEEE, 2012.
-
Watts, Duncan J., and Steven H. Strogatz. “Collective dynamics of ‘small-world’networks.” nature 393.6684 (1998): 440-442.
-
Introduction to Algorithms (1990) by T H Cormen, C E Leiserson, R L Rivest