Butterfly-Core Community Search over Labeled Graphs - arXiv

Page created by Patrick Payne
 
CONTINUE READING
Butterfly-Core Community Search over Labeled Graphs - arXiv
Butterfly-Core Community Search over Labeled Graphs
 Zheng Dong1 , Xin Huang2 , Guorui Yuan1 , Hengshu Zhu1 , Hui Xiong3
 1 Baidu
 Talent Intelligence Center, Baidu Inc.
 2 Hong
 Kong Baptist University 3 Rutgers University
 {dongzheng01, yuanguorui, zhuhengshu}@baidu.com, xinhuang@comp.hkbu.edu.hk, xionghui@gmail.com
 ABSTRACT
 Community search aims at finding densely connected subgraphs for
 query vertices in a graph. While this task has been studied widely
 in the literature, most of the existing works only focus on finding
arXiv:2105.08628v2 [cs.SI] 20 May 2021

 homogeneous communities rather than heterogeneous communities
 with different labels. In this paper, we motivate a new problem of
 cross-group community search, namely Butterfly-Core Community
 (BCC), over a labeled graph, where each vertex has a label indicat-
 ing its properties and an edge between two vertices indicates their Figure 1: An example of labeled graph in IT professional net-
 cross relationship. Specifically, for two query vertices with different works with three labels denote in different shapes and colors:
 labels, we aim to find a densely connected cross community that SE, UI and PM. The collaborations between two employees of
 contains two query vertices and consists of butterfly networks, where the same role (across over different roles) denote by the solid
 each wing of the butterflies is induced by a k-core search based on edges (dashed edges).
 one query vertex and two wings are connected by these butterflies.
 Indeed, the BCC structure admits the structure cohesiveness and networks can be regarded as labeled graphs, where vertices are usu-
 minimum diameter, and thus can effectively capture the heteroge- ally associated with attributes as labels (e.g., roles in IT professional
 neous and concise collaborative team. Moreover, we theoretically networks). A unique topological structure of labeled graph is the
 prove this problem is NP-hard and analyze its non-approximability. cross-group community, which refers to the subgraph formed by two
 To efficiently tackle the problem, we develop a heuristic algorithm, knit-groups with close collaborations but different labels. For exam-
 which first finds a BCC containing the query vertices, then iteratively ple, a cross-role business collaboration naturally forms a cross-group
 removes the farthest vertices to the query vertices from the graph. community in the IT professional networks.
 The algorithm can achieve a 2-approximation to the optimal solu- In the literature, numerous community models have been pro-
 tion. To further improve the efficiency, we design a butterfly-core posed for community search based on various kinds of dense sub-
 index and develop a suite of efficient algorithms for butterfly-core graphs, e.g. quasi-clique [10], -core [11, 26, 36], -truss [20], and
 identification and maintenance as vertices are eliminated. Exten- densest subgraph. For example, in the classical -core based com-
 sive experiments on seven real-world networks and four novel case munity model, a subgraph of -core requires that each vertex has
 studies validate the effectiveness and efficiency of our algorithms. at least neighbors within -core [3, 35]. The cohesive structure
 of -core ensures that group members are densely connected with
 at least members. However, most of the existing studies only fo-
 PVLDB Reference Format:
 Zheng Dong1 , Xin Huang2 , Guorui Yuan1 , Hengshu Zhu1 , Hui Xiong3 .
 cus on finding homogeneous communities [13, 18], which treat the
 Butterfly-Core Community Search over Labeled Graphs. PVLDB, 14(1): semantics of all vertices and edges without differences.
 XXX-XXX, 2020. Motivating example. Figure 1 shows an example of IT professional
 doi:XX.XX/XXX.XX networks , where each vertex represents an employee and an edge
 represents the collaboration relationship between two employees.
 PVLDB Artifact Availability: The vertices have three shapes and colors, which represent three
 The source code, data, and/or other artifacts have been made available at different roles: “Software Engineer (SE)”, “UI Designer (UI)”, and
 http://vldb.org/pvldb/format_vol14.html. “Product Manager (PM)”. The edges have two types of solid and
 dashed lines. A solid edge represents a collaboration between two
 1 INTRODUCTION employees of the same role. A dashed edge represents a collaboration
 Graphs are extensively used to represent real-life network data, e.g., across over different roles, e.g., the dashed edge ( , ) represents
 social networks, academic collaboration networks, expertise net- the collaboration between two employees of SE and UI roles re-
 works, professional networks, and so on. Indeed, most of these spectively. Our motivation is how to effectively find communities
 formed by these cross-group collaborations given two employees
 This work is licensed under the Creative Commons BY-NC-ND 4.0 International with different roles. Interestingly, considering a search for cross-
 License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of group communities containing two query vertices = { , }, we
 this license. For any use beyond those covered by this license, obtain permission by find that conventional community search models cannot discover
 emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
 licensed to the VLDB Endowment. satisfactory results:
 Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
 doi:XX.XX/XXX.XX • Structural community search. This kind algorithms find com-
 munities containing all query vertices over a simple graph,
In light of the above, we are interested in developing efficient algo-
 rithms for the BCC search problem. However, the efficient extraction
 of BCCs raises significant challenges. We theoretically prove that
 the BCC search problem is NP-hard and cannot be approximated
 in polynomial time within a factor (2 − ) of an optimal answer
 with the smallest diameter for any small ∈ (0, 1) unless = .
Figure 2: An example of butterfly-core community on in Therefore, we develop a greedy algorithmic framework, which first
Figure 1. is a bow tie formed by all dashed edges across two finds a BCC containing and then iteratively maintains BCC by
labeled groups. removing the farthest vertices to from the graph. The method can
 achieve a 2-approximation to the optimal BCC answer, obtaining
 which ignores the vertex labels and treats as a homogeneous no greater than twice the smallest diameter. To further improve ef-
 graph. W.l.o.g. we select -core [35] as an example, the maxi- ficiency, we construct the offline butterfly-core index and develop
 mum core value of , are 4 and 3 respectively, one limitation efficient algorithms for butterfly-core identification and maintenance.
 of this model is the smaller vertex coreness dominates value In addition, we further develop a fast algorithm L2 P-BCC, which
 to contain all query vertices. Each vertex on has a degree of integrates several optimization strategies including the bulk deletion
 at least 3, thus the whole graph of is returned as the answer. of removing multiple vertices each time, the fast query distance com-
 However, the model suffers from several disadvantages: (1) it putation, leader pair strategy, and the local exploration to generate a
 fails to capture different community densities of two teams. small candidate graph. We further discuss how to extend the BCC
 (2) it treats the semantics of all edges equally, which not only model to handle queries with multiple vertex labels. To summarize,
 ignores the semantics of different edges but also mixes different we make the following contributions.
 teams. (3) the vertices span a long distance to others, e.g., the • We study and formulate a novel problem of BCC search over
 distance between 8 and 7 is 8. Many vertices are irrelevant labeled graphs. We propose a ( 1, 2, )-BCC model to find a
 to the query vertices, such as the vertices { 6, 7, 8, 9, 10 } cross-group community containing two query vertices and
 and { 4, 5, 6, 7 }. (4) the vertices with irrelevant labels to the with different labels. Moreover, we give the BCC-problem
 query vertices, e.g., 1 ’s label is PM different with SE and UI. analysis and illustrate useful applications. (Section 3).
 Other community search models add graph size constraints • We show the BCC problem is NP-hard and cannot be approxi-
 such as the minimum size of -core [23] or the minimum diam- mated in polynomial time within a factor (2 − ) of the optimal
 eter [20]. However, such improved models find the answer of diameter for any small ∈ (0, 1) unless = (Section 4).
 { , , 5, 3 }, which suffers from missing many group mem- • We develop a greedy algorithm for finding BCC containing
 bers with no cross-group edges. two query vertices, which achieves a 2-approximation to an
 optimal BCC answer with the smallest diameter. The algorithm
 • Attributed community search. The studies of attributed com-
 iteratively deletes the farthest vertices from a BCC, which
 munity [13, 18] focus on identifying the communities that have
 achieves a small diameter (Section 5).
 cohesive structure and share homogeneous labels. For instance,
 • We develop several improved strategies for fast BCC search.
 using query vertices = { , }, keywords = { , }
 First, we design an efficient bulk deletion strategy to remove
 as input, [13] returns a -core subgraph and maximizes the
 multiple vertices at each iteration; Second, we optimize the
 number of keywords all the vertices share. Since the vertices
 shortest path computations of two query vertices; Third, we
 only contain one label (keyword) on the labeled graph, the
 make a leader pair algorithm for butterfly count maintenance;
 cross-group community share no common attributes, then the
 Finally, we propose an index based local search method (Sec-
 keyword cohesiveness is always 0, it will return empty result.
 tion 6).
 • Ours. The expected answer of our proposed Butterfly-Core • We extend the BCC model to handle cross-group communities
 Community (BCC) search, aims to find cross-group commu- with multiple vertex labels and leverage our L2 P-BCC tech-
 nities using two query vertices, as shown in Figure 2. The niques to develop an efficient search solution (Section 7).
 cross-group community has three key parts. The first is the in- • We conduct extensive experiments on seven real-world datasets
 duced subgraph formed by the vertices with the label SE, which with ground-truth communities. Four interesting case studies of
 is a 4-core. The second is the 3-core of vertices with the label cross-group communities are discovered by our BCC model on
 UI. The third part is the bipartite graph (subgraph induced real-world global flight networks, international trade networks,
 by vertices { , , 5, 3 } with only dashed edges) across over complex fiction networks, and academic collaboration networks.
 two groups of SE and UI vertices containing a butterfly, i.e., a The results show that our proposed algorithms can efficiently
 complete 2 × 2 biclique. and effectively discover BCC, which significantly outperform
 Motivated by the above example, in this paper, we study a novel other approaches (Section 8).
problem of cross-group community search in the labeled graph, We discuss related work in Section 2, and conclude the paper
namely BCC Search. Specifically, given a labeled graph , two with a summary in Section 9.
query vertices with different labels = { , } ∈ and integers
{ 1, 2, }, we define the ( 1, 2, )-BCC search problem as to find
out a densely connected cross-group community that is related to
query vertices .
 2
2 RELATED WORK Table 1: Frequently used notations.
 Notation Description
Attributed community discovery. The studies of attributed com-
 = ( , , ℓ) a labeled graph with a vertex label function ℓ
munity discovery involve two problems of attributed community
 , query vertices , 
detection and attributed community search. Attributed community , , left 1 -core, bipartite graph, right 2 -core
detection is to find all communities in an attributed graph where , the vertex sets of left 1 -core and right 2 -core
vertices have attributed labels [5, 17, 30, 47]. Thus, attributed com- ( , ) the length of the shortest path between and in 
munity detection is not the same as our problem in terms of commu- ( ) the diameter of a graph 
nity properties and input data. A survey of clustering on attributed ( ) the butterfly degree of vertex 
graphs can be found in [5]. In addition, given a set of query vertices ℓ ( ) the label associated with vertex 
and query attributes, attributed community search finds the query- ( ) the set of neighbor vertices of vertex in graph 
dependent communities in which vertices share homogeneous query ( ) ∩ ( ) the common neighbors of and 
 2 the vertices within ’s 2-hop neighborhood (excluding )
attributes [13, 18, 29]. Most recently, Zhang et al. [45] proposed
an attributed community search model using only query keywords high degree vertex with high priority to visit wedges. However, these
but no query vertices. Other related works to ours are community studies focus on the global butterfly counting to compute the butterfly
detection in heterogeneous networks [38] where vertices have vari- number over an entire graph. While our BCC search algorithms aim
ous vertex labels. However, heterogeneous communities are defined at finding a few vertices with large butterfly degrees as the leaders of
based on meta patterns, which are different from our communities cross-group communities. Moreover, our algorithms can dynamically
across over two labeled groups. Compared with all the above studies, update such leader vertices to admit butterfly degrees when the graph
our butterfly-core community search is a novel problem over labeled structure changes. Overall, our proposed butterfly search solutions
graphs, which has not been studied before. are efficient to find leader vertices and update butterfly degrees
Community search. Community search finds the query-dependent locally, which can avoid the global butterfly counting multiple times.
communities in a graph [6, 14, 19]. Community search models
can be categorized based on different dense subgraphs including 3 PROBLEM FORMULATION
 -core [2, 11, 26, 27, 36, 40], -truss [20], quasi-clique [10], and In this section, we introduce the definition and our problem.
densest subgraph [43]. Sozio and Gionis defined the problem of
community search and proposed a -core based model with the 3.1 Labeled Graph
distance and size constraints [36]. All these community models Let = ( , , ℓ) be a labeled graph, where is a set of vertices,
work on simple structural graphs, which ignore the vertex labels. ⊆ × is a set of undirected edges, and ℓ : → A is a
Recently, several complex community models have been studied for vertex label function mapping from vertices to labels A. For each
various graph data, such as directed graphs [15, 24, 28], weighted vertex ∈ , is associated with a label ℓ ( ) ∈ A. The edges have
graphs [37, 46], spatial-social networks [1, 7, 8, 22] and so on. Most two types, i.e., given two vertices { , } ∈ , if with same label
attributed community search studies aim at finding the communities ℓ ( ) = ℓ ( ), ( , ) is a homogeneous edge; otherwise, if ℓ ( ) ≠ ℓ ( ),
that have a dense structure and similar attributes [13, 18, 29, 45]. ( , ) is a heterogeneous (cross) edge. For example, consider in
This is different from our studies over labeled graphs, which trends Figure 1, has three labels: SE, UI and PM. The vertex 1 has a
to find two groups with different labels. Our community model re- label of SE. The edge ( 1, 2 ) is a homogeneous edge for (SE-SE).
quires the dense structure to appear not only in the inter-groups but The edge ( 5, 3 ) is a heterogeneous edge for (SE-UI).
also between two intra-groups. Most recently, there are two studies Given a subgraph ⊆ , the degree of a vertex in is de-
that investigate community search on heterogeneous information net- noted as ( ) = | ( )|, where ( ) is the set of ’s neigh-
works [16, 21], where vertices belong to multiple labeled types. Fang bors in . For two vertices , , we denote ( , ) as a length
et al. [16] leveraged meta-path patterns to find communities where of the shortest path between and in , where ( , ) =
all vertices have the same label of a query vertex and close relation- ∞ if and are disconnected. The diameter of is defined as
ships following the given meta-paths. Jian et al. [21] proposed the the maximum length of the shortest path in , i.e., ( ) =
relational constraint to require connections between labeled vertices max , ∈ ( ) ( , ) [20].
in a community. They developed heuristic solutions for detecting and
searching relational communities due to the hard-to-approximate
 3.2 K-Core and Butterfly
problem. Both studies are different from our BCC search model that We give two definitions of -core [35] and butterfly [32, 41].
takes two query vertices with different labels and finds a leader pair D EFINITION 1 ( - CORE ). Given a subgraph ⊆ and an
based community integrating two cross-over groups. Our problem is integer , is a -core if each vertex has at least neighbors
NP-hard but can be approximately tackled in polynomial time with within , i.e., ( ) ≥ .
an approximate ratio of 2.
 The coreness ( ) of a vertex ∈ is defined as the largest
Butterfly counting. In the bipartite graph analytics [34, 42], the
 number such that there exists a connected -core containing . In
butterfly is a cohesive structure of 2 × 2 biclique. Butterfly counting
 Figure 2, is a 4-core as each vertex has at least 4 neighbors within
is to calculate the number of butterflies in a bipartite graph [25,
 . Next, we define the butterfly [32, 41] in a bipartite graph.
32, 33, 41, 44]. Sanei et al. [32] proposed exact butterfly counting
and approximation solutions using randomized strategies. Wang D EFINITION 2 (B UTTERFLY ). Given a bipartite graph =
et al. [41] further optimized the butterfly counting by assigning ( , , ) where ⊆ × , a butterfly is a 2 × 2 biclique
 3
of induced by four vertices 1 , 2 ∈ , 1 , 2 ∈ such that all In terms of vertex labels, condition (1) requires that the BCC
four edges ( 1 , 1 ), ( 1 , 2 ), ( 2 , 1 ) and ( 2 , 2 ) exist in . contains exactly two labels for all vertices. In terms of homoge-
 neous groups, conditions (2) and (3) ensure that each homogeneous
 D EFINITION 3 (B UTTERFLY D EGREE ). Given a bipartite graph group satisfies the cohesive structure of -core, in which community
 = ( , , ), the butterfly degree of vertex is the number of members are internally densely connected. In terms of cross-group
butterfly subgraphs containing in , denoted by ( ). interactions, condition (4) targets two representative vertices of two
 homogeneous groups, which have a required number of butterflies
 E XAMPLE 1. In Figure 2, the subgraph is a butterfly since it is with densely cross-group interactions. Moreover, we call the two
a 2 × 2 biclique formed by four vertices { , 5, , 3 }. There exists a vertices and with ( ) ≥ and ( ) ≥ as a leader pair.
unique butterfly containing the vertex . Thus, the butterfly degree E XAMPLE 2. Figure 2 shows a (4, 3, 1)-BCC. The subgraphs 
of is ( ) = 1. and are respectively the left 4-core group and the right 3-core
 group, respectively. The subgraph is a butterfly across over two
3.3 Butterfly-Core Community Model groups and , and ( ) = ( ) = 1.
We next discuss a few choices to model the cross-group relationships
 3.4 Problem Formulation
between two groups and with different labels in the community
 , and analyze their pros and cons. To quantify the strength of cross- We formulate the BCC-Problem studied in this paper.
group connections, we use the number of butterflies between two P ROBLEM 1. (BCC-Problem) Given ( , , ℓ), two query ver-
groups, denoted as . tices = { , } ⊆ and three integers { 1, 2, }, the BCC-
 • First, we consider that ( ) ≥ for each vertex ∈ . It Problem finds a BCC ⊆ , such that:
 requires that each vertex’s butterfly count is at least , i.e., 1. Participation & Connectivity: is a connected subgraph con-
 ( ) ≥ . This constraint is too strict, which may miss some taining ;
 vertices without heterogeneous edges. Take Figure 1 as an exam- 2. Cohesiveness: is a ( 1, 2, )-BCC.
 ple, some vertices act like leaders or liaisons who are in charge 3. Smallest diameter: has the smallest diameter, i.e., š ′ ⊆
 of communications across the groups, i.e., { , , 5, 3 }, while , such that ( ′ ) < ( ), and ′ satisfies the above
 some vertices mostly link within their own group with less in- conditions 1 and 2.
 teractions across the groups such as { 1, 2, 3, 4 }. If we model The BCC-Problem prefers a tight BCC with the smallest diame-
 in this way, an input = { 1, } requires that 1 exists in at ter such that group members have a small communication cost, to
 least one butterfly, which is impossible. remove query unrelated vertices. In addition, we further study the
 Í
 • Second, we alternatively consider ∈ ( ) ( ) ≥ , which BCC-problem for multiple query vertices, generalizing the BCC
 requires the total butterfly count in is at least ⌈ /4⌉. However, model from 2 group labels to group labels where ≥ 2 in Sec-
 it is hard for us to determine the parameter as we cannot tion 7.
 estimate a qualified number of butterflies in community ,
 E XAMPLE 3. Consider a labeled graph in Figure 1. Assume
 which is a global criterion varying significantly over different
 that the inputs = { , }, 1 = 4, 2 = 3, and = 1. The answer
 kinds of graphs.
 is the (4, 3, 1)-butterfly-core community containing as shown in
 • Finally, we consider a constraint between two groups of vertices
 Figure 2.
 and that ∃ ∈ and ∃ ∈ to make ( ) ≥ and
 ( ) ≥ hold. It is motivated by real applications. Generally,
 one collaboration community has at least one leader or liaison 3.5 Why Butterfly-Core Community Model?
 in each group, so we require there exists at least one vertex in Why butterfly. A butterfly is a complete bipartite subgraph of 2 × 2
 each group whose butterfly count is at least . In this setting, vertices, which serves as the fundamental motif in bipartite graphs.
 no matter the input query vertices are leaders biased (e.g., = For two groups and with different labels, we model the col-
 { , }) or juniors biased (e.g., = { 1, 1 }), the underlying laborative interactions between two groups and using the
 community is identical. butterfly model [4, 12, 31]. More butterflies indicate 1) stronger
 In view of these considerations, we define the butterfly-core com- connections between two groups and 2) similar properties sharing
munity as follows. within the same group members, which are validated in many ap-
 plication scenarios. For instance, in the users-items bipartite graph
 D EFINITION 4 (B UTTERFLY-C ORE C OMMUNITY ). Given a la- , a user 1 ∈ buy an item 1 ∈ , then we have an edge
beled graph , a ( 1, 2, )-butterfly-core community (BCC) ⊆ ( 1, 1 ) in . Thus, two users 1 and 2 buy the same two items 1
satisfies the following conditions: and 2 , which forms a butterfly in . Two users purchase the same
 1. Two labels: there exist two labels , ∈ A, = { ∈ : items, indicating the more similar purchasing preferences of them
ℓ ( ) = } and = { ∈ : ℓ ( ) = } such that ∩ = ∅ and more butterflies in the community. Similar cases happen in the
and ∪ = ; common members of the board of directors between two different
 2. Left core: the induced subgraph of by is 1 -core; companies and also the common members of the steering committee
 3. Right core: the induced subgraph of by is 2 -core; in two conference organizations. Moreover, in the email communica-
 4. Cross-group interactions: ∃ ∈ and ∃ ∈ such that tion networks, threads of emails are delivered between two partner
butterfly degree ( ) ≥ and ( ) ≥ hold. groups and also cc’s to superiors on both sides. The superiors of
 4
two partner groups receive the most emails and play the leader pair and parameters { 1, 2, }, test whether has a connected butterfly-
positions of our BCC model. Overall, the butterfly plays an essential core subgraph containing with a diameter at most .
role as the basis higher-order motif in bipartite graphs, which can
be regarded as an implicit connection measure between two same T HEOREM 1. The BCC-Problem is NP-hard.
labeled vertices.
 P ROOF. We reduce the well-known NP-hard problem of Max-
Why BCC model. The BCC model inherits several good structural imum Clique (decision version) to BCC-Problem. Given a graph
properties and efficient computations. First, the community structure ( , ) and a number , the maximum clique decision problem is
enjoys high computational efficiency. The -core is a natural and to check whether contains a clique of size . From this, construct
cohesive subgraph model of communities in real applications, requir- an instance ′ ( ′, ′, ℓ) of BCC-Problem as follows. ′ = ( ′ :
ing that every person has at least neighbors in social groups, which + , ′ : + + , ℓ). For each vertex ∈ we assign the
can be computed faster than -clique. In addition, butterfly listing label of 1 , i.e., ℓ ( ) = 1, ∀ ∈ . ( , ) is a copy of ( , )
takes a polynomial time complexity and enjoys an efficient enumer- associated with labels 2 , i.e., ℓ ( ) = 2, ∀ ∈ . is the edge set
ation, which could be optimized by assigning the wedge visiting that connects any two vertices and where ∈ and ∈ ,
priority based on vertex degrees [31, 41]. Second, two labeled groups i.e., = × . Set parameters 1 = 2 = − 1, = 1 (actually 
in BCC model admit practical cases of different group densities in
 
 could choose any value fits ≤ ( − 1) × 2 ), = 1 and the query
real-world applications. Our BCC model crosses over two labeled vertices = { , } where ∈ and ∈ , i.e., ℓ ( ) ≠ ℓ ( ).
groups, which may have different group sizes and densities. Thus, We show that the instance of the maximum clique decision problem
two different -core parameters, i.e., 1 and 2 are greatly help- is a YES-instance iff the corresponding instance of BCC-Problem is
ful to capture different community structures of two groups. One a YES-instance.
simple way for parameter setting is to automatically set 1 and (⇒) : Clearly, any clique with at least vertices is a connected
 2 with the coreness of two queries and respectively. Third, ( − 1)-core, since is a copy of then there will be a connected
automatic identification of leader pair in the BCC discovery. The ( − 1)-core in . Because any edge between the vertices in and
constraint of cross-group interactions is motivated by real-world are connected then for each vertex ∈ , it forms 1 butterfly
scenarios that leaders or liaisons in each group always take most with any vertex exclude itself in and any two vertices in ; the
interactions with the other group. same proof to the vertex ∈ . Then is a ( − 1, − 1, 1)-BCC
 with a diameter 1.
3.6 Applications (⇐) : Given a solution for BCC-Problem, we split into two
 parts whose vertices label is ℓ ( ) and whose vertices label
In the following, we illustrate representative applications of butterfly-
 is ℓ ( ). Since is a ( − 1, − 1, 1)-BCC, and must contain
core community search.
 at least vertices, ( ) = = 1 implies and are both
 • Interdisciplinary collaboration search. Given two principal
 cliques which implies has a clique since is a copy of . □
 investigators from different departments in the universities,
 who intend to form a team to apply for an interdisciplinary
 Given the NP-hardness of the BCC-Problem, it is interesting
 research grant. The team is better formed by two cohesive
 whether it can be approximately tackled. We analyze the approxima-
 groups with good inner-group communications. Moreover, the
 tion and non-approximability as follows.
 principal investigators or liaisons should also have cross-group
 communications. Approximation and non-approximability. For ≥ 1, we say that
 • Professional team discovery. In high-tech companies, there an algorithm achieves an -approximation to BCC-Problem if it
 are usually many cross-department projects between two teams outputs a connected ( 1, 2, )-BCC ⊆ such that ⊆ and
 with different sizes of employees. Moreover, the technical ( ) ≤ · ( ∗ ), where ∗ is the optimal BCC. That is,
 leader and product manager of each team always take charge ∗ is a connected ( 1, 2, )-BCC s.t. ⊆ ∗ , and diam( ∗ ) is the
 of the cross-group communications and information sharing, minimum among all such BCCs containing .
 which naturally form a butterfly, i.e., 2 × 2 biclique.
 • Various real-world cross-group mining tasks. BCC search T HEOREM 2. Unless = , for any small ∈ (0, 1) and given
 can be applied on various real-world labeled graphs, e.g., global parameters { 1, 2, }, BCC-Problem cannot be approximated in
 flight networks, international trade networks, complex fiction polynomial time within a factor (2 − ) of the optimal.
 networks, and academic collaboration networks, as reported in
 four interesting case studies in Section 8. P ROOF. We prove it by contradiction. Assume that there exists
 a (2 − )-approximation algorithm for the BCC-Problem in polyno-
4 HARDNESS AND APPROXIMATION mial time complexity, no matter how small the ∈ (0, 1) is. This
 algorithm can distinguish between the YES and NO instances of
In this section, we analyze the hardness and non-approximability of the maximum clique decision problem. That is, if an approximate
the BCC-Problem. answer of the reduction problem has a diameter of 1, it corresponds
Hardness. We define a decision version of the BCC-Problem. to the Yes-instance of maximum clique decision problem; otherwise,
 the answer with a diameter value of no less than 2 corresponds to
 P ROBLEM 2. (BCC-Decision Problem) Given a labeled graph the No-instance of the maximum clique decision problem. This is
 ( , , ℓ), two query vertices = { , } ⊆ with different labels, impossible unless P=NP. □
 5
Algorithm 1 BCC Online Search ( , ) Algorithm 2 Find 0 ( , )
Require: = ( , , ℓ), = { , }, three integers { 1 , 2 , }. Require: = ( , , ℓ), = { , }, three integers { 1 , 2 , }.
Ensure: A connected ( 1 , 2 , )-BCC with a small diameter. Ensure: A connected { 1 , 2 , }-BCC 0 containing .
 1: Find a maximal connected ( 1 , 2 , )-BCC containing as 0 ; //see 1: ← { ∈ | ℓ ( ) = ℓ ( ) }; ← { ∈ | ℓ ( ) = ℓ ( ) };
 Algorithm 2 2: Let be a 1 -core induced subgraph of by ;
 2: ← 0; 3: Let be a 2 -core induced subgraph of by ;
 3: while ( ) = true do 4: = { , }, where = ∪ and = { × } ∩ ;
 4: Compute dist ( , ), ∀ ∈ and ∀ ∈ ; 5: Butterfly Counting( ); // See Algorithm 3
 5: ∗ ← ∈ ( , ); 6: ← max ∈ ( );
 6: ( , ) ← ( ∗ , ); 7: ← max ∈ ( );
 7: Delete ∗ and its incident edges from ; 8: if < or < then
 8: Maintain as a ( 1 , 2 , )-BCC; //see Algorithm 4 9: return ∅;
 9: +1 ← ; ← + 1; 10: 0 ← ∪ ∪ ;
10: ← arg min ′ ∈{ 0 ,..., } ′ ( ′, );
 −1

5 BCC ONLINE SEARCH ALGORITHMS 5.2.1 Finding 0 . As an essentially important step, finding 0
 is to identify a maximal connected ( 1, 2, )-BCC containing 
 In this section, we present a greedy algorithm for the BCC- in graph . The challenge lies in finding a butterfly-core struc-
problem, which online searches a BCC. Then, we show that the ture, which needs to shrink the graph by vertex removals. However,
greedy algorithm can achieve a 2-approximation to optimal answers. deleting vertices may trigger off the change of vertex coreness and
Finally, we discuss an efficient implementation of the algorithm and butterfly degree for vertices in the remaining graph. To address it,
analyze the time and space complexity. our algorithm runs the -core decomposition algorithm twice and
 then runs the butterfly counting method once. The general idea is to
5.1 BCC Online Search Algorithm first identify a candidate subgraph formed by two groups of vertices
We begin with a definition of query distance as follows. sharing the same labels with and . Then, it shrinks the graph
 D EFINITION 5 (Q UERY D ISTANCE ). Given a graph , a query by applying core decomposition algorithm, which deletes disqual-
set , and a set of vertices , the query distance of is the maximum ified vertices to identify 1 -core and 2 -core, denoted by and 
length of the shortest path from ∈ to a query vertex ∈ , i.e., respectively. Then, it counts the butterfly degree for all vertices and
 ( , ) = ∈ , ∈ ( , ). checks whether there exists two vertices ∈ and ∈ such
 that ( ) ≥ and ( ) ≥ hold.
 For simplicity, we use ( , ) and ( , ) to represent Algorithm 2 presents the details of finding 0 . For query vertices
the query distance for the vertex set in ⊆ and a vertex ∈ . , first we pick out all vertices with the same labels with query
Motivated by [20], we develop a greedy algorithm to find a BCC with vertices (line 1). Each vertex set in and constructs the subgraph
the smallest diameter. Here is an overview of the algorithm. First, it and we run the -core algorithm respectively, find the connected
finds a maximal connected ( 1, 2, )-BCC containing = { , }, component graph and containing query vertices and (lines
denoted as 0 . As the diameter of 0 may be large, it then iteratively 2-3). Next we construct a bipartite graph to find cross-group
removes from 0 the vertices far away to , meanwhile it maintains butterfly structures in the community. consists of the vertex set
the remaining graph as a ( 1, 2, )-BCC. and , are cross-group edges (line 4). Then, we compute the
Algorithm. Algorithm 1 outlines a greedy algorithmic framework number of butterflies for each vertex in using Algorithm 3 (line
for finding a BCC. The algorithm first finds 0 that is a maximal 5), which is presented in detail in the next paragraph. Algorithm
connected ( 1, 2, )-BCC containing (line 1). Then, we set = 0. 3 returns the butterfly degree of all the vertices, maintaining two
For all ∈ and ∈ , we compute the shortest distance between values and to record the maximum butterfly degree on
 and , and obtain the vertex query distance ( , ) (line 4). each side. Then we check if there exists at least one vertex whose
Among all vertices, we pick up a vertex ∗ with the maximum butterfly degree is no less than in each side, i.e., ≥ and
distance ( ∗, ), which equals ( , ) (lines 5-6). Next, ≥ , otherwise return ∅ (lines 8-9). Finally, we merge three
we remove the vertex ∗ and its incident edges from and also subgraph parts to form 0 (line 10).
delete vertices/edges to maintain as a ( 1, 2, )-BCC (lines 7- Next, we describe the details of the butterfly counting algorithm.
8). Then, we repeat the above steps until is disqualified to be a 2 = { | ( , ) ∈ ∧ ≠ , ∀ ∈ ( )} is the set of vertices that
BCC containing (lines 3-9). Finally, the algorithm terminates and are exactly in distance 2 from , i.e., neighbors of the neighbors of 
returns a BCC , where is one of the graphs ′ ∈ { 0, ..., −1 } (excluding itself). To calculate the butterfly degree of each vertex,
with the smallest query distance ′ ( ′, ) (line 10). Note that we take a vertex ∈ as an example (the same for ∈ ), each
each intermediate graph ′ ∈ { 0, ..., −1 } is a ( 1, 2, )-BCC. butterfly it participates in has one other vertex ∈ ( ≠ ) and
 two vertices { , } ∈ . By definition, ∈ 2 . In order to find the
5.2 Butterfly-Core Discovery and Maintenance number of , pairs that and form a butterfly, we compute the
We present two important procedures for BCC online search algo- intersection of the neighbor sets of and . We use | ( ) ∩ ( )| to
rithm: finding 0 (line 1 in Algorithm 1) and butterfly-core mainte- denote the number of common neighbors of and , the number of
nance (line 8 in Algorithm 1). the intersection pairs is | ( )∩ 
 2
 ( ) | 
 . The butterfly degree equation
 6
Algorithm 3 Butterfly Counting [32, 41] Algorithm 4 Butterfly Core Maintenance( , )
Require: = ( , ). Require: = ∪ ∪ , vertex set to be removed, three integers
Ensure: ( ) for all vertices ∈ . { 1 , 2 , }.
 1: for ∈ do Ensure: A ( 1 , 2 , )-BCC graph.
 2: ← ℎ ℎ ; // initialized with zero 1: Split into and according to their labels;
 3: for ∈ ( ) do 2: Remove vertices from and maintain as 1 -core, update ;
 4: for ∈ ( ) do 3: Remove vertices from and maintain as 2 -core, update ;
 5: [ ] ← [ ] + 1; //the number of 2-hop paths 4: Run butterfly Counting on and check there exists one vertex on each
 6: for ∈ do side with butterfly degree larger than ;
 7: ( ) ← ( ) + [ 2
 ]
 ; 5: return ;
 8: return { ( ) |∀ ∈ };

 . (2) ∗ ⊆ −1 . We prove −1 ( −1, ) ≤ −1 ( ∗, )
for is as follows: ( ) = ∈ ( ) ∈ ( )\ | ( )∩ ( ) | 
 Í Í
 2 = by contradiction. This follows from the fact if −1 ( −1, ) >
 | ( )∩ ( ) | 
 −1 ( ∗, ), then −1 will not be the last feasible BCC. There
Í
 2
 ∈ 2 . Instead of performing a set intersection at
each step, we count and store the number of paths from a vertex must exist a vertex ∗ ∈ −1 \ ∗ with the largest query distance
 ∈ to each of its distance-2 neighbor ∈ by using a hash map −1 ( ∗, ) so that −1 ( ∗, ) = −1 ( −1, ) > −1
 (lines 1-7) [32]. ( ∗, ). In the next iteration, Algorithm 1 will delete ∗ from −1 ,
 and maintain the butterfly-core structure of −1 . As ∗ is a BCC,
5.2.2 ( 1, 2, )-Butterfly-Core Maintenance. Algorithm 4 de-
 Algorithm 1 can find a feasible BCC s.t. ∗ ⊆ , which con-
scribes the procedure for maintaining as a ( 1, 2, )-BCC after the
 tradicts that −1 is the last feasible BCC. Overall, ( , ) ≤
deletion of vertices from . In Algorithm 1, = ∗ (line 8). Gen-
 −1 ( −1, ) ≤ −1 ( ∗, ) ≤ ∗ ( ∗, ).
erally speaking, after removing vertices and their incident edges
 From above we have proved that ( ) ≤ 2 ( , ) and
from , may not be a ( 1, 2, )-BCC any more, or may be
 ( , ) ≤ ∗ ( ∗, ). We have ( ) ≤ 2 ( , ) ≤
disconnected. Thus, Algorithm 4 iteratively deletes vertices having
 2 ∗ ( ∗, ) ≤ 2 ( ∗ ). □
degree less than 1 ( 2 ), until becomes a connected ( 1, 2, )-
BCC containing . It firstly splits the vertex set to two parts by To clarify, Algorithm 1 returns a BCC community that has the
their labels (line 1). Single vertex also works here, we only need minimum query distance = arg min ′ ∈ { 0 ,..., −1 } ′ ( ′, )
to run Algorithm 4 on the corresponding side. Then run core main- (line 10), rather than the BCC community ′ with the smallest
tenance algorithm on and respectively to maintain ( ) as a diameter, i.e, ′ = arg min ′ ∈ { 0 ,..., −1 } ′ ( ′, ). This can
 1 -core ( 2 -core) (lines 2-3). Next, we count butterfly degree again speedup the efficiency by avoiding expensive diameter calculation. It
on the updated with Algorithm 3 to check if it meets the butterfly is intuitive that the BCC community ′ also inherit 2-approximation
constraint of the BCC model (line 4). Finally, Algorithm 4 produces of the optimal answer, i.e., ( ′ ) ≤ ( ) ≤ 2 ( ∗ ).
a BCC (line 5). Complexity analysis. We analyze the time and space complexity
 of Algorithm 1. Let be the number of iterations and ≤ | |. We
5.3 Approximation and Complexity Analysis assume that | | − 1 ≤ | |, w.l.o.g., considering that is a connected
We first analyze the approximation of Algorithm 1. graph.
 T HEOREM 4. Algorithm 1 takes ( ( ∈ 2 + | |)) time and
 Í
 T HEOREM 3. Algorithm 1 achieves 2-approximation to an opti-
mal solution ∗ of the BCC-problem, that is, the obtained { 1, 2, }- (| |) space.
BCC has ( ) ≤ 2 ( ∗ ).
 P ROOF. Let | | and | | denote vertices and edges number re-
 P ROOF. First We have ( ) = max ∈ , ∈ ( , ), and spectively, is the vertex degree in the bipartite graph . First,
 ( , ) = max ∈ , ∈ ( , ), because of ⊆ then we consider the time complexity of finding 0 in Algorithm 2 is
 ( ∈ 2 + | |) which runs -core computation once in (| |)
 Í
 ( , ) ≤ ( ) holds. Suppose longest shortest path in
 and calls Algorithm 3 once in ( ∈ 2 )[32]. Next, we analyze
 Í
 is from vertex to , i.e., ( ) = ( , ). For ∀ ∈ ,
 ( ) = ( , ) ≤ ( , ) + ( , ) ≤ 2 ( , ). the time complexity of shrinking the community diameter. First, the
 Next, we prove ( , ) ≤ ∗ ( ∗, ) motivated by [20]. computation of shortest distances by a BFS traversal starting from
Algorithm 1 outputs a sequence of intermediate graphs { 0, ..., −1 }, each query vertex ∈ takes ( | || |) time for iterations, here
which are BCC containing query vertices . is the one with | | = 2 then ( | || |) could be eliminated to ( | |). Second, the
 time complexity of maintenance algorithm 4 is ( ∈ 2 + | |)
 Í
the smallest query distance, i.e., ( , ) ≤ ( , ), ∀ ∈
{0, ..., −1}. We consider two cases. (1) ∗ ⊈ −1 . Suppose the first in iterations, i.e., times butterfly counting but note that times
deleted vertex ∗ ∈ ∗ happens in graph , i.e., ∗ ⊆ , where -core maintain algorithm in total is (| |). The number of removed
0 ≤ ≤ − 2. The vertex ∗ must be deleted because of the distance edges is no less than ( { 1, 2 } − 1), thus the total number of
constraint but not the butterfly-core structure maintenance. Thus, iterations is ≤ ( {| | − { 1, 2 }, | |/ { 1, 2 }}) which
 ( , ) = ( ∗, ) = ( ∗, ). As a result, we have is (| |) since | | < | |. As a result, the time complexity of Algo-
 ( , ) ≤ ( , ) = ( ∗, ) ≤ ∗ ( ∗, ), rithm 1 is ( ( ∈ 2 + | |)). Next, we analyze the space com-
 Í
where the first inequality holds from has the smallest query dis- plexity of Algorithm 2. It takes (| | + | |) space to store graph
tance, and the second inequality holds for that ∗ is a subgraph of and (| |) space to keep the vertex information including the
 7
Algorithm 5 Fast Query Distance Computation( , , ) Table 2: The shortest distances of queries to other vertices. The
Require: A graph , query vertex , a set of removal vertices .
 symbol “−” represents the vertex set unchanged. The vertices
Ensure: The updated distance ( , ) for all vertices . with changed distance are depicted in bold, i.e., 4 and 7 .
 1: Remove all vertices and their incident edges from ;
 query 1 2 3 4
 2: ← ∈ ( , );
 { 1 , 2 , 3 } { 2 , 3 , 5 , 6 } { , 1 , 4 , 7 } { 9 }
 3: = { ∈ ( ) \ | ( , ) = };
 { 1 , 2 , 3 , 9 } { 1 , 3 , 4 , 5 , 7 } { , 2 , 6 } ∅
 4: = { ∈ ( ) \ | ( , ) > };
 after the deletion of 9
 5: Apply the BFS traverse on starting from to update the query
 − − − ∅
 distance of vertices in ;
 { 1 , 2 , 3 } { 1 , 3 , 5 } { , 2 , 6 , 4 , 7 } ∅
 6: return all updated distance ( , ) for ∈ ;

coreness, butterfly degree, and query distance. Overall, Algorithm 2
takes (| | + | |) = (| |) space, due to a connected graph with
| | − 1 ≤ | |. □

6 L2 P-BCC: LEADER PAIR BASED LOCAL
 BCC SEARCH
 Based on our greedy algorithmic framework in Algorithm 1, we
 Figure 3: An example of labeled graph and its bipartite sub-
propose three methods for fast BCC search in this section. The
 graph .
first method is the fast computation of query distance, which only
updates a partial of vertices with new query distances in Section 6.1. starting points (line 3). Let a set of vertices to be updated as 
The second method fast identifies a pair of leader vertices, which whose shortest path is larger than (line 4). Then, we run the
both have the butterfly degrees no less than in Section 6.2. The BFS algorithm starting from vertices in , we treat as unvisited
leader pair tends to have large butterfly degrees even after the phase and all the other vertices as visited (line 5). The algorithm terminates
of graph removal, which can save lots of computations in butterfly until are visited or the BFS queue becomes empty. Finally, we
counting. The third method of local BCC search is presented in return the shortest path to for all vertices (line 6). Note that in each
Section 6.3, which finds a small candidate graph for bulk removal iteration of Algorithm 1 it always keeps one query vertex’s distance
refinement, instead of starting from whole graph . to other vertices unchanged, only needs to update the distances of
 another query vertex, because = ∅ always holds for one of the
6.1 Fast Query Distance Computation query vertices.
Here, we present a fast algorithm to compute the query distance
for vertices in . In line 4 of Algorithm 1, it needs to compute the E XAMPLE 4. Consider the graph in Figure 3 as and =
query distance for all vertices, which invokes expensive computation { , }. From Table 2 we know the vertex 9 has the maximum
costs. However, we observe that a majority of vertices keep the query distance to , i.e., ( 9, ) = 4 (line 5 in Algorithm 1). Thus,
distance unchanged after each phase of graph removal. This suggests the removal vertex set is = { 9 }. Now, we apply Algorithm 5
that a partial update of query distances may ensure the updating on to update the query distance. (1) For , the vertex 9 is the
exactness and speed up the efficiency. farthest vertex to , thus = ∅ (line 4), which indicates no vertices’
 The key idea is to identify the vertices whose query distances distances to need to update. (2) For , = { 9 } and =
need to update. Given a set of vertices deleted in graph , let ∈ { 9 } ( , ) = ( 9, ) = 1 (line 2). The vertex set
denote the distance = min ∈ ( , ). Let the vertex to update is = { ∈ ( ) \ { 9 } | ( , ) > 1} = { ∈
set = { ∈ ( ) \ | ( , ) > }. We have two ( ) \ { 9 } | ( , ) = 2} ∪ { ∈ ( ) \ { 9 } | ( , ) =
importantly useful observations as follows. 3} = { 1, 3, 4, 5, 7 } ∪ { , 2, 6 } = { , 1, 2, 3, 4, 5, 6, 7 }
 • For each vertex ∈ , we need to recompute the query dis- (line 4). Then, we apply BFS search starting from the vertex set
 tance +1 ( , ). It is due to the fact that the vertex with = { ∈ ( ) \ { 9 } | ( , ) = 1} = { 1, 2, 3 } (line 3),
 ( , )
Algorithm 6 Leader Pair Identification ( , , , ) Algorithm 7 Butterfly Degree Update for Leader Pair ( , , )
Require: A graph = (or ), a query vertex = (or ), , . Require: A graph = ( , ), a leader vertex , a deletion vertex .
Ensure: A lead vertex = (or ). Ensure: Butterfly degree ( ).
 1: ← ; //Initiate as query vertex. 1: if ℓ ( ) = ℓ ( ) then
 2: ← ∈ ( ); 2: ← | ( ) ∩ ( ) |; //The number of common neighbors of , .
 3: ← /2; 3: ( ) ← ( ) − 2 ;
 4: if ( ) > then 4: else
 5: return ; //Query vertex has a large enough butterfly degree. 5: if ∈ ( ) then
 6: else 6: for ∈ ( ) and ≠ do
 7: while ≥ do 7: ← + | ( ) ∩ ( ) | − 1; //Add the count of removed
 8: ← 1; //Search from the 1-hop neighbors of . butterflies.
 9: while ≤ do 8: ( ) ← ( ) − ;
10: ← { | ( , ) = , ∈ }; 9: return ( );
11: if ∃ ∈ that ( ) ≥ then
12: return ;
13: else process where = 3 and = /2 = 1.5, returns 2 as the
14: ← + 1; //Increase the search distance . leader vertex. Finally, we obtain { 1, 2 } as the leader pair.
15: ← /2;
16: return ; Butterfly degree update for leader pair. Here, we consider how to
 efficiently update the butterfly degrees of leader pair vertices ( )
 and ( ) after graph removal in Algorithm 1.
large enough w.r.t. in two groups and , even after a number of The algorithm of updating the butterfly degrees of the leader pair
graph removal iterations. Thus, it can avoid finding a new leader pair is outlined in Algorithm 7. First, we check the labels of and (line
and save time cost. In the following, we show two key observations 1). If ℓ ( ) = ℓ ( ), we find the common neighbors ( )∩ ( ) shared
to find a good leader pair ( , ) where ∈ and ∈ . by and , its number denoted
  as , then the number of butterflies
 containing and is 2 . Then, ( ) decreases by 2 (lines 2-3).
 O BSERVATION 1. The leader pair should have large butterfly
 If ℓ ( ) ≠ ℓ ( ), firstly there exists no butterflies if and do not
degrees ( ) and ( ), which do not easily violate the constraint
 connect (line 5). We enumerate each vertex , i.e., ’s neighbors, and
of cross-over interactions.
 check their common neighbors with , i.e., ( ) ∩ ( ). Note that
 O BSERVATION 2. The leader pair should have small query dis- keeps the number of butterflies involving , so we directly update
tances ( , ) and ( , ), which are close to query vertices ( ) by decreasing (lines 6-8).
 and not easily deleted by graph removal. E XAMPLE 6. We apply Algorithm 7 on in Figure 3 to update
 We present the algorithm of leader pair identification in Algo- the butterfly degree of leader pair { 1, 2 }. First, the deletion of
rithm 6. Here, we use the graph to represent a graph / and vertex 9 has no influence on the butterfly degree. Next vertex ∗ to
find a leader vertex / , denotes to search leaders within -hops delete is selected from { 2, 1, 4, 6, 7 }, which has the maximum
neighbors of the query vertex . We first initiate as the query query distance ( ∗, ) = 3. To illustrate, we assume to delete
vertex since it is the closest with distance 0 (line 1). This is espe- 6 . (1) For 2 , ( 2 ) = 3 before the deletion. Since ℓ ( 2 ) = ℓ ( 6 )
cially effective when the input query vertex is leader biased who (line 1), their common neighbors ( 2 ) ∩ ( 6 ) = { 1, 3 } and
contains the largest butterfly degree. If the degree number is large = |{  1, 3 }| = 2 (line 2). The updated butterfly degree is ( 2 ) =
enough, i.e., greater than /2, we return as the leader vertex 3 − 2 = 3 − 1 = 2 (line 3). (2) For 1 , ( 1 ) = 6 before the deletion.
(lines 2-5); Otherwise, we find the leader vertex in the following Since ℓ ( 1 ) ≠ ℓ ( 6 ) and 6 ∈ ( 1 ) (lines 4-5), we enumerate 6 ’s
 Í
manner. We first increase from 1 to (line 14), and decrease neighbors except 1 , i.e., { 3 } (line 6). Since = ∈ { 3 } (| ( ) ∩
in { /2, /4, ..., } (lines 7-15). We get the set of vertices ( 1 )| − 1) = | ( 3 ) ∩ ( 1 )| − 1 = |{ 2, 3, 5, 6 }| − 1 = 3 (line 7),
whose distance to the query is (line 10). Then, we identify one then the updated butterfly degree is ( 1 ) = 6 − 3 = 3 (line 8).
vertex ∈ with ( ) ≥ and return as the leader vertex otherwise
 Complexity analysis. Next, we analyze the time and space com-
we increase by 1 to search the next hop (lines 9-14). Note that we
 plexity of leader pair identification and update in Algorithms 6 and 7.
return an initial if no better answer is identified (line 16).
 First, Algorithm 6 takes (| | log ( − )) time in (| |) space,
 E XAMPLE 5. We apply Algorithm 6 on in Figure 3 for = which identifies the leader pair using a binary search of approximate
{ , } and select = 3. In the graph , the non-zero butterfly butterfly degree ∈ [ , ] within the query vertex’s neighbor-
degree of vertices are ( 1 ) = ( 3 ) = 6 and ( 2 ) = ( 3 ) = hood. Next, we analyze the leader pair update in Algorithm 7.
 ( 5 ) = ( 6 ) = 3. (1) For the subgraph , first it initializes as
 T HEOREM 5. One run of butterfly degree update in Algorithm 7
the query vertex (line 1). Since = 6 and = /2 = 3
 takes ( 2 ) time and (| |) space, where = ∈ ( )
(lines 2-3), where ( ) = 0 which is less than = 3 then it goes
to line 6. Then, it starts from = 1 (line 8) to search ’s 1-hop P ROOF. The degree of leader vertex and delete vertex are 
neighbors, i.e., = { 1, 2, 3 } (line 10), finally finds there exists and , the time complexity of Algorithm 7 is ( ( , )) if
vertex = 1 such that ( ) ≥ = 3 and returns 1 as the leader ℓ ( ) = ℓ ( ); ( ) if ℓ ( ) ≠ ℓ ( ). Assume that the maximum
vertex (lines 11-12). (2) For the subgraph , it follows a similar degree in bipartite graph is . Thus, the total complexity of
 9
Algorithm 8 Index-based Local Exploration in . Smaller the shortfall of a path, lower its distance. Here, 1
Require: = ∪ ∪ , = { , }, three integers { 1 , 2 , }. controls the extent to which a small vertex coreness is penalized, and
Ensure: A connected ( 1 , 2 , )-BCC with a small diameter. 2 controls the penalized extent of a small butterfly degree. Using
 1: Compute a shortest path connecting using the butterfly-core path BCindex, for any vertex we can access the structural coreness ( )
 weight; and the butterfly degree ( ) in (1) time. Thus, the shortest path 
 2: ← min ∈ ( ); ← min ∈ ( ); can be extracted based on the butterfly-core path weight definition.
 3: Iteratively expand into graph = { ∈ | ( ) ≥ } ∪ { ∈ We then expand the extracted path to a large graph as a
 | ( ) ≥ } by adding adjacent vertices , until | ( ) | > ;
 candidate BCC by involving the local neighborhood of query ver-
 4: Compute a connected ( 1 , 2 , )-BCC containing of with the
 tices. We start from vertices in , split the vertices by their labels
 largest coreness on each side;
 5: Remove disqualified subgraphs on to identify the final BCC;
 into and , obtain the minimum coreness of vertices in each
 side as = min ∈ ( ) and = min ∈ ( ) (line 2). Due
 to the different density of two groups, we expand in different
Algorithm 7 is ( 2 ). In addition, we analyze the space complexity. core values. We then expand the path by iteratively inserting adja-
Algorithms 7 take (| | + | |) = (| |) space to store the vertices cent vertices with coreness no less than and respectively, in
and their incident edges in the graph. □ a BFS manner into , until the vertex size exceeds a threshold ,
 i.e., | ( )| > , where is empirically tuned. After that, for each
 In summary, a successful leader pair identification by Algorithm 6 vertex ∈ ( ), we add all its adjacent edges into (line 3).
can significantly reduce the times of calling the butterfly counting in Since is a local expansion, the coreness of , will be at most
Algorithm 3. In addition, it only needs to update the butterfly degree and . Based on , we extract the connected ( 1, 2, )-BCC
of leader vertices using Algorithm 7 but not the entire vertex set in containing . If input parameters 1 and 2 are not supplied, they
BCC candidate graph, which is also an improvement of butterfly are set automatically with the largest coreness on each side (line
counting. This butterfly computing strategy for leader pair identifi- 4). Then it iteratively removes the farthest vertices from this BCC.
cation and update is very efficient for BCC search, as validated in Moreover, since ( , ) is monotone non-decreasing with de-
Exp-5 in Section 8. creasing graphs, to reduce the iteration number of butterfly-core
 maintenance, we propose to delete a batch of vertices with the same
6.3 Index-based Local Exploration farthest query distance, i.e., = { ∗ | ( ∗, ) ≥ }, which
In this section, we develop a query processing algorithm which effi- can further improve the search efficiency (line 5). Although Algo-
ciently detects a small subgraph candidate around query vertices , rithm 8 does not preserve 2-approximation guarantee of optimal
which tends to be densely connected both in its own label subgraph answers, it achieves the results of good quality fast in practice, as
and bipartite graph. validated in our experiments.
 First, we construct the BCindex for all vertices in . The data
structure of BCindex consists of two components: the coreness and 7 HANDLE MULTI-LABELED BCC SEARCH
butterfly number. For -core index construction of each vertex, we In this section, we generalize the BCC model from 2 group labels
apply the existing core decomposition [3] on graph to construct its to multiple group labels where ≥ 2, and formulate the multi-
coreness. The -core has a nested property, i.e., the 1 -core of graph labeled BCC-problem. Then, we discuss how to extend our existing
 must be a subgraph of 2 -core if 1 > 2 . The offline -core index techniques to handle a BCC search query with multiple labels.
could efficiently find -core subgraphs from , meanwhile reduce
the size of bipartite graph . Moreover, we keep the butterfly degree Multi-labeled BCC search. First, we extend the cross-group inter-
number index of each vertex on the bipartite graph with different action to a new definition of cross-group connectivity. According
labels using Algorithm 3. to the 4th constraint of Def. 4, two groups labeled with and 
 Based on the obtained BCindex, we present our method of index- have cross-group interaction iff ∃ ∈ and ∃ ∈ such that
based local exploration, which is briefly presented in Algorithm 8. butterfly degree ( ) ≥ and ( ) ≥ in the induced subgraph
The algorithm starts from the query vertices and finds the shortest of by ∪ . We say that two labels and have cross-group
path connecting two query vertices. A naive method of shortest path interaction, denoted as ( , ).
search is to find a path is the minimum number of edges, while this D EFINITION 7 (C ROSS - GROUP C ONNECTIVITY ). Two labels
may produce a path along with the vertices of small corenesses and and have cross-group connectivity if and only if there exists
small butterfly degrees. To this end, we give a new definition of a cross-group path = { 1, . . . , } where 1 = , = , and
butterfly-core path weight as follows. ( , +1 ) has cross-group interaction for any 1 ≤ < .
 D EFINITION 6 (B UTTERFLY-C ORE PATH W EIGHT ). Given a
 To model group connection in a multi-labeled BCC, it requires
path between two vertices and in , the butterfly-core weight of
 that for any two labels and , there exists either a direct cross-
path is defined as ( , ) = ( , )+ 1 ( −min ∈ ( ))+
 group interaction or a cross-group path between two group core
 2 ( −min ∈ ( )), where ( ) is the coreness of vertex , ( )
 structures, implying the strong connection between two different
is the butterfly degree of , and are the maximum core-
 groups within a BCC community. Specifically, we extend the defini-
ness and the maximum butterfly degree in .
 tion of the multi-labeled BCC (mBCC) model as follows.
 The value of ( − min ∈ ( )) and ( − min ∈ ( ))
respectively measures the shortfall in the coreness and butterfly D EFINITION 8 (M ULTI - LABELED B UTTERFLY-C ORE C OMMU -
degree of vertices in path w.r.t. the corresponding maximum value NITY ). Given a labeled graph , an integer ≥ 2, group core
 10
Table 3: Network statistics (K = 103 and M = 106 ).
Algorithm 9 Multi-labeled BCC Search Framework ( , )
 Network | | | | 
Require: = ( , , ℓ), multi-labeled query = { 1 , . . . , }, group core Baidu-1 30K 508K 383 43 12
 parameters { : 1 ≤ ≤ }, butterfly degree parameter . Baidu-2 41K 2M 346 189 13
Ensure: A connected mBCC with a small diameter. Amazon 335K 926K 2 6 549
 1: Find a maximal connected subgraph of multi-labeled BCC containing 
 DBLP 317K 1M 2 113 342
 as 0 by Def. 8; //using Alg. 2 Youtube 1.1M 3M 2 51 28,754
 2: ← 0;
 LiveJournal 4M 35M 2 360 14,815
 3: while ( ) = true do
 Orkut 3.1M 117M 2 253 33,313
 4: Delete vertex ∗ ∈ ( ) with the largest ( ∗ , ) from ;
 //using Alg. 5
 5: Maintain each labeled group as a -core, ∀ ∈ {1, 2, ..., }; connectivity of . In addition, our fast query distance computation
 6: Update the leader pairs and check the cross-group connectivity among in Algorithm 5 and local search strategies in Algorithms 6 and 7 can
 labeled groups by Def. 7; //Alg. 3 & 4, optimized by Alg. 6 & 7 be also easily extended to improve the efficiency.
 7: +1 ← ; ← + 1;
 8: ← arg min ′ ∈{ 0 ,..., } ′ ( ′, ); Complexity analysis. We analyze the complexity of Algorithm 9.
 −1
 At each iteration of Algorithm 9, the shortest path computation takes
 (| | · (| | + | |)) = ( | |) time, due to | | −1 ≤ | |. Moreover, it
parameters { : 1 ≤ ≤ }, a multi-labeled butterfly-core commu- takes (| | + | |) time to find and maintain all -cores for different
nity (mBCC) ⊆ satisfies the following conditions: labels. To identify the leader pairs, it runs the butterfly counting
 over the whole graph once in ( ∈ 2 ) time. The extra cost of
 Í
 1. Multiple labels: there exist different labels 1, 2, ..., ∈
A such that ∀1 ≤ ≤ , = { ∈ ( ) : ℓ ( ) = } and checking cross-group connectivity takes ( 2 ) ⊆ ( | |) time,
 ∩ = ∅ for ≠ , and 1 ∪ 2 ∪ ... ∪ = ; due to the queries ⊆ and | | − 1 ≤ | |. Actually, the checking
 2. Core groups: the induced subgraph of by is -core where cross-group connectivity can be further optimized in ( ) time
1 ≤ ≤ ; using the union-find algorithm in any conceivable application [9].
 3. Cross-group connectivity: for 1 ≤ , ≤ , any two labels Thus, for a query with different labeled vertices, Algorithm 9
 takes ( ( | | + ∈ 2 )) time for iterations and (| |) space.
 Í
and have cross-group connectivity in .
 For = 2, this definition is equivalent to our butterfly-core com-
munity model in Def. 4. From the conditions (1) and (2) the mBCC 8 EXPERIMENTS
has exactly labeled groups and each group is a -core. The condi- In this section, we conduct experiments to evaluate our proposed
tion (3) requires that these groups could be cross-group connected model and algorithms.
by the bipartite interactions. To ensure the cross-group connectivity, Datasets. We use seven real datasets as shown in Table 3. Two new
each group should have one leader vertex whose butterfly degree real-world datasets of labeled graphs are collected from Baidu which
has no less than . Note that the accumulated number of butterflies is a high tech company in China. They are IT professional networks
from multiple bipartite graphs does not all count into the butterfly with the ground-truth communities of joint projects between two
degree. Specifically, we only count on the butterfly degree among department teams, denoted as Baidu-1 and Baidu-2. In both graphs,
two labeled bipartite graph. each vertex represents an employee and the label represents the
mBCC-search problem. Based on the multi-labeled BCC com- working department. An edge exists between two employees if they
munity model, the corresponding mBCC-search problem can be have communication through the company’s internal instant messag-
similarly defined as Problem 1 as follows. Consider a query = ing software. Baidu-1 and Baidu-2 are generated based on data logs
{ 1, . . . , } where each query vertex has a distinct label , the for three months and one whole year, respectively. In addition, we
group core parameters { : 1 ≤ ≤ }, and the butterfly degree use five graphs with ground-truth communities from SNAP, namely
parameter , the mBCC-search problem is to find a connected mBCC Amazon, DBLP, Youtube, LiveJournal and Orkut, with randomly
containing with the smallest diameter. added synthetic vertex labels into them. Specifically, we split the
 vertices based on communities into two parts, assigned all vertices
Algorithm. We propose a multi-labeled BCC search framework in
 in each part with one label. We also generated the query pairs by
Algorithm 9, which leverages our previous techniques of search
 picking any two vertices with different labels. To add cross edges
framework in Algorithm 1 and fast strategies in Section 6. The
 within communities, we randomly assigned vertices with 10% cross
algorithm first finds a maximal connected mBCC containing all
 edges to simulate the collaboration behaviors between two commu-
query vertices as 0 (line 1) and then iteratively removes the
 nities. Moreover, we added 10% noise data of cross edges globally
farthest vertex ∗ from the graph (lines 3-7). The mechanism is
 on the whole graph.
to maintain each labeled group as a -core (line 5) and update
the leader pairs (line 6). Based on the identified leader pairs and Compared methods. We compare our three BCC search approaches
cross-group interactions, the algorithm can check the cross-group with two community search competitors as follows.
connectivity for labeled groups in by Def. 7 as follows. We first • CTC: finds a closest -truss community containing a set of
construct a new graph with isolated vertices, where each vertex query vertices [20].
represents a labeled group. Then, we insert into an edge between • PSA: progressively finds a minimum -core containing a set of
two vertices if two labeled groups have a cross-group interaction. query vertices [23].
Finally, the cross-group connectivity of is equivalent to the graph • Online-BCC: our online BCC search in Algorithm 1.
 11
You can also read