A Query-Driven System for Discovering Interesting Subgraphs in Social Media

Page created by Bryan Singh
 
CONTINUE READING
A Query-Driven System for Discovering Interesting Subgraphs in Social Media
Social Network Analysis and Mining manuscript No.
                                         (will be inserted by the editor)

                                         A Query-Driven System for Discovering Interesting Subgraphs
                                         in Social Media
                                         Subhasis Dasgupta ·
                                         Amarnath Gupta
arXiv:2102.09120v1 [cs.SI] 18 Feb 2021

                                         Received: date / Accepted: date

                                         Abstract Social media data are often modeled as heterogeneous graphs with multiple types of nodes and edges.
                                         We present a discovery algorithm that first chooses a “background” graph based on a user’s analytical interest
                                         and then automatically discovers subgraphs that are structurally and content-wise distinctly different from the
                                         background graph. The technique combines the notion of a group-by operation on a graph and the notion of
                                         subjective interestingness, resulting in an automated discovery of interesting subgraphs. Our experiments on a
                                         socio-political database show the effectiveness of our technique.
                                         Keywords social network · interesting subgraph discovery · subjective interestingness

                                         1 Introduction                                                tities, topics, sentiment scores) may also be used for
                                                                                                       analysis. To be practically useful, the system must
                                         An information system designed for analysis of social         accommodate semantic synonyms – #KamalaHarris,
                                         media must consider a common set properties that char-        @KamalaHarris and “Kamala Harris” refer to the
                                         acterize all social media data.                               same entity.
                                         – Information elements in social media are essentially      – Relationships between information items in social me-
                                           heterogeneous in nature – users, posts, images, exter-      dia data must capture both canonical relationships
                                           nal URL references, although related, all bear differ-      like (tweet-15 mentions user-392) but a wide-variety
                                           ent kinds information.                                      of computed relationships over base entities (users,
                                         – Most social information is temporal – a timestamp           posts, . . .) and text-derived information (e.g., named
                                           is associated with user events like the creation or re-     entities).
                                           sponse on a post, as well as system events like user
                                                                                                     It is also imperative that such an information must sup-
                                           account creation, deactivation and deletion. The sys-
                                                                                                     port three styles of analysis tasks
                                           tem should therefore allow both temporal as well as
                                           time-agnostic analyses.                                   1. Search, where the user specifies content predicate
                                         – Information in social media evolves fast. In one study       without specifying the structure of the data. For
                                           (Zhu et al., 2013), it was shown that the number             example, seeking the number of tweets related to
                                           of users in a social media is a power function of            Kamala Harris should count tweets where she is the
                                           time. More recently, (Antonakaki et al., 2018) showed        author, as well as tweets where any synonym of “Ka-
                                           that Twitter’s growth is supralinear and follows Le-         mala Haris” is in the tweet text.
                                           scovec’s model of graph evolution (Leskovec et al.,       2. Query, where the user specifies query conditions
                                           2007). Therefore, an analyst may first have to per-          based on the structure of the data. For example,
                                           form exploration tasks on the data before figuring out       tweets with create date between 9 and 9:30 am on
                                           their analysis plan.                                         January 6th, 2021, with text containing the string
                                         – Social media has a significant textual content, some-        “Pence” and that were favorited at least 100 times
                                           times with specific entity markers (e.g., mentions)          during the same time period.
                                           and topic markers (e.g., hashtags). Therefore any in-     3. Discovery, where the user may or may not know
                                           formation element derived from text (e.g., named en-         the exact predicates on the data items to be re-
A Query-Driven System for Discovering Interesting Subgraphs in Social Media
2

    trieved, but can specify analytical operations (to-     2 Related Work
    gether with some post-filters) whose results will pro-
    vide insights into the data. For example, we call a     The problem of finding “interesting” information in a
    query like Perform community detection on all           data set is not new. (Silberschatz and Tuzhilin, 1996)
    tweets on January 6, 2021 and return the usersdescribed that an“interestingness measure” can be “ob-
    from the largest community a discovery query.           jective” or “subjective”. A measure is “objective” when
                                                            it is computed solely based on the properties of the
                                                            data. In contrast, a “subjective” measure must take into
In general, a real-life analytics workload will freely com-
                                                            account the user’s perspective. They propose that (a) a
bine these modalities as part of a user’s information
                                                            pattern is interesting if it is ”surprising” to the user (un-
exploration process.
                                                            expectedness) and (b) a pattern is interesting if the user
    In this paper, we present a general-purpose graph-      can act on it to his advantage (actionability). Of these
based model for social media data and a subgraph dis-       criteria, actionability is hard to determine algorithmi-
covery algorithm atop this data model. Physically, the      cally; unexpectedness, on the other hand, can be viewed
data model is implemented on AWESOME (Dasgupta              as the departure from the user’s beliefs. For example,
et al., 2016), an analytical platform designed to enable    a user may believe that the 24-hour occurrence pattern
large-scale social media analytics over continuously ac-    of all hashtags are nearly identical. In this case, a dis-
quired data from social media APIs. The platform, de-       covery would be to find a set of hashtags and sample
veloped as a polystore system natively supports rela-       dates for which this belief is violated. Following (Geng
tional, graph and document data, and hence enables a        and Hamilton, 2006), there are three possibilities re-
user to perform complex analysis that include arbitrary     garding how a system is informed of a user’s knowledge
combinations of search, query and discovery operations.     and beliefs: (a) the user provides a formal specification
We use the term query-driven discovery to reflect           of his or her knowledge, and after obtaining the mining
that the scenario where the user does not want to run       results, the system chooses which unexpected patterns
the discovery algorithm on a large and continuously col-    to present to the user (Bing Liu et al., 1999); (b) ac-
lected body of data; rather, the user knows a starting      cording to the user’s interactive feedback, the system re-
point that can be specified as an expressive query (illus-  moves uninteresting patterns (Sahar, 1999); and (c) the
trated later), and puts bounds on the discovery process     system applies the user’s specifications as constraints
so that it terminates within an acceptable time limit.      during the mining process to narrow down the search
                                                            space and provide fewer results. Our work roughly cor-
                                                            responds to the third strategy. Early research on finding
Contributions. This paper makes the following contri-
                                                            interesting subgraphs focused primarily on finding in-
butions. (a) It offers a new formulation for the subgraph
                                                            teresting substructures. This body of research primarily
interestingness problem for social media; (b) based on
                                                            found interestingness in two directions: (a) finding fre-
this formulation, it presents a discovery algorithm for
                                                            quently occurring subgraphs in a collection of graphs
social media; (c) it demonstrates the efficacy of the al-
                                                            (e.g., chemical structures) (Kuramochi and Karypis,
gorithm on multiple data sets.
                                                            2001; Yan and Han, 2003; Thoma et al., 2010) and (b)
                                                            finding regions of a large graph that have high edge den-
Organization of the paper. The rest of the paper            sity (Lee et al., 2010; Sariyuce et al., 2015; Epasto et al.,
is organized as follows. Section 2 describes the related    2015; Wen et al., 2017) compared to other regions in the
research on interesting subgraph finding in as inves-       graph. Note that while a dense region in the graph can
tigated by researchers in Knowledge Discovery, Infor-       definitely interesting, shows, the inverse situation where
mation Management, as well Social Network Mining.           a sparsely connected region is surrounded by an other-
Section 3 presents the abstract data model over which       wise dense periphery can be equally interesting for an
the Discovery Algorithm operations and the basic defi-      application.
nitions to establish the domain of discourse for the dis-        We illustrate the situation in Figure 1. The primary
covery process. Section 4 presents our method of gener-       data set is a collection of tweets on COVID-19 vacci-
ating candidate subgraphs that will be tested for inter-      nation, but this specific graph shows a sparse core on
estingness. Section 5 first presents our interestingness      Indian politics that is loosely connected to nodes of an
metrics and then the testing process based on these           otherwise dense periphery on the primary topic. Stan-
metrics. Section 6 describes the experimental valida-         dard network features like hashtag histogram and the
tion of our approach on multiple data sets. Section 7         node degree histograms do not reveal this substructure
presents concluding discussions.                              requiring us to explore new methods of discovery.
A Query-Driven System for Discovering Interesting Subgraphs in Social Media
3

Fig. 1 An example of an interesting subgraph. Tweets on Indian politics form a sparse center in a dense periphery (separately
shown in the top right). The bottom two figures show the hashtag distribution of the center and the degree distribution of the
center (log scale) respectively.

   A serious limitation of the above class of work is            graphs whose connectivity properties (e.g., the average
that the interestingness criteria does not take into ac-         degree of a vertices) are distinctly different from an “ex-
count node content (resp. edge content) which may be             pected” background graph. This approach uses a con-
present in a property graph data model (Angles, 2018)            strained optimization problem that maximizes an ob-
where nodes and edges have attributes. (Bendimerad,              jective function over the information content (IC) and
2019) presents several discovery algorithms for graphs           the description length (DL) of the desired subgraph
with vertex attributes and edge attributes. They per-            pattern.
form both structure-based graph clustering and sub-                  Our work is conceptually most inspired by (Bendimerad
space clustering of attributes to identify interesting (in       et al., 2019) that explores the subjective interestingness
their domain “anomalous”) subgraphs.                             problem for attributed graphs. Their main contribution
    On the “subjective” side of the interestingness prob-        centers around CSEA (Cohesive Subgraph with Excep-
lem, one approach considers interesting subgraphs as a           tional Attributes) patterns that inform the user that a
subgraph matching problem (Shan et al., 2019). Their             given set of attributes has exceptional values through-
general idea is to compute all matching subgraphs that           out a set of vertices in the graph. The subjective inter-
satisfy a user the query and then ranking the results            estingness is given by
based on the rarity and the likelihood of the associa-
tions among entities in the subgraphs. In contrast (Adri-
aens et al., 2019) uses the notion of “subjective inter-                                        IC(U, S)
estingness” which roughly corresponds to finding sub-                               S(U, S) =
                                                                                                 DL(U )
A Query-Driven System for Discovering Interesting Subgraphs in Social Media
4

where U is a subset of nodes and S is a set of restric-          We realistically assume that the inverse of these map-
tions on the value domains of the attributes. The sys-           pings can be computed, i.e., if v1 , v2 ∈ V are terms, we
tem models the prior beliefs of the user as the Maxi-            can perform a contains−1 (v1 ∧ ¬v2 ) operation to yield
mum Entropy distribution subject to any stated prior,            all posts that used v1 but not v2 .
beliefs the user may hold about the data (e.g., the dis-             The AWESOME information system allows users
tribution of an attribute value). The information con-           construct derived or computed edge types depending
tent IC(U, S) of a CSEA pattern (U, S) is formalized             on the specific discovery problem they want to solve.
as negative of the logarithm of the probability that the         For example, they can construct a standard hashtag
pattern is present under the background distribution.            co-occurrence graph using a non-recursive aggregation
The length of a description of U is the intersection of all      rule in Datalog (Consens and Mendelzon, 1993):
neighborhoods in a subset X ⊆ N (U ), along with the
set of “exceptions”, vertices are in the intersection but        HC(h1 , h2 , count(p)) ←−
not part of U . However, we have a completely different,                              p ∈ P, h1 ∈ H, h2 ∈ H
more database-centric formulation of the background                                   uses(p, h1 ), uses(p, h2 )
and the user’s beliefs.
                                                                 We interpret HC(h1 , h2 , count(p)) as an edge between
                                                                 nodes h1 and h2 and count(p) as an attribute of the
                                                                 edge. In our model, a computed edge has the form:
3 The Problem Setup                                              ET (Nb , Ne , B̄) where ET is the edge type, Nb , Ne are
                                                                 the tail and head nodes of the edge, and B̄ designates a
3.1 Data Model                                                   flat schema of edge properties. The number of such com-
                                                                 puted edges can be arbitrarily large and complex for
Our abstract model social media data takes the form of           different subgraph discovery problems. A more complex
a heterogeneous information network (an information              computed edge may look like: U M U HD(u1 , u2 , mCount, d, h),
network with multiple types of nodes and edges), which           where an edge from u1 to u2 is constructed if user u1
we view as a temporal property graph G. Let N be the             creates a post that contains hashtag h, and mentions
node set and E be the edge set of G. N can be viewed as          user u2 a total number of mCount times on day d. Note
a disjoint union of different subsets (called node types)        that in this case, hashtag h is a property of edge type
– users U , posts P , topic markers (e.g., hastags) H,           U M U HD and is not a topic marker node of graph G.
term vocabulary V (the set of all terms appearing in
a corpus), references (e.g., URLs) R(τ ), where τ repre-
sents the type of resource (e.g., image, video, web site         3.2 Specifying User Knowledge
. . .). Each type of node have a different set of proper-
ties (attributes) Ā(.). We denote the attributes of U as        Since the discovery process is query-driven, the system
Ā(U ) = a1 (U ), a2 (U ) . . . such that ai (U ) is the i-the   has no a priori information about the user’s interests,
attribute of U . An attribute of a node type may be              prior knowledge and expectations, if any, for a specific
temporal – a post p ∈ P may have temporal attribute              discovery task. Hence, given a data corpus, the user
called creationDate. Edges in this network can be di-            needs to provide this information through a set of pa-
rectional and have a single edge type. The following is          rameterized queries. We call these queries “heteroge-
a set of base (but not exhaustive) edge types:                   neous” because they can be placed on conditions on any
                                                                 arbitrary node (resp. edge) properties, network struc-
−   writes: U 7→ P
                                                                 ture and text properties of the graph.
−   uses: P 7→ H
−   mentions: P 7→ U maps a post p to a user u if u              User Interest Specification. The discovery process
    mentioned in p                                               starts with the specification of the user’s universe of
−   repostOf: P 7→ P maps a post p2 to a post p1 if p2 is        discourse identified with a query Q0 . We provide some
    repost of p1 . This implies that ts(p2 ) < ts(p1 ) where     illustrative examples of user interest with queries of in-
    ts is the timestamp attribute                                creasing complexity.
−   replyTo/comment: P 7→ P maps a post p2 to a post             Example 1. “All tweets related to COVID-19 between
    p1 if p2 is a reply to p1 . This implies that ts(p2 ) <      06/01/2020 and 07/31/2020 that refer to Hydroxychloro-
    ts(p1 ) where ts is the timestamp attribute                  quine”. In this specification, the condition “related to
−   contains: P 7→ V × N where N is the set of natural           COVID-19” amounts to finding tweets containing any
    numbers and represents the count of a token v ∈ V            k terms from a user-provided list, and the condition on
    in a post p ∈ P                                              “Hydroxychloroquine” is expressed as a fuzzy search.
A Query-Driven System for Discovering Interesting Subgraphs in Social Media
5

                                                             Clearly, we can assert that C(p) is a connected graph
                                                             and that N (p) @g C(p) where @g denotes a subgraph
                                                             relationship.

                                                             Definition 3 (Initial Background Graph.) The ini-
                                                             tial background graph G0 is a merger of all conversa-
                                                             tion contexts C(pi ), pi ∈ P0 , together with all computed
                                                             edges induces the nodes of ∪i C(pi )

                                                             The initial background graph itself can be a gateway to
                                                             finding interesting properties of the graph. To illustrate
Fig. 2 Hydroxychloroquine and COVID related Hashtags         this based on the graph obtained from our Example
                                                             2. Figure 3 presents two views of the #ados cluster of
                                                             hashtags from January 2021. The left chart shows the
Figure 2 shows the top hashtags related to this search       time vs. count of the hashtags while the right chart
– note that the hashtag co-occurrence graph around           shows the dominant hashtags of the same period in this
COVID-19 and hyroxychloroquine includes “FakeNews-           cluster. The strong peak in the timeline, was due to an
Media”.                                                      intense discussion, revealed by topic modeling, on the
Example 2. “All tweets from users who mention Trump          creation of an office on African American issues. The
in their user profile and Fauci in at least n of their       occurrence of this peak is interesting because most of
tweets”. Notice that this query is about users with a        the social media conversation in this time period was
certain behavioral pattern – it captures all tweets from     focused on the Capitol attack on January 6.
users who have a combination of specific profile features        Given the G0 graph,we discover subgraphs Si ⊂ Go
and tweet content.                                           whose content and structure are distinctly different that
Example 3. “All tweets from users U1 whose tweets            of G0 . However, unlike previous approaches, we ap-
appear in hashtag-cooccurrence in the 3 neighborhood         ply a generate-and-test paradigm for discovery. The
around #ados is used together with all tweets of users       generate-step (Section 4) uses a graph cube like (Zhao
U2 who belong to the user-mention-user networks of           et al., 2011) technique to generate candidate subgraphs
these users (i.e., U1 ) ”, where #ados refers to “Ameri-     that might be interesting and the test-step (Section
can Descendant of Slaves”, which represents an African       5.2) computes if (a) the candidate is sufficiently dis-
American cause.                                              tinct from the G0 , and (b) the collection of candidates
The end result of Q0 is a collection of posts that we        are sufficiently distinct from each other.
call the initial post set P0 . Using the posts in P0 , the
                                                             Subgraph Interestingness. For a subgraph Si to be
system creates a background graph as follows.
                                                             considered as a candidate, it must satisfy the following
Initial Background Graph. The initial background             conditions.
graph G0 is the graph derived from P0 over which the         C1. Si must be connected and should satisfy a size
discovery process runs. However, to define the initial       threshold θn , the minimal number of nodes.
graph, we first develop the notion of a conversation.        C2. Let Aij (resp. Bik ) be the set of local properties
                                                             of node j (resp. edge k) of subgraph Si . A property is
Definition 1 (Semantic Neighborhood.) N (p), the             called “local” if it is not a network property like vertex
semantic neighborhood of a post p is the graph connect-      degree. All nodes (resp. edges) of Si must satisfy some
ing p to instances of U ∪ H ∪ P ∪ V that directly relates    user-specified predicate φN (resp. φE ) specified over Aij
to p.                                                        (resp. Bik ). For example, a node predicate might re-
                                                             quire that all “post” nodes in the subgraph must have
Definition 2 (Conversation Context.) The conver-
                                                             a re-post count of at least 300, while an edge predicate
sation context C(p) of post p is a subgraph satisfying
                                                             may require that all hashtag co-occurrence relation-
the following conditions:
                                                             ships must have a weight of at least 10. A user defined
1. P1 : The set of posts reachable to/from p along the       constraint on the candidate subgraph improves the in-
   relationships repostOf, replyTo belong to C(p).           terpretability of the result. Typical subjective interest-
2. P2 : The union of posts in the semantic neighborhood      ingness techniques (van Leeuwen et al., 2016; Adriaens
   of P1 belong to C(p).                                     et al., 2019) use only structural features of the network
3. E: The induced subgraph of P1 ∪ P2 belong to C(p)         and do not consider attribute-based constraints, which
4. Nothing else belongs to C(p).                             limits their pragmatic utility.
A Query-Driven System for Discovering Interesting Subgraphs in Social Media
6

Fig. 3 Two views of the #ados cluster of hashtags

C3. For each text-valued attribute a of Aij , let C(a)        value of the grouping attributes, and then applies the
be the collection of the values of a over all nodes of        aggregation function to the list. Thus, the result of the
Si , and D(C(a)) is a textual diversity metric computed       groupby operation is a single aggregated value for each
over C(a). For Si to be interesting, it must have at          distinct cross-product value of grouping attributes.
least one attribute a such that D(C(a)) does not have
the usual power-law distribution expected in social net-          To apply this operation to a social network graph,
works. Zheng et al (Zheng and Gupta, 2019) used vocab-        we recognize that there are two distinct ways of defin-
ulary diversity and topic diversity as textual diversity      ing the “grouping-object”.
measures.                                                     (1) Node properties can be directly used just like in
                                                              the relational case. For example, for tweets a grouping
                                                              condition might be getDate (Tweet.created at) ∧
4 Candidate Subgraph Generation                               bin(Tweet.favoriteCount, 100), where the getDate
                                                              function extracts the date of a tweet and the bin func-
Section 3.2 describes the creation of the initial back-       tion creates buckets of size 100 from the favorite count
ground graph G0 that serves as the domain of discourse        of each tweet.
for discovery. Depending on the number of initial posts       (2) The grouping-object is a subgraph pattern. For ex-
P0 resulting form the initial query, the size of G0 might     ample, the subgraph pattern
be too large – in this case the user can specify followup     (:tweet{date})-[:uses]->(:hashtag{text}) (P1)
queries on G0 to narrow down the scope of discovery.          states that all ”tweet” nodes having the same posting
We call this narrowed-down graph of interest as G0 – if       date, together with every distinct hashtag text will be
no followup queries were used, G0 = G0 . The next step        placed in a separate group. Notice that while (1) pro-
is to generate some candidate subgraphs that will be          duces disjoint tweets, (2) produces a “soft” partitioning
tested for interestingness.                                   on the tweets and hashtags due to the many-to-many
Node Grouping. A node group is a subset of nodes(G0 )         relationship between tweets and hashtags.
where all nodes in a group have some similar property.        In either case, the result is a set of node groups, desig-
We generalize the groupby operation, commonly used            nated here as Ni . For example, the grouping pattern P1
in relational database systems, to heterogeneous infor-       expressed in a Cypher-like syntax (Francis et al., 2018)
mation networks. To describe the generalization, let us       (implemented in the Neo4J graph data management
assume R(A, B, C, D, . . .) is a relation (table) with at-    system) states that all tweets having the same posting
tributes A, B, C, D, . . . A groupby operation takes as in-   date, together with every distinct hashtag text will be
put (a) a subset of grouping attributes (e.g. A, B), (b) a    placed in a separate group. Notice that this process pro-
grouped attribute (e.g., C) and (c) an aggregation func-      duces a “fuzzy” partitioning on the tweets and hashtags
tion (e.g., count). The operation first computes each         due to the many-to-many relationship between tweets
distinct cross-product value of the grouping attributes       and hashtags. Hence, the same tweet node can belong
(in our example, A × B) and creates a list of all values      to two different groups because it has multiple hash-
of the grouped attribute corresponding to each distinct       tags. Similarly, a hashtag node can belong to multiple
A Query-Driven System for Discovering Interesting Subgraphs in Social Media
7

groups because tweets from different dates may have                 able, we find a∗, the maximum curvature (elbow)
used the same hashtag. While the grouping condition                 point of f (A) numerically (Antunes et al., 2018).
specification language can express more complex group-           ◦ We compute A0 , the values of attribute A to the
ing conditions, in this paper, we will use simpler cases            left of a∗ for all attributes and choose the at-
to highlight the efficacy of the discovery algorithm. We            tribute where the cardinality of A0 is maximum.
denote the node set in each group as Ni .                           In other words, we choose attributes which have
Graph Construction. To complete the groupby oper-                   the highest number of pre-elbow values.
ation, we also need to specify the aggregation function       – We adopt a similar strategy for subgraph patterns.
                                                                             L
in addition to the grouping-object and the grouped-             If T1 (ai ) −→ T2 (bj ) is an edge where T1 , T2 are node
object. This function takes the form of a graph con-            labels, ai , bj are node properties and L is an edge
struction operation that constructs a subgraph Si by            label, then ai and bj will be selected based on the
expanding on the node set Ni . Different expansion rules        conditions above. Since the number of edge labels is
can be specified, leading to the formation of different         fairly small in our social media data, we will evalu-
graphs. Here we list three rules that we have found fairly      ate the estimated cardinality of the edge for all such
useful in practice.                                             triples and select one with the lowest cardinality.
G1. Identify all the tweet nodes in Ni . Construct a
relaxed induced subgraph of the tweet-labeled nodes in
Ni . The subgraph is induced because it only uses tweets      5 The Discovery Process
contained within Ni , and it is relaxed because contains
all nodes directly associated with these tweet nodes,         5.1 Measures of for Relative Interestingness
such as author, hashtags, URLs, and mentioned-users.
G2. Construct a mention network from within the tweet         We compute the interestingness of a subgraph S in ref-
nodes in Ni – the mention network initially connects all      erence to a background graph Gb (e.g., G0 ), and con-
tweet and user-labeled nodes. Extend the network by           sists of a structural as well as a content component. We
including all nodes directly associated with these tweet      first discuss the structural component. To compare a
nodes.                                                        subgraph Si with the background graph, we first com-
G3. A third construction relaxes the grouping con-            pute a set of network properties Pj (see below) for
straint. We first compute either G1 or G2, and then           nodes (or edges) and then compute the frequency dis-
extend the graph by including the first order neighbor-       tribution f (Pj (Si )) of these properties over all nodes
hood of mentioned users or hashtags. While this clearly       (resp. edges) of (a) subgraphs Si , and (b) the refer-
breaks the initial group boundaries, a network thus           ence graph (e.g., G0 ). A distance between f (Pj (Si ))
constructed includes tweets of similar themes (through        and f (Pj (Gb )) is computed using Jensen–Shannon di-
hashtags) or audience (through mentions).                     vergence (JSD). In the following, we use ∆(f1 , f2 ) to
Automated Group Generation. In a practical set-               refer to the JS-divergence of distributions f1 and f2 .
ting, as shown in Section 6, the parameters for node          Eigenvector Centrality Disparity: The testing pro-
grouping operation can be specified by a user, or it          cess starts by identifying the distributions of nodes with
can be generated automatically. Automatic generation          high node centrality between the networks. While there
of grouping-objects is based on the considerations de-        is no shortage of centrality measures in the literature,
scribed below. To keep the autogeneration manageable,         we choose eigenvector centrality (Das et al., 2018) de-
we will only consider single and two objects for at-          fined below, to represent the dominant nodes. Let A =
tribute grouping and only a single edge for subgraph          (ai,j ) be the adjacency matrix of a graph. The eigen-
patterns.                                                     vector centrality xi of node i is given by:
– Since temporal shifts in social media themes and                                      1X
  structure are almost always of interest, the posting                            xi =        ak,i xk
                                                                                        λ
                                                                                           k
  timestamp is always a grouping variable. For our pur-
  poses, we set the granularity to a day by default,          where λ 6= 0 is a constant. The rationale for this choice
  although a user can set it.                                 follows from earlier studies in (Bonacich, 2007; Ruh-
– The frequency of most nontemporal attributes (like          nau, 2000; Yan et al., 2014), who establish that since
  hashtags) have a mixture distribution of double-pareto      the eigenvector centrality can be seen as a weighted sum
  lognormal distribution and power law (Bhattacharya          of direct and indirect connections, it represents the true
  et al., 2020), we will adopt the following strategy.        structure of the network more faithfully than other cen-
   ◦ Let f (A) be distribution of attribute A, and κ(f (A))   trality measures. Further, (Ruhnau, 2000) proved that
      be the curvature of f (A). If A is a discrete vari-     the eigenvector-centrality under the Euclidean norm
A Query-Driven System for Discovering Interesting Subgraphs in Social Media
8

can be transformed into node-centrality, a property not        the subgraph) that belong to the cluster. If the number
exhibited by other common measures. Let the distribu-          of posts is greater than a threshold, we compute the
tions of eigenvector centrality of subgraphs A and B           navigability disparity.
be βa and βb respectively, and that of the background
                                                               Propagativeness Disparity: The concept of prop-
graph be βt , then |∆e (βt , βa )| > θ indicates that A is
                                                               agativeness builds on the concept of navigability. Prop-
sufficiently structurally distinct from Gb |∆e (βt , βa )| >
                                                               agativeness attempts to capture how strongly the net-
|∆e (βt , βb )| indicates that A contains significantly more
                                                               work is spreading information through a navigable sub-
or significantly less influential nodes than B.
                                                               graph S. We illustrate the concept with a network con-
Topical Navigability Disparity: Navigability mea-              structed over tweets where a political personality (Sen-
sures ease of flow. If subgraph S is more navigable than       ator Kamala Harris in this example) is mentioned in
subgraph S 0 , then there will be more traffic through         January 2021. The three rows in Figure 10 show the
S compared to S 0 . However, the likelihood of seeing a        network characteristics of the subregions of this graph,
higher flow through a subgraph depends not just on the         related, respectively to the themes of #ados and “black
structure of the network, but on extrinsic covariates like     lives matter” (Row 1), Captiol insurrection (Row 2)
time and topic. So, a subgraph is interesting in terms         and Socioeconomic issues related to COVID-19 includ-
of navigability if for some values of a covariate, its nav-    ing stimulus funding ad business reopening (Row 3). In
igability measure is different from that of a background       earlier work (Zheng and Gupta, 2019), we have shown
subgraph.                                                      that a well known propagation mechanism for tweets is
    Inspired by its application in biology (Seguin et al.,     to add user-mentions to improve the reach of a message
2018), traffic analysis (Scellato et al., 2010), and net-      - hence the user-mention subgraph is indicative of prop-
work attack analysis (Lekha and Balakrishnan, 2020),           agative activity. In Figure 10, we compare the hashtag
we use edge betweenness centrality (Das et al., 2018) as       activity (measured by the Hashtag subgraph) and the
the generic (non-topic) measure of navigability. Let αij       mention activity (size of the mention graph) in these
be the number of shortest paths from node i to j and           three subgraphs. Figure 10 (e) shows a low and fairly
αij (k) is the number of paths passes through the edge         steady size of the user mention activity in relation to the
k. Then the edge-betweenness centrality is                     hashtag activity on the same topic, and these two indi-
                                                               cators are not strongly correlated. Further, Figure 10 (f)
                              X αij (k)
                 Ceb (k) =                                     shows that the mean and standard deviation of node de-
                                 αij                           gree of hashtag activity are fairly close, and the average
                             (i,j)∈V
                                                               degree of user co-mention (two users mentioned in the
By this definition, the edge betweenness centrality is the     same tweet) graph is relatively steady over the period of
portion of all-pairs shortest paths that pass through an       observation – showing low propagativeness. In contrast,
edge. Since edge betweenness centrality of edge e mea-         Row 1 and Row 2 show sharper peaks. But the curve
sures the proportion of paths that passes through e,           in Figure 10 (c) declines and has low, uncorrelated user
a subgraph S with a higher proportion of high-valued           mention activity. Hence, for this topic although there is
edge betweenness centrality implies that S may be more         a lot of discussion (leading to high navigability edges),
navigable than Gb or another subgraph S 0 of the graph,        the propagativeness is quite low. In comparison, Figure
i.e., information propagation is higher through this sub-      10 (a) shows a strong peak and a stronger correlation
graph compared to the whole background network, for            be the two curves indicating more propagativeness. The
that matter, any other subgraph of network having a            higher standard deviation in the co-mention node de-
lower proportion of nodes with high edge betweenness           gree over time (Figure 10 (b)) also shows the making
centrality. Let the distribution of the edge betweenness       of more propagation around this topic compared to the
centrality of two subgraphs A and B are c1 and c2 re-          others.
spectively, and that of the reference graph is c0 . Then,          We capture propagativeness using current flow be-
|∆b (c0 , c1 )| < |∆b (c0 , c2 )| means the second subgraph    tweenness centrality (Brandes and Fleischer, 2005) which
is more navigable than the first.                              is based on Kirchoff’s current laws. We combine this
     To associate navigability with topics, we detect topic    with the average neighbor degree of the nodes of S to
clusters over the background graph and the subgraph            measure the spreading propensity of S. The current flow
being inspected. The exact method for topic cluster            betweenness centrality is the portion of all-pairs short-
finding is independent of the use of topical navigabil-        est paths that pass through a node, and the average
ity. In our setting, we have used topic topic modeling         neighbor degree is the average degree of the neighbor-
and dense region detection in hashtag cooccurrence net-        hood of each node. If a subgraph has higher current flow
works. For each topic cluster, we identify posts (within       betweenness centrality plus a higher average neighbor
A Query-Driven System for Discovering Interesting Subgraphs in Social Media
9

degree, the network should have faster communicability.       Algorithm 1: Graph Construction Algorithm
Let αij be the number of shortest paths from node i to         INPUT : Qout Output of the query, L Graph construction
j and αij (n) is the number of paths passes through the         rules, gv grouping variable, thsize is the minimum size of
                                                                the subgraph;
node n. Then the current flow betweenness centrality:          Function gmetrics (Qout , L, groupV ar)
                                                                   G[]← ConstructGraph(Qout , L);
                                X αij (n)                          T ← [];
                  Cnb (n) =                                        for g ∈ G do
                                   αij                                    tα ← ComputeMetrics(g);
                              (i,j)∈V                                     T.push(talpha );
                                                                   end
     Suppose the distribution of the current flow between-         return T
                                                               end
ness centrality of two subgraphs A and B is p1 and p2          Function ComputeMetrics (Graph g)
respectively, and distribution of the reference graph is           m ← [];
                                                                   m.push(eigenV ectorCentrality(g));
pt . Also the distribution of the βn , the average neighbor        ......... m.push(coreN umber(g));
                                                                   return m
degree of the node n, for the subgraph A and B is γ1           end
and γ1 respectively, and the reference distribution is γt .    Function CompareHistograms (List t1 , List x2 )
                                                                   sg ← cut2bin(x2 , binedges );
If the condition                                                   binedges ← getBinEdges(x2 );
                                                                   tg ← cut2bin(t1 , binedges );
                                                                   βjs ← distance.jensenShannon(tg , sg );
      ∆(pt , p1 ) ∗ ∆(γt , γ1 ) < ∆(pt , p2 ) ∗ ∆(γt , γ2 )        ht ← histogram(tg , sg , binedges );
                                                                   return βjs , ht , binedges ;
holds, we can conclude that subgraph B is a faster             end
propagating network than subgraph A. This measure
is of interest in a social media based on the obser-
vation that misinformation/disinformation propagation         Algorithm 2: Graph Discovery Algorithm
groups either try to increase the average neighbor de-         Input: Set of all subgraphs divergence σ
gree by adding fake nodes or try to involve influential        Output: Feature vectors v1 , v2 , v3 , List for re-partition
                                                                         recommendations l
nodes with high edge centrality to propagate the mes-          ev : eigenvector centrality;
sage faster (Besel et al., 2018).                              ec : edge current flow betweenness centrality;
                                                               nc : current flow betweenness centrality;
                                                               µ : core number;
Subgroups within a Candidate Subgraph: The                     z : average neighbor degree;
purpose of the last metric is to determine whether a           Function discover (σ)
                                                                    for any two set of divergence from σ1 ans σ2 do
candidate subgraph identified using the previous mea-                   if σ2 (ev) > σ1 (ev) then
sures need to be further decomposed into smaller sub-                        v1 (σ2 ) = v1 (σ2 ) + 1;
                                                                             if σ2 (ec) > σ1 (ec) then
graphs. We use subgraph centrality (Estrada and Rodriguez-                        v2 (σ2 ) = v2 (σ2 ) + 1;
                                                                                  if (σ2 (nc) + σ2 (µ)) > (σ1 (ec) + σ2 (µ))
Velazquez, 2005) and coreness of nodes as our metrics.                              then
The subgraph centrality measures the number of sub-                                    v3 (σ2 ) = v3 (σ2 ) + 1;
                                                                                  end
graphs a vertex participates in, and the core number                              if (σ2 (sc) + σ2 (z)) > (σ1 (sc) + σ2 (z))
of a node is the largest value k of a k-core containing                             then
                                                                                       l(σ2 ) = 1;
that node. So a subgraph for which the core number                                end
and subgraph centrality distributions are right-skewed                       end
                                                                        end
compared to the background subgraph are (i) either                  end
split around high-coreness nodes, or (ii) reported to the      end
user as a mixture of diverse topics. The node grouping,
per-group subgraph generation and candidate subgraph
identification process is presented in Algorithm 1. In
the algorithm, function cut2bin extends the cut func-     of the first three lists contains one specific factor of
tion, which compares the histograms of the two dis-       interestingness of the subgraph. The most interesting
tributions whose domains (X-values) must overlap, and     subgraph should present in all three vectors. If the sub-
produces equi-width bins to ensure that two histograms    graph has many cores and is sufficiently dense, then
(i.e., frequency distributions) have compatible bins.     the system considers the subgraph to be uninterpretable
                                                          and sends it for re-partitioning. Therefore, the fourth
                                                          list contains the subgraph that should partition again.
5.2 The Testing Process                                   Currently, our repartitioning strategy is to take subsets
                                                          of the original keyword list provided by the user at the
The discovery algorithm’s input is the list of divergence beginning of the discovery process to re-initiate the dis-
values of two candidate sets computed against the same    covery process for the dense, uninterpretable subgraph.
reference graph. It produces four lists at the end. Each
A Query-Driven System for Discovering Interesting Subgraphs in Social Media
10

Fig. 4 #ADOS and Black Lives Matter                          Fig. 5 AVG Degree Distributions and Std Dev

Fig. 6 Insurrection and Capitol Attack                       Fig. 7 AVG Degree Distributions and Std Dev

Fig. 8 Socioeconomic Issues During COVID-19                  Fig. 9 AVG Degree Distributions and Std Dev
Fig. 10 Various Metrics for Daily Partitioned Hashtag co-occurrences and User mentioned Graph from the Political Personalty
Dataset. Q0 retrieves tweets where Kamala Harris is mentioned in a hashtag, text body or user mention.

The output of each metric produces a value for each            tension of a standard cut function, which compares the
participant node of the input. However, to compare             histograms of the two distributions whose domains (X-
two different candidates, in terms of the metrics men-         values) must overlap. The cut function accepts as input
tioned above, we need to convert them to comparable            a set of set of node property values (e.g., the central-
histograms by applying a binning function depending            ity metrics), and optionally a set of edge boundaries
on the data type of the grouping function.                     for the bins. It returns the histograms of distribution.
Bin Formation (cut2bin): Cut is a conventional opera-          Using the cut, first, we produce n equi-width bins from
tor (available with R, Matlab, Pandas etc. ) segments          the distribution with the narrower domain. Then we ex-
and sorts data values into bins. The cut2bin is an ex-         tract bin edges from the result and use it as the input
11

bin edges to create the wider distribution‘s cut. This        quantitative summary on the three data sets. We used
enforces the histograms to be compatible. In case one         a set of subqueries to find subsets from each of our
of the distribution is known to be a reference distribu-      datasets. These subqueries are temporal, and represent
tion (distribution from the background graph) against         trending terms that stand for the emerging issues. For
which the second distribution is compared, we use the         the first two datasets, we used three themes to con-
reference distribution for equi-width binning and bin         struct the background graphs: a) capitol attack and
the second distribution relative to the first.                insurrection, b) Black Lives Matter and ADOS move-
The CompareHistograms function uses the cut2bin               ment, and c) American economic crisis. and recovery
function to produce the histograms, and then computes         efforts. For the vaccine data set, we selected two subsets
the JS Divergence on the comparable histograms. The           of posts, one for vaccine-related concerns, anti-vaccine
CompareHistograms function returns the set of diver-          movements, and related issues, and the second for covid
gence values for each metric of a subgraph, which is          testing and infection-related issues. The vaccine data
the input of the discovery algorithm. The function re-        set is larger, and the content is more diverse than the
quires the user to specify which of the compared graphs       first two. All the data sets are publicly available from
should be considered as a reference – this is required        The Awesome Lab 1
to ensure that our method is scalable for large back-
                                                                  In the experiments, we constructed two subgraphs
ground graphs (which are typically much larger than
                                                              from each subquery. The first is the “hashtag co-occurrence”
the interesting subgraphs). If the background graph is
                                                              graph, where each hashtag is a node, and they are con-
very large, we take several random subgraphs from this
                                                              nected through an edge if they coexist in a tweet. The
graph to ensure they are representative before the ac-
                                                              second is the “user co-mention” graph, where each user
tual comparisons are conducted. To this end, we adopt
                                                              is a node, and there is an edge between two nodes if a
the well-known random walk strategy.
                                                              tweet mentioned them jointly. Intuitively, the hashtag
In the algorithm v1 , v2 and v3 are the three vectors
                                                              co-occurrence subgraph captures topic prevalence and
to store the interestingness factors of the subgraphs,
                                                              propagation, whereas the co-mention subgraph captures
and l is the list for repartitioning. For two subgraphs, if
                                                              the tendency to influence and propagate messages to a
one of them qualified for v1 means, the subgraph con-
                                                              larger audience. Our goal is to discover surprises (and
tains higher centrality than the other. In that case, it
                                                              lack thereof) in these two aspects for our data sets.
increases the value of that qualified bit in the vector
by one. Similarly, it increases the value of v2 by one,           We note that the dataset chosen is from a month
if the same candidate has high navigability. Finally, it      where the US experienced a major event in the form
increases the v3 , if it has higher propagativeness. The      of the Capitol Attack, and a new administration was
algorithm selects the top-k scores of candidates from         sworn in. This explains why the number of tweets in
each vector, and marks them interesting.                      “Capitol Attack” subgraph is high for both politicians
                                                              in this week, and not surprisingly it is also the most
                                                              discussed topic as evidenced by the high average node
6 Experiments
                                                              degree. Therefore, this “selection bias” sets our expec-
                                                              tation for subjective interestingness – given the specific
6.1 Data Sets
                                                              week we have chosen, this issue will dominate most so-
Data sets used for the experiments are tweets collected       cial media conversations in the USA. We also observe
using the Twitter Streaming API using a set of domain-        the low ratio of the number of unique nodes to the num-
specific, hand-curated keywords. We used three data           ber of tweets, signifying the high number of retweets,
sets, all collected between the 1st and the 31st of Jan-      that signals a form of information propagation over the
uary, 2021. The first two sets are for political person-      network. The propagativeness of the network during
alities. The Kamala Harris data set was collected by          this eventful week is also evidenced by the fact that the
using variants of Kamala Harris’s name together with          unique node count of a co-mention network is almost
her Twitter handle and hashtags constructed from her          75% - 88% higher on average compared to the hashtag
name. The second data set was similarly constructed           co-occur network of the same class. In Section 6.3, we
for Joe Biden. The third data set was collected during        show how our interestingness technique performs in the
the COVID-19 pandemic. The keywords were selected             face of this dataset.
based on popularly co-occurring terms from Google Trends.
We selected the terms manually to ensure that they are
related to the pandemic and vaccine related issues (and    1
                                                              https://code.awesome.sdsc.edu/awsomelabpublic/datasets/int-
not, for example, partisan politics). Table 1 presents a  springer-snam/)
12

Table 1 Dataset Descriptions
             Total
                           Sub          Network      Total       Unique     Unique     Self               Avg
 Data Set    Collection                                                                        Density
                           Query        Type         Tweets      nodes      edges      Loop               Degree
             Size
 Kamala                    Capitol      Hashtag
             12469480                                164397      1398       7801       16      0.0025     4.3
 Harris                    Attack       co-occur
                                        user
                                                     164397      8012       48604      87      0.00012    3.19
                                        co-mention
                                        Hashtag
                           #ADOS                     158419      3671       10738      29      0.0015     5.8
                                        co-occur
                                        user
                                                     158419      30829      39865      49      8.3        2.5
                                        co-mention
                           Economic     Hashtag
                                                     36678       1278       1828       4       0.0022     2.8
                           Issues       co-occur
                                        user
                                                     36678       6971       11584      19      0.0004     3.4
                                        co-mention
                           Capitol      Hashtag
 Joe Biden   45258151                                676898      7728       21422      50      0.00071    5.49
                           Attack       co-occur
                                        user
                                                     676898      82046      101646     130     3.0146     2.473
                                        co-mention
                                        Hashtag
                           #ADOS                     183765      3007       11008      29      0.002      3.85
                                        co-occur
                                        user
                                                     158419      29547      40932      56      9.3        2.7
                                        co-mention
                           Economic     Hashtag
                                                     138754      2961       5733       10      0.0013     3.87
                           Issues       co-occur
                                        user
                                                     138754      21417      19691      23      8.5        1.83
                                        co-mention
                           Vaccine      Hashtag
 Vaccine     24172676                                1000000     18809      24195      44      2.52       2.5
                           Anti-vax     co-occur
                                        user
                                                     1000000     203211     41877      46      2.02       0.4
                                        co-mention
                                        Hashtag
                           Covid Test                1000000     26671      45378      69      0.00012    3.4
                                        co-occur
                                        user
                                                     1000000     188761     83656      109     4.67       0.886
                                        co-mention
                                        hashtag
                           economy                   917890      3002       4395       9       0.0009     2.9
                                        co-occur
                                        user
                                                     917890      20590      8528       13      4.023      0.8
                                        co-mention

6.2 Experimental Setup                                        mance optimizations that we implemented are outside
                                                              the scope of this paper.

The experimental setup has three steps a) data collec-
tion and archival storage, b)Indexing and storing data,       6.3 Result Analysis
and c) executing analytical pipelines. We used The AWE-
SOME project’s continuous tweet ingestion system that         Kamala Harris Network: Figure 29 represents Ka-
collects all the tweets through Twitter 1% REST API           mala Harris network. This network is interesting be-
using a set of hand-picked keywords. We used the AWE-         cause even in the context of the Capitol Attack and the
SOME Polysotre for indexing, storing, and search the          Senate approval of election results, it is dominated by
data. For computation, we used the Nautilus facility          #ADOS conversations, not by the other political issues
of the Pacific Research Platform (PRP). Our hardware          including economic policies. An analysis of the three in-
configurations are as follows. The Awesome server has         teresting measures for this network reveals the follow-
64 GB memory and 32 core processors, the Nautilus             ing. The Eigenvector Centrality Disparity for the hash-
has 32 core, and 64 GB nodes. The data ingestion pro-         tag network (Figure 23) shows that all the groups gen-
cess required a large memory. Depending on the den-           erated by the subqueries are equally distributed, while
sity of the data, this requirement varies. Similarly, cen-    the random graph has a higher volume with similar
trality computation is a CPU bounded process. Perfor-         trends. Hence, these three subqueries are equally im-
13

Fig. 11 Capitol Attack                  Fig. 12 #ADOS                         Fig. 13 Economic issues
Fig. 14 Top Hashtag Distributions of the Kamala Harris Data set

Fig. 15 Capitol Attack                  Fig. 16 #ADOS Issues                  Fig. 17 Economic issues
Fig. 18 Top Hashtag Distributions of the Joe Biden set

portant. However, there are a few spikes on #ADOS             30. The eigenvector centrality disparity shows that the
and “Capitol Attack” that indicates the possibility of        three subgroups are equally dominated in this network.
interestingness. Figure 24 shows that most of the “eco-       However, the navigability of the network(figure 31) also
nomic issues” has a spike on lower centrality nodes,          shows that it is navigable for all three subgraphs. It has
but it touches zero with the increased centrality. While      two big spikes for “economic issues” and “Capitol at-
“Capitol attack” and “#ADOS” has many more spikes             tack,” plus many mid-sized spikes for ”#ADOS”. Inter-
in different centrality level. Hence, we conclude that this   estingly, in the figure 32 the propagativeness shows that
network is much more navigable for “Capitol attack”           the network is strongly propagative with the economic
and “#ADOS”. However, the figure 25 represents Prop-          issues and the #ADOS issue, which shows up both in
agativeness Disparity, and it is clear from this picture      the Capitol Attack and the ADOS subgroups. Inter-
that African American issues dominated the conversa-          estingly, the “Joe Biden” data’s co-mention network
tion here, any subtopic related to it including non-US is-    shows more propagativeness than the hashtag co-occur
sues like “Tigray” (on Ethiopian genocide) propagated         network, which indicates exploring the co-mention sub-
quickly in this network.                                      graph will be useful. We also note the occurrence of
Biden Network: While the predominance of #ADOS                certain complete unexpected topics (e.g., COVIDIOT,
issues might still be expected for the Kamala Harris          AUSvIND – Australia vs. India) within the ADOS group,
data set, we discovered it to be an “dominant enough”         while Economic Issues for Biden do not exhibit surpris-
subgraph in the Joe Biden data set represented in fig-        ing results.
ure 36. The eigenvector centrality disparity shows that       Vaccine Network In the vacation network, we found
the three subgroups are equally dominated in this net-        that “economic issues” and “covid tests” are more prop-
work. The “Joe Biden” data set represented in figure          agative than “vaccine and anti-vaccine” related topics
14

Fig. 19 Vaccine anti-vaccine Issues    Fig. 20 COVID-19 Test                  Fig. 21 Economic issues
Fig. 22 Top Hashtag Distributions of the Vaccine Data set

(Figure 43). The surprising result here is that the “vac-    while the core has about the same density as the ran-
cine - anti-vaccine” topics show a strong correlation        dom graph. The core periphery separation is much more
with “Economy” in the other two charts. We observe           prominent in first three plots of Figure 53. Unlike the
that while the vaccine issues are navigable through the      Kamala Harris random graph, Figure 53(d) shows that
network, but this topic cluster is is not very propagative   the random graph of this data set itself has a very large
in the network. In contrast, in the co-mention network,      dense core and a moderately dense periphery. Among
vaccine and anti-vaccine issues are both very naviga-        the three subfigures, the small (but lighter) core of Fig-
ble and strongly propagative. Further, the propagative-      ure 53(b) has the maximum difference compared to the
ness in the co-mention network for the Covid-test shows      random graph, although we find 53(a) to be conceptu-
many spikes at the different levels, which signifies for     ally more interesting. For the vaccine data set, Figure
testing related issues, the network serves as a vehicle of   58 (c)is closest to the random graph showing that most
message propagation and influencing.                         conversations around the topic touches economic issues,
                                                             while the discussion on the vaccine itself (Figure 53(a))
                                                             and that of COVID-testing (Figure 53(b)) are more fo-
6.4 Result Validation                                        cused and propagative.

There is not a lot of work in the literature on interest-
ing subgraph finding. Additionally, there are no bench-      7 Conclusion
mark data sets on which our “interestingness finding”
technique can be compared to. This prompted us to            In this paper, we presented a general technique to dis-
evaluate the results using a core-periphery analysis as      cover interesting subgraphs in social networks with the
indicated earlier in Figure 1. The idea is to demonstrate    intent that social network researchers from different do-
that the parts of the network claimed to be interesting      mains would find it as a viable research tool. The results
stand out in comparison to the network of a random           obtained from the tool will help researchers to probe
sample of comparable size from the background graph.         deeper into analyzing the underlying phenomena that
These results are presented in Figures 48, 53, 58. In each   leads to the surprising result discovered by the tool.
of these cases, we have shown a representative random        While we used Twitter as our example data source, sim-
graph in the rightmost subfigure to represent the back-      ilar analysis can be performed on other social media,
ground graph. To us, the large and dense core formation      where the content features would be different. Further,
in Figure 48(b) is an expected, non-surprising result.       in this paper, we have used a few centrality measures
However, the lack of core formation on Figure 48(c) is       to compute divergence-based features – but the system
interesting because it shows while there was a sizeable      is designed to plug in other measures as well.
network for economics related term for Kamala Har-
ris, the topics never “gelled” into a core-forming con-
versation and showed little propagation. Figure 48(a)
is somewhat more interesting because the density of
the periphery is far less strong than the random graph
15

Fig. 23 Eigenvector Centrality Disparity Hashtag Network    Fig. 24 Topical Navigability Disparity Hashtag Network

Fig. 25 Propagativeness Disparity Hashtag Network           Fig. 26 Eigenvector Centrality Disparity co-mention Network

Fig. 27 Topical Navigability Disparity co-mention Network   Fig. 28 Propagativeness Disparity co-mention Network
Fig. 29 Comparative studies of all sub-queries using the ”Kamala Harris” data set.
16

Fig. 30 Eigenvector Centrality Disparity Hashtag Network     Fig. 31 Topical Navigability Disparity Hashtag Network

Fig. 32 Propagativeness Disparity Hashtag Network            Fig. 33 Eigenvector Centrality Disparity co-mention Network

Fig. 34 Topical Navigability Disparity co-mention Network    Fig. 35 Propagativeness Disparity co-mention Network
Fig. 36 Comparative studies of all sub-queries using the ”Joe Biden” data set.
17

Fig. 37 Eigenvector Centrality Disparity: ”Vaccine” HashtagFig. 38 Topical Navigability Disparity: Vaccine Hashtag Net-
Network                                                    work

Fig. 39 Propagativeness Disparity: ”Vaccine” Hashtag Net-Fig. 40 Eigenvector Centrality Disparity: ”Vaccine” co-
work                                                     mention Network

Fig. 41 Topical Navigability Disparity: ”Vaccine” co-mentionFig. 42 Propagativeness Disparity: ”Vaccine” co-mention Net-
Network                                                     work
Fig. 43 Comparative studies of all sub-queries using the ”Vaccine” data set.
18

Fig. 44 #ADOS                 Fig. 45 Capitol attack        Fig. 46 Economic issues   Fig. 47 Random Graph
Fig. 48 Core and Periphery visualization of ”Kamala Harris” Data set

Fig. 49 #ADOS                 Fig. 50 Capitol attack        Fig. 51 Economic issues   Fig. 52 Random Data
Fig. 53 Core and Periphery visualization of ”Joe Biden” Data set

Fig. 54 Vaccine Issues        Fig. 55 COVID-19 Test         Fig. 56 Economic issues   Fig. 57 Random Graph
Fig. 58 Core and Periphery visualization of the Vaccine Data set
19

Acknowledgements This work was partially supported by       References
NSF Grants #1909875, #1738411. We also acknowledge SDSC
cloud support, particularly Christine Kirkpatrick & Kevin   Adriaens F, Lijffijt J, De Bie T (2019) Subjectively in-
Coakley, for generous help and support for collecting and
managing our tweet collection.                                teresting connecting trees and forests. Data Mining
                                                              and Knowledge Discovery 33(4):1088–1124
                                                            Angles R (2018) The property graph database model.
                                                              In: AMW
                                                            Antonakaki D, Ioannidis S, Fragopoulou P (2018) Uti-
                                                              lizing the average node degree to assess the temporal
                                                              growth rate of twitter. Social Network Analysis and
                                                              Mining 8(1):12
                                                            Antunes M, Gomes D, Aguiar RL (2018) Knee/elbow
                                                              estimation based on first derivative threshold. In:
                                                              Proc. of 4th IEEE International Conference on Big
                                                              Data Computing Service and Applications (Big-
                                                              DataService), IEEE, pp 237–240
                                                            Bendimerad A (2019) Mining useful patterns in at-
                                                              tributed graphs. PhD thesis, Université de Lyon
                                                            Bendimerad A, Mel A, Lijffijt J, Plantevit M, Ro-
                                                              bardet C, De Bie T (2019) Mining subjectively
                                                              interesting attributed subgraphs. arXiv preprint
                                                              arXiv:190503040
                                                            Besel C, Echeverria J, Zhou S (2018) Full cycle analysis
                                                              of a large-scale botnet attack on twitter. In: 2018
                                                              IEEE/ACM International Conference on Advances
                                                              in Social Networks Analysis and Mining (ASONAM),
                                                              IEEE, pp 170–177
                                                            Bhattacharya S, Sinha S, Roy S, Gupta A (2020) To-
                                                              wards finding the best-fit distribution for OSN data.
                                                              J Supercomput 76(12):9882–9900
                                                            Bing Liu, Wynne Hsu, Lai-Fun Mun, Hing-Yan Lee
                                                              (1999) Finding interesting patterns using user expec-
                                                              tations. IEEE Transactions on Knowledge and Data
                                                              Engineering 11(6):817–832, DOI 10.1109/69.824588
                                                            Bonacich P (2007) Some unique properties of eigenvec-
                                                              tor centrality. Soc Networks 29(4):555–564
                                                            Brandes U, Fleischer D (2005) Centrality measures
                                                              based on current flow. In: Annual symposium on
                                                              theoretical aspects of computer science, Springer, pp
                                                              533–544
                                                            Consens MP, Mendelzon AO (1993) Low-complexity
                                                              aggregation in graphlog and datalog. Theoretical
                                                              Computer Science 116(1):95–116
                                                            Das K, Samanta S, Pal M (2018) Study on centrality
                                                              measures in social networks: a survey. Social network
                                                              analysis and mining 8(1):13
                                                            Dasgupta S, Coakley K, Gupta A (2016) Analytics-
                                                              driven data ingestion and derivation in the AWE-
                                                              SOME polystore. In: IEEE International Conference
                                                              on Big Data, Washington DC, USA, IEEE Computer
                                                              Society, pp 2555–2564
                                                            Epasto A, Lattanzi S, Sozio M (2015) Efficient densest
                                                              subgraph computation in evolving graphs. In: Proc.
You can also read