Detecting Beneficial Feature Interactions for Recommender Systems

Page created by Gary Schroeder
 
CONTINUE READING
Detecting Beneficial Feature Interactions for Recommender Systems
                                                                        Yixin Su, 1 Rui Zhang, 2 Sarah Erfani, 1 Zhenghua Xu 3
                                                                                                  1
                                                                                           University of Melbourne
                                                                                           2
                                                                                             Tsinghua University
                                                                                      3
                                                                                        Hebei University of Technology
                                                yixins1@student.unimelb.edu.au, rui.zhang@ieee.org, sarah.erfani@unimelb.edu.au, zhenghua.xu@hebut.edu.cn
arXiv:2008.00404v6 [cs.LG] 18 May 2021

                                                                     Abstract                                account the interaction between every pair of features. How-
                                                                                                             ever, in practice, not all feature interactions are relevant to
                                              Feature interactions are essential for achieving high ac-      the recommendation result (Langley et al. 1994; Siegmund
                                              curacy in recommender systems. Many studies take into
                                              account the interaction between every pair of features.
                                                                                                             et al. 2012). Modeling the feature interactions that provide
                                              However, this is suboptimal because some feature in-           little useful information may introduce noise and cause over-
                                              teractions may not be that relevant to the recommenda-         fitting, and hence decrease the prediction accuracy (Zhang
                                              tion result, and taking them into account may introduce        et al. 2017; Louizos, Welling, and Kingma 2018). For ex-
                                              noise and decrease recommendation accuracy. To make            ample, a user may use Gmail on a workday no matter what
                                              the best out of feature interactions, we propose a graph       weather it is. However, if the interaction of workday and spe-
                                              neural network approach to effectively model them, to-         cific weather is taking into account in the model and due to
                                              gether with a novel technique to automatically detect          the bias in the training set (e.g., the weather of the days when
                                              those feature interactions that are beneficial in terms        the Gmail usage data are collected happens to be cloudy), the
                                              of recommendation accuracy. The automatic feature in-          interaction < Monday, Cloudy > might be picked up by the
                                              teraction detection is achieved via edge prediction with
                                              an L0 activation regularization. Our proposed model is
                                                                                                             model and make less accurate recommendations on Gmail
                                              proved to be effective through the information bottle-         usage. Some work considers each feature interaction’s im-
                                              neck principle and statistical interaction theory. Experi-     portance through the attention mechanism (Xiao et al. 2017;
                                              mental results show that our model (i) outperforms ex-         Song et al. 2019). However, these methods still take all fea-
                                              isting baselines in terms of accuracy, and (ii) automati-      ture interactions into account. Moreover, they capture each
                                              cally identifies beneficial feature interactions.              individual feature interaction’s contribution to the recom-
                                                                                                             mendation prediction, failing to capture the holistic contri-
                                                                                                             bution of a set of feature interactions together.
                                                                 Introduction
                                                                                                                In this paper, we focus on identifying the set of feature
                                         Recommender systems play a central role in addressing in-           interactions that together produce the best recommendation
                                         formation overload issues in many Web applications, such as         performance. To formulate this idea, we propose the novel
                                         e-commerce, social media platforms, and lifestyle apps. The         problem of detecting beneficial feature interactions, which
                                         core of recommender systems is to predict how likely a user         is defined as identifying the set of feature interactions that
                                         will interact with an item (e.g., purchase, click). An impor-       contribute most to the recommendation accuracy (see Defi-
                                         tant technique is to discover the effects of features (e,g., con-   nition 1 for details). Then, we propose a novel graph neural
                                         texts, user/item attributes) on the target prediction outcomes      network (GNN)-based recommendation model, L0 -SIGN,
                                         for fine-grained analysis (Shi, Larson, and Hanjalic 2014).         that detects the most beneficial feature interactions and uti-
                                         Some features are correlated to each other, and the joint ef-       lizes only the beneficial feature interactions for recommen-
                                         fects of these correlated features (i.e., feature interactions)     dation, where each data sample is treated as a graph, features
                                         are crucial for recommender systems to get high accuracy            as nodes and feature interactions as edges. Specifically, our
                                         (Blondel et al. 2016; He and Chua 2017). For example, it            model consists of two components. One component is an
                                         is reasonable to recommend a user to use Uber on a rainy            L0 edge prediction model, which detects the most beneficial
                                         day at off-work hours (e.g., during 5-6pm). In this situa-          feature interactions by predicting the existence of edges be-
                                         tion, considering the feature interaction < 5-6pm, rainy > is       tween nodes. To ensure the success of the detection, an L0
                                         more effective than considering the two features separately.        activation regularization is proposed to encourage unbenefi-
                                         Therefore, in recent years, many research efforts have been         cial edges (i.e. feature interactions) to have the value of 0,
                                         put in modeling the feature interactions (He and Chua 2017;         which means that edge does not exist. Another component
                                         Lian et al. 2018; Song et al. 2019). These models take into         is a graph classification model, called Statistical Interaction
                                             * Rui Zhang and Zhenghua Xu are corresponding authors.          Graph neural Network (SIGN). SIGN takes nodes (i.e., fea-
                                         Copyright © 2021, Association for the Advancement of Artificial     tures) and detected edges (i.e., beneficial feature interac-
                                         Intelligence (www.aaai.org). All rights reserved.                   tions) as the input graph, and outputs recommendation pre-
dictions by effectively modeling and aggregating the node            Graph Neural Networks (GNNs) GNNs can facilitate
pairs that are linked by an edge.                                    learning entities and their holistic relations (Battaglia et al.
   Theoretical analyses are further conducted to verify the          2018). Existing work leverages GNNs to perform rela-
effectiveness of our model. First, the most beneficial fea-          tional reasoning in various domains. For example, Duvenaud
ture interactions are guaranteed to be detected in L0 -SIGN.         et al. (2015) and Gilmer et al. (2017) use GNNs to predict
This is proved by showing that the empirical risk minimiza-          molecules’ property by learning their features from molec-
tion procedure of L0 -SIGN is a variational approximation of         ular graphs. Chang et al. (2016) use GNNs to learn object
the Information Bottleneck (IB) principle, which is a golden         relations in dynamic physical systems. Besides, some rela-
criterion to find the most relevant information correlating          tional reasoning models in computer vision such as (Santoro
to target outcomes from inputs (Tishby, Pereira, and Bialek          et al. 2017; Wang et al. 2018) have been shown to be vari-
2000). Specifically, only the most beneficial feature interac-       ations of GNNs (Battaglia et al. 2018). Fi-GNN (Li et al.
tions will be retained in L0 -SIGN. It is because, in the train-     2019) use GNNs for feature interaction modeling in CTR
ing stage, our model simultaneously minimizes the number             prediction. However, it still models all feature interactions.
of detected feature interactions by the L0 activation reg-           Our model theoretically connects the beneficial feature in-
ularization, and maximizes the recommendation accuracy               teractions in recommender systems to the edge set in graphs
with the detected feature interactions. Second, we further           and leverages the relational reasoning ability of GNNs to
show that the modeling of the detected feature interactions          model beneficial feature interactions’ holistic contribution
in SIGN is very effective. By accurately leveraging the rela-        to the recommendation predictions.
tional reasoning ability of GNN, iff a feature interaction is
detected to be beneficial in the L0 edge prediction compo-           L0 Regularization L0 regularization sparsifies models by
nent, it will be modeled in SIGN as a statistical interaction        penalizing non-zero parameters. Due to the problem of non-
(an interaction is called statistical interaction if the joint ef-   differentiable, it does not attract attention previously in
fects of variables are modeled correctly).                           deep learning domains until Louizos, Welling, and Kingma
   We summarize our contributions as follows:                        (2018) solve this problem by proposing a hard concrete
                                                                     distribution in L0 regularization. Then, L0 regularization
 • This is the first work to formulate the concept of beneficial
                                                                     has been commonly utilized to compress neural networks
   feature interactions, and propose the problem of detecting
                                                                     (Tsang et al. 2018; Shi, Glocker, and Castro 2019; Yang
   beneficial feature interactions for recommender systems.
                                                                     et al. 2017). We explore to utilize L0 regularization to limit
 • We propose a model, named L0 -SIGN, to detect the ben-            the number of detected edges in feature graphs for beneficial
   eficial feature interactions via a graph neural network ap-       feature interaction detection.
   proach and L0 regularization.
 • We theoretically prove the effectiveness of L0 -SIGN                    Problem Formulation and Definitions
   through the information bottleneck principle and statis-
   tical interaction theory.                                         Consider a dataset with input-output pairs: D =
                                                                     {(Xn , yn )}1≤n≤N , where yn ∈ R/Z, Xn = {ck : xk }k∈Jn
 • We have conducted extensive experimental studies. The             is a set of categorical features (c) with their values (x),
   results show that (i) L0 -SIGN outperforms existing base-         Jn ⊆ J and J is an index set of all features in D. For ex-
   lines in terms of accuracy; (ii) L0 -SIGN effectively iden-       ample, in app recommendation, Xn consists of a user ID,
   tifies beneficial feature interactions, which result in the       an app ID and context features (e.g., Cloudy, Monday) with
   superior prediction accuracy of our model.                        values to be 1 (i.e., recorded in this data sample), and yn is
                                                                     a binary value to indicate whether the user will use this app.
                       Related Work                                  Our goal is to design a model F (Xn ) that detects the most
Feature Interaction based Recommender Systems Rec-                   beneficial pairwise1 feature interactions and utilizes only the
ommender systems are one of the most critical research do-           detected feature interactions to predict the true output yn .
mains in machine learning (Lu et al. 2015; Wang et al. 2019).
Factorization machine (FM) (Rendle 2010) is one of the               Beneficial Pairwise Feature Interactions Inspired by the
most popular algorithms in considering feature interactions.         definition of relevant feature by usefulness (Langley et al.
However, FM and its deep learning-based extensions (Xiao             1994; Blum and Langley 1997), we formally define the ben-
et al. 2017; He and Chua 2017; Guo et al. 2017) consider all         eficial feature interactions in Definition 1.
possible feature interactions, while our model detects and           Definition 1. (Beneficial Pairwise Feature Interactions)
models only the most beneficial feature interactions. Re-            Given a data sample X = {xi }1≤i≤k of k features whose
cent work considers the importance of feature interactions           corresponding full pairwise feature interaction set is A =
by giving each feature interaction an attention value (Xiao          {(xi , xj )}1≤i,j≤k , a set of pairwise feature interactions
et al. 2017; Song et al. 2019), or select important interactions     I1 ⊆ A is more beneficial than another set of pairwise fea-
by using gates (Liu et al. 2020) or by searching in a tree-          ture interactions I2 ⊆ A to a model with X as input if the
structured space (Luo et al. 2019). In these methods, the im-        accuracy of the predictions that the model produces using I1
portance is not determined by the holistic contribution of the       is higher than the accuracy achieved using I2 .
feature interactions, thus limit the performance. Our model
detects beneficial feature interactions that together produce            1
                                                                           We focus on pairwise feature interactions in this paper, and we
the best recommendation performance.                                 leave high-order feature interactions in future work.
The above definition formalizes our detection goal: find                         Edge Prediction           Node Update       Aggregation

and retain only a part of feature interactions that together
produce the highest prediction accuracy by our model.
                                                                                                                (ℎ(⋅))                ( (⋅))
                                                                                                                                               ′

Statistical Interaction Statistical interaction, or non-                                                 ′
                                                                                                                                 ′

additive interaction, ensures a joint influence of several vari-                                                           ′

                                                                                                                                     SIGN
ables on an output variable is not additive (Tsang et al.                                                                         0 -SIGN

2018). Sorokina et al. (2008) formally define the pairwise                                     Initial Node      Updated Node

statistical interaction:
Definition 2. (Pairwise Statistical Interaction) Function                             Figure 1: An overview of L0 -SIGN.
F (X), where X = {xi }1≤i≤k has k variables, shows no
pairwise statistical interaction between variables xi and xj
if F (X) can be expressed as the sum of two functions f\i               Specifically, SIGN firstly conducts interaction modeling on
and f\j , where f\i does not depend on xi and f\j does not              each pair of initial node representation that are linked by an
depend on xj :                                                          edge. Then, each node representation is updated by aggre-
           F (X) =f\i (x1 , ..., xi−1 , xi+1 , ..., xk )
                                                                        gating all of the corresponding modeling results. Finally, all
                                                                 (1)    updated node representations are aggregated to get the final
                  + f\j (x1 , ..., xj−1 , xj+1 , ..., xk ).             prediction. The general form of SIGN prediction function is
                                                                         0                     0
  More generally, if using vi ∈ Rd to describe the i-th vari-           yn = fS (Gn (Xn , En ); θ), where θ is SIGN’s parameters
                                                                                                      0
able with d factors (Rendle 2010), e.g., variable embedding,            and the predicted outcome yn is the graph classification re-
each variable can be described in a vector form ui = xi vi .            sult. Therefore, the L0 -SIGN prediction function fLS is:
Then, we define the pairwise statistical interaction in vari-
                                                                           fLS (Gn (Xn , ∅); θ, ω) = fS (Gn (Xn , Fep (Xn ; ω)); θ).               (3)
able factor form by changing the Equation 1 into:
           F (X) =f\i (u1 , ..., ui−1 , ui+1 , ..., uk )                   Figure 1 shows the structure of L0 -SIGN3 . Next, we will
                                                                 (2)    show the two components in detail. When describing the two
                  + f\j (u1 , ..., uj−1 , uj+1 , ..., uk ).
                                                                        components, we focus on one input-output pair, so we omit
   The definition indicates that the interaction information of         the index “n” for simplicity.
variables (features) xi and xj will not be captured by F (X)
if it can be expressed as the above equations. Therefore,               L0 Edge Prediction Model A (neural) matrix factoriza-
to correctly capture the interaction information, our model             tion (MF) based model is used for edge prediction. MF is
should not be expressed as the above equations. In this pa-             effective in modeling relations between node pairs by fac-
per, we theoretically prove that the interaction modeling in            torizing the adjacency matrix of a graph into node dense
our model strictly follow the definition of pairwise statisti-          embeddings (Menon and Elkan 2011). In L0 -SIGN, since
cal interaction, which ensures our model to correctly capture           we do not have the ground truth adjacency matrix, the gra-
interaction information.                                                dients for optimizing this component come from the errors
                                                                        between the outputs of SIGN and the target outcomes.
                                                                                                              0          0
                   Our Proposed Model                                      Specifically, the edge value, eij ∈ E , is predicted by
In this section, we formally describe L0 -SIGN’s overview               an edge prediction function fep (vie , vje ) : R2×b → Z2 ,
structure and the two components in detail. Then, we pro-               which takes a pair of node embeddings for edge prediction
vide theoretical analyses of our model.                                 with dimension b as input, and output a binary value, in-
                                                                        dicating whether the two nodes are connected by an edge.
L0 -SIGN                                                                vie = oi W e is the embedding of node i for edge prediction,
Model Overview Each input of L0 -SIGN is represented                    where W e ∈ R|J|×b are parameters and oi is the one-hot
as a graph (without edge information), where its features               embedding of node i. To ensure that the detection results
are nodes and their interactions are edges. More specifi-               are identical to the same pair of nodes, fep should be invari-
cally, a data sample n is a graph Gn (Xn , En ), and En =               ant to the order of its input, i.e., fep (vie , vje ) = fep (vje , vie ).
{(en )ij }i,j∈Xn is a set of edge/interaction values 2 , where          For example, in our experiments, we use an multilayer neu-
(en )ij ∈ {1, 0} , 1 indicates that there is an edge (beneficial        ral network (MLP) with the input as the element-wise prod-
feature interaction) between nodes i and j, and 0 otherwise.            uct result of vie and vje (details are in the section of exper-
                                                                                              0
Since no edge information are required, En = ∅.                         iments). Note that eii = 1 can be regarded as the feature i
   While predicting, the L0 edge prediction component,                  being beneficial.
Fep (Xn ; ω), analyzes the existence of edges on each pair                 While training, an L0 activation regularization is per-
                                                                                       0
of nodes, where ω are parameters of Fep , and outputs the               formed on E to minimize the number of detected edges,
                       0
predicted edge set En . Then, the graph classification com-             which will be described later in the section about the empir-
                                                             0
ponent, SIGN, performs predictions based on G(Xn , En ).                ical risk minimization function of L0 -SIGN.
    2                                                                       3
      In this paper, nodes and features are used interchangeably, and         Section D of Appendix lists the pseudocodes of our model and
the same as edges and feature interactions.                             the training procedures.
SIGN In SIGN, each node i is first represented as an initial
                                                                                                     1
node embedding vi of d dimensions for interaction model-                                             0
                                                                                                     0
ing, i.e., each node has node embeddings vie and vi for edge                                         1
prediction and interaction modeling, respectively, to ensure                                         1
                                                                                                     0
the best performance of respective tasks. Then, the inter-
action modeling is performed on each node pair (i, j) that
  0
eij = 1, by a non-additive function h(ui , uj ) : R2×d → Rd           Figure 2: The interaction analysis results s from L0 -SIGN
                                                                      are like the relevant part (bottleneck) in the IB principle.
(e.g., an MLP), where ui = xi vi . The output of h(ui , uj ) is
denoted as zij . Similar to fep , h should also be invariant to
the order of its input. The above procedure can be reformu-
                  0                                                   weight factors for the regularizations, L(·) corresponds to a
lated as sij = eij zij , where sij ∈ Rd is called the statistical     loss function and θ ∗ , ω ∗ are final parameters.
interaction analysis result of (i, j).                                   A practical difficulty of performing L0 regularization is
     Next, each node is updated by aggregating all of the anal-       that it is non-differentiable. Inspired by (Louizos, Welling,
ysis results between the node and its neighbors using a linear        and Kingma 2018), we smooth the L0 activation regulariza-
                             0                   0
aggregation function ψ: vi = ψ(ςi ), where vi ∈ Rd is the             tion by approximating the Bernoulli distribution with a hard
                                                                                                        0
updated embedding of node i, ςi is a set of statistical in-           concrete distribution so that eij is differentiable. Section G
teraction analysis results between node i and its neighbors.          of Appendix gives details about the approximation.
Note that ψ should be invariant to input permutations, and
be able to take inputs with variant number of elements (e.g.,         Theoretical Analyses
element-wise summation/mean).
                                                                      We conduct theoretical analyses to verify our model’s effec-
     Finally, each updated node embedding will be trans-
                                                                      tiveness, including how L0 -SIGN satisfies the IB principle
formed into a scalar value by a linear function g : Rd → R,
                                                                      to guarantee the success of beneficial interaction detection,
and all scalar values are linearly aggregated as the output of
                    0                            0                    the relation of the statistical interaction analysis results with
SIGN. That is: y = φ(ν), where ν = {g(ui ) | i ∈ X},                  the spike-and-slab distribution (the golden standard in spar-
   0         0
ui = xi vi and φ : R|ν|×1 → R is an aggregation func-                 sity), and how SIGN and L0 -SIGN strictly follow the statis-
tion having similar properties to ψ. Therefore, the prediction        tical interaction theory for effective interaction modeling.
function of SIGN is:
                                  0                                   Satisfaction of Information Bottleneck (IB) Principle
       fS (G; θ) = φ({g(ψ({eij h(ui , uj )}j∈X ))}i∈X ).        (4)
                                                                      IB principle (Tishby, Pereira, and Bialek 2000) aims to ex-
   In summary, we formulate the L0 -SIGN prediction func-             tract the most relevant information that input random vari-
tion of Equation 3 with the two components in detail 4 :              ables X contains about output variables Y by considering
 fLS (G; ω, θ) = φ({g(ψ({fep (vie , vje )h(ui , uj )}j∈X ))}i∈X ).
                                                                      a trade-off between the accuracy and complexity of the pro-
                                                                (5)   cess. The relevant part of X over Y denotes S. The IB prin-
                                                                      ciple can be mathematically represented as:
Empirical Risk Minimization Function                                                  min    (I(X; S) − βI(S; Y )),                 (7)
The empirical risk minimization function of L0 -SIGN min-
imizes a loss function, a reparameterized L0 activation reg-          where I(·) denotes mutual information between two vari-
                   0
ularization5 on En , and an L2 activation regularization on           ables and β > 0 is a weight factor.
zn . Instead of regularizing parameters, activation regular-            The empirical risk minimization function of L0 -SIGN in
ization regularizes the output of models (Merity, McCann,             Equation 6 can be approximately derived from Equation 7
and Socher 2017). We leverage activation regularization to            (Section A of Appendix gives a detailed derivation):
link our model with the IB principle to ensures the success                    min R(θ, ω) ≈ min(I(X; S) − βI(S; Y )).              (8)
of the interaction detection, which will be discussed in the
theoretical analyses. Formally, the function is:                      Intuitively, the L0 regularization in Equation 6 minimizes
                                                                                                                             0
                         N                                            a Kullback–Leibler divergence between every eij and a
                      1 X
          R(θ, ω) =        (L(FLS (Gn ; ω, θ), yn )                   Bernoulli distribution Bern(0), and the L2 regularization
                      N n=1                                           minimizes a Kullback–Leibler divergence between every zij
                                                                      and a multivariate standard distribution N (0, I). As illus-
                            X
                      +λ1      (πn )ij + λ2 kzn k2 ),           (6)
                            i,j∈Xn
                                                                      trated in Figure 2, the statistical interaction analysis results,
                                                                      s, can be approximated as the relevant part S in Equation 7.
               θ ∗ , ω ∗ = arg min R(θ, ω),
                            θ,ω                                          Therefore, through training L0 -SIGN, s is the most com-
                                              0                       pact (beneficial) representation of the interactions for recom-
where (πn )ij is the probability of (en )ij being 1 (i.e.,            mendation prediciton. This provides a theoretical guarantee
  0
(en )ij = Bern((πn )ij )), Gn = Gn (Xn , ∅), λ1 and λ2 are            that the most beneficial feature interactions will be detected.
   4
     The time complexity analysis is in Section E of Appendix.        Relation to Spike-and-slab Distribution The spike-and-
   5                                                                  slab distribution (Mitchell and Beauchamp 1988) is the
     Section F of Appendix gives detailed description about L0 reg-
ularization and its reparameterization trick.                         golden standard in sparsity. It is defined as a mixture of a
delta spike at zero and a continuous distribution over the real    DATASET     #F EATURES         #G RAPHS      #N ODES /G RAPH
line (e.g., a standard normal):                                    F RAPPE       5,382             288,609            10
                                                                   M OVIE L ENS 90,445            2,006,859            3
          p(a) = Bern(π),      p(θ | a = 0) = δ(θ),                T WITTER      1,323             144,033           4.03
                                                            (9)    DBLP         41,324              19,456           10.48
                 p(θ | a = 1) = N (θ | 0, 1).

We can regard the spike-and-slab distribution as the prod-        Table 1: Dataset statistics. All datasets are denoted in graph
uct of a continuous distribution and a Bernoulli 0distribution.   form. Twitter and DBLP datasets are used for question (ii).
In L0 -SIGN, the predicted edge value vector e (the vector
           0
form of E ) is a multivariate Bernoulli distribution and can
be regarded as p(a). The interaction modeling result z is a       Frappe (Baltrunas et al. 2015) is a context-aware recom-
multivariate normal distribution and can be regarded as p(θ).     mendation dataset that records app usage logs from different
Therefore, L0 -SIGN’s statistical interaction analysis results,   users with eight types of contexts (e,g, weather). Each log is
s, is a multivariate spike-and-slab distribution that performs    a graph (without edges), and nodes are user ID, app ID, or
edge sparsification by discarding unbeneficial feature inter-     the contexts.
actions. The retained edges in the spike-and-slab distribution    MovieLens-tag (He and Chua 2017) focuses on the movie
are sufficient for L0 -SIGN to provide accurate predictions.      tag recommendation (e.g., “must-see”). Each data instance
                                                                  is a graph, with nodes as user ID, movie ID, and a tag that
Statistical Interaction in SIGN The feature interaction           the user gives to the movie.
modeling in SIGN strictly follows the definition of statis-          To evaluate the question (ii), we further study two datasets
tical interaction, which is formally described in Theorem 1       for graph classification, which will be discussed later. The
(Section B of Appendix gives the proof):                          statistics of the datasets are summarized in Table 1.
Theorem 1. (Statistical Interaction in SIGN) Consider
a graph G(X, E), where X is the node set and E =                  Baselines We compare our model with recommender sys-
{eij }i,j∈X is the edge set that eij ∈ {0, 1}, eij = eji . Let    tem baselines that model all feature interactions:
G(X, E) be the input of SIGN function fS (G) in Equation          FM (Koren 2008): It is one of the most popular recom-
4, then the function flags pairwise statistical interaction be-   mendation algorithms that models every feature interactions.
tween node i and node j if and only if they are linked by an      AFM (Xiao et al. 2017): Addition to FM, it calculates an
edge in G(X, E), i.e., eij = 1.                                   attention value for each feature interaction. NFM (He and
                                                                  Chua 2017): It replaces the dot product procedure of FM by
   Theorem 1 guarantees that SIGN will capture the interac-       an MLP. DeepFM (Guo et al. 2017): It uses MLP and FM
tion information from node pairs iff they are linked by an        for interaction analysis, respectively. xDeepFM (Lian et al.
edge. This ensures SIGN to accurately leverage the detected       2018): It is an extension of DeepFM that models feature in-
beneficial feature interactions for both inferring the target     teractions in both explicit and implicit way. AutoInt (Song
outcome and meanwhile providing useful feedback to the            et al. 2019): It explicitly models all feature interactions us-
L0 edge prediction component for better detection.                ing a multi-head self-attentive neural network. We use the
Statistical Interaction in L0 -SIGN L0 -SIGN provides             same MLP settings in all baselines (if use) as our interaction
the same feature interaction modeling ability as SIGN, since      modeling function h in SIGN for fair comparison.
we can simply extend Theorem 1 to Corollary 1.1 (Section          Experimental Set-up In the experiments, we use element-
C of Appendix gives the proof):                                   wise mean as both linear aggregation functions ψ(·) and
Corollary 1.1. (Statistical Interaction in L0 -SIGN) Con-         φ(·). The linear function g(·) is a weighted sum function
                                                                              0            0
sider a graph G that the edge set is unknown. Let G be the        (i.e., g(ui ) = wgT ui , where wg ∈ Rd×1 are the weight
input of L0 -SIGN function FLS (G) in Equation 5, the func-       parameters). For the interaction modeling function h(·),
tion flags pairwise statistical interaction between node i and    we use a MLP with one hidden layer after element-wise
node j if and only if they are predicted to be linked by an       product: h(ui , uj ) = W2h σ(W1h (ui uj ) + bh1 ) + bh2 ,
                                0
edge in G by L0 -SIGN, i.e., eij = 1.                             where W1h , W2h , bh1 , bh2 are parameters of MLP and σ(·) is
                                                                  a Relu activation function. We implement the edge predic-
                      Experiments                                 tion model based on the neural collaborative filtering frame-
We focuses on answering three questions: (i) how L0 -SIGN         work (He et al. 2017), which has a similar form to h(·):
performs compared to baselines and whether SIGN helps de-         fep (vie , vje ) = W2e σ(W1e (vie vje ) + be1 ) + be2 . We set node
tect more beneficial interactions for better performance? (ii)    embedding sizes for both interaction modeling and edge pre-
How is the detection ability of L0 -SIGN? (iii) Can the sta-      diction to 8 (i.e., b, d = 8) and the sizes of hidden layer for
tistical interaction analysis results provide potential expla-    both h and fep to 32. We choose the weighting factors λ1
nations for the recommendations predictions?                      and λ2 from [1 × 10−5 , 1 × 10−1 ] that produce the best per-
                                                                  formance in each dataset6 .
Experimental Protocol                                                6
                                                                      Our implementation of our L0 -SIGN model is avail-
Datasets We study two real-world datasets for recom-              able at https://github.com/ruizhang-ai/SIGN-Detecting-Beneficial-
mender systems to evaluate our model:                             Feature-Interactions-for-Recommender-Systems.
F RAPPE           M OVIE L ENS                                          T WITTER                DBLP
                    AUC       ACC        AUC      ACC                                        AUC      ACC           AUC      ACC
    FM             0.9263    0.8729    0.9190     0.8694                     GCN            0.7049    0.6537       0.9719   0.9289
    AFM            0.9361    0.8882    0.9205     0.8711                     L0 -GCN        0.7053    0.6543       0.9731   0.9301
    NFM            0.9413    0.8928    0.9342     0.8903                     CHEBY          0.7076    0.6522       0.9717   0.9291
    D EEP FM       0.9422    0.8931    0.9339     0.8895                     L0 -CHEBY      0.7079    0.6519       0.9719   0.9297
    X D EEP FM     0.9435    0.8950    0.9347     0.8906                     GIN            0.7149    0.6559       0.9764   0.9319
    AUTO I NT      0.9432    0.8947    0.9351     0.8912                     L0 -GIN        0.7159    0.6572       0.9787   0.9328
    SIGN           0.9448    0.8974    0.9354     0.8921                     SIGN           0.7201    0.6615       0.9761   0.9316
    L0 -SIGN       0.9580    0.9174    0.9407     0.8970                     L0 -SIGN       0.7231    0.6670       0.9836   0.9427

Table 2: Summary of results in comparison with baselines.         Table 3: The results in comparison with existing GNNs. The
                                                                  model names without “L0 -” use heuristic edges, and those
                                                                  with “L0 -” automatically detect edges via our L0 edge pre-
  Each dataset is randomly split into training, validation,       diction technique.
and test datasets with a proportion of 70%, 15%, and 15%.
We choose the model parameters that produce the best re-                      1    Twitter AUC       Twitter ACC
sults in validation set when the number of predicted edges

                                                                  Imp. (%)
                                                                                   DBLP AUC          DBLP ACC
being steady. We use accuracy (ACC) and the area under a                     0.5
curve with Riemann sums (AUC) as evaluation metrics.
                                                                              0
Model Performance                                                                  GCN           CHEBY         GIN            SIGN
Table 2 shows the results of comparing our model with rec-
ommendation baselines, with the best results for each dataset     Figure 3: The comparison of improvement via the L0 edge
in bold. The results of SIGN are using all feature interactions   prediction technique to using heuristic edges in Table 3.
as input, i.e., the input is a complete graph.
   Through the table, we observe that: (i) L0 -SIGN out-
performs all baselines. It shows L0 -SIGN’s ability in pro-       paper is a graph with nodes being paper ID or keywords and
viding accurate recommendations. Meanwhile, SIGN solely           edges being the citation relationship or keyword relations.
gains comparable results to competitive baselines, which             Table 3 shows the accuracies of each GNN that runs both
shows the effectiveness of SIGN in modeling feature inter-        on using the given (heuristic) edges (without “L0 ” in name)
actions for recommendation. (ii) L0 -SIGN gains significant       and on predicting edges with our L0 edge prediction model
improvement from SIGN. It shows that more accurate pre-           (with “L0 ” in name). We can see that for all GNNs, L0 -
dictions can be delivered by retaining only beneficial fea-       GNNs gain competitive results comparing to corresponding
ture interactions and effectively modeling them. (iii) FM and     GNNs. It shows that our model framework lifts the require-
AFM (purely based on dot product to model interactions)           ment of domain knowledge in defining edges in order to use
gain lower accuracy than other models, which shows the ne-        GNNs in some situations. Also, L0 -SIGN outperforms other
cessity of using sophisticated methods (e.g., MLP) to model       GNNs and L0 -GNNs, which shows L0 -SIGN’s ability in de-
feature interactions for better predictions. (iv) The models      tecting beneficial feature interactions and leveraging them to
that explicitly model feature interactions (xDeepFM, Au-          perform accurate predictions. Figure 3 shows each GNN’s
toInt, and L0 -SIGN) outperform those that implicitly model       improvement from the results from “GNN” to “L0 -GNN” in
feature interactions (NFM, DeepFM). It shows that explicit        Table 3. It shows that among all the different GNN based
feature interaction analysis is promising in delivering accu-     models, SIGN gains the largest improvement from the L0
rate predictions.                                                 edge prediction technique v.s. SIGN without L0 edge pre-
                                                                  diction. It shows that SIGN can help the L0 edge predic-
Comparing SIGN with Other GNNs in Our Model                       tion model to better detect beneficial feature interactions for
To evaluate whether our L0 edge prediction technique can          more accurate predictions.
be used on other GNNs and whether SIGN is more suit-
able than other GNNs in our model, we replace SIGN with           Evaluation of Interaction Detection
existing GNNs: GCN (Kipf and Welling 2017), Chebyshev             We then evaluate the effectiveness of beneficial feature in-
filter based GCN (Cheby) (Defferrard, Bresson, and Van-           teraction detection in L0 -SIGN.
dergheynst 2016) and GIN (Xu et al. 2019). We run on two
datasets for graph classification since they contain heuristic    Prediction Accuracy vs. Numbers of Edges Figure 4
edges (used to compare with the predicted edges):                 shows the changes in prediction accuracy and the number of
Twitter (Pan, Wu, and Zhu 2015) is extracted from twitter         edges included while training. The accuracy first increases
sentiment classification. Each tweet is a graph with nodes        while the number of included edges decreases dramatically.
being word tokens and edges being the co-occurrence be-           Then, the number of edges becomes steady, and the accuracy
tween two tokens in each tweet.                                   reaches a peak at similar epochs. It shows that our model can
DBLP (Pan et al. 2013) consists of papers with labels in-         recognize unbeneficial feature interactions and remove them
dicating either they are from DBDM or CVPR field. Each            for better prediction accuracy.
Frappe                                Movielens                                                               649   I ndia Evening 135 Wor k day
                                                                                                                           0.5
     1.00                                     1.00
     0.90                                                                                                                  0.0
     0.80
                               AUC            0.80                      AUC                                784   1013     728                                    Feat ur es
                               ACC                                      ACC                 N oon H om e
     0.70                    Edges (%)                                Edges (%)                                           − 0.5
                                              0.60                                                                                (a)
     0.60
            1   50    100 150 200                    1     100   200              300    Prediction: 0.636 + 0.3626 + 0.3161 … – 0.1013 – 0.1159 = 1.313 > 0
                     Epochs                                  Epochs

                     Twitter                                   DBLP                                                                                     0.3161
     1.00                                     1.00
                               AUC
                               ACC                                                        -0.0973
                             Edges (%)
     0.80                                     0.80
                                                           AUC
                                                                                                                                                                   -0.1013
                                                           ACC
     0.60                                     0.60       Edges (%)

            1   50    100 150 200                    1   50 100 150 200 250
                     Epochs                                  Epochs

                                                                                                                 -0.096
Figure 4: The changes in prediction accuracy and number of                                                                        (b)
edges (in percentage) while training.
                                                                                        Figure 6: (a) The highest interaction values with Gmail. The
                      Frappe                                                            numbers are city indexes (e.g., 1014). (b) The prediction that
                                                           Movielens
                                              0.95                                      the user u349 will use Gmail as the prediction value is 1.313
      0.95
                                                                                        > 0. Darker edges are higher interaction values.
                                              0.90
      0.90
                                              0.85
      0.85
                Pred (AUC)    Rev (AUC)                  Pred (AUC)   Rev (AUC)
      0.80      Pred (ACC)    Rev (ACC)       0.80       Pred (ACC)   Rev (ACC)
                                                                                        Case Study
          0.2   0.4     0.6 0.8           1      0.2     0.4     0.6 0.8          1
                       Ratio                                    Ratio                   Another advantage of interaction detection is to automati-
                                                                                        cally discover potential explanations about recommendation
                      Twitter                                  DBLP
                                                                                        predictions. We conduct case studies on the Frappe dataset
      0.70                                                                              to show how L0 -SIGN provides the explanations.
                                              0.95                                         We first show the beneficial interactions that have the
      0.65                                                                              highest interaction values with Gmail in Figure 6a. We can
                Pred (AUC)    Rev (AUC)                  Pred (AUC)   Rev (AUC)
      0.60      Pred (ACC)    Rev (ACC)
                                              0.90       Pred (ACC)   Rev (ACC)
                                                                                        see that Workday has a high positive value, while Home
          0.2   0.4     0.6 0.8           1      0.2     0.4     0.6 0.8          1     has a high negative value. It may show that Gmail is very
                       Ratio                                    Ratio                   likely to be used in workdays since people need to use
                                                                                        Gmail while working, whereas Gmail is not likely to be used
Figure 5: Evaluating different number of edges. “Pred” is the                           at home since people may not want to dealing with email
predicted edges and “Rev” is the reversed edges.                                        while resting. We then show how L0 -SIGN provides po-
                                                                                        tential explanations for the prediction on each data sample.
                                                                                        Figure 6b visualizes a prediction result that a user (u349)
Predicted Edges vs. Reversed Edges We evaluate how                                      may use Gmail. The graph shows useful information for
predicted edges and reversed edges (the excluded edges) in-                             explanations. Despite the beneficial interactions such as <
fluence the performance. We generate 5 edge sets with a                                 Gmail, Workday > that have discussed above, we can also
different number of edges by randomly selecting from 20%                                see that Cloudy and Morning have no beneficial interaction,
predicted edges (ratio 0.2) to 100% predicted edges (ratio                              meaning that whether it is a cloudy morning does not con-
1.0). We generate another 5 edge sets similarly from re-                                tribute to decide the user’s will of using Gmail.
versed edges. Then, we run SIGN on each edge set for 5
times, and show the averaged results in Figure 5. It shows                                          Conclusion and Future Work
that using predicted edges always gets higher accuracy than
using reversed edges. The accuracy stops increasing when                                We are the first to propose and formulate the problem of de-
using reversed edges since the ratio 0.6, while it continually                          tecting beneficial feature interactions for recommender sys-
improves when using more predicted edges. According to                                  tems. We propose L0 -SIGN to detect the beneficial feature
Definition 1, the detected edges are proved beneficial since                            interactions via a graph neural network approach and L0 reg-
considering more of them can provide further performance                                ularization. Theoretical analyses and extensive experiments
gain and the performance is the best when using all pre-                                show the ability of L0 -SIGN in detecting and modeling ben-
dicted edges, while considering the reversed edges cannot.                              eficial feature interactions for accurate recommendations. In
Note that the increment of reversed edges from 0.2 to 0.6                               future work, we will model high-order feature interactions
may come from covering more nodes since features solely                                 in GNNs with theoretical foundations to effectively capture
can provide some useful information for prediction.                                     high-order feature interaction information in graph form.
Acknowledgments                                Lian, J.; Zhou, X.; Zhang, F.; Chen, Z.; Xie, X.; and Sun,
This work is supported by the China Scholarship Council.         G. 2018. xdeepfm: Combining Explicit and Implicit Feature
                                                                 Interactions for Recommender Systems. In SIGKDD, 1754–
                                                                 1763.
                       References
Alemi, A. A.; Fischer, I.; Dillon, J. V.; and Murphy, K. 2017.   Liu, B.; Zhu, C.; Li, G.; Zhang, W.; Lai, J.; Tang, R.; He, X.;
Deep Variational Information Bottleneck. In ICLR, 1–16.          Li, Z.; and Yu, Y. 2020. AutoFIS: Automatic Feature Inter-
                                                                 action Selection in Factorization Models for Click-Through
Baltrunas, L.; Church, K.; Karatzoglou, A.; and Oliver, N.       Rate Prediction. In SIGKDD, 2636–2645. ACM.
2015. Frappe: Understanding the usage and perception of
mobile app recommendations in-the-wild. arXiv preprint           Louizos, C.; Welling, M.; and Kingma, D. P. 2018. Learn-
arXiv:1505.03014 .                                               ing Sparse Neural Networks through L 0 Regularization. In
                                                                 ICLR, 1–11.
Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-
Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.;       Lu, J.; Wu, D.; Mao, M.; Wang, W.; and Zhang, G. 2015.
Raposo, D.; Santoro, A.; and Faulkner, R. 2018. Relational       Recommender System Application Developments: a Survey.
inductive biases, deep learning, and graph networks. arXiv       DSS 74: 12–32.
preprint arXiv:1806.01261 .                                      Luo, Y.; Wang, M.; Zhou, H.; Yao, Q.; Tu, W.-W.; Chen, Y.;
Blondel, M.; Ishihata, M.; Fujino, A.; and Ueda, N. 2016.        Dai, W.; and Yang, Q. 2019. Autocross: Automatic Feature
Polynomial Networks and Factorization Machines: New In-          Crossing for Tabular Data in Real-world Applications. In
sights and Efficient Training Algorithms. In ICML, 850–          SIGKDD, 1936–1945.
858.                                                             MacKay, D. J. 2003. Information Theory, Inference and
Blum, A. L.; and Langley, P. 1997. Selection of Relevant         Learning Algorithms. Cambridge university press.
Features and Examples in Machine Learning. AI 97(1-2):           Maddison, C. J.; Mnih, A.; and Teh, Y. W. 2017. The
245–271.                                                         Concrete Distribution: A Continuous Relaxation of Discrete
Chang, M. B.; Ullman, T.; Torralba, A.; and Tenenbaum,           Random Variables. In ICLR, 1–12.
J. B. 2016. A compositional object-based approach to learn-      Menon, A. K.; and Elkan, C. 2011. Link Prediction via Ma-
ing physical dynamics. In ICLR, 1–14.                            trix Factorization. In ECML PKDD, 437–452.
Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016.         Merity, S.; McCann, B.; and Socher, R. 2017. Revisiting
Convolutional Neural Networks on Graphs with Fast Local-         Activation Regularization for Language RNNs. In ICML
ized Spectral Filtering. In NeurIPS, 3844–3852.                  Workshop.
Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell,     Mitchell, T. J.; and Beauchamp, J. J. 1988. Bayesian variable
R.; Hirzel, T.; Aspuru-Guzik, A.; and Adams, R. P. 2015.         selection in linear regression. ASA 83(404): 1023–1032.
Convolutional networks on graphs for learning molecular
fingerprints. In NeurIPS, 2224–2232.                             Pan, S.; Wu, J.; and Zhu, X. 2015. CogBoost: Boosting for
                                                                 Fast Cost-sensitive Graph Classification. In TKDE, 2933–
Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and    2946.
Dahl, G. E. 2017. Neural message passing for quantum
chemistry. In ICML, 1263–1272.                                   Pan, S.; Zhu, X.; Zhang, C.; and Philip, S. Y. 2013. Graph
                                                                 Stream Classification Using Labeled and Unlabeled Graphs.
Guo, H.; Tang, R.; Ye, Y.; Li, Z.; and He, X. 2017. Deepfm:      In ICDE, 398–409.
a factorization-machine based neural network for ctr predic-
tion. In IJCAI, 1725–1731.                                       Rendle, S. 2010. Factorization Machines. In ICDM, 995–
                                                                 1000.
He, X.; and Chua, T.-S. 2017. Neural factorization machines
for sparse predictive analytics. In SIGIR, 355–364.              Santoro, A.; Raposo, D.; Barrett, D. G.; Malinowski, M.;
                                                                 Pascanu, R.; Battaglia, P.; and Lillicrap, T. 2017. A simple
He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T.-S.    neural network module for relational reasoning. In NeurIPS,
2017. Neural Collaborative Filtering. In WWW, 173–182.           4967–4976.
Kipf, T. N.; and Welling, M. 2017. Semi-supervised classi-       Shi, C.; Glocker, B.; and Castro, D. C. 2019. PVAE: Learn-
fication with graph convolutional networks. In ICLR, 1–14.       ing Disentangled Representations with Intrinsic Dimension
Koren, Y. 2008. Factorization meets the neighborhood:            via Approximated L0 Regularization. In PMLR, 1–6.
a multifaceted collaborative filtering model. In SIGKDD,         Shi, Y.; Larson, M.; and Hanjalic, A. 2014. Collaborative
426–434.                                                         Filtering Beyond the User-item Matrix: A Survey of the
Langley, P.; et al. 1994. Selection of Relevant Features in      State of the Art and Future Challenges. CSUR 47(1): 1–45.
Machine Learning. In AAAI, volume 184, 245–271.                  Siegmund, N.; Kolesnikov, S. S.; Kästner, C.; Apel, S.; Ba-
Li, Z.; Cui, Z.; Wu, S.; Zhang, X.; and Wang, L. 2019. Fi-       tory, D.; Rosenmüller, M.; and Saake, G. 2012. Predicting
gnn: Modeling Feature Interactions via Graph Neural Net-         Performance via Automated Feature-interaction Detection.
works for CTR Prediction. In CIKM, 539–548.                      In ICSE, 167–177.
Song, W.; Shi, C.; Xiao, Z.; Duan, Z.; Xu, Y.; Zhang, M.;
and Tang, J. 2019. Autoint: Automatic Feature Interac-
tion Learning via Self-attentive Neural Networks. In CIKM,
1161–1170.
Sorokina, D.; Caruana, R.; Riedewald, M.; and Fink, D.
2008. Detecting Statistical Interactions with Additive
Groves of Trees. In ICML, 1000–1007.
Tishby, N.; Pereira, F. C.; and Bialek, W. 2000. The Infor-
mation Bottleneck Method. arXiv preprint physics/0004057
.
Tsang, M.; Liu, H.; Purushotham, S.; Murali, P.; and Liu,
Y. 2018. Neural Interaction Transparency (NIT): Disentan-
gling Learned Interactions for Improved Interpretability. In
NeurIPS, 5804–5813.
Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Non-
local neural networks. In CVPR, 7794–7803.
Wang, X.; Zhang, R.; Sun, Y.; and Qi, J. 2019. Doubly Ro-
bust Joint Learning for Recommendation on Data Missing
Not at Random. In ICML, 6638–6647. PMLR.
Xiao, J.; Ye, H.; He, X.; Zhang, H.; Wu, F.; and Chua, T.-
S. 2017. Attentional factorization machines: Learning the
weight of feature interactions via attention networks. In IJ-
CAI, 3119–3125.
Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2019. How
Powerful are Graph Neural Networks? ICLR 1–13.
Yang, C.; Bai, L.; Zhang, C.; Yuan, Q.; and Han, J. 2017.
Bridging collaborative filtering and semi-supervised learn-
ing: a neural approach for poi recommendation. In SIGKDD,
1245–1254.
Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; and Vinyals, O.
2017. Understanding Deep Learning Requires Rethinking
Generalization. In ICLR, 1–11.
Derivation from IB to L0 -SIGN                                  Combining Equation 12 and Equation 13 into Equation
Recall that the empirical risk minimization procedure of L0 -           11, the minimization function correlating to L0 -SIGN be-
SIGN in Equation 6 is:                                                  comes:
                                                                                           N
                     N                                                                  1 X
                  1 X                                                        JIB      =      (Es ∼p(sn |Xn ) [− log q(yn | sn )]
        R(θ, ω) =      (L(FLS (Gn ; ω, θ), yn )                                         N n=1 n                                    (14)
                  N n=1
                        X                                                                         + βKL[p(sn | Xn ), r(sn )]).
                  + λ1     (πn )ij + λ2 kzn k2 ),
                            i,j∈Xn
                                                                           Next, we use the forward Kullback–Leibler divergence
                                                                        KL[r(sn ), p(sn | Xn )] to approximate the reverse Kull-
   Deep variational information bottleneck method (Alemi                back–Leibler divergence in Equation 14 to ensure the Kull-
et al. 2017) performs a variational approximation to the In-            back–Leibler divergence can be properly derived into the L0
formation Bottleneck principle (Equation 7). Specifically,              and L2 activation regularization (will be illustrated in Equa-
the function can be approximated by maximizing a lower                  tion 17 and Equation 18). We can perform this approxima-
bound L:                                                                tion because when the variational approximation r(z̃n ) only
                                                                        contains one mode (e.g., Bernoulli distribution, normal dis-
               N Z
           1 X                                                          tribution), both forward and reverse Kullback–Leibler diver-
       L≈         [ ds̃n p(s̃n | x̃n ) log q(ỹn | s̃n )                gence force p(z̃n | x̃n ) to cover the only mode and will have
           N n=1
                                                         (10)           the same effect (MacKay 2003). Then Equation 14 becomes:
                                p(s̃n | x̃n )
           − βp(s̃n | x̃n ) log               ],
                                  r(s̃n )                                       JIB
where x̃n , ỹn , s̃n are input, output and the middle state re-                1 X
                                                                                   N
spectively, r(s̃n ) is the variational approximation to the                 ≈        (Es ∼p(sn |Xn ) [− log q(yn | sn )]
marginal p(s̃n ).                                                               N n=1 n
  Then, maximizing the lower bound L equals to minimiz-                         + βKL[r(sn ), p(sn | Xn )])
ing the JIB :                                                                      N
                                                                                1 X
                 N                                                          =        (Es ∼p(sn |Xn ) [− log q(yn | sn )]
             1   X                                                              N n=1 n
    JIB =            (Es̃n ∼p(s̃n |x̃n ) [− log q(ỹn | s̃n )]
             N                                                   (11)                                             0
                 n=1                                                            + βKL[Bern(0)N (0, I), p(en | Xn )p(zn | Xn )])
                 + βKL[p(s̃n | x̃n ), r(s̃n )]),                                   N
                                                                                1 X
                                                                            =        (Es ∼p(sn |Xn ) [− log q(yn | sn )]
where KL[p(s̃n | x̃n ), r(s̃n )] is the Kullback–Leibler diver-                 N n=1 n
gence between p(s̃n | x̃n ) and r(s̃n ).                                                                    0
   In L0 -SIGN, the middle state part between input and out-                    + β(dKL[Bern(0), p(en | Xn )]
put is the statistical interaction analysis result sn , which is                    + KL[N (0, I), p(zn | Xn )])),
                                                  0
the multiplication of predicted edge values en (the vector                                                                         (15)
           0                                             0
form of En ) and interaction modeling results zn . (en )ij =
                           0                                            where d is the dimention of each vector (zn )ij .
Bern((πn )ij ) so that en is a multivariate Bernoulli distri-              Minimizing Esn ∼p(sn |Xn ) [− log q(yn | sn )] in Equa-
                         0
bution, denoted as p(en | Xn ). Similarly, (zn )ij is a multi-          tion 15 is equivalent to minimizing L(fLS (Gn ; ω, θ), yn ) in
variate normal distribution N ((zn )ij , Σij ) so that zn is a          Equation 6: its the negative log likelihood of the prediction
multivariate normal distribution, denoted as p(zn | Xn ).               as the loss function of L0 -SIGN, with p(sn | Xn ) being the
Therefore, the distribution of sn (denoted as p(sn | Xn ))              edge prediction procedure and interaction modeling proce-
is represented as:                                                      dure, and q(yn | sn ) being the aggregation procedure from
                       0                                                the statistical interaction analysis result to the outcome yn .
  p(sn | Xn ) = p(en | Xn )p(zn | Xn )                                                                       0
                                                        (12)               For the part KL[Bern(0), p(en | Xn )] in Equation 15,
              = ki,j∈Xn [Bern(πij )N ((zn )ij , Σij )],                     0
                                                                        p(en | Xn ) = Bern(πn ) is a multivariate Bernoulli distri-
where Σij is a covariance matrix and k is concatenation.                bution, so the KL divergence is:
   Meanwhile, we set the variational approximation of the                                           0
                                                                          KL[Bern(0), p(en | Xn )] = KL[Bern(0), Bern(πn )]
sn being a concatenated multiplication of normal distribu-
tions with mean of 0 and variance of 1, and Bernoulli distri-
                                                                           X           0                     1−0
                                                                        =     (0 log         + (1 − 0) log             )
butions that the probability of being 0 is 1. Then, the varia-                       (πn )ij               1 − (πn )ij
                                                                            i,j∈Xn
tional approximation in the vector form is:                                  X                  1
                                                                        =             log               .
                 r(sn ) = Bern(0)N (0, I),                       (13)                       1 − (πn )ij
                                                                            i,j∈Xn

where I is an identity matrix.                                                                                                     (16)
reformulate the KL divergence:
                     Bernoulli KL Divergence
         1         Linear Approximation (γ = 1)                                KL[N (0, I), p(zn | Xn )] = KL[N (0, I), N (zn , Σn )]
                                                                                1                                          det Σn
                                                                              = (Tr(Σ−1                   T −1
                                                                                        n I) + (zn − 0) Σ1 (zn − 0) + ln           − d|Xn |)
                                                                                2                                           det I
                                                                                          d
                                                                                1 X X 1            (zn )2ijk
       0.5                                                                    =             ( 2 +            + ln σ 2 − 1)
                                                                                2 i,j∈X      σ       σ2
                                                                                           n   k=1
                                                                                            d
                                                                                   1      X X
                                                                              =                      ((zn )2ijk + C2 )
                                                                                  2σ 2   i,j∈Xn k=1

         0                                                                         1 X X
                                                                                             d
                                                                              ∝                (zn )2ijk ,
             0    0.2       0.4         0.6          0.8           1              2σ 2 i,j∈X
                                                                                               n   k=1
                                  π                                                                                                      (18)

Figure 7: Linear Approximation vs. Bernoulli KL Diver-                        where C2 = 1 + σ 2 ln σ 2 − σ 2 is a constant and (zn )ijk is
gence on different π values.                                                  the kth dimension of (zn )ij .
                                                                                 Relating to Equation 6, Equation 17 is exactly the L0 ac-
                                                                              tivation regularization part and Equation 18 is the L2 ac-
                                                                              tivation regularization part in the empirical risk minimiza-
                                                                              tion procedure of L0 -SIGN, with λ1 = dβγ and λ2 = 2σβ 2 .
                                                                              Therefore, the empirical risk minimization procedure of L0 -
   Next, we use a linear function γ(πn )ij to approximate                     SIGN is proved to be a variational approximation of mini-
log 1−(π1n )ij in Equation 16, where γ > 0 is a scalar                        mizing the object function of IB (JIB ):
constant. Figure 7 shows the values (penalization) of the                                                min R(θ, ω) ≈ min JIB .        (19)
Bernoulli KL divergence and its linear approximation on dif-
ferent π values. It can be seen that both Bernoulli KL diver-                                            Proof of Theorem 1
gence and its approximations are monotonically increasing.                    Theorem 1. (Statistical Interaction in SIGN) Consider
In the empirical risk minimization procedure, they will have                  a graph G(X, E), where X is the node set and E =
similar effects on penalizing those π > 0. In addition, the                   {eij }i,j∈X is the edge set that eij ∈ {0, 1}, eij = eji . Let
approximation is more suitable for our model because: (i)                     G(X, E) be the input of SIGN function fS (G) in Equation
it penalizes more than the Bernoulli KL divergence when π                     4, then the function flags pairwise statistical interaction be-
is approaching 0 (take more effort on removing unbenefi-                      tween node i and node j if and only if they are linked by an
cial feature interactions); and (ii) it gives reasonable (finite)             edge in G(X, E), i.e., eij = 1.
penalization when π is approaching 1 (retrain beneficial fea-
ture interactions), while the Bernoulli KL divergence pro-                    Proof. We prove Theorem 1 by proving two lemmas:
duces infinite penalization when π = 1.
                                                                              Lemma 1.1. Under the condition of Theorem 1, for a graph
                                                                              G(X, E), if fS (G) shows pairwise statistical interaction be-
   Then the KL divergence of the multivariate Bernoulli dis-                  tween node i and node j, where i, j ∈ X, then the two nodes
tribution can be approximately calculated by:                                 are linked by an edge in G(X, E), i.e., eij = 1.
                                                                              Proof. We prove this lemma by contradiction. Assume that
                                                                              the SIGN function fS with G(X, E\eij ) as input shows pair-
                        0              X                  1                   wise statistical interaction between node i and node j, where
   KL[Bern(0), p(en | Xn )] =                  log                            G(X, E\eij ) is a graph with E\eij being a set of edges that
                                      i,j∈Xn
                                                     1 − (πn )ij
                                       X                               (17)   eij = 0.
                                  ≈            γ(πn )ij .                        Recall that the SIGN function in Equation 4. Without los-
                                      i,j∈Xn                                  ing generality, we set both the aggregation functions φ and
                                                                              ψ being element-wise average. That is:
                                                                                                 1 X 1 X
                                                                                    fS (G) =            (        (eij h(ui , uj ))),   (20)
                                                                                                |X|       ρ(i)
                                                                                                           i∈X       j∈X

   For the part KL[N (0, I), p(zn | Xn )], the distribution                   where ρ(i) is the degree of node i.
p(zn | Xn ) is a multivariate normal distribution and is de-                    From Equation 20, we know that the SIGN function can
noted as N (zn , Σ). If we assume all normal distributions                    be regarded as the linear aggregation of non-additive statisti-
in p(zn | Xn ) are i.i.d, and have the same variance (i.e.,                   cal interaction modeling procedures h(uk , um ) for all nodes
Σ = diag(σ 2 , σ 2 , . . . , σ 2 ) where σ is a constant), we can             pairs (k, m) that k, m ∈ X and ekm = 1. Since E\eij does
not contain an edge between i and j (i.e., eij = 0), the SIGN        cannot represent a non-additive function h as a form like
function does not perform interaction modeling between the           h(ui , uj ) = f1 (ui ) + f2 (uj ), where f1 and f2 are func-
two nodes into final predictions.                                    tions. That is to say, we cannot merge the second component
   According to Definition 2, since i and j have statistical in-     in the RHS into either q\i or q\j .
teraction, we cannot find a replacement form of SIGN func-              Therefore, Equation 24 cannot be represented as the form
tion like:                                                           of Equation 21, and the node pair (i, j) shows pairwise sta-
     fS (G) =q\i (u1 , ..., ui−1 , ui+1 , ..., u|X| )                tistical interaction in fS (G), which contradicts our assump-
                                                              (21)   tion. Lemma 1.2 is proved.
                + q\j (u1 , ..., uj−1 , uj+1 , ..., u|X| ),
                                                                       Combing Lemma 1.1 and Lemma 1.2, Theorem 1 is
where q\i and q\j are functions without node i and node j            proved.
as input, respectively.
   However, from our assumption, since there is no edge be-
tween node i and node j, there is no interaction modeling
                                                                                      Proof of Corollary 1.1
function that performs between them in fS (G). Therefore,            Corollary 1.1. (Statistical Interaction in L0 -SIGN) Con-
we can easily find many such q\i and q\j that satisfy Equa-          sider a graph G that the edge set is unknown. Let G be the
tion 21. For example:                                                input of L0 -SIGN function FLS (G) in Equation 5, the func-
                                                                     tion flags pairwise statistical interaction between node i and
   q\i (u1 , ..., ui−1 , ui+1 , ..., u|X| ) =                        node j if and only if they are predicted to be linked by an
                                                                                                     0
        1       X         1       X
                                                              (22)   edge in G by L0 -SIGN, i.e., eij = 1.
                      (                 (ekm h(uk , um ))),
       |X|              ρ(k)
           k∈X\{i}          m∈X\{i}                                  Proof. In Equation 5, we can perform the prediction pro-
  and                                                                cedure by first predicting edge values on all potential node
                                                                     pairs. Then we perform node pair modeling and aggregating
     q\j (u1 ,..., uj−1 , uj+1 , ..., u|X| ) =                       the results to get the predictions (as illustrated in Figure 1).
                1     X          1                                   Specifically, we can regard the edge prediction procedure in
                             (        eim h(ui , um ))               L0 -SIGN as being prior to the following SIGN procedure.
              |X|              ρ(i)
                    m∈X\{j}                                   (23)   The edge values in an input graph G(X, ∅) can be first pre-
                                                                                                                                  0
                  1       X         1                                dicted by function Fep . Then, we have the graph G(X, E ),
               +                 (      eki h(uk , ui )).                     0
                 |X|               ρ(k)                              where E is a predicted edge set. Therefore, the following
                       k∈X\{j}                                                                                                  0
                                                                     procedure is the same as the SIGN model with G(X, E ) as
  Therefore, it contradicts our assumption. Lemma 1.1 is             the input graph, which satisfies Theorem 1.
proved.
Lemma 1.2. Under the condition of Theorem 1, for a graph                                     Algorithms
G(X, E), if an edge links node i and node j in G (i.e.,              In this section, we provide the pseudocode of SIGN and L0 -
i, j ∈ X and eij = 1), then fS (G) shows pairwise sta-               SIGN prediction algorithms in Algorithm 1 and Algorithm
tistical interaction between node i and node j.                      2, respectively. Meanwhile, we provide the pseudocode of
                                                                     SIGN and L0 -SIGN training algorithm in Algorithm 3 and
Proof. We prove this lemma by contradiction as well. As-             Algorithm 4, respectively.
sume there is a graph G(X, E) with a pair of nodes (i, j)
that eij = 1, but shows no pairwise statistical interaction
between this node pair in fS (G).                                    Algorithm 1 SIGN prediction function fS
   Since eij = 1, we can rewrite SIGN function as:                     Input: data G(X, E)
               1 X 1 X                                                 for each pair of feature (i, j) do
    fS (G) =          (           (ekm h(uk , um )))                      if eij = 1 then
              |X|       ρ(k)                                                  zij = h(xi vi , xj vj )
                  k∈X        m∈X
                                                       (24)
                 ρ(i) + ρ(j)                                                  sij = zij
              +               (h(ui , uj )),                              else
                 |X|ρ(i)ρ(j)
                                                                              sij = 0
where (k, m) ∈  / {(i, j), (j, i)}.                                       end if
   In our assumption, fS (G) shows no pairwise statistical             end for
interaction between node i and node j. That is, we can                 for 0each feature i do
write fS (G) in the form of Equation 21 according to Def-                 vi = ψ(ςi )
                                                                            0         0
inition 2. For the first component in the RHS of Equation                 u i = x i vi
                                                                                     0
24, we can easily construct functions q\i and q\j in a simi-              νi = g(ui )
lar way of Equation 22 and Equation 23 respectively. How-              end   for
                                                                        0
ever, for the second component in the RHS of Equation 24,              y = φ(ν)
the non-additive function h(ui , uj ) operates on node i and           Return: y
                                                                                   0

node j. Through the definition of non-additive function, we
Algorithm 2 L0 -SIGN prediction function fLS                                        L0 Regularization
  Input: data G(X, ∅)                                           L0 regularization encourages the regularized parameters θ
  for each pair of feature (i, j) do                            to be exactly zero by setting an L0 term:
      0
     eij = HardConcrete(fep (vie , vje ))                                                    |θ|
                                                                                             X
     zij = h(xi vi , xj vj )                                                        kθk0 =         I[θj 6= 0],                    (25)
             0
     sij = eij zij                                                                           j=1
  end for
  for 0each feature i do                                        where |θ| is the dimensionality of the parameters and I is 1
     vi = ψ(ςi )                                                if θj 6= 0, and 0 otherwise.
       0         0                                                 For a dataset D, an empirical risk minimization proce-
     ui = xi vi
                0                                               dure is used with L0 regularization on the parameters θ of
     νi = g(ui )                                                a hypothesis H(·; θ), which can be any objective function
  end
   0
        for                                                     involving parameters, such as neural networks. Then, using
  y = φ(ν)                                                      reparameterization of θ , we set θj = θ̃j zj , where θ̃j 6= 0
              0
  Return: y                                                     and zj is a binary gate with Bernoulli distribution Bern(πj )
                                                                (Louizos, Welling, and Kingma 2018). The procedure is rep-
                                                                resented as:
                                                                                         N                                  |θ|
                                                                                      1 X                                   X
Algorithm 3 Training procedure of SIGN                           R(θ̃, π) = Ep(z|π)    (    L(H(Xn ; θ̃     z), yn )) + λ         πj ,
                                                                                      N n=1                                 j=1
  Randomly initialize θ
  repeat                                                                          θ̃ ∗ , π ∗ = arg min R(θ̃, π),
    for each input-output pair (Gn (Xn , En ), yn ) do                                          θ̃,π
            0     0                                                                                                               (26)
       get yn , vn from fS (Gn ; θ)                             where p(zj |πj ) = Bern(πj ), N is the number of samples
                0
       vn = vn                                                  in D, is element-wise production, L(·) is a loss function
    end for        PN             0                             and λ is the weighting factor of the L0 regularization.
    R(θ) = N1 n=1 (L(FLS (yn , yn ))
    update θ (exclude v) by min R(θ)                              Approximate L0 Regularization with Hard
  until reach the stop conditions
                                                                          Concrete Distribution
                                                                A practical difficulty of performing L0 regularization is that
                                                                it is non-differentiable. Inspired by (Louizos, Welling, and
                                                                Kingma 2018), we smooth the L0 regularization by approx-
Algorithm 4 Training procedure of L0 -SIGN                      imating the binary edge value with a hard concrete distri-
  Randomly initialize θ, ω                                      bution. Specifically, let fep now output continuous values.
  repeat                                                        Then
    for each input-output pair (Gn (Xn , ∅), yn ) do                u ∼ U (0, 1),
            0     0
       get yn , vn from fLS (Gn ; θ, ω)                             s = Sigmoid((log u − log(1 − u) + log(αij ))/β),
                0
       vn = vn                                                      s̄ = s(δ − γ) + γ,                                            (27)
    end for                                                          0
    calculate R(θ, ω) through Equation 6                            eij = min(1, max(0, s̄)),
    update ω, θ(exclude v) by min R(θ, ω)
                                                                where u ∼ U(0, 1) is a uniform distribution, Sig is the Sig-
  until reach the stop conditions
                                                                moid function, αij ∈ R+ is the output of fep (vie , vje ), β is
                                                                the temperature and (γ, δ) is an interval with γ < 0, δ > 0.
                                                                PTherefore, the LP     0 activation regularization is changed to:
                                                                                                                       −γ
             Time Complexity Analysis                              i,j∈Xn (π n ) ij =     i,j∈Xn Sig(log αij − β log δ ).
                                                                                     0

Our model performs interaction detection on each pair of           As a result, eij follows a hard concrete distribution
features and interaction modeling on each detected fea-         through the above approximation, which is differentiable
ture interaction. The time complexity is O(q 2 (M e + M h )),   and approximates a binary distribution. Following the rec-
where q = |Xn |, O(M e ) and O(M h ) are the complexi-          ommendations from (Maddison, Mnih, and Teh 2017), we
ties of interaction detection and interaction modeling on one   set γ = −0.1, δ = 1.1 and β = 2/3 throughout our experi-
feature interaction, respectively. The two aggregation proce-   ments. We refer interested readers to (Louizos, Welling, and
dures φ and ψ can be regarded together as summing up all        Kingma 2018) for details about hard concrete distributions.
statistical interaction analysis results two times. The com-
plexity is O(2dq 2 ). Finally, there is an aggregation proce-            Additional Experimental Results
dure g on each node. The complexity is O(qd). Therefore,        We also evaluate our model using the f1-score. We show the
the complexity of our model is O(q 2 (M e + M h + 2d) + qd).    evaluation results in this section.
You can also read