Improving Multimodal fusion via Mutual Dependency Maximisation

 
CONTINUE READING
Improving Multimodal fusion via Mutual Dependency Maximisation
Improving Multimodal fusion via Mutual Dependency Maximisation

                                                     Pierre Colombo1,2 , Emile Chapuis1 , Matthieu Labeau1 , Chloe Clavel1
                                                              1
                                                                LTCI, Telecom Paris, Institut Polytechnique de Paris,
                                                                                  2
                                                                                    IBM GBS France,
                                                                      1
                                                                        firstname.lastname@telecom-paris.fr,
                                                                             2
                                                                               pierre.colombo@ibm.com,

                                                              Abstract                          the difficulty of learning multimodal representa-
                                                                                                tions and raise specific challenges. Baltrušaitis
                                            Multimodal sentiment analysis is a trending
                                            area of research, and the multimodal fusion         et al. (2018) identifies fusion as one of the five core
                                            is one of its most active topic. Acknowledg-        challenges in multimodal representation learning,
arXiv:2109.00922v2 [cs.LG] 9 Sep 2021

                                            ing humans communicate through a variety of         the four other being: representation, modality align-
                                            channels (i.e visual, acoustic, linguistic), mul-   ment, translation and co-learning. Fusion aims at
                                            timodal systems aim at integrating different        integrating the different unimodal representations
                                            unimodal representations into a synthetic one.      into one common synthetic representation. Effec-
                                            So far, a consequent effort has been made on
                                                                                                tive fusion is still an open problem: the best multi-
                                            developing complex architectures allowing the
                                            fusion of these modalities. However, such sys-      modal models in sentiment analysis (Rahman et al.,
                                            tems are mainly trained by minimising sim-          2020) improve over their unimodal counterparts,
                                            ple losses such as L1 or cross-entropy. In          relying on text modality only, by less than 1.5% on
                                            this work, we investigate unexplored penalties      accuracy. Additionally, the fusion should not only
                                            and propose a set of new objectives that mea-       improve accuracy but also make representations
                                            sure the dependency between modalities. We          more robust to missing modalities.
                                            demonstrate that our new penalties lead to a        Multimodal fusion can be divided into early and
                                            consistent improvement (up to 4.3 on accu-
                                                                                                late fusion techniques: early fusion takes place
                                            racy) across a large variety of state-of-the-art
                                            models on two well-known sentiment analysis         at the feature level (Ye et al., 2017), while late
                                            datasets: CMU-MOSI and CMU-MOSEI. Our               fusion takes place at the decision or scoring level
                                            method not only achieves a new SOTA on both         (Khan et al., 2012). Current research in multimodal
                                            datasets but also produces representations that     sentiment analysis mainly focuses on developing
                                            are more robust to modality drops. Finally, a       new fusion mechanisms relying on deep architec-
                                            by-product of our methods includes a statisti-      tures (e.g TFN (Zadeh et al., 2017), LFN (Liu et al.,
                                            cal network which can be used to interpret the
                                                                                                2018), MARN (Zadeh et al., 2018b), MISA (Haz-
                                            high dimensional representations learnt by the
                                            model.                                              arika et al., 2020), MCTN (Pham et al., 2019), HFNN
                                                                                                (Mai et al., 2019), ICCN (Sun et al., 2020)). The-
                                        1   Introduction                                        ses models are evaluated on several multimodal
                                                                                                sentiment analysis benchmark such as IEMOCAP
                                        Humans employ three different modalities to com-
                                                                                                (Busso et al., 2008), MOSI (Wöllmer et al., 2013),
                                        municate in a coordinated manner: the language
                                                                                                MOSEI (Zadeh et al., 2018c) and POM (Garcia
                                        modality with the use of words and sentences, the
                                                                                                et al., 2019b; Park et al., 2014). Current state-of-
                                        vision modality with gestures, poses and facial ex-
                                                                                                the-art on these datasets uses architectures based
                                        pressions and the acoustic modality through change
                                                                                                on pre-trained transformers (Tsai et al., 2019; Siri-
                                        in vocal tones. Multimodal representation learning
                                                                                                wardhana et al., 2020) such as MultiModal Bert
                                        has shown great progress in a large variety of tasks
                                                                                                (MAGBERT) or MultiModal XLNET (MAGXLNET)
                                        including emotion recognition, sentiment analy-
                                                                                                (Rahman et al., 2020).
                                        sis (Soleymani et al., 2017), speaker trait analysis
                                        (Park et al., 2014) and fine-grained opinion min-          The aforementioned architectures are trained by
                                        ing (Garcia et al., 2019a). Learning from different     minimising either a L1 loss or a Cross-Entropy
                                        modalities is an efficient way to improve perfor-       loss between the predictions and the ground-truth
                                        mance on the target tasks (Xu et al., 2013). Never-     labels. To the best of our knowledge, few efforts
                                        theless, heterogeneities across modalities increase     have been dedicated to exploring alternative losses.
In this work, we propose a set of new objectives        (3) provides an explanation of the decision taken
to perform and improve over existing fusion mech-       by the neural architecture (sec. 5.4).
anisms. These improvements are inspired by the
InfoMax principle (Linsker, 1988), i.e. choosing        2     Problem formulation & related work
the representation maximising the mutual informa-
tion (MI) between two possibly overlapping views        In this section, we formulate the problem of learn-
of the input. The MI quantifies the dependence          ing multi-modal representation (sec. 2.1) and we re-
of two random variables; contrarily to correlation,     view both existing measures of mutual dependency
MI also captures non-linear dependencies between        (see sec. 2.2) and estimation methods (sec. 2.3).
the considered variables. Different from previous       In the rest of the paper, we will focus on learn-
work, which mainly focuses on comparing two             ing from three modalities (i.e language, audio and
modalities, our learning problem involves multiple      video), however our approach can be generalised
modalities (e.g text, audio, video). Our proposed       to any arbitrary number of modalities.
method, which induces no architectural changes,
                                                        2.1    Learning multimodal representations
relies on jointly optimising the target loss with an
additional penalty term measuring the mutual de-        Plethora of neural architectures have been pro-
pendency between different modalities.                  posed to learn multimodal representations for sen-
                                                        timent classification. Models often rely on a fusion
1.1   Our Contributions                                 mechanism (e.g multi-layer perceptron (Khan et al.,
                                                        2012), tensor factorisation (Liu et al., 2018; Zadeh
We study new objectives to build more performant        et al., 2019) or complex attention mechanisms
and robust multimodal representations through an        (Zadeh et al., 2018a)) that is fed with modality-
enhanced fusion mechanism and evaluate them on          specific representations. The fusion problem boils
multimodal sentiment analysis. Our method also          down to learning a model Mf : Xa × Xv × Xl →
allows us to explain the learnt high dimensional        Rd . Mf is fed with uni-modal representations of
multimodal embeddings. The paper contributions          the inputs Xa,v,l = (Xa , Xv , Xl ) obtained through
can be summarised as follows:                           three embedding networks fa , fv and fl . Mf has
A set of novel objectives using multivariate de-        to retain both modality-specific interactions (i.e
pendency measures. We introduce three new               interactions that involve only one modality) and
trainable surrogates to maximise the mutual de-         cross-view interactions (i.e more complex, they
pendencies between the three modalities (i.e audio,     span across both views). Overall, the learning of
language and video). We provide a general algo-         Mf involves both the minimisation of the down-
rithm inspired by MINE (Belghazi et al., 2018),         stream task loss and the maximisation of the mutual
which was developed in a bi-variate setting for esti-   dependency between the different modalities.
mating the MI. Our new method enriches MINE by
extending the procedure to a multivariate setting       2.2    Mutual dependency maximisation
that allows us to maximise different Mutual Depen-      Mutual information as mutual dependency
dency Measures: the Total Correlation (Watanabe,        measure: the core ideas we rely on to better learn
1960), the f-Total Correlation and the Multivari-       cross-view interactions are not new. They consist of
ate Wasserstein Dependency Measure (Ozair et al.,       mutual information maximisation (Linsker, 1988),
2019).                                                  and deep representation learning. Thus, one of the
Applications and numerical results. We apply            most natural choices is to use the MI that measures
our new set of objectives to five different archi-      the dependence between two random variables, in-
tectures relying on LSTM cells (Huang et al.,           cluding high-order statistical dependencies (Kinney
2015) (e.g EF-LSTM, LFN, MFN) or transformer            and Atwal, 2014). Given two random variables X
layers (e.g MAGBERT, MAG-XLNET). Our pro-               and Y , the MI is defined by
posed method (1) brings a substantial improvement
on two different multimodal sentiment analysis
                                                                                                    
                                                                                         pXY (x, y)
datasets (i.e MOSI and MOSEI,sec. 5.1), (2) makes              I(X; Y ) , EXY       log                ,   (1)
                                                                                        pX (x)pY (y)
the encoder more robust to missing modalities (i.e
when predicting without language, audio or video        where pXY is the joint probability density function
the observed performance drop is smaller, sec. 5.3),    (pdf) of the random variables (X, Y ), and pX , pY
represent the marginal pdfs. MI can also be defined          3.0
with a the KL divergence:                                    2.5                           KL
                                                                                           f
                                                             2.0
  I(X; Y ) , KL [pXY (x, y)||pX (x)pY (y)] . (2)
                                                             1.5
Extension of mutual dependency to different                  1.0
metrics: the KL divergence seems to be limited               0.5
when used for estimating MI (McAllester and
                                                             0.0
Stratos, 2020). A natural step is to replace the                   1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
KL divergence in Eq. 2 with different divergences
such as the f-divergences or distances such as the      Figure 1: Estimation of different dependency measures
Wasserstein distance. Hence, we introduce new mu-       for multivariate Gaussian random variables for differ-
tual dependency measures (MDM): the f-Mutual            ent degree of correlation.
Information (Belghazi et al., 2018), denoted If and
the Wasserstein Measures (Ozair et al., 2019), de-
noted IW . As previously, pXY denotes the joint         follow: corr(Xi , Xk ) = δi,k ρ , where ρ ∈ (−1, 1)
pdf, and pX , pY denote the marginal pdfs. The new      and δi,k is Kronecker’s delta. We observe that the
measures are defined as follows:                        dependency measure based on Wasserstein distance
                                                        is different from the one based on the divergences
       If , Df (pXY (x, y); pX (x)pY (y)),       (3)    and thus will lead to different gradients. Although
                                                        theoretical studies have been done on the use of
where Df denotes any f -divergences and                 different metrics for dependency estimations, it re-
                                                        mains an open question to know which one is the
       IW , W(pXY (x, y); pX (x)pY (y)),         (4)
                                                        best suited. In this work, we will provide an exper-
where W denotes the Wasserstein distance (Peyré         imental response in a specific case.
et al., 2019).
                                                        3   Model and training objective
2.3   Estimating mutual dependency measures
                                                        In this section, we introduce our new set of losses to
The computation of MI and other mutual depen-
                                                        improve fusion. In sec. 3.1, we first extend widely
dency measures can be difficult without knowing
                                                        used bi-variate dependency measures to multivari-
the marginal and joint probability distributions,
                                                        ate dependencies (James and Crutchfield, 2017)
thus it is popular to maximise lower bounds to
                                                        measures (MDM). We then introduce variational
obtain better representations of different modal-
                                                        bounds on the MDM, and in sec. 3.2, we describe
ities including image (Tian et al., 2019; Hjelm
                                                        our method to minimise the proposed variational
et al., 2018), audio (Dilpazir et al., 2016) and text
                                                        bounds.
(Kong et al., 2019) data. Several estimators have
                                                        Notations We consider Xa , Xv , Xl as the multi-
been proposed: MINE (Belghazi et al., 2018) uses
                                                        modal data from the audio,video and language
the Donsker-Varadhan representation (Donsker and
                                                        modality respectively with joint probability dis-
Varadhan, 1985) to derive a parametric lower bound
                                                        tribution pXa Xv Xl . We denote as pXj the marginal
holds, Nguyen et al. (2017, 2010) uses variational
                                                        distribution of Xj with j ∈ {a, v, l} corresponding
characterisation of f-divergence and a multi-sample
                                                        to the jth modality.
version of the density ratio (also known as noise
                                                        General loss As previously mentioned, we rely on
contrastive estimation (Oord et al., 2018; Ozair
                                                        the InfoMax principle (Linsker, 1988) and aim at
et al., 2019)). These methods have mostly been
                                                        jointly maximising the MDM between the different
developed and studied in a bi-variate setting.
                                                        modalities and minimising the task loss; hence, we
Illustration of neural dependency measures on
                                                        are in a multi-task setting (Argyriou et al., 2007;
a bivariate case. In Fig. 1 we can see the aforemen-
                                                        Ruder, 2017) and the objective of interest can be
tioned dependency measures (i.e see Eq. 2, Eq. 4,
                                                        defined as:
Eq. 3) when estimated with MINE (Belghazi et al.,
2018) for multivariate Gaussian random variables,
Xa and Xb . The component wise correlation for                     L , Ldown. −          λ · LM DM            .   (5)
                                                                       | {z }            | {z     }
the considered multivariate Gaussian is defined as                       main task   mutual dependency term
Ldown. represents a downstream specific (target         The neural multivariate f-mutual information mea-
task) loss i.e a binary cross-entropy or a L1 loss, λ   sure If is defined as follows:
is a meta-parameter and LM DM is the multivariate
dependencies measures (see sec. 3.2). Minimisa-             If , sup       EpXa Xv Xl [Tθ ] − E Q pX [eTθ −1 ]. (7)
                                                                                                        j
                                                                  θ                         j∈{a,v,l}
tion of our newly defined objectives requires to
derive lower bounds on the LM DM terms, and then
                                                          The neural multivariate Wasserstein dependency
to obtain trainable surrogates.                         measure IW is defined as follows:
3.1   From bivariate to multivariate                                                                            
                                                             IW , sup EpXa Xv Xl [Tθ ] − log E     Q
                                                                                                    pXj     [Tθ ] .   (8)
      dependencies                                               θ:Tθ ∈L                      j∈{a,v,l}

In our setting, we aim at maximising cross-view
                                                        Where L is the set of all 1-Lipschitz functions from
interactions involving three modalities, thus we
                                                        Rd → R
need to generalise bivariate dependency measures
to multivariate dependency measures.                    Sketch of proofs: Eq. 6 is a direct application of
Definition 3.1 (Multivariate Dependencies Mea-          the Donsker-Varadhan representation of the KL
sures). Let Xa , Xv , Xl be a set of random vari-       divergence (we assume that the integrability con-
ables with joint pdf pXa Xv Xl and respective           straints are satisfied). Eq. 7 comes from the work
marginal pdf pXj with j ∈ {a, v, l}. Then we            of Nguyen et al. (2017). Eq. 8 comes from the
defined the multivariate mutual information Ikl         Kantorovich-Rubenstein: we refer the reader to
which is also refered as total correlation (Watan-      (Villani, 2008; Peyré et al., 2019) for a rigorous
abe, 1960) or multi-information (Studenỳ and Vej-      and exhaustive treatment.
narová, 1998):                                          Practical estimate of the variational bounds.
                                      Y                 The empirical estimator that we derive from Th. 1
 Ikl , KL(pXa Xv Xl (xa , xv , xl )||   pXj (xj )).     can be used in practical way: the expectations in
                                  j∈{a,v,l}             Eq. 6, Eq. 7 and Eq. 8 are estimated using empirical
Simarly for any f-divergence we define the multi-       samples from the joint distribution
                                                                                      Q       pXa Xv Xl . The
variate f-mutual information If as:                     empirical samples from             pXj are obtained
                                                                                       j∈{a,v,l}
                                      Y                 by shuffling the samples from the joint distribution
  If , Df (pXa Xv Xl (xa , xv , xl );   pXj (xj )).
                                                        in a batch. We integrate this into minimising a
                                 j∈{a,v,l}
                                                        multi-task objective (5) by using minus the estima-
Finally, we also extend Eq. 3 to obtain the multi-      tor. We refer to the losses obtained with the penalty
variate Wasserstein dependency measure IW :             based on the estimators described in Eq. 6, Eq. 7
                                   Y                    and Eq. 8 as Lkl , Lf and LW respectively. Details
 IW , W(pXa Xv Xl (xa , xv , xl );    pXj (xj )).       on the practical minimisation of our variational
                                 j∈{a,v,l}
                                                        bounds are provided in Algorithm 1.
where W denotes the Wasserstein distance.               Remark. In this work we choose to generalise
3.2   From theoretical bounds to trainable              MINE to compute multivariate dependencies. Com-
      surrogates                                        paring our proposed algorithm to other alterna-
                                                        tives mentioned in sec. 2 is left for future work.
To train our neural architecture we need to esti-       This choice is driven by two main reasons: (1)
mate the previously defined multivariate depen-         our framework allows the use of various types
dency measures. We rely on neural estimators that       of contrast measures (e.g Wasserstein distance,f -
are given in Th. 1.                                     divergences); (2) the critic network Tθ can be used
Theorem 1. Multivariate Neural Dependency               for interpretability purposes as shown in sec. 5.4.
Measures Let the family of functions T (θ) : Xa ×
Xv × Xl → R parametrized by a deep neural net-
                                                        4     Experimental setting
work with learnable parameters θ ∈ Θ. The multi-
variate mutual information measure Ikl is defined       In this section, we present our experimental settings
as:                                                     including the neural architectures we compare, the
                                                        datasets, the metrics and our methodology, which
 Ikl , sup EpXa Xv Xl [Tθ ] − log E Q pX [eTθ ] . (6)
                                              
        θ                       j∈{a,v,l}
                                            j           includes the hyper-parameter selection.
Algorithm 1 Two-stage procedure to minimise                   2014), BERT or XLNET contextualised embed-
multivariate dependency measures.                             dings. For Glove, the embeddings are of dimension
INPUT: Dn = {(xja , xjv , xjl ), ∀j ∈ [1, n]} multi-          300, where for BERT and XLNET this dimension
  modal training dataset, m batch size, σa , σv , σl :        is 768.
  [1, m] → [1, m] three permutations, θc weights              Vision: Vision features are extracted using Facet
  of the deep classifier, θ weights of the statistical        which results into facial action units corresponding
  network Tθ .                                                to facial muscle movement. For CMU-MOSEI, the
Initialization: parameters θ and θc                           video vectors are composed of 47 units, and for
Build Negative Dataset:                                       CMU-MOSI they are composed of 35.
                                                              Audio : Audio features are extracted using CO-
                                        σ (j)
       D¯n = {(xσa a (j) , xσv v (j) , xl l ), ∀j ∈ [1, n]}   VAREP (Degottex et al., 2014). This results into
                                                              a vector of dimension 74 which includes 12 Mel-
Optimization:                                                 frequency cepstral coefficients (MFCCs), as well
 while (θ, θc ) not converged do                              as pitch tracking and voiced/unvoiced segmenting
     for i ∈ [1, U nroll] do                                  features, peak slope parameters, maxima disper-
        Sample from Dn , B ∼ pXQ a Xv Xl                      sion quotients and glottal source parameters.
        Sample from D¯n , B̄ ∼         pXj                    Video and audio are aligned on text-based follow-
                                         j∈{a,v,l}            ing the convention introduced in (Chen et al., 2017)
         Update θ based on the empirical version
                                                              and the forced alignment described in (Yuan and
 of Eq. 6 or Eq. 7 or Eq. 8.
                                                              Liberman, 2008).
     end for
     Sample a batch B from D                                  4.2    Evaluation metrics
     Update θc with B using Eq. 5.
 end while                                                    Multimodal Opinion Sentiment Intensity prediction
OUTPUT: Classifiers weights θc                                is treated as a regression problem. Thus, we report
                                                              both the Mean Absolute Error (MAE) and the cor-
                                                              relation of model predictions with true labels. In
4.1     Datasets                                              the literature, the regression task is also turned into
                                                              a binary classification task for polarity prediction.
We empirically evaluate our methods on two
                                                              We follow standard practices (Rahman et al., 2020)
english datasets: CMU-MOSI and CMU-MOSEI.
                                                              and report the Accuracy2 (Acc7 denotes accuracy
Both datasets have been frequently used to assess
                                                              on 7 classes and Acc2 the binary accuracy) of our
model performance in human multimodal senti-
                                                              best performing models.
ment and emotion recognition.
CMU-MOSI: Multimodal Opinion Sentiment In-
                                                              4.3    Neural architectures
tensity (Wöllmer et al., 2013) is a sentiment an-
notated dataset gathering 2, 199 short monologue              In our experiments, we choose to modify the loss
video clips.                                                  function of the different models that have been
CMU-MOSEI: CMU-Multimodal Opinion Senti-                      introduced for multi-modal sentiment analysis on
ment and Emotion Intensity (Zadeh et al., 2018c)              both CMU-MOSI and CMU-MOSEI: Memory Fu-
is an emotion and sentiment annotated corpus con-             sion Network (MFN (Zadeh et al., 2018a)), Low-
sisting of 23, 454 movie review videos taken from             rank Multimodal Fusion (LFN (Liu et al., 2018))
YouTube. Both CMU-MOSI and CMU-MOSEI                          and two state-of-the-art transformers based models
are labelled by humans with a sentiment score in              (Rahman et al., 2020) for fusion rely on BERT (De-
[−3, 3]. For each dataset, three modalities are avail-        vlin et al., 2018) (MAG-BERT) and XLNET (Yang
able; we follow prior work (Zadeh et al., 2018b,              et al., 2019) (MAG-XLNT). To assess the validity
2017; Rahman et al., 2020) and the features that              of the proposed losses, we also apply our method
have been obtained as follows1 :                              to a simple early fusion LSTM (EF-LSTM) as a
Language: Video transcripts are converted to word             baseline model.
embeddings using either Glove (Pennington et al.,             Model overview: Aforementioned models can be
   1                                                             2
    Data from CMU-MOSI and CMU-MOSEI can be obtained              The regression outputs are turned into categorical values
from       https://github.com/WasifurRahman/                  to obtain either 2 or 7 categories (see (Rahman et al., 2020;
BERT_multimodal_transformer                                   Zadeh et al., 2018a; Liu et al., 2018))
seen as a multi-modal encoder fθe providing a rep-     the different dependency measures. The compared
resentation Zavl containing information and depen-     measures are combined with different models using
dencies between modalities Xl , Xa , Xv namely:        various fusion mechanisms.
                                                       Improving the robustness to modality drop: a
            fθe (Xa , Xv , Xl ) = Zavl .               desirable quality of multimodal representations is
                                                       the robustness to a missing modality. We study
As a final step, a linear transformation Aθp is ap-
                                                       how the maximisation of mutual dependency mea-
plied to Zavl to perform the regression.
                                                       sures during training affects the robustness of the
EF-LSTM: is the most basic architecture used in
                                                       representation when a modality becomes missing.
the current multimodal analysis where each se-
                                                       Towards explainable representations: the sta-
quence view is encoded separately with LSTM
                                                       tistical network Tθ allows us to compute a de-
channels. Then, a fusion function is applied to
                                                       pendency measure between the three considered
all representations.
                                                       modalities. We carry out a qualitative analysis in
TFN: computes a representation of each view, and
                                                       order to investigate if a high dependency can be
then applies a fusion operator. Acoustic and visual
                                                       explained by complementariness across modalities.
views are first mean-pooled then encoded through
a 2-layers perceptron. Linguistic features are com-    5.1    Efficiency of the MDM penalty
puted with a LSTM channel. Here, the fusion func-
                                                       For a simple EF-LSTM, we study the improvement
tion is a cross-modal product capturing unimodal,
                                                       induced by addition of our MDM penalty. The re-
bimodal and trimodal interactions across modali-
                                                       sults are presented in Tab. 1, where a EF-LSTM
ties.
                                                       trained with no mutual dependency term is denoted
MFN enriches the previous EF-LSTM architecture
                                                       with L∅ . On both studied datasets, we observe that
with an attention module that computes a cross-
                                                       the addition of a MDM penalty leads to stronger
view representation at each time step. They are
                                                       performances on all metrics. For both datasets, we
then gathered and a final representation is com-
                                                       observe that the best performing models are ob-
puted by a gated multi-view memory (Zadeh et al.,
                                                       tained by training with an additional mutual depen-
2018a).
                                                       dency measure term. Keeping in mind the example
MAG-BERT and MAG-XLNT are based on pre-
                                                       shown in Fig. 1, we can draw a first comparison
trained transformer architectures (Devlin et al.,
                                                       between the different dependency measures. Al-
2018; Yang et al., 2019) allowing inputs on each
                                                       though in a simple case Lf and Lkl estimate a simi-
of the transformer units to be multimodal, thanks
                                                       lar quantity (see Fig. 1), in more complex practical
to a special gate inspired by Wang et al. (2018).
                                                       applications they do not achieve the same perfor-
The Zavl is the [CLS] representation provided by
                                                       mance. Even though, the Donsker-Varadhan bound
the last transformer head. For each architecture,
                                                       used for Lkl is stronger3 than the one used to es-
we use the optimal architecture hyperparameters
                                                       timate Lf ; for a simple model the stronger bound
provided by the associated papers (see sec. 8).
                                                       does not lead to better results. It is possible that
5   Numerical results                                  most of the differences in performance observed
                                                       come from the optimisation process during train-
We present and discuss here the results obtained us-   ing4 .
ing the experimental setting described in sec. 4. To   Takeaways: On the simple case of EF-LSTM
better understand the impact of our new methods,       adding MDM penalty improves the performance
we propose to investigate the following points:        on the downstream tasks.
Efficiency of the LM DM : to gain understanding
of the usefulness of our new objectives, we study      5.2    Improving models and comparing
the impact of adding the mutual dependency term               multivariate dependency measures
on the basic multimodal neural model EF-LSTM.          In this experiment, we apply the different penalties
Improving model performance and comparing              to more advanced architectures, using various fu-
multivariate dependency measures: the choice           sion mechanisms.
of the most suitable dependency measure for a
                                                          3
given task is still an open problem (see sec. 3).           For a fixed Tθ the right term in Eq. 6 is greater than Eq. 7
                                                          4
                                                            Similar conclusion have been drawn in the field of metric
Thus, we compare the performance – on both mul-        learning problem when comparing different estimates of the
timodal sentiment and emotion prediction tasks– of     mutual information (Boudiaf et al., 2020).
Acch7     Acch2 M AE l      Corrh                   Acch7
                                                                            CMU-MOSI
                                                                          Acch2 M AE l   Corrh Acch7
                                                                                                          CMU-MOSEI
                                                                                                         Acch2 M AE l   Corrh
                       CMU-MOSI                                                               MFN
                                                            L∅    31.3    76.6   1.01     0.62    44.4   74.7   0.72    0.53
      L∅      31.1      76.1   1.00        0.65             Lkl   32.5    76.7   0.96     0.65    44.2   74.7   0.72    0.57
      Lkl     31.7      76.4   1.00        0.66             Lf    35.7    77.4   0.96     0.65    46.1   75.4   0.69    0.56
                                                            LW    35.9    77.6   0.96     0.65    46.2   75.1   0.69    0.56
      Lf      33.7      76.2   1.02        0.66                                               LFN
      LW      33.5      76.4   0.98        0.66             L∅    31.9    76.9   1.00     0.63    45.2   74.2   0.70    0.54
                                                            Lkl   32.6    77.7   0.97     0.63    46.1   75.3   0.68    0.57
                      CMU-MOSEI                             Lf    35.6    77.1   0.97     0.63    45.8   75.4   0.69    0.57
                                                            LW    35.6    77.7   0.96     0.67    46.2   75.4   0.67    0.57
      L∅      44.2      75.0   0.72        0.52                                            MAGBERT
      Lkl     44.5      75.6   0.70        0.53             L∅    40.2    84.7   0.79     0.80    46.8   84.9   0.59    0.77
                                                            Lkl   42.0    85.6   0.76     0.82    47.1   85.4   0.59    0.79
      Lf      45.5      75.2   0.70        0.52             Lf    41.7    85.6   0.78     0.82    46.9   85.6   0.59    0.79
      LW      45.3      75.9   0.68        0.54             LW    41.8    85.3   0.76     0.82    47.8   85.5   0.59    0.79
                                                                                          MAGXLNET
                                                            L∅    43.0    86.2   0.76     0.82    46.7   84.4   0.59    0.79
Table 1: Results on sentiment analysis on both              Lkl   44.5    86.1   0.74     0.82    47.5   85.4   0.59    0.81
CMU-MOSI and CMU-MOSEI for a EF-LSTM. Acc7                  Lf    43.9    86.6   0.74     0.82    47.4   85.0   0.59    0.81
                                                            LW    44.4    86.9   0.74     0.82    47.9   85.8   0.59    0.82
denotes accuracy on 7 classes and Acc2 the binary ac-
curacy. M AE denotes the Mean Absolute Error and           Table 2: Results on sentiment and emotion prediction
Corr is the Pearson correlation. h means higher is         on both CMU-MOSI and CMU-MOSEI dataset for the
better and l means lower is better. The choice of the      different neural architectures presented in sec. 4 relying
evaluation metrics follows standard practices (Rahman      on various fusion mechanisms.
et al., 2020). Underline results demonstrate significant
improvement (p-value belows 0.05) against the base-
line when performing the Wilcoxon Mann Whitney test        NET) or by the Multimodal Adaptation Gate used
(Wilcoxon, 1992) on 10 runs using different seeds.
                                                           to perform the fusion.
                                                           Comparing dependency measures. Tab. 2 shows
General analysis. Tab. 2 shows the performance             that there is no dependency measure that achieves
of various neural architectures trained with and           the best results in all cases. This result tends to con-
without MDM penalty. Results are coherent with             firm that the optimisation process during training
the previous experiment: we observe that jointly           plays an important role (see hypothesis in sec. 5.1).
maximising a mutual dependency measure leads               However, we can observe that optimising the multi-
to better results on the downstream task: for ex-          variate Wasserstein dependency measure is usually
ample, a MFN on CMU-MOSI trained with LW                   a good choice, since it achieves state of the art re-
outperforms by 4.6 points on Acch7 the model               sults in many configurations. It is worth noting
trained without the mutual dependency term. On             that several pieces of research point the limitations
CMU-MOSEI we also obtain subsequent improve-               of mutual information estimators (McAllester and
ments while training with MMD. On CMU-MOSI                 Stratos, 2020; Song and Ermon, 2019).
the TFN also strongly benefits from the mutual de-         Takeaways: The addition of MMD not only ben-
pendency term with an absolute improvement of              efits simple models (e.g EF-LSTM) but also im-
3.7% (on Acch7 ) with LW compared to L∅ . Tab. 2           proves performance when combined with both com-
shows that our methods not only perform well               plex fusion mechanisms and pretrained models. For
on recurrent architectures but also on pretrained          practical applications, the Wasserstein distance is a
Transformer-based models, that achieve higher re-          good choice of contrast function.
sults due to a superior capacity to model contextual
dependencies (see (Rahman et al., 2020)).                  5.3    Improved robustness to modality drop
Improving state-of-the-art models. MAGBERT                 Although fusion with visual and acoustic modali-
and MAGXLNET are state-of-the art models on both           ties provided a performance improvement (Wang
CMU-MOSI and CMU-MOSEI. From Tab. 2, we                    et al., 2018), the performance of Multimodal sys-
observe that our methods can improve the perfor-           tems on sentiment prediction tasks is mainly car-
mance of both models. It is worth noting that, in          ried by the linguistic modality (Zadeh et al., 2018a,
both cases, LW combined with pre-trained trans-            2017). Thus it is interesting to study how a mul-
formers achieves good results. This performance            timodal system behaves when the text modality is
gain suggests that our method is able to capture           missing because it gives insights on the robustness
dependencies that are not learnt during either pre-        of the representation.
training of the language model (i.e BERT or XL-            Experiment description. In this experiment, we
Spoken Transcripts                                                            Acoustic and visual behaviour                   Tθ
      um the story was all right                                                    low energy monotonous voice + headshake         L
      i mean its a Nicholas Sparks book it must be good                             disappointed tone + neutral facial expression   L
      the action is fucking awesome                                                 head nod + excited voice                        H
      it was cute you know the actors did a great job bringing the smurfs to
                                                                                    multiple smiles                                 H
      life such as joe george lopez neil patrick harris katy perry and a fourth

Table 3: Examples from the CMU-MOSI dataset using MAGBERT. The last column is computed using the statistical
network Tθ . L stands for low values and H stands for high values. Green, grey, red highlight positive, neutral and
negative expression/behaviours respectively

               0.6
                           MAGBERT                                       to dropping the language modality, and thus, could
                                                                         be preferred in practical applications.
               0.5                                                       Takeaway: Maximising the MMD allows an infor-
                                                                         mation transfer between modalities.
Acc2corrupt/Acc2

               0.4
                                                                f        5.4       Towards explainable representations
               0.3                                              kl
                                                                         In this section, we propose a qualitative experi-
               0.2                                                       ment allowing us to interpret the predictions made
                                                                         by the deep neural classifier. During training, Tθ
               0.1
                                                                         estimates the mutual dependency measure, using
               0.0                                                       the surrogates introduced in Th. 1. However, the
                     A           V           A+V                         inference process only involves the classifier, and
                         Present Modality
                                                                         Tθ is unused. Eq. 6, Eq. 7, Eq. 8 show that Tθ is
Figure 2: Study of the robustness of the representations                 trained to discriminate between valid representa-
against drop of the linguistic modality. Studied model                   tions (coming from the joint distribution) and cor-
is MAGBERT on CMU-MOSI. The ratio between the ac-                        rupted representations (coming from the product of
curacy achieved with a corrupted linguistic modality
                                                                         the marginals). Thus, Tθ can be used, at inference
Acccorrupt
     2      and the accuracy Acc2 without any corrup-
tion is reported on y-axis. The preserved modalities
                                                                         time, to measure the mutual dependency of the rep-
during inference are reported on x-axis. A, V respec-                    resentations used by the neural model. In Tab. 3
tively stands for acoustic and visual modality.                          we report examples of low and high discrepancy
                                                                         measures for MAGBERT on CMU-MOSI. We can
focus on the MAGBERT and MAGXLNET since they                             observe that high values correspond to video clips
are the best performing models.5 As before, the                          where audio, text and video are complementary
considered models are trained using the losses de-                       (e.g use of head node (McClave, 2000)) and low
scribed in sec. 3 and all modalities are kept during                     values correspond to the case where there exists
training time. During inference, we either keep                          contradictions across several modalities. Results
only one modality (Audio or Video) or both. Text                         on MAGXLET can be found in sec. 8.3.
modality is always dropped.                                              Takeaways: Tθ used to estimate the MDM pro-
Results. Results of the experiments conducted on                         vides a mean to interpret representations learnt by
CMU-MOSI are shown in Fig. 2, giving values for                          the encoder.
the ratio Acccorrupt
              2      /Acc2 where Acccorrupt
                                       2       is the
binary accuracy in the corrupted configuration and                       6        Conclusions
Acc2 the accuracy obtained when all modalities are                       In this paper, we introduced three new losses based
considered. We observe that models trained with                          on MDM. Through extensive set of experiments on
an MDM penalty (either Lkl , Lf or LW ) resist bet-                      CMU-MOSI and CMU-MOSEI, we have shown that
ter to missing modalities than those trained with                        SOTA architectures can benefit from these innova-
L∅ . For example, when trained with Lkl or Lf , the                      tions with little modifications. A by-product of our
drop in performance is limited to ≈ 25% in any                           method involves a statistical network that is a useful
setting. Interestingly, for MAGBERT LW and LKL                           tool to explain the learnt high dimensional multi-
achieve comparable results; LKL is more resistant                        modal representations. This work paves the way
  5
    Because of space constraints results corresponding to                for using and developing new alternative methods
MAGXLNET are reported in sec. 8.                                         to improve the learning (e.g new estimator of mu-
tual information (Colombo et al., 2021a), Wasser-         Emile Chapuis, Pierre Colombo, Matteo Manica,
stein Barycenters (Colombo et al., 2021b), Data             Matthieu Labeau, and Chloé Clavel. 2020a. Hierar-
                                                            chical pre-training for sequence labelling in spoken
Depths (Staerman et al., 2021), Extreme Value The-
                                                            dialog. CoRR, abs/2009.11152.
ory (Jalalzai et al., 2020)). A future line of research
involves using this methods for emotion (Colombo          Emile Chapuis, Pierre Colombo, Matteo Manica,
et al., 2019; Witon et al., 2018) and dialog act (Cha-      Matthieu Labeau, and Chloé Clavel. 2020b. Hierar-
                                                            chical pre-training for sequence labelling in spoken
puis et al., 2021, 2020a,b) classification with pre-        dialog. In Findings of the Association for Compu-
trained model tailored for spoken language (Dinkar          tational Linguistics: EMNLP 2020, Online Event,
et al., 2020).                                              16-20 November 2020, volume EMNLP 2020 of
                                                            Findings of ACL, pages 2636–2648. Association for
7   Acknowledgments                                         Computational Linguistics.

The research carried out in this paper has received       Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Bal-
                                                            trušaitis, Amir Zadeh, and Louis-Philippe Morency.
funding from IBM, the French National Research              2017. Multimodal sentiment analysis with word-
Agency’s grant ANR-17-MAOI and the DSAIDIS                  level fusion and reinforcement learning. In Proceed-
chair at Telecom-Paris. This work was also granted          ings of the 19th ACM International Conference on
access to the HPC resources of IDRIS under the              Multimodal Interaction, pages 163–171.
allocation 2021-AP010611665 as well as under the          Pierre Colombo, Chloé Clavel, and Pablo Piantanida.
project 2021-101838 made by GENCI.                           2021a. A novel estimator of mutual information
                                                             for learning to disentangle textual representations.
                                                             CoRR, abs/2105.02685.
References                                                Pierre Colombo, Guillaume Staerman, Chloé Clavel,
Abien Fred Agarap. 2018.          Deep learning us-          and Pablo Piantanida. 2021b. Automatic text eval-
  ing rectified linear units (relu). arXiv preprint          uation through the lens of wasserstein barycenters.
  arXiv:1803.08375.                                          CoRR, abs/2108.12463.

Andreas Argyriou, Theodoros Evgeniou, and Massim-         Pierre Colombo, Wojciech Witon, Ashutosh Modi,
  iliano Pontil. 2007. Multi-task feature learning. In       James Kennedy, and Mubbasir Kapadia. 2019.
  Advances in neural information processing systems,         Affect-driven dialog generation. In Proceedings of
  pages 41–48.                                               the 2019 Conference of the North American Chap-
                                                             ter of the Association for Computational Linguistics:
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-              Human Language Technologies, NAACL-HLT 2019,
  Philippe Morency. 2018. Multimodal machine learn-         Minneapolis, MN, USA, June 2-7, 2019, Volume 1
  ing: A survey and taxonomy.        IEEE transac-          (Long and Short Papers), pages 3734–3743. Associ-
  tions on pattern analysis and machine intelligence,        ation for Computational Linguistics.
  41(2):423–443.
                                                          Gilles Degottex, John Kane, Thomas Drugman, Tuomo
Mohamed Ishmael Belghazi, Aristide Baratin, Sai             Raitio, and Stefan Scherer. 2014. Covarep—a col-
 Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron              laborative voice analysis repository for speech tech-
 Courville, and R Devon Hjelm. 2018. Mine: mu-              nologies. In 2014 ieee international conference
 tual information neural estimation. arXiv preprint         on acoustics, speech and signal processing (icassp),
 arXiv:1801.04062.                                          pages 960–964. IEEE.
Malik Boudiaf, Jérôme Rony, Imtiaz Masud Ziko, Eric       Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
 Granger, Marco Pedersoli, Pablo Piantanida, and Is-         Kristina Toutanova. 2018. Bert: Pre-training of deep
 mail Ben Ayed. 2020. A unifying mutual informa-             bidirectional transformers for language understand-
 tion view of metric learning: cross-entropy vs. pair-       ing. arXiv preprint arXiv:1810.04805.
 wise losses. In European Conference on Computer
 Vision, pages 548–564. Springer.                         Hammad Dilpazir, Zia Muhammad, Qurratulain Min-
                                                            has, Faheem Ahmed, Hafiz Malik, and Hasan Mah-
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe              mood. 2016. Multivariate mutual information for au-
  Kazemzadeh, Emily Mower, Samuel Kim, Jean-                dio video fusion. Signal, Image and Video Process-
  nette N Chang, Sungbok Lee, and Shrikanth S               ing, 10(7):1265–1272.
  Narayanan. 2008. Iemocap: Interactive emotional
  dyadic motion capture database. Language re-            Tanvi Dinkar, Pierre Colombo, Matthieu Labeau, and
  sources and evaluation, 42(4):335.                        Chloé Clavel. 2020. The importance of fillers for
                                                            text representations of speech transcripts. In Pro-
Emile Chapuis, Pierre Colombo, Matthieu Labeau, and         ceedings of the 2020 Conference on Empirical Meth-
  Chloé Clave. 2021. Code-switched inspired losses          ods in Natural Language Processing, EMNLP 2020,
  for generic spoken dialog representations. CoRR,          Online, November 16-20, 2020, pages 7985–7993.
  abs/2108.12465.                                           Association for Computational Linguistics.
MD Donsker and SRS Varadhan. 1985. Large devia-          Ralph Linsker. 1988. Self-organization in a perceptual
 tions for stationary gaussian processes. Communi-         network. Computer, 21(3):105–117.
 cations in Mathematical Physics, 97(1-2):187–210.
                                                         Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshmi-
Alexandre Garcia, Pierre Colombo, Slim Essid, Flo-         narasimhan, Paul Pu Liang, Amir Zadeh, and Louis-
  rence d’Alché Buc, and Chloé Clavel. 2019a. From         Philippe Morency. 2018. Efficient low-rank multi-
  the token to the review: A hierarchical multi-           modal fusion with modality-specific factors. arXiv
  modal approach to opinion mining. arXiv preprint         preprint arXiv:1806.00064.
  arXiv:1908.11216.
                                                         Ilya Loshchilov and Frank Hutter. 2017. Decou-
Alexandre Garcia, Slim Essid, Florence d’Alché Buc,         pled weight decay regularization. arXiv preprint
  and Chloé Clavel. 2019b. A multimodal movie re-           arXiv:1711.05101.
  view corpus for fine-grained opinion mining. arXiv     Sijie Mai, Haifeng Hu, and Songlong Xing. 2019. Di-
  preprint arXiv:1902.10102.                                vide, conquer and combine: Hierarchical feature fu-
Devamanyu Hazarika, Roger Zimmermann, and Sou-              sion network with local and global perspectives for
  janya Poria. 2020. Misa: Modality-invariant and-          multimodal affective computing. In Proceedings of
  specific representations for multimodal sentiment         the 57th Annual Meeting of the Association for Com-
  analysis. arXiv preprint arXiv:2005.03545.                putational Linguistics, pages 481–492.
                                                         David McAllester and Karl Stratos. 2020. Formal
R Devon Hjelm, Alex Fedorov, Samuel Lavoie-                limitations on the measurement of mutual informa-
  Marchildon, Karan Grewal, Phil Bachman, Adam             tion. In International Conference on Artificial Intel-
  Trischler, and Yoshua Bengio. 2018.     Learn-           ligence and Statistics, pages 875–884.
  ing deep representations by mutual information
  estimation and maximization.    arXiv preprint         Evelyn Z McClave. 2000. Linguistic functions of head
  arXiv:1808.06670.                                        movements in the context of speech. Journal of
                                                           pragmatics, 32(7):855–878.
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-
  tional lstm-crf models for sequence tagging. arXiv     Tu Nguyen, Trung Le, Hung Vu, and Dinh Phung. 2017.
  preprint arXiv:1508.01991.                               Dual discriminator generative adversarial nets. In
                                                           Advances in Neural Information Processing Systems,
Hamid Jalalzai, Pierre Colombo, Chloé Clavel, Éric         pages 2670–2680.
  Gaussier, Giovanna Varni, Emmanuel Vignon, and
  Anne Sabourin. 2020. Heavy-tailed representations,     XuanLong Nguyen, Martin J Wainwright, and
  text polarity classification & data augmentation. In     Michael I Jordan. 2010. Estimating divergence func-
  Advances in Neural Information Processing Systems        tionals and the likelihood ratio by convex risk min-
  33: Annual Conference on Neural Information Pro-         imization. IEEE Transactions on Information The-
  cessing Systems 2020, NeurIPS 2020, December 6-          ory, 56(11):5847–5861.
  12, 2020, virtual.
                                                         Aaron van den Oord, Yazhe Li, and Oriol Vinyals.
Ryan G James and James P Crutchfield. 2017. Mul-           2018. Representation learning with contrastive pre-
  tivariate dependence beyond shannon information.         dictive coding. arXiv preprint arXiv:1807.03748.
  Entropy, 19(10):531.                                   Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron
                                                           Van den Oord, Sergey Levine, and Pierre Sermanet.
Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost              2019. Wasserstein dependency measure for repre-
  Van De Weijer, Andrew D Bagdanov, Maria Vanrell,         sentation learning. In Advances in Neural Informa-
  and Antonio M Lopez. 2012. Color attributes for          tion Processing Systems, pages 15604–15614.
  object detection. In 2012 IEEE Conference on Com-
  puter Vision and Pattern Recognition, pages 3306–      Sunghyun Park, Han Suk Shim, Moitreya Chatterjee,
  3313. IEEE.                                              Kenji Sagae, and Louis-Philippe Morency. 2014.
                                                           Computational analysis of persuasiveness in social
Diederik P Kingma and Jimmy Ba. 2014. Adam: A              multimedia: A novel dataset and multimodal predic-
  method for stochastic optimization. arXiv preprint       tion approach. In Proceedings of the 16th Interna-
  arXiv:1412.6980.                                         tional Conference on Multimodal Interaction, pages
                                                           50–57.
Justin B Kinney and Gurinder S Atwal. 2014. Eq-
   uitability, mutual information, and the maximal in-   Jeffrey Pennington, Richard Socher, and Christopher D
   formation coefficient. Proceedings of the National       Manning. 2014. Glove: Global vectors for word
  Academy of Sciences, 111(9):3354–3359.                    representation. In EMNLP, volume 14, pages 1532–
                                                            1543.
Lingpeng Kong, Cyprien de Masson d’Autume, Wang
  Ling, Lei Yu, Zihang Dai, and Dani Yogatama. 2019.     Gabriel Peyré, Marco Cuturi, et al. 2019. Computa-
  A mutual information maximization perspective of         tional optimal transport: With applications to data
  language representation learning. arXiv preprint         science. Foundations and Trends® in Machine
  arXiv:1910.08350.                                        Learning, 11(5-6):355–607.
Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-              Proceedings of the conference. Association for Com-
  Philippe Morency, and Barnabás Póczos. 2019.               putational Linguistics. Meeting, volume 2019, page
  Found in translation: Learning robust joint represen-      6558. NIH Public Access.
  tations by cyclic translations between modalities. In
  Proceedings of the AAAI Conference on Artificial In-     Cédric Villani. 2008. Optimal transport: old and new,
  telligence, volume 33, pages 6892–6899.                    volume 338. Springer Science & Business Media.
Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee,               Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang,
 AmirAli Bagher Zadeh, Chengfeng Mao, Louis-                 Amir Zadeh, and Louis-Philippe Morency. 2018.
  Philippe Morency, and Ehsan Hoque. 2020. Inte-             Words can shift: Dynamically adjusting word rep-
  grating multimodal information in large pretrained         resentations using nonverbal behaviors.
  transformers. In Proceedings of the 58th Annual
 Meeting of the Association for Computational Lin-         Satosi Watanabe. 1960. Information theoretical analy-
  guistics, pages 2359–2369.                                 sis of multivariate correlation. IBM Journal of re-
Sebastian Ruder. 2017. An overview of multi-task             search and development, 4(1):66–82.
  learning in deep neural networks. arXiv preprint
  arXiv:1706.05098.                                        Frank Wilcoxon. 1992. Individual comparisons by
                                                             ranking methods. In Breakthroughs in statistics,
Shamane Siriwardhana, Andrew Reis, Rivindu                   pages 196–202. Springer.
  Weerasekera, and Suranga Nanayakkara. 2020.
  Jointly fine-tuning" bert-like" self supervised          Wojciech Witon, Pierre Colombo, Ashutosh Modi, and
  models to improve multimodal speech emotion               Mubbasir Kapadia. 2018. Disney at IEST 2018: Pre-
  recognition. arXiv preprint arXiv:2008.06682.             dicting emotions using an ensemble. In Proceedings
                                                            of the 9th Workshop on Computational Approaches
Mohammad Soleymani, David Garcia, Brendan Jou,              to Subjectivity, Sentiment and Social Media Analysis,
 Björn Schuller, Shih-Fu Chang, and Maja Pantic.            WASSA@EMNLP 2018, Brussels, Belgium, October
 2017. A survey of multimodal sentiment analysis.           31, 2018, pages 248–253. Association for Computa-
 Image and Vision Computing, 65:3–14.                       tional Linguistics.
Jiaming Song and Stefano Ermon. 2019. Understand-          Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn
   ing the limitations of variational mutual information    Schuller, Congkai Sun, Kenji Sagae, and Louis-
   estimators. arXiv preprint arXiv:1910.06222.             Philippe Morency. 2013. Youtube movie reviews:
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,        Sentiment analysis in an audio-visual context. IEEE
  Ilya Sutskever, and Ruslan Salakhutdinov. 2014.           Intelligent Systems, 28(3):46–53.
  Dropout: a simple way to prevent neural networks
  from overfitting. The journal of machine learning        Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015.
  research, 15(1):1929–1958.                                 Empirical evaluation of rectified activations in con-
                                                             volutional network.
Guillaume Staerman, Pavlo Mozharovskyi, and
  Stéphan Clémençon. 2021.           Affine-invariant      Chang Xu, Dacheng Tao, and Chao Xu. 2013. A
  integrated rank-weighted depth: Definition, prop-          survey on multi-view learning. arXiv preprint
  erties and finite sample analysis. arXiv preprint          arXiv:1304.5634.
  arXiv:2106.11068.
                                                           Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
Milan Studenỳ and Jirina Vejnarová. 1998. The multi-        bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
  information function as a tool for measuring stochas-      Xlnet: Generalized autoregressive pretraining for
  tic dependence. In Learning in graphical models,           language understanding. In Advances in neural in-
  pages 261–297. Springer.                                   formation processing systems, pages 5753–5763.
Zhongkai Sun, Prathusha Sarma, William Sethares, and
                                                           Jun Ye, Hao Hu, Guo-Jun Qi, and Kien A Hua. 2017.
  Yingyu Liang. 2020. Learning relationships be-
                                                             A temporal order modeling approach to human ac-
  tween text, audio, and video via deep canonical cor-
                                                             tion recognition from multimodal sensor data. ACM
  relation for multimodal language analysis. In Pro-
                                                             Transactions on Multimedia Computing, Communi-
  ceedings of the AAAI Conference on Artificial Intel-
                                                             cations, and Applications (TOMM), 13(2):1–22.
  ligence, volume 34, pages 8992–8999.
Yonglong Tian, Dilip Krishnan, and Phillip Isola.          Jiahong Yuan and Mark Liberman. 2008. Speaker iden-
  2019. Contrastive multiview coding. arXiv preprint          tification on the scotus corpus. Journal of the Acous-
  arXiv:1906.05849.                                           tical Society of America, 123(5):3878.

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang,          Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cam-
  J Zico Kolter, Louis-Philippe Morency, and Rus-           bria, and Louis-Philippe Morency. 2017. Tensor
  lan Salakhutdinov. 2019. Multimodal transformer           fusion network for multimodal sentiment analysis.
  for unaligned multimodal language sequences. In           arXiv preprint arXiv:1707.07250.
Amir Zadeh, Paul Pu Liang, Navonil Mazumder,
 Soujanya Poria, Erik Cambria, and Louis-Philippe
 Morency. 2018a.     Memory fusion network for
 multi-view sequential learning.   arXiv preprint
 arXiv:1802.00927.
Amir Zadeh, Paul Pu Liang, Soujanya Poria, Pra-
 teek Vij, Erik Cambria, and Louis-Philippe Morency.
 2018b. Multi-attention recurrent network for hu-
 man communication comprehension. In Proceed-
 ings of the... AAAI Conference on Artificial Intelli-
 gence. AAAI Conference on Artificial Intelligence,
 volume 2018, page 5642. NIH Public Access.
Amir Zadeh, Chengfeng Mao, Kelly Shi, Yiwei Zhang,
 Paul Pu Liang, Soujanya Poria, and Louis-Philippe
 Morency. 2019. Factorized multimodal transformer
 for multimodal sequential learning. arXiv preprint
 arXiv:1911.09826.
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Po-
 ria, Erik Cambria, and Louis-Philippe Morency.
 2018c. Multimodal language analysis in the wild:
 Cmu-mosei dataset and interpretable dynamic fu-
 sion graph. In Proceedings of the 56th Annual Meet-
 ing of the Association for Computational Linguistics
 (Volume 1: Long Papers), pages 2236–2246.
8     Supplementary                                                                       Statistic Network
                                                              Layer                    Number of outputs Activation function
8.1     Training details                                      [Zlav , Z lav ]              din , din              -
                                                              Dense layer                   din /2          LeakyReLU
In this section, we both present a comprehensive il-          Dropout                        0.4                  -
lustration of the Algorithm 1 and state the details of        Dense layer                    din            LeakyReLU
                                                              Dropout                        0.4                  -
experimental hyperparameters selection as well as             Dense layer                    din            LeakyReLU
and the architectures used for the statistic network          Dropout                        0.4                  -
Tθ .                                                          Dense layer                   din /4          LeakyReLU
                                                              Dropout                        0.4                  -
8.1.1    Illustration of Algorithm 1                          Dense layer                   din /4          LeakyReLU
                                                              Dropout                        0.4                  -
Fig. 3 describes the Algorithm 1. As can be seen              Dense layer                      1              Sigmoïd
in the figure, to compute the mutual dependency
measure the statistic network Tθ takes the two em-          Table 4: Statistics network description. din denotes the
                                                            dimension of Zavl .
beddings of the different batch B and B̄.

                                                            8.2           Additional experiments for robustness to
                                                                          modality drop
                                                            Fig. 4 shows the results of the robustness text on
                                                            MAGXLNET. Similarly to Fig. 2 we observe more
                                                            robust representation to modality drop when jointly
                                                            maximising the LW and Lkl with the target loss.
                                                            Fig. 4 shows no improvement when training with
                                                            Lf . This can also be linked to Tab. 2 which simi-
                                                            larly shows no improvement in this very specific
Figure 3: Illustration of the method describes in Al-       configuration.
gorithm 1 for the different estimators derived from
Th. 1. B and B̄ stands for the batch of data sample                                          MAGXLNET
from the joint probability distribution and the product                          0.6
of the marginal distribution respectively. Zavl denotes                          0.5
the fusion representation of linguistic, acoustic and vi-
                                                                  Acc2corrupt/Acc2

sual (resp. l, a and v) modalities provided by a multi-                          0.4
                                                                                                                       kl
modal architecture fθe for the batch B . Z lav denotes                           0.3
the same quantity as described before for the batch B̄.                                                                f
Aθp denotes the linear projection before classification                          0.2
or regression.
                                                                                 0.1
                                                                                 0.0   A          V           A+V
8.1.2    Hyperparameters selection                                                         Present Modality
We use dropout (Srivastava et al., 2014) and opti-          Figure 4: Study of the robustness of the representations
mise the global loss Eq. 5 by gradient descent using        against a drop of the linguistic modality. Studied model
AdamW (Loshchilov and Hutter, 2017; Kingma                  is MAGXLNET on CMU-MOSI. The ratio between the
and Ba, 2014) optimiser. The best learning rate is          accuracy achieved with a corrupted linguistic modality
                                                            Acccorrupt  and the accuracy Acc2 without any corrup-
found in the grid {0.002, 0.001, 0.0005, 0.0001}.                2
                                                            tion is reported on y-axis. The preserved modalities
The best model is selected using the lowest MAE             during inference are reported on x-axis. A, V respec-
on the validation set. We U nroll to 10.                    tively stands for the acoustic and visual modality.

8.1.3    Architectures of Tθ
Across the different experiment we use a statis-            8.3           Additional qualitative examples
tic network with an architecture as describes in            Tab. 5 illustrates the use of Tθ to explain the repre-
Tab. 4. We follow (Belghazi et al., 2018) and use           sentations learnt by the model. Similarly to Tab. 4
LeakyRELU (Agarap, 2018; Xu et al., 2015) as                we observe that high values correspond to com-
activation function.                                        plementarity across modalities and low values are
related to contradictoriness across modalities.
Spoken Transcripts                                                Acoustic and visual behaviour                   Tθ
 but the m the script is corny                                     high energy voice + headshake + (many) smiles   L
 as for gi joe was it was just like laughing
                                                                   high enery voice + laughts + smiles             L
 its the the plot the the acting is terrible
 but i think this one did beat scream 2 now                        headshake + long sigh                           L
 the xxx sequence is really well done                              static head + low energy monotonous voice       L
 you know of course i was waithing for the princess and the frog   smiles + high energy voice + + high pitch       H
 dennis quaid i think had a lot of fun                             smiles + high energy voice                      H
 it was very very very boring                                      low energy voice + frown eyebrows               H
 i do not wanna see any more of this                               angry voice + angry facial expression           H

Table 5: Examples from the CMU-MOSI dataset using MAGXLNET trained with LW . The last column is computed
using the statistic network Tθ . L stands for low values and H stands for high values. Green, grey, red highlight
positive, neutral and negative expression/behaviours respectively.
You can also read