Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis

Page created by Erik Castillo
 
CONTINUE READING
Improving Multimodal Fusion with Hierarchical Mutual Information
 Maximization for Multimodal Sentiment Analysis

 Wei Han† , Hui Chen† , Soujanya Poria† ,
 †
 Singapore University of Technology and Design, Singapore
 {wei han, hui chen}@mymail.sutd.edu.sg
 sporia@sutd.edu.sg

 Abstract encounters a specific topic, person or entity (De-
 onna and Teroni, 2012; Poria et al., 2020). Min-
 In multimodal sentiment analysis (MSA), the ing and understanding these emotional elements
 performance of a model highly depends on the
 from multimodal data, namely multimodal senti-
arXiv:2109.00412v2 [cs.CL] 16 Sep 2021

 quality of synthesized embeddings. These em-
 beddings are generated from the upstream pro- ment analysis (MSA), has become a hot research
 cess called multimodal fusion, which aims to topic because of numerous appealing applications,
 extract and combine the input unimodal raw such as obtaining overall product feedback from
 data to produce a richer multimodal represen- customers or gauging polling intentions from po-
 tation. Previous work either back-propagates tential voters (Melville et al., 2009). Generally,
 the task loss or manipulates the geometric different modalities in the same data segment are
 property of feature spaces to produce favor-
 often complementary to each other, providing ex-
 able fusion results, which neglects the preser-
 vation of critical task-related information that
 tra cues for semantic and emotional disambigua-
 flows from input to the fusion results. In tion (Ngiam et al., 2011). The crucial part for
 this work, we propose a framework named MSA is multimodal fusion, in which a model aims
 MultiModal InfoMax (MMIM), which hier- to extract and integrate information from all input
 archically maximizes the Mutual Information modalities to understand the sentiment behind the
 (MI) in unimodal input pairs (inter-modality) seen data. Existing methods to learn unified repre-
 and between multimodal fusion result and sentations are grouped in two categories: through
 unimodal input in order to maintain task-
 loss back-propagation or geometric manipulation
 related information through multimodal fu-
 sion. The framework is jointly trained with in the feature spaces. The former only tunes
 the main task (MSA) to improve the perfor- the parameters based on back-propagated gradi-
 mance of the downstream MSA task. To ad- ents from the task loss (Zadeh et al., 2017; Tsai
 dress the intractable issue of MI bounds, we et al., 2019a; Ghosal et al., 2019), reconstruction
 further formulate a set of computationally sim- loss (Mai et al., 2020), or auxiliary task loss (Chen
 ple parametric and non-parametric methods et al., 2017; Yu et al., 2021). The latter addition-
 to approximate their truth value. Experimen-
 ally rectifies the spatial orientation of unimodal or
 tal results on the two widely used datasets
 demonstrate the efficacy of our approach.
 multimodal representations by matrix decomposi-
 The implementation of this work is pub- tion (Liu et al., 2018) or Euclidean measure opti-
 licly available at https://github.com/ mization (Sun et al., 2020; Hazarika et al., 2020).
 declare-lab/Multimodal-Infomax. Although having gained excellent results in MSA
 tasks, these methods are limited to the lack of con-
 1 Introduction trol in the information flow that starts from raw
 inputs till the fusion embeddings, which may risk
 With the unprecedented advances in social media
 losing practical information and introducing unex-
 in recent years and the availability of smartphones
 pected noise carried by each modality (Tsai et al.,
 with high-quality cameras, we witness an explo-
 2020). To alleviate this issue, different from pre-
 sive boost of multimodal data, such as movies,
 vious work, we leverage the functionality of mu-
 short-form videos, etc. In real life, multimodal
 tual information (MI), a concept from the subject
 data usually consists of three channels: visual (im-
 of information theory. MI measures the depen-
 age), acoustic (voice), and transcribed text. Many
 dencies between paired multi-dimensional vari-
 of them often express sort of sentiment, which
 ables. Maximizing MI has been demonstrated ef-
 is a long-term disposition evoked when a person
ficacious in removing redundant information irrel- 2.1 Multimodal Sentiment Analysis (MSA)
evant to the downstream task and capturing in-
 MSA is an NLP task that collects and tackles data
variant trends or messages across time or differ-
 from multiple resources such as acoustic, visual,
ent domains (Poole et al., 2019), and has been
 and textual information to comprehend varied hu-
shown remarkable success in the field of repre-
 man emotions (Morency et al., 2011). Early fu-
sentation learning (Hjelm et al., 2018; Veličković
 sion models adopted simple network architectures,
et al., 2018). Based on these experience, we pro-
 such as RNN based models (Wöllmer et al., 2013;
pose MultiModal InfoMax (MMIM), a framework
 Chen et al., 2017) that capture temporal depen-
that hierarchically maximizes the mutual informa-
 dencies from low-level multimodal inputs, SAL-
tion in multimodal fusion. Specifically, we en-
 CNN (Wang et al., 2017) which designed a select-
hance two types of mutual information in repre-
 additive learning procedure to improve the gener-
sentation pairs: between unimodal representations
 alizability of trained neural networks, etc. Mean-
and between fusion results and their low-level uni-
 while, there were many trials to combine geo-
modal representations. Due to the intractability
 metric measures as accessory learning goals into
of mutual information (Belghazi et al., 2018), re-
 deep learning frameworks. For instance, Hazarika
searchers always boost MI lower bound instead
 et al. (2018); Sun et al. (2020) optimized the deep
for this purpose. However, we find it is still dif-
 canonical correlation between modality represen-
ficult to figure out some terms in the expressions
 tations for fusion and then passed the fusion re-
of these lower bounds in our formulation. Hence
 sult to downstream tasks. More recently, formu-
for convenient and accurate estimation of these
 lations influenced by novel machine learning top-
terms, we propose a hybrid approach composed
 ics have emerged constantly: Akhtar et al. (2019)
of parametric and non-parametric parts based on
 presented a deep multi-task learning framework to
data and model characteristics. The parametric
 jointly learn sentiment polarity and emotional in-
part refers to neural network-based methods, and
 tensity in a multimodal background. Pham et al.
in the non-parametric part we exploit a Gaussian
 (2019) proposed a method that cyclically trans-
Mixture Model (GMM) with learning-free param-
 lates between modalities to learn robust joint rep-
eter estimation. Our contributions can be summa-
 resentations for sentiment analysis. Tsai et al.
rized as follows:
 (2020) proposed a routing procedure that dynam-
 1. We propose a hierarchical MI maximization ically adjusts weights among modalities to pro-
 framework for multimodal sentiment analy- vide interpretability for multimodal fusion. Mo-
 sis. MI maximization occurs at the input level tivated by advances in the field of domain sepa-
 and fusion level to reduce the loss of valuable ration, Hazarika et al. (2020) projected modality
 task-related information. To our best knowl- features into private and common feature spaces to
 edge, this is the first attempt to bridge MI and capture exclusive and shared characteristics across
 MSA. different modalities. Yu et al. (2021) designed
 a multi-label training scheme that generates ex-
 2. We formulate the computation details in our tra unimodal labels for each modality and concur-
 framework to solve the intractability prob- rently trained with the main task.
 lem. The formulation includes parametric In this work, we build up a hierarchical MI-
 learning and non-parametric GMM with sta- maximization guided model to improve the fusion
 ble and smooth parameter estimation. outcome as well as the performance in the down-
 stream MSA task, where MI maximization is re-
 3. We conduct comprehensive experiments on alized not only between unimodal representations
 two publicly available datasets and gain su- but also between fusion embeddings and unimodal
 perior or comparable results to the state-of- representations.
 the-art models.
 2.2 Mutual Information in Deep Learning
2 Related Work
 Mutual information (MI) is a concept from infor-
In this section, we briefly overview some related mation theory that estimates the relationship be-
work in multimodal sentiment analysis and mutual tween pairs of variables. It is a reparameterization-
information estimation and application. invariant measure of dependency (Tishby and Za-
slavsky, 2015) defined as: two levels—input level and fusion level are esti-
 
 p(x, y)
  mated and boosted. The two parts work concur-
 I(X; Y ) = Ep(x,y) log (1) rently to produce task and MI-related losses for
 p(x)p(y)
 back-propagation, through which the model learns
Alemi et al. (2016) first combined MI-related op-
 to infuse the task-related information into fusion
timization into deep learning models. From then
 results as well as improve the accuracy of predic-
on, numerous works studied and demonstrated the
 tions in the main task.
benefit of the MI-maximization principle (Bach-
man et al., 2019; He et al., 2020; Amjad and
 3.3 Modality Encoding
Geiger, 2019). However, since direct MI estima-
tion in high-dimensional spaces is nearly impos- We firstly encode the multimodal sequential input
sible, many works attempted to approximate the Xm into unit-length representations hm . Specif-
true value with variational bounds (Belghazi et al., ically, we use BERT (Devlin et al., 2019) to en-
2018; Cheng et al., 2020; Poole et al., 2019). code an input sentence and extract the head em-
 In our work, we apply MI lower bounds at both bedding from the last layer’s output as ht . For vi-
the input level and fusion level and formulate or sual and acoustic, following previous works (Haz-
reformulate estimation methods for these bounds arika et al., 2020; Yu et al., 2021), we employ two
based on data characteristics and mathematical modality-specific unidirectional LSTMs (Hochre-
properties of the terms to be estimated. iter and Schmidhuber, 1997) to capture the tempo-
 ral features of these modalities:
3 Method
 ht = BERT Xt ; θtBERT
 
3.1 Problem Definition
 LST M
  (2)
In MSA tasks, the input to a model is unimodal hm = sLSTM Xm ; θm m ∈ {v, a}
raw sequences Xm ∈ Rlm ×dm drawn from the
same video fragment, where lm is the sequence 3.4 Inter-modality MI Maximization
length and dm is the representation vector di-
mension of modality m, respectively. Particu- For a modality representation pair X, Y that
larly, in this paper we have m ∈ {t, v, a}, where comes from a single video clip, although they
t, v, a denote the three types of modalities—text, seem to be independent sequences, there is a
visual and acoustic that we obtained from the certain correlation between them (Arandjelovic
datasets. The goal for the designed model is to ex- and Zisserman, 2017). Formally, suppose we
tract and integrate task-related information from have a collection of videos V and assume
these input vectors to form a unified representa- that their prior distributions are known. Then
tion and then utilize that to make accurate predic- the prior distribution of X and Y can be de-
tions about a truth value y that reflects the senti- composed byR the sampling process in V as
ment strength. P
 R (X) = V P (X|V )P (V ) and P (Y ) =
 V P (Y |V )P (V ), as well as their joint distribu-
3.2 Overall Architecture
 R
 tion P (X, Y ) = V P (X, Y |V )P (V ). Unless
As shown in Figure 1, our model firstly processes P (X, Y |V ) = P (X|V )P (Y |V ), i.e., X and Y
raw input into numerical sequential vectors with are conditionally independent from V , the MI is
feature extractor (firmware for visual and acous- never trivially 0.
tic with no parameters to train) and tokenizer (for Since the analysis above, we hope that through
text). Then we encode them into individual unit- prompting MI between multimodal input we can
length representations. The model then works in filter out modality-specific random noise that is ir-
two collaborative parts—fusion and MI maximiza- relevant to our task and keep modality-invariant
tion, marked by solid and dash lines in Figure contents that span all modalities as much as pos-
1 respectively. In the fusion part, a fusion net- sible. As stated before, we boost a tractable lower
work F of stacked linear-activation layers trans- bound instead of computing MI directly for this
forms the unimodal representations into the fusion purpose. We exploit an accurate and straightfor-
result Z, which is then passed through a regres- ward MI lower bound introduced in Barber and
sion multilayer perceptron (MLP) for final predic- Agakov (2004). It approximates the truth condi-
tions. In the MI part, the MI lower bounds at tional distribution p(y|x) with a variational coun-
Modality History Entropy
 Type ( )
 A-sLSTM Memory Estimator
 Acoustic Update
 Acoustic Feature
 Extractor
 MLP for Regression
 
 BERT Encoder Predictor lld
 
 Fusion ො

 Layer 12
 
 Layer 1

 Layer 2
 Network
 Text [CLS] And I just love it [SEP]

 Contrastive
 Predictive CPC score
 Predictor lld Coding

 Fusion
 V-sLSTM
 Visual MI
 Visual Feature Estimation
 Update
 Extractor
 History Entropy
 ( )
 Memory Estimator

 Figure 1: The overall architecture of the MMIM model.

terpart q(y|x): likelihood maximization is:
 N
 1 XX
 Llld = − log q(yi |xi ) (4)
 
 q(y|x)
  N tv,ta
 i=1
 I(X; Y ) =Ep(x,y) log +
 p(y)
 where N is the batch size in training, tv, ta means
 Ep(y) [KL(p(y|x)kq(y|x))] (3) summing the likelihood of two predictors.
 ≥Ep(x,y) [log q(y|x)] + H(Y ) For the entropy term H(Y ), we solve its
 ,IBA computation with the Gaussian Mixture Model
 (GMM), a commonly utilized approach for un-
 known distribution approximation that can facili-
where H(Y ) is the differential entropy of Y . This tate distribution-based estimation (Nilsson et al.,
lower bound is tight, i.e., there is no gap between 2002; Kerroum et al., 2010). GMM builds up
the bound and truth value, when q(y|x) = p(y|x). multiple Gaussian distributions for different prop-
In our implementation, we optimize the bounds erty classes. We choose the sentiment polarity
for two modality pairs— (text, visual) and (text, (non-negative/negative), which is a natural prop-
acoustic). In each pair, we treat text as X and erty in the datasets, as the classification crite-
the other modality as Y in (3). We do so be- rion, which can also balance the trade-off between
cause 1) Since we have to train a predictor q(y|x) estimation accuracy (requires more classes) and
to approximate p(y|x), prediction from higher- computational cost (requires fewer classses). We
dimensional vectors ht ∈ Rdt (dt =768) to lower build up two normal distributions Npos (µ1 , Σ1 )
ones hv ∈ Rdv and ha ∈ Rda (dv , da ¡ 50) con- and Nneg (µ2 , Σ2 ) for each class, where µ is the
verges faster with higher accuracy; 2) many pre- mean vector and Σ is the covariance matrix. The
vious works (Tsai et al., 2019a; Hazarika et al., parameters are estimated via the maximum like-
2020) pointed out that from empirical study the lihood method on a sufficiently large sampling
text modality is predominate, which can integrate batch Ds ⊂ Dtrain :
more task-related features than other modalities in 1 X
 Nc
this step. Additionally, we examine the efficacy of µ̂c = hic
 Nc
the design choice in the ablation study part. Fol- i=1
 (5)
 Nc
lowing Cheng et al. (2020), we formulate q(y|x) 1 X
as a multivariate Gaussian distributions qθ (y|x) = Σ̂c = hic hic − µ̂Tc µ̂c
 Nc
 i=1
N (y|µθ1 (x), σθ22 (x)I), with two neural networks
parameterized by θ1 and θ2 to predict the mean where c ∈ {pos, neg} represents the polarity class
and variance, respectively. The loss function for that the sample belongs to, Nc is the number of
samples in class c and is component-wise mul- the fusion network F that produces fusion results
tiplication. The entropy of a multivariate normal Z = F (Xt , Xv , Xa ). Since we already have a
distribution is given by: generation path from Xm to Z, we expect an op-
 posite path, i.e. to constructs Xm , m ∈ {t, v, a}
 1   log(det(2πeΣ))
H= log (2πe)k det(Σ) = from Z. Inspired by but different from Oord et al.
 2 2 (2018), we use a score function that acts on the
 (6)
where k is the dimensionality of the vectors in normalized prediction and truth vectors to gauge
GMM and det(Σ) is the determinant of Σ. Based their correlation:
on the nearly equal frequencies of the two polarity Gφ (Z) hm
classes in the dataset, we assume the prior prob- Gφ (Z) = , hm =
 kGφ (Z)k2 khm k 2
ability that one data point x = (x1 , ..., xk ) be-   T  (10)
longs to each is equal, i.e., wpos = p(x ∈ pos) = s(hm , Z) = exp hm Gφ (Z)
wneg = p(x ∈ neg) = 12 . Under the assumption
that the two sub-distributions are disjoint, from where Gφ is a neural network with parameters φ
Huber et al. (2008) the lower and upper bound of that generates a prediction of hm from Z, k · k2 is
a GMM’s entropy are: the Euclidean norm, by dividing which we obtain
 X X unit-length vectors. Because we find the model
 wc hc ≤ H(Y ) ≤ wc (− log wc + hc ) (7) intends to stretch both vectors to maximize the
 c c score in (10) without this normalization. Then
where hc is the entropy of the sub-distribution for same as what Oord et al. (2018) did, we incorpo-
class c. Taking the lower bound as an approxima- rate this score function into the Noise-Contrastive
tion, we obtain the entropy term for the MI lower Estimation framework (Gutmann and Hyvärinen,
bound: 2010) by treating all other representations of that
 modality in the same batch H̃im = Hm \ {him } as
 1 negative samples:
 H(Y ) = [log((det(Σ1 ) det(Σ2 ))] (8)
 4 " #
In this formulation, we implicitly assume that the s(Z, him )
 LN (Z, Hm ) = −EH log P
prior probabilities of the two classes are equal. We hjm ∈Hm
 s(Z, hjm )
further notice that H(Y ) changes every time dur- (11)
ing each training epoch but at a very slow pace in Here is a short explanation of the rationality of
several continuous steps due to the small gradients such formulation. Contrastive Predictive Coding
and consequently slight fluctuation in parameters. (CPC) scores the MI between context and future
This fact demands us to update parameters timely elements “across the time horizon” to keep the
to ensure estimation accuracy. Besides, according portion of “slow features” that span many time
to statistical theory, we should increase the batch steps (Oord et al., 2018). Similarly, in our model,
size (N∗ ) to reduce estimation error, but the maxi- we ask the fusion result Z to reversely predict
mum batch size is restricted to the GPU’s capacity. representations “across modalities” so that more
Considering the situation above, we indirectly en- modality-invariant information can be passed to
large Ds by encompassing the data from the near- Z. Besides, by aligning the prediction to each
est history. In implmentation, we store such data modality we enable the model to decide how much
in a history data memory. The loss function for information it should receive from each modality.
MI lower bound maximization in this level is given This insight will be further discussed with experi-
by: mental evidence in Section 5.2. The loss function
 t,v t,a
 LBA = −IBA − IBA (9) for this level is given by:

 LCP C = Lz,v z,a z,t
 N + LN + LN (12)
3.5 MI Maximization in the Fusion Level
To enforce the intermediate fusion results to cap-
ture modality-invariant cues among modalities, we 3.6 Training
repeat MI maximization between fusion results The training process consists of two stages in
and input modalities. The optimization target is each iteration: In the first stage, we approximate
Algorithm 1: MultiModal Mutual Information Split CMU-MOSI CMU-MOSEI
 Maximization (MM-MIM) Train 1284 16326
 Input: Dataset D = {(Xt , Xv , Xa ), Y }, α, β, Validation 229 1871
 learning rate ηlld , ηmain , embedding history Test 686 4659
 memory M
 Output: Prediction ŷ
 All 2199 22856
 for each training epoch do
 Stage 1: Conditional Likelihood Maximization Table 1: Dataset split.
 for minibatch B = {(Xti , Xvi , Xai )}Ni=1
 sampled from Dsub ⊆ D do
 i
 Encode Xm to him as (2) 4.1 Datasets and Metrics
 Compute Llld as (4)
 Update parameters of predictor q: We conduct experiments on two publicly
 θq ← θq − ηlld ∇θ Llld
 end available academic datasets in MSA research:
 Stage 2: MI-maximization Joint Training: CMU-MOSI (Zadeh et al., 2016) and CMU-
 for minibatch B = {(Xti , Xvi , Xai )}Ni=1 MOSEI (Zadeh et al., 2018). CMU-MOSI
 sampled from D do
 i i
 Encode Xm to hm as (2) contains 2199 utterance video segments sliced
 Estimate the mean vectors and co-variance from 93 videos in which 89 distinct narrators
 matrices in GMM model with M as (5) are sharing opinions on interesting topics. Each
 Update history M :
 M ← M \ {Oldest Hidden Batch} segment is manually annotated with a sentiment
 |B|
 M ← M ∪ {him }i=1 , m ∈ {v, a} value ranged from -3 to +3, indicating the po-
 Compute LBA as (3), (8), (9) larity (by positive/negative) and relative strength
 Produce fusion results
 Zi = F (Xti , Xvi , Xai ) and predictions ŷ
 (by absolute value) of expressed sentiment.
 Compute LN , LCP C as (10), (11), (12) CMU-MOSEI dataset upgrades CMU-MOSI by
 Compute Lmain as (14) expanding the size of the dataset. It consists of
 Update all parameters in the model except
 q: θk ← θk − ηk ∇θ Lmain 23,454 movie review video clips from YouTube.
 end Its labeling style is the same as CMU-MOSI. We
 end provide the split specifications of the two datasets
 in Table 1.
 We use the same metric set that has been consis-
p(y|x) with q(y|x) by minimizing the negative tently presented and compared before: mean abso-
log-likelihood for inter-modality predictors with lute error (MAE), which is the average mean abso-
the loss in (4). In the second stage, hierarchical MI lute difference value between predicted values and
lower bounds in previous subsections are added to truth values, Pearson correlation (Corr) that mea-
the main loss as auxiliary losses. After obtaining sures the degree of prediction skew, seven-class
the final prediction ŷ, along with the truth value y, classification accuracy (Acc-7) indicating the pro-
we have the task loss: portion of predictions that correctly fall into the
 same interval of seven intervals between -3 and +3
 Ltask = MAE(ŷ, y) (13)
 as the corresponding truths, binary classification
where MAE stands for mean absolute error loss, accuracy (Acc-2) and F1 score computed for posi-
which is a common practice in regression tasks. tive/negative and non-negative/negative classifica-
Finally, we calculate the weighted sum of all these tion results.
losses to obtain the main loss for this stage:
 4.2 Baselines
 Lmain = Ltask + αLCP C + βLBA (14) To inspect the relative performance of MMIM,
 we compare our model with many baselines.
where α, β are hyper-parameters that control the We consider pure learning based models, such
impact of MI maximization. We summarize the as TFN (Zadeh et al., 2017), LMF (Liu
training algorithm in Algorithm 1. et al., 2018), MFM (Tsai et al., 2019b) and
 MulT (Tsai et al., 2019a), as well as ap-
4 Experiments
 proaches involving feature space manipulation
In this section, we present some experimental de- like ICCN (Sun et al., 2020) and MISA (Haz-
tails, including datasets, baselines, feature extrac- arika et al., 2020). We also compare our model
tion tool kits, and results. with more recent and competitive baselines, in-
CMU-MOSI CMU-MOSEI
 models♦
 MAE Corr Acc-7 Acc-2 F1 MAE Corr Acc-7 Acc-2 F1
 †
 TFN 0.901 0.698 34.9 - /80.8 - /80.7 0.593 0.700 50.2 - /82.5 - /82.1
 LMF† 0.917 0.695 33.2 - /82.5 - /82.4 0.623 0.677 48.0 - /82.0 - /82.1
 MFM† 0.877 0.706 35.4 - /81.7 - /81.6 0.568 0.717 51.3 - /84.4 - /84.3
 ICCN† 0.862 0.714 39.0 - /83.0 - /83.0 0.565 0.713 51.6 - /84.2 - /84.2
 MulT‡ 0.861 0.711 - 81.5/84.1 80.6/83.9 0.580 0.703 - - /82.5 - /82.3
 MISA‡ 0.804 0.764 - 80.79/82.10 80.77/82.03 0.568 0.724 - 82.59/84.23 82.67/83.97
 MAG-BERT‡ 0.731 0.789 - 82.5/84.3 82.6/84.3 0.539 0.753 - 83.8/85.2 83.7/85.1
 Self-MM‡ 0.713 0.798 - 84.00/85.98 84.42/85.95 0.530 0.765 - 82.81/85.17 82.53/85.30
 MAG-BERT∗ 0.727 0.781 43.62 82.37/84.43 82.50/84.61 0.543 0.755 52.67 82.51/84.82 82.77/84.71
 Self-MM∗ 0.712 0.795 45.79 82.54/84.77 82.68/84.91 0.529 0.767 53.46 82.68/84.96 82.95/84.93
 MMIM 0.700\ 0.800\ 46.65\ 84.14\ /86.06\ 84.00\ /85.98\ 0.526 0.772 54.24\ 82.24/85.97\ 82.66/85.94\

Table 2: Results on CMU-MOSI and CMU-MOSEI; ♦: all models use BERT as the text encoder; †: from Hazarika
et al. (2020); ‡: from Yu et al. (2021); ∗: reproduced from open-source code with hyper-parameters provided in
original papers. For Acc-2 and F1, we have two sets of non-negative/negative (left) and positive/negative (right)
evaluation results. Best results are marked in bold and \ means the corresponding result is significantly better than
SOTA with p-value ¡ 0.05 based on paired t-test.

cluding BERT-based model—MAG-BERT (Rah- MAG-BERT (Rahman et al., 2020): Mul-
man et al., 2020) and Self-MM (Yu et al., 2021), timodal Adaptation Gate for BERT designs an
which works with multi-task learning and is the alignment gate and insert that into vanilla BERT
SOTA method. Some of the baselines are available model to refine the fusion process.
at https://github.com/declare-lab/ SELF-MM (Yu et al., 2021): Self-supervised
multimodal-deep-learning. Multi-Task Learning assigns each modality a uni-
 The baselines are listed below: modal training task with automatically generated
 TFN (Zadeh et al., 2017): Tensor Fusion Net- labels, which aims to adjust the gradient back-
work disentangles unimodal into tensors by three- propagation.
fold Cartesian product. Then it computes the outer
product of these tensors as fusion results. 4.3 Basic Settings and Results
 LMF (Liu et al., 2018): Low-rank Multimodal Experimental Settings. We use unaligned raw
Fusion decomposes stacked high-order tensors data in all experiments as in Yu et al. (2021). For
into many low rank factors then performs efficient visual and acoustic, we use COVAREP (Degot-
fusion based on these factors. tex et al., 2014) and P2FA (Yuan and Liberman,
 MFM (Tsai et al., 2019b): Multimodal Fac- 2008), which both are prevalent tool kits for fea-
torization Model concatenates a inference net- ture extraction and have been regularly employed
work and a generative network with intermedi- before. We trained our model on a single RTX
ate modality-specific factors, to facilitate the fu- 2080Ti GPU and ran a grid search for the best set
sion process with reconstruction and discrimina- of hyper-parameters. The details are provided in
tion losses. the supplementary file.
 MulT (Tsai et al., 2019a): Multimodal Trans-
 Hyperparameter Setting We perform a grid-
former constructs an architecture unimodal and
 search for the best set of hyper-parameters: batch
crossmodal transformer networks and complete
 size in {32, 64}, ηlld in {1e-3,5e-3}, ηmain in {5e-
fusion process by attention.
 4, 1e-3, 5e-3}, α, β in {0.05, 0.1, 0.3}, hidden dim
 ICCN (Sun et al., 2020): Interaction Canonical
 in {32, 64}, memory size in {1, 2, 3} batches, gra-
Correlation Network minimizes canonical loss be-
 dient clipping value is fixed at 5.0, learning rate for
tween modality representation pairs to ameliorate
 BERT fine-tuning is 5e-5, BERT embedding size
fusion outcome.
 is 768 and fusion vector size is 128. The hyperpa-
 MISA (Hazarika et al., 2020): Modality-
 rameters are given in Table 3.
Invariant and -Specific Representations projects
features into separate two spaces with special lim- Summary of the Results. In accord with pre-
itations. Fusion is then accomplished on these fea- vious work, we ran our model five times un-
tures. der the same hyper-parameter settings and report
Item CMU-MOSI CMU-MOSEI
 batch size 32 64
 learning rate ηlld 5e-3 1e-3
 learning rate ηmain 1e-3 5e-4
 α 0.3 0.1
 β 0.1 0.05
 V-LSTM hidden dim 32 64
 A-LSTM hidden dim 32 16
 memory size 32 (1 batch) 64 (1 batch)
 gradient clip 5.0 5.0

 Table 3: Hyperparameters for best performance.

 Figure 2: Visualization of loss changing as training
the average performance in Table 2. We find proceeds on CMU-MOSEI.
that MMIM yields better or comparable results to
many baseline methods. To elaborate, our model
significantly outperforms SOTA in all metrics on single term, which shows the efficacy of our MI
CMU-MOSI and in (non-0) Acc-7, (non-0) Acc- maximization framework. Besides, by replacing
2, F1 score on CMU-MOSEI. For other metrics, current optimization target pairs in inter-modality
MMIM achieves very closed performance (¡0.5%) MI with single pair or other pair combinations we
to SOTA. These outcomes preliminarily demon- can not gain better results, which provides exper-
strate the efficacy of our method in MSA tasks. imental evidence for the candidate pair choice in
 that level. Then we test the components for en-
 Description MAE Corr Acc-7 Acc-2 F1 tropy estimation. We deactivate the history mem-
 MMIM 0.526 0.772 54.24 82.24/85.97 82.66/85.94 ory and evaluate µ and Σ in (5) using only the
 Inter-modality MI
 t,v
 IBA 0.533 0.763 53.80 80.87/85.08 81.37/85.06 current batch. It is surprising to observe that the
 t,a
 IBA
 v,a
 0.538 0.767 53.31 80.26/82.73 80.81/82.00 training process broke down due to the gradient’s
 IBA 0.545 0.753 53.85 80.40/85.05 80.85/84.95
 t,a
 IBA v,a
 + IBA 0.536 0.764 53.53 79.40/85.39 80.12/85.47 “NaN” value. Therefore, the history-based esti-
 t,v v,a
 IBA + IBA 0.534 0.770 54.11 80.62/85.61 81.20/85.64
 t,v
 IBA v,a
 + IBA t,a
 + IBA 0.527 0.772 54.53 80.02/85.42 80.64/85.44
 mation has another advantage of guaranteeing the
 None 0.541 0.752 53.57 79.60/84.75 80.21/84.76 training stability. Finally, we substitute the GMM
 LCP C loss
 w/o Lz,t 0.535 0.768 53.66 76.46/83.92 77.38/84.04
 with a unified Gaussian where µ and Σ are esti-
 N
 w/o Lz,v
 N 0.536 0.766 53.70 82.71/85.86 82.80/85.97 mated on all samples regardless of their polarity
 w/o Lz,a
 N 0.530 0.771 53.44 80.68/85.78 81.18/85.72
 w/o Lz,t z,v
 N , LN , LN
 z,a
 0.543 0.759 53.49 78.89/84.37 79.57/84.40 classes. We spot a clear drop in all metrics, which
 Entropy estimation implies the GMM built on natural class leads to a
 w/o history data NaN NaN NaN NaN/NaN NaN/NaN
 w/o GMM 0.533 0.768 53.4 79.57/84.94 80.19/84.95 more accurate estimation for entropy terms.

Table 4: Ablation study of MMIM on CMU-MOSEI. 5 Further Analysis
t, v, a, z represent text, visual, acoustic and fusion re-
sults. In this section, we dive into our models to explore
 how it functions in the MSA task. We first visual-
 ize all types of losses in the training process, then
4.4 Ablation Study we analyze some representative cases.
To show the benefits from the proposed loss func-
tions and the corresponding estimation methods 5.1 Tracing the Losses
in MMIM, we carried out a series of ablation ex- To better understand how MI losses work, we vi-
periments on CMU-MOSEI. The results under dif- sualize the variation of all losses during training
ferent ablation settings are categorized and listed in Figure 2. The values for plotting are the av-
in Table 4. First, we eliminate one or several MI erage losses in a constant interval of every 20
loss terms, for both the inter-modality MI lower steps. From the figure, we can see throughout the
bound (IBA ) and CPC loss (Lz,m N where m ∈ training process, Ltask and LCP C keep decreas-
{v, a, t}), from the total loss. We note the man- ing nearly all the time, while LBA goes down in
ifest performance degradation after removing part an epoch except the beginning of that. We also
of the MI loss, and the results are even worse when mark the time that the best epoch ends, i.e., the
removing all terms in one loss than only removing task loss on the validation set reaches the mini-
Text Visual Acoustic szt /szv /sza Pred Truth
 We’ll pick it up from here in Slightly rising tone
 (A) the next video in this series. Smile Normal volume 0.67/0.96/0.43 +0.6663 +0.6667

 I’d probably only give it a two Peaceful tone
 (B) out of five stars. Frown Normal volume 0.85/0.96/0.36 −1.6642 −1.6667

 So these people are commissioned
 (C) to hunt the animals Glance Peaceful & Narrative 0.64/0.93/0.73 −0.0009 0.0000

 I’m sorry, on the scale of one to
 (D) five I would give this a five. Turn head High pitch on “five” 0.83/0.71/0.54 −2.0023 +2.6667

Table 5: Representative examples with their predictions and fusion-modality scores in the case study. High scores
(≥ 0.8) are highlighted in bold.

mum. It is notable that LBA and LCP C reach a flows from unimodal input into the fusion results
relatively lower level at this time while the task consistently with their individual contribution to
loss on the training set does not. This scenario re- the final predictions. However, this mechanism
veals the crucial role that LBA and LCP C play in may malfunction in cases like (D). The remark
the training process—they offer supplemental un- “I’m sorry” bewilders the model and meanwhile
supervised gradient rectification to the parameters visual and acoustic remind none. In this circum-
in their respective back-propagation path and fix stance, the model casts attention on text and is
up the over-fitting of the task loss. Besides, be- misled to a wrong prediction in the opposite di-
cause in the experiment settings α and β are in rection.
the same order and at the end of best epoch LBA
reaches the lowest value, which is synchronized 6 Conclusion
as the validation loss, but LCP C fails to, we can
conclude that LBA , or MI maximization in the in- In this paper, we present MMIM, which hierar-
put (lower) level, has a more significant impact chically maximizes the mutual information (MI)
on model’s performance than LCP C , or MI maxi- in a multimodal fusion pipeline. The model ap-
mization in the fusion (higher) level. plies two MI lower bounds for unimodal inputs
 and the fusion stage, respectively. To address the
5.2 Case Study intractability of some terms in these lower bounds,
 we specifically design precise, fast and robust es-
We display some predictions and truth values, timation methods to ensure the training can go
as well as corresponding input raw data (for vi- on normally as well as improve the test outcome.
sual and acoustic we only illustrate literally) and Then we conduct comprehensive experiments on
three CPC scores in Table 5. As described in two datasets followed by the ablation study, the
Section 3.5, these scores imply how much the fu- results of which verify the efficacy of our model
sion results depend on each modality. It is noted and the necessity of the MI maximization frame-
that the scores are all beyond 0.35 in all cases, work. We further visualize the losses and display
which demonstrates the fusion results seize a cer- some representative examples to provide a deeper
tain amount of domain-invariant features. We also insight into our model. We believe this work can
observe the different extents that the fusion re- inspire the creativity in representation learning and
sults depend on each modality. In case (A), visual multimodal sentiment analysis in the future.
provides the only clue of the truth sentiment, and
correspondingly szv is higher than the other two Acknowledgments
scores. In case (B), the word “only” is a piece of
additional evidence apart from what visual modal- This project is supported by the AcRF MoE Tier-
ity exposes, and we find szt achieves a higher level 2 grant titled: “CSK-NLP: Leveraging Common-
than in (A). For (C), acoustic and visual help in- sense Knowledge for NLP”, and the SRG grant
fer a neutral sentiment and thus szv and sza are id: T1SRIS19149 titled “An Affective Multimodal
large than szt . Therefore, we conclude that the Dialogue System”.
model can intelligently adjust the information that
References Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
 Kristina Toutanova. 2019. Bert: Pre-training of
Md Shad Akhtar, Dushyant Chauhan, Deepanway deep bidirectional transformers for language under-
 Ghosal, Soujanya Poria, Asif Ekbal, and Pushpak standing. In Proceedings of the 2019 Conference of
 Bhattacharyya. 2019. Multi-task learning for multi- the North American Chapter of the Association for
 modal emotion recognition and sentiment analysis. Computational Linguistics: Human Language Tech-
 In Proceedings of the 2019 Conference of the North nologies, Volume 1 (Long and Short Papers), pages
 American Chapter of the Association for Computa- 4171–4186.
 tional Linguistics: Human Language Technologies,
 Volume 1 (Long and Short Papers), pages 370–379. Deepanway Ghosal, Navonil Majumder, Soujanya Po-
 ria, Niyati Chhaya, and Alexander Gelbukh. 2019.
Alexander A Alemi, Ian Fischer, Joshua V Dillon, and DialogueGCN: A graph convolutional neural net-
 Kevin Murphy. 2016. Deep variational information work for emotion recognition in conversation. In
 bottleneck. arXiv preprint arXiv:1612.00410. Proceedings of the 2019 Conference on Empirical
 Methods in Natural Language Processing and the
Rana Ali Amjad and Bernhard C Geiger. 2019. Learn- 9th International Joint Conference on Natural Lan-
 ing representations for neural network-based classi- guage Processing (EMNLP-IJCNLP), pages 154–
 fication using the information bottleneck principle. 164, Hong Kong, China. Association for Computa-
 IEEE transactions on pattern analysis and machine tional Linguistics.
 intelligence, 42(9):2225–2239.
 Michael Gutmann and Aapo Hyvärinen. 2010. Noise-
Relja Arandjelovic and Andrew Zisserman. 2017. contrastive estimation: A new estimation principle
 Look, listen and learn. In Proceedings of the for unnormalized statistical models. In Proceed-
 IEEE International Conference on Computer Vision, ings of the Thirteenth International Conference on
 pages 609–617. Artificial Intelligence and Statistics, pages 297–304.
 JMLR Workshop and Conference Proceedings.
Philip Bachman, R Devon Hjelm, and William Buch-
 walter. 2019. Learning representations by maximiz-
 Devamanyu Hazarika, Soujanya Poria, Sruthi Gorantla,
 ing mutual information across views. arXiv preprint
 Erik Cambria, Roger Zimmermann, and Rada Mi-
 arXiv:1906.00910.
 halcea. 2018. CASCADE: Contextual sarcasm de-
 tection in online discussion forums. In Proceedings
David Barber and Felix Agakov. 2004. The IM algo-
 of the 27th International Conference on Computa-
 rithm: A variational approach to information maxi-
 tional Linguistics, pages 1837–1848.
 mization. Advances in neural information process-
 ing systems, 16:201.
 Devamanyu Hazarika, Roger Zimmermann, and Sou-
Mohamed Ishmael Belghazi, Aristide Baratin, Sai janya Poria. 2020. MISA: Modality-invariant and-
 Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron specific representations for multimodal sentiment
 Courville, and Devon Hjelm. 2018. Mutual infor- analysis. In Proceedings of the 28th ACM Interna-
 mation neural estimation. In International Confer- tional Conference on Multimedia, pages 1122–1131.
 ence on Machine Learning, pages 531–540. PMLR.
 Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Bal- Ross Girshick. 2020. Momentum contrast for un-
 trušaitis, Amir Zadeh, and Louis-Philippe Morency. supervised visual representation learning. In Pro-
 2017. Multimodal sentiment analysis with word- ceedings of the IEEE/CVF Conference on Computer
 level fusion and reinforcement learning. In Proceed- Vision and Pattern Recognition, pages 9729–9738.
 ings of the 19th ACM International Conference on
 Multimodal Interaction, pages 163–171. R Devon Hjelm, Alex Fedorov, Samuel Lavoie-
 Marchildon, Karan Grewal, Phil Bachman, Adam
Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Trischler, and Yoshua Bengio. 2018. Learning deep
 Liu, Zhe Gan, and Lawrence Carin. 2020. Club: representations by mutual information estimation
 A contrastive log-ratio upper bound of mutual in- and maximization. In International Conference on
 formation. In International Conference on Machine Learning Representations.
 Learning, pages 1779–1788. PMLR.
 Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Gilles Degottex, John Kane, Thomas Drugman, Tuomo Long short-term memory. Neural computation,
 Raitio, and Stefan Scherer. 2014. Covarep—a col- 9(8):1735–1780.
 laborative voice analysis repository for speech tech-
 nologies. In 2014 ieee international conference on Marco F Huber, Tim Bailey, Hugh Durrant-Whyte, and
 acoustics, speech and signal processing (icassp), Uwe D Hanebeck. 2008. On entropy approxima-
 pages 960–964. IEEE. tion for gaussian mixture random vectors. In 2008
 IEEE International Conference on Multisensor Fu-
Julien Deonna and Fabrice Teroni. 2012. The emo- sion and Integration for Intelligent Systems, pages
 tions: A philosophical introduction. Routledge. 181–188. IEEE.
Mounir Ait Kerroum, Ahmed Hammouch, and Driss Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee,
 Aboutajdine. 2010. Textural feature selection by AmirAli Bagher Zadeh, Chengfeng Mao, Louis-
 joint mutual information based on gaussian mixture Philippe Morency, and Ehsan Hoque. 2020. Inte-
 model for multispectral image classification. Pat- grating multimodal information in large pretrained
 tern Recognition Letters, 31(10):1168–1174. transformers. In Proceedings of the 58th Annual
 Meeting of the Association for Computational Lin-
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshmi- guistics, pages 2359–2369, Online. Association for
 narasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, Computational Linguistics.
 and Louis-Philippe Morency. 2018. Efficient low-
 rank multimodal fusion with modality-specific fac- Zhongkai Sun, Prathusha Sarma, William Sethares, and
 tors. In Proceedings of the 56th Annual Meeting of Yingyu Liang. 2020. Learning relationships be-
 the Association for Computational Linguistics (Vol- tween text, audio, and video via deep canonical cor-
 ume 1: Long Papers), pages 2247–2256. relation for multimodal language analysis. In Pro-
 ceedings of the AAAI Conference on Artificial Intel-
Sijie Mai, Haifeng Hu, and Songlong Xing. 2020.
 ligence, volume 34, pages 8992–8999.
 Modality to modality translation: An adversarial
 representation learning and graph fusion network Naftali Tishby and Noga Zaslavsky. 2015. Deep learn-
 for multimodal fusion. In Proceedings of the AAAI ing and the information bottleneck principle. In
 Conference on Artificial Intelligence, volume 34, 2015 IEEE Information Theory Workshop (ITW),
 pages 164–172. pages 1–5. IEEE.
Prem Melville, Wojciech Gryc, and Richard D
 Lawrence. 2009. Sentiment analysis of blogs by Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang,
 combining lexical knowledge with text classifica- J Zico Kolter, Louis-Philippe Morency, and Ruslan
 tion. In Proceedings of the 15th ACM SIGKDD in- Salakhutdinov. 2019a. Multimodal Transformer for
 ternational conference on Knowledge discovery and unaligned multimodal language sequences. In Pro-
 data mining, pages 1275–1284. ceedings of the Annual Meeting of the Association
 for Computational Linguistics.
Louis-Philippe Morency, Rada Mihalcea, and Payal
 Doshi. 2011. Towards multimodal sentiment analy- Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh,
 sis: Harvesting opinions from the web. In Proceed- Louis-Philippe Morency, and Ruslan Salakhutdinov.
 ings of the 13th international conference on multi- 2019b. Learning factorized multimodal representa-
 modal interfaces, pages 169–176. tions. In International Conference on Representa-
 tion Learning.
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan
 Nam, Honglak Lee, and Andrew Y Ng. 2011. Mul- Yao-Hung Hubert Tsai, Martin Ma, Muqiao Yang,
 timodal deep learning. In ICML. Ruslan Salakhutdinov, and Louis-Philippe Morency.
 2020. Multimodal routing: Improving local and
Mattias Nilsson, Harald Gustaftson, Søren Vang An- global interpretability of multimodal language anal-
 dersen, and W Bastiaan Kleijn. 2002. Gaussian mix- ysis. In Proceedings of the 2020 Conference on
 ture model based mutual information estimation be- Empirical Methods in Natural Language Processing
 tween frequency bands in speech. In 2002 IEEE (EMNLP), pages 1823–1833.
 International Conference on Acoustics, Speech, and
 Signal Processing, volume 1, pages I–525. IEEE. Petar Veličković, William Fedus, William L Hamilton,
 Pietro Liò, Yoshua Bengio, and R Devon Hjelm.
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Deep graph infomax. In International Con-
 2018. Representation learning with contrastive pre- ference on Learning Representations.
 dictive coding. arXiv preprint arXiv:1807.03748.
Hai Pham, Paul Pu Liang, Thomas Manzini, Louis- Haohan Wang, Aaksha Meghawat, Louis-Philippe
 Philippe Morency, and Barnabás Póczos. 2019. Morency, and Eric P Xing. 2017. Select-additive
 Found in translation: Learning robust joint represen- learning: Improving generalization in multimodal
 tations by cyclic translations between modalities. In sentiment analysis. In 2017 IEEE International
 Proceedings of the AAAI Conference on Artificial In- Conference on Multimedia and Expo (ICME), pages
 telligence, volume 33, pages 6892–6899. 949–954. IEEE.

Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn
 Alemi, and George Tucker. 2019. On variational Schuller, Congkai Sun, Kenji Sagae, and Louis-
 bounds of mutual information. In International Philippe Morency. 2013. Youtube movie reviews:
 Conference on Machine Learning, pages 5171– Sentiment analysis in an audio-visual context. IEEE
 5180. PMLR. Intelligent Systems, 28(3):46–53.

Soujanya Poria, Devamanyu Hazarika, Navonil Ma- Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu.
 jumder, and Rada Mihalcea. 2020. Beneath the tip 2021. Learning modality-specific representa-
 of the iceberg: Current challenges and new direc- tions with self-supervised multi-task learning for
 tions in sentiment analysis research. IEEE Transac- multimodal sentiment analysis. arXiv preprint
 tions on Affective Computing, pages 1–1. arXiv:2102.04830.
Jiahong Yuan and Mark Liberman. 2008. Speaker A Appendix
 identification on the scotus corpus. The Journal
 of the Acoustical Society of America, 123(5):3878– A.1 Implementation details of history
 3878. memory
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik The workflow of the history embedding memory
 Cambria, and Louis-Philippe Morency. 2017. Ten- comprises two stages, as shown in Figure 3. In the
 sor fusion network for multimodal sentiment analy- estimation stage, parameters of the GMM model
 sis. In Proceedings of the 2017 Conference on Em-
 pirical Methods in Natural Language Processing, is estimated using both history embeddings read-
 pages 1103–1114. out from history memory and current batch input,
 as shown in (a). Then in the update stage, the old-
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-
 est batch of data is driven out of the memory to
 Philippe Morency. 2016. Multimodal sentiment in-
 tensity analysis in videos: Facial gestures and verbal leave space for new data, as described in (b). The
 messages. IEEE Intelligent Systems, 31(6):82–88. memory is implemented as a FIFO queue.
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria,
 Erik Cambria, and Louis-Philippe Morency. 2018.
 Multimodal language analysis in the wild: CMU-
 Current Batch
 MOSEI dataset and interpretable dynamic fusion
 graph. In Proceedings of the 56th Annual Meeting of
 the Association for Computational Linguistics (Vol-
 Past Batch
 ume 1: Long Papers), pages 2236–2246. Used for GMM
 Read Out and Entropy
 History Memory estimation
 (FIFO queue)

 (a) Estimation stage

 Current Batch
 Enqueue

 Oldest Batch
 History Memory Dequeue

 (FIFO queue)

 (b) Update stage

 Figure 3: The workflow of a history embedding mem-
 ory

 A.2 Proof of eq. (7)
 Proof. For a GMM model, the marginal probabil-
 ity density function of x can be written as
 X
 f (x) = f (x|z = i)p(x ∈ Ci ) (15)
 i

 where z is an indicator that reflects which
 P class x
 th
 falls in and Ci is the i class. Since i P (x ∈
 Ci ) = 1 by Jensen’s inequality we have (note
R R
H(X) = (−p(x) log p(x))dx = g(x)dx and
g(x) is a convex function)
 !
 X
 H(X) = H f (x|z = i)p(x ∈ Ci )
 i (16)
 X
 ≥ p(x ∈ Ci )H(f (x|z = i))
 i

In our case we have p(x ∈ C1 ) = p(x ∈ C2 ) = 21 ,
then
 1
 H(X) ≥ (H(X; µ1 , Σ1 ) + H(X; µ2 , Σ2 ))
 2
 = KL (X)
 (17)

Hence we get a lower bound of H(X) as the right
side of the inequality. On the other hand, an upper
bound as proposed in Huber et al. (2008) is
 1 1
 KU (X) = −2 × × log +
 2 2
 1
 (H(X; µ1 , Σ1 ) + H(X; µ2 , Σ2 ))
 2
 = 0.693 + KL (X)
 (18)

To summarize

KL (X) ≤ H(X) ≤ KU (X) = 0.693 + KL (X)
 (19)
Then through maximizing the lower bound
KL (X) we can maximize H(X).
You can also read