Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients

Page created by Elmer Patterson
 
CONTINUE READING
Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients
Value Functions Factorization with Latent State
 Information Sharing in Decentralized
 Multi-Agent Policy Gradients
arXiv:2201.01247v1 [cs.MA] 4 Jan 2022

 Hanhan Zhou, Tian Lan,*and Vaneet Aggarwal †

 Abstract
 Value function factorization via centralized training and decentralized execu-
 tion is promising for solving cooperative multi-agent reinforcement tasks. One of
 the approaches in this area, QMIX, has become state-of-the-art and achieved the
 best performance on the StarCraft II micromanagement benchmark. However, the
 monotonic-mixing of per agent estimates in QMIX is known to restrict the joint
 action Q-values it can represent, as well as the insufficient global state informa-
 tion for single agent value function estimation, often resulting in suboptimality.
 To this end, we present LSF-SAC, a novel framework that features a variational
 inference-based information-sharing mechanism as extra state information to as-
 sist individual agents in the value function factorization. We demonstrate that such
 latent individual state information sharing can significantly expand the power of
 value function factorization, while fully decentralized execution can still be main-
 tained in LSF-SAC through a soft-actor-critic design. We evaluate LSF-SAC on
 the StarCraft II micromanagement challenge and demonstrate that it outperforms
 several state-of-the-art methods in challenging collaborative tasks. We further set
 extensive ablation studies for locating the key factors accounting for its perfor-
 mance improvements. We believe that this new insight can lead to new local value
 estimation methods and variational deep learning algorithms. A demo video and
 code of implementation can be found at https://sites.google.com/view/sacmm.

 1 Introduction
 Reinforcement learning has been shown to match or surpass human performance in
 multiple domains, including Atari games [24], Go [19], and StarCraft II [42]. Many
 real-world problems, like autonomous vehicles coordination [14] and network packet
 delivery [47] often involve multiple agents’ decision making, which can be modeled
 as multi-agent reinforcement learning (MARL). Even though multi-agent cooperative
 * Hanhan Zhou and Tian Lan are with the Department of Electrical and Computer Engineering, the George
 Washington University, Washington, DC, 20052, e-mail: {hanhan, tlan}@gwu.edu.
 † Vaneet Aggarwal is with School of Industrial Engineering and the School of Electrical and Computer

 Engineering, Purdue University, West Lafayette IN, 47907, email: vaneet@purdue.edu

 1
Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients
problems could be solved by single-agent algorithms, joint state and action space im-
plies limited scalability. Further, partial observability and communication constraints
give rise to additional challenges to MARL problems. One approach to deal with such
issues is the paradigm of centralized training and decentralized execution (CTDE) [18].
The approaches for CTDE mainly include value function decomposition [37, 32] and
multi-agent policy gradient [4].
 Value decomposition based approaches like QMIX [32] represent the joint action
values using a monotonic mixing function of per-agent estimates. The algorithms
recorded the best performance on many StarCraft II micromanagement challenge maps
[22]. Further, it is demonstrated [29] that multi-agent policy gradient is substantially
outperformed by QMIX on both multi-agent particle world environment (MPE) [25]
and StarCraft multi-agent challenge (SMAC) [33]. Despite recent attempts for com-
bining policy gradient methods and value decomposition, e.g., VDAC [36], and mSAC
[30], the achieved improvements over QMIX are limited. One of the fundamental chal-
lenges is that the restricted function class permitted by QMIX limits the joint action
Q-values it can represent, leading to suboptimal value approximations and inefficient
explorations [22]. A number of proposals have been made to refine the value function
factorization of QMIX, e.g., QTRAN [35] and weighted QMIX [31]. However, solving
tasks that require significant coordination remains as a key challenge.
 To this end, we propose LSF-SAC a Latent State information sharing assisted value
function factorization under multi-agent Soft-Actor-Critic paradigm. In particular,
we introduce a novel peer-assisted information-sharing mechanism to enable effective
value function factorization by sharing the latent individual states, which can be con-
sidered extra state information for more accurate individual Q-value estimation by each
agent. While global information sharing or communications in MARL - e.g., TarMAC
[2] - typically prevents fully distributed decision making, we show that by leveraging
the design of soft-actor-critic, LSF-SAC is able to retain fully decentralized execution
while enjoying the benefits of latent individual states sharing. It also incorporates the
entropy measure of the policy into the reward to encourage exploration.
 The key insight of LSF-SAC is that existing approaches of value function factor-
ization mainly use the joint state information only in the mixing network, which yet
is restricted by the function class it can represent. We believe an accurate indepen-
dent value function estimation requires not only the state information of one specific
agent, but also a proper represent of the all individual state information. We propose
a way to extract and utilize the extra state information for individual, per-agent value
function estimation through a variational inference method, serving as latent individ-
ual state information. It is shown to significantly improve the power of value function
factorization. Since we utilize such latent state information sharing only in centralized
critic, the CTDE assumptions are preserved without affecting fully decentralized deci-
sion making, unlike previous work introducing global communications [44]. Further,
we note that combining actor-critic framework with value decomposition in LSF-SAC
offers a way to decouple the decision making of individual agents (through separate
policy networks) from value function networks, while also allowing the maximization
of entropy to enhance its stability and exploration.
 Our key contributions are summarized as follows:

 2
Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients
• We propose a novel method, LSF-SAC, the first framework for value function factor-
 ization by providing extra individual latent state information to facilitate individual,
 per-agent value function estimation. It is shown that latent state information can
 significantly improve the power of monotonic factorization operators.
• LSF-SAC leverages a soft-actor-critic design to separate individual agents’ policy
 networks from value function networks and to maintain fully decentralized execu-
 tion, while enjoying the benefits of peer-assisted value function factorization. It also
 leads to an entropy maximization MARL for a more effective exploration.
• We demonstrate the effectiveness of LSF-SAC and show that LSF-SAC significantly
 outperforms a number of state-of-the-art baselines on the StarCraft II micromanage-
 ment challenge in terms of better performance and faster convergence.

2 Background
2.1 Value Function Decomposition
Value function decomposition methods [37, 32, 35, 45] learn a joint Q functions Qtot (τ, a)
as a function of combined individual Q functions, conditioning individual local obser-
vation history,then these local Q values are combined with a learnable mixing neural
network to produce joint Q values.

 Qtot (τ, a) = q mix s, q i τ i , ai
  
 (1)

 Under the principle of guaranteed consistency between global optimal joint actions
and local optimal actions, a global argmax performed on Qtot yields the same result
as a set of individual argmax operations performed on each local q i , also known as
Individual Global Maximum (IGM).
 QMIX proposed a more general case of VDN by approximating a broader class of
monotonic functions to represent joint action-value functions rather than summation of
the local action values.
 ∂Qtot (τ , u)
 > 0, ∀i ∈ N . (2)
 ∂Qi (τi , ui )
QPLEX [43] provides IGM consistency by taking the advantage of duplex dueling
architecture,
 N
 X N
 X
 Qtot (τ , u) = Qi (τ , ui ) + (λi (τ , u) − 1) Ai (τ , ui ) (3)
 i=1 i=1

where
 Ai (τ , ui ) = wi (τ ) [Qi (τi , ui ) − Vi (τi )] , Vi (τi )
 (4)
 = max Qi (τi , ui ) ,
 ui

wi (τ ) is a positive weight, yet its operator still limits it to only discrete action space
[48].

 3
Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients
2.2 Maximum Entropy Deep Reinforcement Learning
In a maximum entropy reinforcement learning framework, also known as soft-actor-
critic [10], the objective is to maximize not only the cumulative expected total reward,
but also the expected entropy of the policy:
 T
 X
 J(π) = E(st ,at )∼ρπ [r (st , at ) + αH (π (·|st ))] (5)
 t=0

where ρπ (st , at ) denotes the state-action marginal distribution of the trajectory in-
duced by the policy π (at|st ). Soft actor-critic ultilized actor-critic architecture with
independent policy and value networks and off-policy paradigm for efficient data col-
lection and entropy maximization for effective exploration. It is considered as a state-
of-the-art baseline for many RL problems with continuous actions due to its stability
and capability.

2.3 Multi-agent Policy Gradient method
Multi-agent policy gradient methods are the extensions to policy gradient algorithms,
with a policy πθa (ua |oa ). Compared with policy gradient methods, MAPG usually
faces the issues of high variance gradient estimates [21] and credit assignment [5]. A
general multi-agent gradient can be written as:
 " #
 X
 a a
 ∇θ J = Eπ ∇θ log πθ (u |o ) Qπ (s, u)
 u

 Multi-agent policy gradients in the current literature often take advantage of CTDE
by using a central critic to obtain extra state information s, and avoid using the vanilla
multi-agent policy gradients (Equation 2) due to high variance. For instance, (Lowe
et al. 2017) utilize a central critic to estimate Q (s, (a1 , · · · , an )) and optimize pa-
rameters in actors by following a multi-agent DDPG gradient, which is derived from
Equation 2 :
 h i
 ∇θα J = Eπ ∇θa π (ua |oa ) ∇u · Qua (s, u)|uα =π(oα )

Unlike most actor-critic frameworks, (Foerster et al. 2018) claim to solve the credit
assignmentPissue by applying the following counterfactual policy gradients: where
Aa (s, u) u− πθ (ua |τ a ) Qaπ (s, (u−a , ua )) is the counterfactual advantage for agent
a. Note that (Foerster et al. 2018) argue that the COMA gradients provide agents with
tailored gradients, thus achieving credit assignment. At the same time, they also prove
that COMA is a variance reduction technique.

2.4 Variational Autoencoders
For variables X ∈ X which are generated from unknown random variable z based on a
generative distribution pu (x|z) with unknown parameter u and a prior distribution on

 4
the latent variables, of which we assume is a Gaussian with 0 mean and unit variance
p(z) = N (z; 0, I). To approximate the true posterior p(z|x) with a variational distri-
bution qw (z|~x) = N (z; µ, Σ, w). [16] proposed Variational Autoencoders (VAE) to
learn this distribution by using the Kullback-Leibler (KL) divergence from the approx-
imate to the true posterior DKL (qw (z|x)kp(z|x)), the lower bound on the evidence
log p(x) is derived as: log p(x) ≥ Ez∼qw (z|x) [log pu (x|z)] − DKL (qw (z|x)kp(z)).
[12] proposed β-VAE, where a parameter β ≥ 0 is used to control the trade off between
the reconstruction loss and the KL-divergence.

2.5 Information bottleneck Method
Information bottleneck method [40] is a technique in information theory which intor-
duced as the principle of extracting the relevant information with random input variable
X ∈ X and output random variable Y ∈ Y, while finding the proper tradeoff between
extraction accuracy and complexity. Given the joint distribution p(x, y), their relevant
information is defined as their mutual information I(X; Y ). This problem can also be
seen as a rate-distortion problem [41] with non-fixed distortion measure conditioning
the optimal map, defined as

 dIB = DKL (p(y|x)kp(y|x̂))

where DKL is the Kullback-Leibler divergence. Then the expected IB distortion E [dIB (x, x̂)] =
DIB = I(X; Y |X̂), with variational principle as

 L[p(x̂|x)] = I(X; X̂) − βI(X; Y |X̂)

where β is a positive Lagrange multiplier operates as a tradeoff parameter between
accuracy and complexity. [1] further proposed a variational approximation to the in-
formation bottleneck using deep neural networks.

3 Related Works
Cooperative multi-agent decision making often suffers from exponential joint state and
action spaces. Multiple approaches including independent Q-learning and mean field
games have been considered in the literature, while they do not perform well in chal-
lenging tasks or require homogeneous agents [36]. Recently, a paradigm of centralized
training and decentralized execution (CTDE) has been proposed for scalable decision
making [18]. Some of the key CTDE approaches include value function decomposition
and multi-agent policy gradient methods.
 Policy Gradient methods are considered to have more stable convergence compared
to value-based methods [8] and can be extended to continuous actions problems easily.
A representative multi-agent policy gradient method is COMA [4], which utilizes a
centralized critic module for estimating the counterfactual advantage of an individual
agent. However, as pointed out in [29], multi-agent policy gradient like MADDPG[21]
is significantly outperformed by QMIX on both multi-agent particle world environment
(MPE) [25] and StarCraft multi-agent challenge (SMAC) [33].

 5
Decomposed actor-critic methods, which combine value function decomposition
and policy gradient methods with the use of the decomposed critics rather than cen-
tralized critics, are introduced to guide policy gradients. VDAC [36] combined the
structure of actor critic and QMIX for the joint state-value function estimation, while
DOP [45] directly uses a network similar to Qatten [46] for policy gradients with off-
policy tree backup and on-policy TD. The authors of [45] pointed out that decomposed
critics are limited by restricted expressive capability and thus cannot guarantee the
converge of global optima; even though the individual policies may converge to local
optima [48]. Extensions of the monotonic mixing function has also been considered,
e.g., QTRAN [35] and weighted QMIX [31]. But solving tasks that require significant
coordination remains as a key challenge.
 Another related topic is representational learning in reinforcement learning. A
VAE-based forward model is proposed in [9] to learn the state representations in the
environment. A model to learn Gaussian embedding representations of different tasks
during meta-testing is considered in [7]. The authors in [15] proposed a recurrent VAE
model which encodes the observation and action history and learns a variational dis-
tribution of the task. A method using inference model to represent decision making of
the opponents is presented in [28].
 The closest paper to our work is NDQ [44], which also utilizes latent variables to
represent the information but as the communication messages during the decentralized
agents execution. Although we both consider the information extraction as an infor-
mation bottleneck problem, there are several key differences between our work and
NDQ: (I) NDQ is a value-based method, while our work is a policy-based method un-
der the soft-actor-critic framework. (II) NDQ requires communication between agents
during decentralized execution, which limits its use cases, while we only utilize the
latent extra state information during the central critics so that CTDE is maintained.
(III) NDQ requires one-to-one communication during the execution stage, while in this
work, we introduce a latent information-sharing mechanism which can be considered
as an all-to-all message sharing method, which potentially requires less training time.
 The proposed LSF-SAC method leverages an actor-critic design with latent state
information for value function factorization. We introduce a novel way to utilize the
extra state information, as inspired from β-VAE [12], by using variational inference
in decomposed critic as latent state information for better individual value estimation.
Despite information sharing, CTDE is still maintained due to the use of actor-critic
structure. We also utilize the entropy and expected return maximization for better
exploration through soft actor-critic with separate actor and critic networks.

4 System Model
Consider a fully cooperative multi-agent task as decentralized partially observable
Markov decision process (DEC-POMDP) [26], given by a tuple G = hI, S, U, P, r, Z, O, n, γi,
where I ≡ {1, 2, · · · , n} is the finite set of agents. The state is given as s ∈ S,
from which each agent draws its own observation from the observation function oi ∈
O(s, i) : S × A → O. At each timestamp t, each agent i choose an action ui ∈ U ,
composing a joint action selection u. A shared reward is then given as r=R(s, a) :

 6
S ×U→R, with the next state of each agent is s0 with transition probability function
P (s0 |s, u) : S × U → [0, 1]. Each agent has an action-observation historyP∞ τi ∈ T ≡
(O×U )∗ . Then a joint action value function Qπtot (τ , u) = Es0:∞ , u0:∞ [ t=0 γ t rt |s0 = s, u0 = u, π]
is proposed with policy π, and γ ∈ [0, 1) is the discount factor. Quantities in bold
denote a joint quantities across all agents, and quantity with super script i denote a
quantity specifically belong to agent i.

 ( , )
 ( , , . ) ( , )

 MLP Mixing Network
 
 ℎ −1 ℎ 
 GRU

 1 ( 1 , 1 ) ( , ) 

 ……
 MLP MLP

 1 |•|
 MLP
 1 ( 1 , . ) ( 1 , . )
 ( , −1
 
 , ) 
 1 =N(μ, ) 
 MLP
 
 Critic 1 Critic n |•|
 MLP MLP
 ( ) Latent Information
 MLP 1 Sharing 
 Softmax 
 MLP
 
 ℎ −1 ℎ 1
 1 = ( 1 , −1 ) = ( , −1
 
 ) ( , )
 GRU

 MLP Agent 1 …… Agent n

 Execution
 ( , −1
 
 ) Training

 Figure 1: Overview of LSF-SAC Approach. Best viewed in color.

5 Proposed Approach
In this section, we first introduce the main structure of our proposed method, LSF-SAC,
then we discuss the detailed implementation of the key designs, namely soft actor-critic
framework for multi-agent reinforcement learning and value decomposition with latent
information-sharing mechanism, and their corresponding optimizing strategies.

5.1 Framework Overview
In our learning framework (Fig. 1), each individual actor (Green part) outputs πθ (ai |τ i )
only conditioned on its own local observation history. The centralized mixing network
(Orange Part) approximates the joint action-value function from individual value func-
tions (Blue part). A latent information-sharing mechanism (Purple part) is proposed to
encode the extracted extra state information to assist individual agents in local action-
value estimation. Function approximators (neural networks) are used for both actor
and critic networks and optimized with stochastic gradient descent.
 The centralized critic network consists of (i) a local Q-network for each agent, (ii)
a mixing network that takes all individual action-values with its weights and biases

 7
generated by a separate hyper-network, and (iii) an extra state information encoder
to generate latent state information for facilitating individual Q-value estimation. For
each agent i, the local Q network represents its local Q value function qi (τi , ai , mi )
where mi is the extra state information for agent i drawn from the global informa-
tion sharing pool. More precisely, the information for agent i is generated from the
messages of all other agents following a multivariate Gaussian distribution, denoted as
mi = with mi
 out
 ∼ N (fm (τi ; θm ), I)), where τi is the lo-
cal observation history, θm is the parameters of encoder fm and I is an identity matrix.
 The mixing network is a feed-forward network, following the approach in QMIX,
which mixes all local Q values to produce an estimate Qtot . The weights and biases
of the mixing network are generated by a hypernetwork that takes joint state infor-
mation s. To enforce monotonicity, the weights generated from the hyper-networks
are followed by an absolute function to create non-negative values. The decentralized
actor-network is similar to the individual Q network, except it only conditions on its
own observation and action history, and a softmax layer is added to the end of the
network to convert logits into categorical distribution. The overall goal is to minimize:

 L(θ) = LT D (θT D ) + λ1 Lm (θ m ) + λ2 Lπ (θ π ) (6)

where LT D (θT D ) is the TD loss, of which we show it can also be used as the center
critic loss, Lm (θ m ) is the message encoding loss, and Lπ (θ π ) is the joint actor (pol-
icy) loss. λ1 and λ2 are the weighting terms. The details about latent state information
generation and soft-actor-critic framework along with how to optimize them will be
discussed in the following section.

5.2 Variational Approach Based Latent State Information
One of the key advantages of multi-agent policy gradients under the CTDE assumption
is the effective utilization of extra state information. In our design, not only is the extra
state information accessible to the mixing network, but also to the individual agents’
value networks (through information sharing). Due to the partial observability and un-
certainty of the multi-agent environments, the individual value estimation conditioned
on its own observation and action history can be volatile and unreliable. Intuitively,
introducing extra information from other agents helps remove the ambiguity and un-
certainty of current observation to enable effective individual value estimation.
 However, it remains a crucial problem on how to efficiently and effectively en-
code such extra state information. We consider this as an information bottleneck prob-
lem [40], specifically, for agent i, we maximize the mutual information between other
agents’ encoded information and their actions while minimizing the mutual informa-
tion between its own encoded information and action selection, so that only the neces-
sary information is chosen and then efficiently encoded.
 Formally, the objective for each agent i can be written as:
 n
 X
 Jm (θ m ) = [Iθm (Aj ; Mi |Tj , Mj ) − βIθm (Mi ; Ti )] (7)
 j=1

 8
where Aj is agent j’s action selection, Mi is a random variable of mout i , Tj is a random
variable of τj , and a parameter β ≥ 0 is used to control the trade-off between the mutual
information of its own and other agents. Yet this does not lead to a learnable model,
since the mutual information is intractable. With the help of variational approximation,
specifically, deep variational information bottleneck [1], we are able to parameterize
this model using a neural network. We then derive and optimize a variational lower
bound of the first term of such objective as follows. Detailed derivations and proofs
can be found in Appendix A.1.
Lemma 1. A lower bound of mutual information Iθm (Aj ; Mi |Tj , Mj ) is

 ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , M )]]

where qψ is a variational Gaussian distribution with parameters ψ to approximate the
unknown posterior p(Aj |Tj , Mj ), T = {T1 , T2 , · · · , Tn }, M = {M1 , M2 , · · · , Mn }.
Proof. We provide a proof outline as follows.

 Iθc (Aj ; Mi |Tj , Mj )
 p (aj |τj , mj )
 Z
 = daj dτj dmj p (aj , τj , mj ) log 
 p aj |τj , mout
 j

where p(aj |τj , mj ) is fully defined by our decoder fm and Markov Chain. Note
this is intractable in our case, let qψ (aj |τj , mj ) be a variational approximation to
p(aj |τj , mj ). Since the KL-divergence is always positive, hence

 Iθc (Aj ; Mi |Tj , Mj )
 qψ (aj |τj , mj )
 Z
 ≥ daj dτj dmj p (aj , τj , mj ) log 
 p aj |τj , mout
 j
 = ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , M )]]
 + H(Aj |Tj , Mjout )

Consider H(Aj |Tj , Mjout ) is a positive term that is independent of our optimization
procedure and can be ignored, then we have

 Iθm (Aj ; Mi |Tj , Mj )
 (8)
 ≥ ET∼D,Mj ∼fm [−H [p (Aj |T) , qψ (Aj |Tj , M )]]

 Similarly, by introducing another variational approximator qφ , we have

 Iθm (Mi ; Ti ) = ETi ∼D,Mj ∼fm [DKL (p (Mi |Ti ) kp (Mi ))]
 (9)
 ≤ ETi ∼D,Mj ∼fm [DKL (p (Mi |Ti ) kqφ (Mi ))]

where DKL denotes Kullback-Leibler divergence operator and qφ (Mi ) is a variational
posterior estimator of p(Mi ) with parameters φ (see Appendix A.1 for details). Then

 9
with the evidence lower bound derived above we optimize this bound for the message
encoding objective which is to minimize

 Lm (θ m ) = ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , Mj )]
 (10)
 + βDKL (p(Mi |Ti )kqφ (Mi ))].

Algorithm 1 LSF-SAC
 1: for k = 0 to max train steps do
 2: Initiate environment
 3: for t = 0 to max episode limits do
 4: For each agent i, take action ai ∼ πi
 5: Execute joint action a, observe reward r,
 and state-action history τ , next state st+1
 0
 6: Store (τ , a, r, τ ) in replay buffer D
 7: end for
 8: for t = 1 to T do
 9: Sample minibatch B from D
10: Generate latent state information
 mout
 i ∼ N (fm (τi ; θm ), I)), for i = 0 to n
11: Update critic network
 θT D ← η ∇L ˆ T D (θT D ) w.r.t Eq(9)
12: Update policy network
 ˆ
 π ← η ∇L(π) w.r.t Eq(7)
13: Update encoding network
 θ m ← η ∇Lˆ m (θ m ) w.r.t Eq(5)
14: Update temperature parameter
 α ← η ∇αˆ w.r.t Eq(8)
15: if time to update target network then
16: θ− ← θ
17: end if
18: end for
19: end for
20: Return π

5.3 Factorizing Multi Agent Maximum Entropy RL
In this section, we present one possible implementation of expanding soft actor-critic
to multi-agent domain with latent state information assisted value function decompo-
sition. Recent works have shown that Boltzmann exploration policy iteration is guar-
anteed to improve the policy and converge to optimal [10]; its objective extended to
multi-agent domain can be defined as
 X
 J(π) = E [r (st , at ) + αH (π (·|st ))] (11)
 t

 10
where the temperature α is the hyper-parameter to control the trade-off between maxi-
mizing the expected return and maximizing the entropy for better exploration.
 Following the previous research on value decomposition, to maximize both the
expected return and the entropy, we find the soft policy loss of LSF-SAC as:

 L(π) = ED [α log π (at |τ t ) − Qπtot (st , τt , at )]
 (12)
 = q mix st , Eπi q i τti , ait − α log π i ait |τti
   

where Qπtot is the soft value decomposition network with ai ∼ πi (oi ), and D is the
replay buffer used to sample training data (state-action history and reward, etc.).
 Then, we can tune the temperature α as proposed in [10] by optimizing the follow-
ing:
 J(α) = Eat ∼πt [−α log πi (at |st ) − αH0 ] (13)
Unlike in VDAC that share the same network for actor networks and local Q value es-
timations, we use a separate network for policy networks and train them independently
from critic networks. Latent state information are used for individual critics for joint
action value function factorization. We propose a latent state information assisted soft
value decomposition design as

 Qtot (τ , a, m; θ) = q mix (st , Eπi [q i (τti , ait , mit ); θ])

 We then use TD advantage with latent information sharing design as the critic loss,
i.e.,
 Qtot τ 0 , a0 , m0 ; θ − −Qπtot (τ , a, m; θ)]2
 
 LT D (θ) = [r+γ max
 ∗ a
 = [r+γ max
 ∗
 q mix (st , Eπi [q i (τt+1
 i
 , ait+1 , mit+1 ); θ − ]) (14)
 a
 − q mix (st , Eπi [q i (τti , ait , mit ); θ])]2
where ai ∼ πi (oi ), θ − is the parameters of the target network that are periodically
updated. Detailed derivations can be found in Appendix A.2.

6 Experiments
In this section, we first empirically study the improvements of power in value func-
tion factorization achieved by LSF-SAC through a non-monotonic matrix game. We
compare the results with several existing value function factorization methods. Then
in StarCraft II, we compare LSF-SAC with several state-of-the-art baselines. Finally,
we perform several ablation studies to analyze the factors that contribute to the perfor-
mance.

6.1 Single-state Matrix Game
Proposed in QTRAN [35], the non-monotonic matrix game, as illustrated in Table 1(a),
consists of two agents with three available actions and a shared reward. We show the
value function factorization results of QTRAN, LSF-SAC, VDN, QMIX, and DOP [45].

 11
u2 Q2
 A B C 4.2(A) 2.3(B) 2.3(C)
 u1 Q1
 A 8.0 -12.0 -12.0 3.8(A) 8.0 6.13 6.1
 B -12.0 0.0 0.0 -2.1(B) 2.1 0.2 0.2
 C -12.0 0.0 0.0 -2.3(C) 1.9 0.0 0.0

 (a) Payoff of matrix game (b) QTRAN
 Q2 Q2
 1.7(A) -11.5(B) -12.7(C) 3.1(A) -2.3(B) -2.4(C)
 Q1 Q1
 0.4(A) 8.1 -6.2 -6.0 -2.3(A) -5.4 -4.6 -4.7
 -9.9(B) -6.0 -5.9 -6.1 -1.2(B) -4.4 -3.5 -3.6
 -9.5(C) -5.9 -6.0 -6.0 -0.7(C) -3.9 -3.0 -3.1
 (c) LSF-SAC (d) VDN
 Q2 Q2
 -0.9(A) 0.0(B) 0.0(C) -2.5(A) -1.3(B) 0.0(C)
 Q1 Q1
 -1.0(A) -8.1 -8.1 -8.1 -1.0(A) -7.8 -6.0 -4.2
 0.1(B) -8.1 0.0 0.0 0.1(B) -6.1 -4.4 -2.6
 0.1(C) -8.1 0.0 0.0 0.1(C) -4.2 -2.4 -0.7

 (e) QMIX (f) DOP

Table 1: Payoff Matrix of the one-step matrix game, Q1 , Q2 and reconstructed Qtot of
selected algorithms. Boldface denotes optimal/greedy actions from state-action value.
The use of variational information can significantly improve the power of the function
factorization operators.

 Table 1b-1f shows the learning results of selected algorithms, QTRAN and LSF-
SAC can learn a policy that each agent jointly takes the optimal action conditioning
only on their local observations, meaning successful factorization. DOP falls into the
sub-optimum caused by miscoordination penalties, similar to VDN and QMIX, which
are limited by additivity and monotonicity constraints. Although QTRAN managed
to address such limitations with more general value decomposition, as pointed out in
later works [22] that it poses computationally intractable constraints that can lead to
poor empirical performance on complex MARL domains. It is also worth noting that
LSF-SAC can find the optimal joint actions under the constraints of monotonicity by
providing variational information; this indicates that the utilization of latent state infor-
mation can significantly improve the power of the monotonic factorization operators in
a mixing network like QMIX.

6.2 Decentralised Starcraft II micromanagement benchmark
In this section, we show our experimental results compared to several state-of-the-
art algorithms, not limited to only multi-agent policy gradient methods, but also with
decomposed value methods and combined methods on decentralized StarCraft II mi-
cromanagement benchmark [33], namely COMA [4], MAVEN [22], QMIX [32], and
VDAC-vmix from VDAC [36] as they report out of the two proposed methods it deliv-

 12
Figure 2: Comparisons with baselines on the SMAC benchmark

ers the best performance.
 We then perform several ablation studies to analyze the factors that contribute to the
performance. It is worth noting that the StarCraft Multi-Agent Challenges (SMAC) are
significantly affected by various code-level optimizations, i.e., hyper-parameter tuning,
as also found by [13], some works are relying on heavy hyper-parameters tuning to
achieve results that they otherwise cannot. Consistent with previous work, we carry
out the test with the same hyper-parameters settings across all algorithms. More details
about the algorithm implementation and settings can be found in the Appendix C.
 We choose six different maps from both symmetric and asymmetric scenarios for
the general test, ranging from easy to hard ones. For each testing algorithm run, training
will be paused every 5000 steps for an evaluation phase, where 32 independent episodes
are generated, with each agent acting greedily according to their policy or value func-
tion. The median percentage of winning rate is used for performance comparison from
five independent training cycles. Specifically, we choose maps ranging from symmet-
ric ones with the same units: 8m (easy), symmetric ones with different units: 1c3s5z,
3s5z, asymmetric ones with different units: 2s vs 1sc , 5m vs 6m, and different units
with large actions space: MMM2. Details about the StarCraft Multi-Agent Challenge
settings can be found in the Appendix B. Note that LSF-SAC performs exceptionally
well on testing maps with challenging tasks that require more state information or sub-
stantial cooperation.

 13
6.3 General Results
Following the practice of previous works, as suggested in [33], for every map result,
we compare the winning rate and plot the median with the shaded area representing the
highest and lowest range from testing results in Figure 2.
 In general, we observe LSF-SAC achieves strong performance on all selected SMAC
maps, notably it outperforms the state-of-the-art algorithms or achieves faster and more
stable convergence at a higher win rate.
 In easy scenarios like 8m and 1c3s5z, almost all algorithms perform well. As
the built-in AI would tend to attack the nearest enemy, by pulling back the friendly
unit with lower health value is a simple strategy to learn for winning. However it is
worth noting that although QMIX can achieve a relatively high winning rate, it is pretty
unstable on convergence, indicating its policy might be overfitting to some specific
scenarios. Specifically, on these two maps, LSF-SAC outperforms all the baselines in
both convergence speed and final performance with a more stable result, proving its
potential of more generalized policy expressiveness in value decomposition.
 On 2s vs 1sc map, where a specific strategy is required to win - only by two units
cooperating and taking turns to attack the enemy unity, LSF-SAC is able to achieve a
high winning rate, yet fluttering at the early stage of training. This is potentially due
to the penalty from the entropy maximization that forces the agent to try out additional
tactics, even though an optimal policy is already learned.
 On more challenging scenarios like 5m vs 6m and 3s5z, LSF-SAC can achieve
a higher winning rate than other algorithms listed. In MMM2, which is a complex
environment with more unit types and numbers, VDAC soon falls into sub-optimal
and converges to it, while LSF-SAC is able to keep exploring for a better policy; this
demonstrates LSF-SAC’s improved exploration ability. Both COMA and MAVEN fail
to learn a consistent policy to defeat the built-in AI due to the non-stationary setting of
the environment and the lack of utilization of extra state information.

6.4 Ablation study
In this section, we perform a comparison between LSF-SAC and several modified algo-
rithms to understand the contribution of different modules in LSF-SAC. We choose two
of the previously tested SMAC maps: 8m and 5m vs 6m. Each experiment is repeated
with four independent runs with random seeds with their median results presented.

6.4.1 Ablation 1
First, we consider the setting of LSF-SAC without the extra state info encoding (Purple
part in Fig.1) as MASAC. This demonstrates how multi-agent soft-actor-critic works
alone. It highlights the importance of latent state information by comparing the results
of MASAC against the original LSF-SAC.

6.4.2 Ablation 2
We then consider our implementation of multi-agent soft-actor-critic with value de-
composition as MAA2C, which can also be considered as QMIX under an A2C setting

 14
[36]. This is to find the contribution of soft-actor-critic to enhancing exploration.

6.4.3 Ablation 3
We also consider a fixed temperature design as MASAC with fixed α = 1.0 (MASAC
α = 1.0); this is to understand the effectiveness of the design in automatically updating
the temperature α.

6.4.4 Ablation 4
Finally we note that the original (single-agent) soft-actor-critic algorithm [10] and sev-
eral other works uses two independently trained soft Q-functions and use the mini-
mum of the two as policy for optimizing, as [11, 6] points out that policy steps are
known to degrade the performance of value-based methods, e.g. in [30] they train with
 0 0 0
L(θ) = [(rt + γ minj∈1,2 Qtot ((st , τ t , at ; θj− ))) − Qtot (st , τ t , at ; θ))2 ] . Their per-
formance comparison can be found in the ablation studies as MASAC DoubleQ [30].
This is to find if TD advantage with double Q learning is more stable under MARL
when combined with value function decomposition.

 Figure 3: Ablation Results on 8m

 Ablation

6.4.5 Ablation Results
By comparing the results of MASAC and LSF-SAC, we observe an improvement in
both maps regarding the performance of LSF-SAC, which confirms the contribution of
the latent state information assisted value decomposition design.
 The MASAC and MASAC with α=1.0 were both able to outperform MAA2C,
despite the latter having a fixed α parameter, which can be viewed as training with
aggressive exploration throughout the entire training session. Note that MAA2C soon

 15
Figure 4: Ablation Results on 5m vs 6m

converges to a local optimal on 3s5z, while it cannot present a learnable policy on
5m vs 6m map.
 Also, MASAC is able to achieve a higher winning rate and faster converging than
MASAC with α=1.0. On the 5m vs 6m map, MASAC with α=1.0 initially find a
correct way of optimization, while the constant penalty from entropy maximization
forces it to explore and find other policies. This illustrates the advantage of automatic
temperature updating design.
 Finally, although MSAC DoubleQ delivers a learnable policy on 3s5z environment
at a plodding pace, it fails to learn a policy on 5m vs 6m within the episode limits; this
could potentially be the result of a complex model and relatively continuous reward
on this specific environment. Also, due to its redundant network size, we find that
MSAC DoubleQ, with its double value function design, takes a significantly longer
time for training. This proves TD advantage with a single value function is sufficient
to optimize multi-agent actor critics with value decomposition.

7 Conclusions
In this paper, we propose LSF-SAC, a novel framework that combines latent state in-
formation assisted individual value estimation for joint value function factorization and
multi-agent entropy maximization, for a collaborative multi-agent reinforcement learn-
ing under the CTDE paradigm. We introduce an information-theoretical regularization
method for optimizing the latent state information assisted latent information generator
to efficiently and effectively utilize extra state information in individual value estima-
tion, while CTDE can still be maintained through a soft-actor-critic design. We also
propose one possible implementation of expanding the off-policy maximum entropy
deep reinforcement learning to the multi-agent domain with latent state information.
We empirically show that latent state information sharing significantly improves the

 16
Figure 5: Ablation Results on 3s5z

power of value function decomposition operators. Empirical results in SMAC show
that our framework significantly outperforms the baseline methods on the SMAC en-
vironment. We further analyze the key factors contributing to the performance in our
framework by a set of ablation studies. In future works, we plan to focus on expand-
ing the proposed method to a continuous action space with different policy gradient
methods.

References
[1] Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. (2016). Deep variational
 information bottleneck. ArXiv Preprint arXiv:1612.00410.
[2] Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., and Pineau,
 J. (2019). Tarmac: Targeted multi-agent communication. International Conference
 on Machine Learning, 1538–1546.
[3] Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Bench-
 marking deep reinforcement learning for continuous control. International Confer-
 ence on Machine Learning, 1329–1338.

[4] Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. (2018). Coun-
 terfactual multi-agent policy gradients. Proceedings of the AAAI Conference on
 Artificial Intelligence, 32.
[5] Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P. H., Kohli, P., and
 Whiteson, S. (2017). Stabilising experience replay for deep multi-agent reinforce-
 ment learning. International Conference on Machine Learning, 1146–1155.

 17
[6] Fujimoto, S., Hoof, H., and Meger, D. (2018). Addressing function approxima-
 tion error in actor-critic methods. International Conference on Machine Learning,
 1587–1596.
[7] Grover, A., Al-Shedivat, M., Gupta, J., Burda, Y., and Edwards, H. (2018). Learn-
 ing policy representations in multiagent systems. International Conference on Ma-
 chine Learning, 1802–1811.
[8] Gupta, J. K., Egorov, M., and Kochenderfer, M. (2017). Cooperative multi-
 agent control using deep reinforcement learning. International Conference on Au-
 tonomous Agents and Multiagent Systems, 66–83.
[9] Ha, D., and Schmidhuber, J. (2018). World models. ArXiv Preprint
 arXiv:1803.10122.
[10] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-
 policy maximum entropy deep reinforcement learning with a stochastic actor. In-
 ternational Conference on Machine Learning, 1861–1870.
[11] Hasselt, H. (2010). Double Q-learning. Advances in Neural Information Process-
 ing Systems, 23, 2613–2621.
[12] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mo-
 hamed, S., and Lerchner, A. (2016). beta-vae: Learning basic visual concepts with
 a constrained variational framework.
[13] Hu, J., Jiang, S., Harding, S. A., Wu, H., and Liao, S. (2021). RIIT: Rethinking
 the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning.
 ArXiv Preprint arXiv:2102.03479.
[14] Hu, Y., Nakhaei, A., Tomizuka, M., and Fujimura, K. (2019). Interaction-aware
 decision making with adaptive strategies under merging scenarios. 2019 IEEE/RSJ
 International Conference on Intelligent Robots and Systems (IROS), 151–158.
[15] Igl, M., Zintgraf, L., Le, T. A., Wood, F., and Whiteson, S. (2018). Deep varia-
 tional reinforcement learning for POMDPs. International Conference on Machine
 Learning, 2117–2126.
[16] Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. (2014). Semi-
 supervised learning with deep generative models. Advances in Neural Information
 Processing Systems, 3581–3589.
[17] Kingma, D. P., and Welling, M. (2013). Auto-encoding variational bayes. ArXiv
 Preprint arXiv:1312.6114.
[18] Kraemer, L., and Banerjee, B. (2016). Multi-agent reinforcement learning as a
 rehearsal for decentralized planning. Neurocomputing, 190, 82–94.
[19] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and
 Wierstra, D. (2015). Continuous control with deep reinforcement learning. ArXiv
 Preprint arXiv:1509.02971.

 18
[20] Littman, M. L. (1994). Markov games as a framework for multi-agent reinforce-
 ment learning. In Machine learning proceedings 1994 (pp. 157–163). Elsevier.
[21] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mordatch, I. (2017). Multi-
 Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Neural In-
 formation Processing Systems (NIPS).
[22] Mahajan, A., Rashid, T., Samvelyan, M., and Whiteson, S. (2019). Maven: Multi-
 agent variational exploration. ArXiv Preprint arXiv:1910.07483.
[23] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D.,
 and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. ArXiv
 Preprint arXiv:1312.5602.
[24] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G.,
 Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., and others. (2015).
 Human-level control through deep reinforcement learning. Nature, 518(7540),
 529–533.
[25] Mordatch, I., and Abbeel, P. (2017). Emergence of Grounded Compositional Lan-
 guage in Multi-Agent Populations. ArXiv Preprint arXiv:1703.04908.
[26] Oliehoek, F. A., and Amato, C. (2016). A concise introduction to decentralized
 POMDPs. Springer.
[27] Panait, L., and Luke, S. (2005). Cooperative multi-agent learning: The state of
 the art. Autonomous Agents and Multi-Agent Systems, 11(3), 387–434.
[28] Papoudakis, G., and Albrecht, S. V. (2020). Variational autoencoders for opponent
 modeling in multi-agent systems. ArXiv Preprint arXiv:2001.10829.
[29] Papoudakis, G., Christianos, F., Schäfer, L., and Albrecht, S. V. (2020). Com-
 parative evaluation of multi-agent deep reinforcement learning algorithms. ArXiv
 Preprint arXiv:2006.07869.
[30] Pu, Y., Wang, S., Yang, R., Yao, X., and Li, B. (2021). Decomposed Soft
 Actor-Critic Method for Cooperative Multi-Agent Reinforcement Learning. ArXiv
 Preprint arXiv:2104.06655.
[31] Rashid, T., Farquhar, G., Peng, B., and Whiteson, S. (2020). Weighted QMIX:
 Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Rein-
 forcement Learning.
[32] Rashid, T., Samvelyan, M., Schroeder, C., Farquhar, G., Foerster, J., and White-
 son, S. (2018). Qmix: Monotonic value function factorisation for deep multi-
 agent reinforcement learning. International Conference on Machine Learning,
 4295–4304.
[33] Samvelyan, M., Rashid, T., De Witt, C. S., Farquhar, G., Nardelli, N., Rudner, T.
 G., Hung, C.-M., Torr, P. H., Foerster, J., and Whiteson, S. (2019). The starcraft
 multi-agent challenge. ArXiv Preprint arXiv:1902.04043.

 19
[34] Shao, J., Zhang, H., Jiang, Y., He, S., and Ji, X. (2021). Credit Assignment with
 Meta-Policy Gradient for Multi-Agent Reinforcement Learning. ArXiv Preprint
 arXiv:2102.12957.
[35] Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi, Y. (2019). Qtran:
 Learning to factorize with transformation for cooperative multi-agent reinforce-
 ment learning. International Conference on Machine Learning, 5887–5896.
[36] Su, J., Adams, S., and Beling, P. (2021). Value-Decomposition Multi-Agent
 Actor-Critics. Proceedings of the AAAI Conference on Artificial Intelligence,
 35(13), 11352–11360.

[37] Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg,
 M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., and others. (2017). Value-
 decomposition networks for cooperative multi-agent learning. ArXiv Preprint
 arXiv:1706.05296.
[38] Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J.,
 and Vicente, R. (2017). Multiagent cooperation and competition with deep rein-
 forcement learning. PloS One, 12(4), e0172395.
[39] Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative
 agents. Proceedings of the Tenth International Conference on Machine Learning,
 330–337.

[40] Tishby, N., Pereira, F., and Bialek, W. (2000). The information bottleneck
 method. ArXiv Preprint physics/0004057.
[41] Tishby, N., and Zaslavsky, N. (2015). Deep learning and the information bottle-
 neck principle. 2015 IEEE Information Theory Workshop (ITW), 1–5.

[42] Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung,
 J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., and others. (2019). Grand-
 master level in StarCraft II using multi-agent reinforcement learning. Nature,
 575(7782), 350–354.
[43] Wang, J., Ren, Z., Liu, T., Yu, Y., and Zhang, C. (2020). Qplex: Duplex dueling
 multi-agent q-learning. ArXiv Preprint arXiv:2008.01062.

[44] Wang, T., Wang, J., Zheng, C., and Zhang, C. (2019). Learning nearly de-
 composable value functions via communication minimization. ArXiv Preprint
 arXiv:1910.05366.
[45] Wang, Y., Han, B., Wang, T., Dong, H., and Zhang, C. (2020). Off-policy multi-
 agent decomposed policy gradients. ArXiv Preprint arXiv:2007.12322.
[46] Yang, Y., Hao, J., Liao, B., Shao, K., Chen, G., Liu, W., and Tang, H. (2020).
 Qatten: A general framework for cooperative multiagent reinforcement learning.
 ArXiv Preprint arXiv:2002.03939.

 20
[47] Ye, D., Zhang, M., and Yang, Y. (2015). A multi-agent framework for packet
 routing in wireless sensor networks. Sensors, 15(5), 10026–10047.
[48] Zhang, T., Li, Y., Wang, C., Xie, G., and Lu, Z. (2021). FOP: Factorizing Optimal
 Joint Policy of Maximum-Entropy Multi-Agent Reinforcement Learning. Interna-
 tional Conference on Machine Learning, 12491–12500.

 A Mathematical Details
 A.1 Boundaries for extra-state information
 To efficiently and effectively encode extra state information for individual value
 estimation, we consider this information encoding problem as an information bot-
 tleneck problem [40], the objective for each agent i can be written as:
 n
 X
 Jm (θ m ) = [Iθm (Aj ; Mi |Tj , Mj ) − βIθm (Mi ; Ti )] (15)
 j=1

 This object is appealing because it defines what is a good representation in terms
 of trade-off between a succinct representation and inferencing ability. The main
 shortcoming is that the computation of the mutual information is computationally
 challenging. Inspired by the recent advancement in Bayesian inference and vari-
 ational auto-encoder [17, 28, 44], we propose a novel way of representing it by
 utilizing latent vectors from variational inference models using information theo-
 retical regularization method, and then derive the evidence lower bound (ELBO)
 of its objective.
 Lemma 2. A lower bound of mutual information Iθm (Aj ; Mi |Tj , Mj ) is

 ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , M )]]

 where qψ is a variational Gaussian distribution with parameters ψ to approx-
 imate the unknown posterior p(Aj |Tj , Mj ), T = {T1 , T2 , · · · , Tn }, M =
 {M1 , M2 , · · · , Mn }.

 Proof.

 Iθc (Aj ; Mi |Tj , Mj )
 
 p aj , mout out
 i |τj , mj
 Z
 = daj dτj dmj p (aj , τj , mj ) log  
 p aj |τj , mout
 j p mout out
 i |τj , mj
 p (aj |τj , mj )
 Z
 = daj dτj dmj p (aj , τj , mj ) log 
 p aj |τj , mout
 j

 where p(aj |τj , mj ) is fully defined by our encoder and Markov Chain. Since
 this is intractable in our case, let qψ (aj |τj , mj ) be a variational approximation

 21
to p(aj |τj , mj ), where this is our decoder which we will take to another neural
network with its own set of parameters ψ. Using the fact that Kullback Leibler
divergence is always positive, we have

 KL[p(aj |τj , mj ), qψ (aj |τj , mj )] ≥ 0

Z Z
 daj dτj dmj p(aj , τj , mj ) log p(aj |τj , mj ) ≥ daj dτj dmj p(aj , τj , mj ) log qψ (aj |τj , mj )

and hence

 Iθc (Aj ; Mi |Tj , Mj )
 qψ (aj |τj , mj )
 Z
≥ daj dτj dmj p (aj , τj , mj ) log 
 p aj |τj , mout
 j
 Z Z
= daj dτj dmj p (aj , τj , mj ) log qψ (aj |τj , mj ) − daj dτj dmj p (aj , τj , mj ) log p aj |τj , mout
 
 j
 Z
= daj dτj dmj p(τj )p(mj |τj )p(aj |τj ) log qψ (aj |τj , mj ) + H(Aj |Tj , Mj )
 Z
=ET∼D,Mj ∼fm ( daj p(Aj |T) log qψ (aj |τj , mj )) + H(Aj |Tj , Mj )

=ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , M )]] + H(Aj |Tj , Mj )

Notice that the entropy of labels H(Aj |Tj , Mj ) is an positive term that is indepen-
dent of our optimization procedure and thus can be ignored. Then we have

 Iθm (Aj ; Mi |Tj , Mj ) ≥ ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , M )]]

which is the lower bound of the first term in Eq.(2)

Lemma 3. A lower bound of mutual information Iθm (Mi ; Ti ) is

 ETi ∼D,Mj ∼fm [DKL (p (Mi |Ti ) kqφ (Mi ))]

where DKL denotes Kullback-Leibler divergence operator and qφ (Mi ) is a varia-
tional posterior estimator of p(Mi ) with parameters φ.

Proof.

Iθm (Mi ; Ti )
 p(mout
 i |τi )
 Z
 = dmout out
 i dτi p(mi |τi )p(τi ) log
 p(mout
 i )
 Z Z
 = dmout i dτi p(mout
 i |τ i )p(τi ) log p(mout
 i |τ i ) − dmout out out
 i dτi p(mi |τi )p(τi ) log p(mi )

 22
Again, p(mout i ) is fully defined by our encoder and RMarkov Chain, and when it
is fully defined, computing the marginal distribution dτi p(mout i |τi )P (τi ) might
be difficult. So we use qφ (mouti ) as a variational approximation to this marginal.
Since KL[p(mout i ), qφ (mout
 i )] ≥ 0,
We have
 Z Z
 dmout out out
 i p(mi ) log p(mi ) ≥ dmout out out
 i p(mi ) log qφ (mi )

Then
Iθm (Mi ; Ti )
 Z Z
 ≤ dmout i dτi p(mout
 i |τi )p(τ i ) log p(m out
 i |τi ) − dmout out out
 i dτi p(mi |τi )p(τi ) log qφ (mi )

 p(mout
 i |τi )
 Z
 = dmout out
 i dτi p(mi |τi )p(τi ) log
 qφ (mout
 i )
= ETi ∼D,Mj ∼fm [DKL (p (Mi |Ti ) kqφ (Mi ))]

Combining Lemma 1 and Lemma 2, we have the ELBO for the message encoding
objective, which is to minimize

Lm (θ m ) = ET∼D,Mj ∼fm [−H[p(Aj |T), qψ (Aj |Tj , Mj )]+βDKL (p(Mi |Ti )kqφ (Mi ))].

A.2 Soft Value Decomposition with latent state Information
The joint soft action value estimation using latent state Information value function
decomposition with a monotonic mixing network:

 Eπi [α log π(at |τ t ) − Qπtot (st , τt , at )]
 X
 = [k i (s)Eπ [α log π i (at |τt )] − Eπ [Qtot (τ , a, m; θ)]
 i
 X
 = [k i (s)Eπ [α log π i (at |τt )] − π(a|τ )Qtot (τ , a, m; θ)
 i
 X X
 = [k i (s)Eπ [α log π i (at |τt )] − π(a|τ ) [k i (s)q i (τti , ait , mit ) + b(s)]
 i i
 X X
 = [k i (s)Eπ [α log π i (at |τt )] − [k i (s)Eπ [q i (τti , ait , mit )] + b(s)]
 i i
 X
 mix i
 = q (st , Eπi [α log π (ait |τti ) − q i (τti , ait , mit )])
 i

Then we have the objective for soft policy gradients update using latent state In-
formation value function decomposition with a monotonic mixing network:
L(π) = ED [α log π(at |τ t )−Qπtot (st , τt , at )] = q mix (st , Eπi [α log π i (ait |τti )−q i (τti , ait , mit )])

 23
B StarCraft Multi-Agent Challenge
For the experiments on StarCraft II micromanagement, we follow the setup of
SMAC [33] with open-source implementation including COMA [4], MAVEN [22],
QMIX [32]and VDAC [36]. We consider combat scenarios where the enemy units
are controlled by the StarCraft II built-in AI and the friendly units are controlled
by the algorithm-trained agent. The possible options for built-in AI difficulties are
Very Easy, Easy, Medium, Hard, Very Hard, and Insane, ranging from 0 to 7.
We carry out the experiments with ally units controlled by a learning agent while
built-in AI controls the enemy units with difficulty = 7 (Insane). Depending on the
specific scenarios(maps), the units of the enemy and friendly can be symmetric or
asymmetric. At each time step each agent chooses one action from discrete action
space, including noop, move[direction], attack[enemy id] and stop. Dead units can
only choose noop action. Killing an enemy unit will result in a reward of 10 while
winning by eliminating all enemy units will result in a reward of 200. The global
state information are only available in the centralized critic.
For easier maps, we train each baseline algorithm with 1.5 million time steps, while
we train with 2 million steps on other maps.
For the maps used in the experiments:
 • 1c3s5z is a symmetric battle that consists of 1 Colossus, 3 Stalkers, and 5
 Zealots for each side.
 • 2s vs 1sc is an asymmetric battle in which the friendly side controls two
 stalker units with the competitor side controls one spine crawler.
 • 3s5z is a symmetric battle that consists of 3 Stalkers and 5 Zealots
 • 5m vs 6m is an asymmetric battle where the friendly side controls 5 marines
 to compete with the competitor side controls 6 marines.
 • 8m is a symmetric battle that consists 8 marines for both sides.
 • MM2 is an asymmetric battle where 1 Medivac, 2 Marauders, and 7 Marines
 battle against 1 Medivac, 3 Marauders, and 8 Marines. Medivacs are healing
 units that can heal friendly unit with limited healing energy. It cannot attack
 an enemy, and its healing energy will regrow over time.

Readers are encouraged to watch the game replay for better understanding. More
details about the environment can be found in [33].

C Implementation Details
We use PyTorch for all implementations. The pseudocode for optimizing PAF-SAC
can be summarized in Algorithm 1.
Experiments are obtained using Nvidia RTX2080Ti GPU. The training uses
episode runners, i.e., non-parallel runners, to discourage a very large batch size

 24
for training. Each independent run takes around 12 hours depending on the sce-
nario, while we carry out 4 different training sessions at the same time, bringing
an amortized training time of around 3 hours. Each training session runs with a
random seed which is generated randomly at the beginning of the session.
The agent networks of all algorithms resemble a DRQN with a recurrent layer
implemented as a GRU with 64-dimensional hidden states. The latent state infor-
mation encoding network is a feed-forward network that outputs 3 latent vectors
per agent. The latent state information information encoder is a fully connected
network with two 16 dimensional hidden layers. The posterior estimator is a fully
connected network implemented with one 16-dimensional hidden layer. Unless
mentioned with an update policy, all parameters introduced in PAF-SAC remain
the same throughout the training session. In Eq. (1), λ1 = λ2 = 1.0, in Eq. (2)
and Eq. (5), β = 0.001. All algorithms are trained with the same default hyper-
parameter settings. RMSprop is used for optimizing all algorithms with a learning
rate 5 ∗ 10−4 . Replay buffer stores the latest 5000 episodes with batch size 32.
Reward discount factor γ = 0.99. The Target network is set to update every 200
training steps.

 25
You can also read