Improving Multi-agent Coordination by Learning to Estimate Contention

Page created by Elmer Simmons
 
CONTINUE READING
Improving Multi-agent Coordination by Learning to Estimate Contention

                                                                 Panayiotis Danassis , Florian Wiedemair and Boi Faltings
                                             Artificial Intelligence Laboratory, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
                                                                                 {firstname.lastname}@epfl.ch
arXiv:2105.04027v2 [cs.MA] 20 Jun 2021

                                                                  Abstract                                 coupled (agents are only aware of their own history), and re-
                                                                                                           quires no communication between the agents. Instead, agents
                                              We present a multi-agent learning algorithm,                 make decisions locally, based on the contest for resources
                                              ALMA-Learning, for efficient and fair allocations            that they are interested in, and the agents that are interested
                                              in large-scale systems. We circumvent the tra-               in the same resources. As a result, in the realistic case
                                              ditional pitfalls of multi-agent learning (e.g., the         where each agent is interested in a subset (of fixed size) of
                                              moving target problem, the curse of dimension-               the total resources, ALMA’s convergence time is constant in
                                              ality, or the need for mutually consistent actions)          the total problem size. This condition holds by default in
                                              by relying on the ALMA heuristic as a coordi-                many real-world applications (e.g., resource allocation in ur-
                                              nation mechanism for each stage game. ALMA-                  ban environments), since agents only have a local (partial)
                                              Learning is decentralized, observes only own ac-             knowledge of the world, and there is typically a cost asso-
                                              tion/reward pairs, requires no inter-agent communi-          ciated with acquiring a resource. This lightweight nature of
                                              cation, and achieves near-optimal (< 5% loss) and            ALMA coupled with the lack of inter-agent communication,
                                              fair coordination in a variety of synthetic scenar-          and the highly efficient allocations [Danassis et al., 2019b;
                                              ios and a real-world meeting scheduling problem.             Danassis et al., 2019a; Danassis et al., 2020], make it ideal
                                              The lightweight nature and fast learning constitute          for an on-device solution for large-scale intelligent systems
                                              ALMA-Learning ideal for on-device deployment.                (e.g., IoT devices, smart cities and intelligent infrastructure,
                                                                                                           industry 4.0, autonomous vehicles, etc.).
                                         1   Introduction                                                     Despite ALMA’s high performance in a variety of domains,
                                         One of the most relevant problems in multi-agent systems is       it remains a heuristic; i.e., sub-optimal by nature. In this
                                         finding an optimal allocation between agents, i.e., comput-       work, we introduce a learning element (ALMA-Learning)
                                         ing a maximum-weight matching, where edge weights cor-            that allows to quickly close the gap in social welfare com-
                                         respond to the utility of each alternative. Many multi-agent      pared to the optimal solution, while simultaneously increas-
                                         coordination problems can be formulated as such. Exam-            ing the fairness of the allocation. Specifically, in ALMA,
                                         ple applications include role allocation (e.g., team formation    while contesting for a resource, each agent will back-off with
                                         [Gunn and Anderson, 2013]), task assignment (e.g., smart          probability that depends on their own utility loss of switching
                                         factories, or taxi-passenger matching [Danassis et al., 2019b;    to some alternative. ALMA-Learning improves upon ALMA
                                         Varakantham et al., 2012]), resource allocation (e.g., park-      by allowing agents to learn the chances that they will actually
                                         ing/charging spaces for autonomous vehicles [Geng and Cas-        obtain the alternative option they consider when backing-off,
                                         sandras, 2013]), etc. What follows is applicable to any such      which helps guide their search.
                                         scenario, but for concreteness we focus on the assignment            ALMA-Learning is applicable in repeated allocation
                                         problem (bipartite matching), one of the most fundamental         games (e.g., self organization of intelligent infrastructure, au-
                                         combinatorial optimization problems [Munkres, 1957].              tonomous mobility systems, etc.), but can be also applied as
                                            A significant challenge for any algorithm for the assign-      a negotiation protocol in one-shot interactions, where agents
                                         ment problem emerges from the nature of real-world applica-       can simulate the learning process offline, before making their
                                         tions, which are often distributed and information-restrictive.   final decision. A motivating real-world application is pre-
                                         Sharing plans, utilities, or preferences creates high overhead,   sented in Section 3.2, where ALMA-Learning is applied to
                                         and there is often a lack of responsiveness and/or communi-       solve a large-scale meeting scheduling problem.
                                         cation between the participants [Stone et al., 2010]. Achiev-     1.1   Our Contributions
                                         ing fast convergence and high efficiency in such information-
                                         restrictive settings is extremely challenging.                     (1) We introduce ALMA-Learning, a distributed algorithm
                                            A recently proposed heuristic (ALMA [Danassis et al.,          for large-scale multi-agent coordination, focusing on scala-
                                         2019a]) was specifically designed to address the aforemen-        bility and on-device deployment in real-world applications.
                                         tioned challenges. ALMA is decentralized, completely un-           (2) We prove that ALMA-Learning converges.

                                                     Copyright International Joint Conferences on Artificial Intelligence (IJCAI), 2021. All rights reserved.
(3) We provide a thorough evaluation in a variety of syn-              approximation procedure so as to maximize some notion of
thetic benchmarks and a real-world meeting scheduling prob-             expected cumulative reward. Both approaches have arguably
lem. In all of them ALMA-Learning is able to quickly (as                been designed to operate in a more challenging setting, thus
little as 64 training steps) reach allocations of high social wel-      making them susceptible to many pitfalls inherent in MAL.
fare (less than 5% loss) and fairness (up to almost 10% lower           For example, there is no stationary distribution, in fact, re-
inequality compared to the best performing baseline).                   wards depend on the joint action of the agents and since all
                                                                        agents learn simultaneously, this results in a moving-target
1.2   Discussion and Related Work                                       problem. Thus, there is an inherent need for coordination in
Multi-agent coordination can usually be formulated as a                 MAL algorithms, stemming from the fact that the effect of an
matching problem. Finding a maximum weight matching                     agent’s action depends on the actions of the other agents, i.e.
is one of the best-studied combinatorial optimization prob-             actions must be mutually consistent to achieve the desired re-
lems (see [Su, 2015; Lovász and Plummer, 2009]). There is a            sult. Moreover, the curse of dimensionality makes it difficult
plethora of polynomial time algorithms, with the Hungarian              to apply such algorithms to large scale problems. ALMA-
algorithm [Kuhn, 1955] being the most prominent centralized             Learning solves both of the above challenges by relying on
one for the bipartite variant (i.e., the assignment problem). In        ALMA as a coordination mechanism for each stage of the
real-world problems, a centralized coordinator is not always            repeated game. Another fundamental difference is that the
available, and if so, it has to know the utilities of all the partic-   aforementioned algorithms are designed to tackle the explo-
ipants, which is often not feasible. Decentralized algorithms           ration/exploitation dilemma. A bandit algorithm for example
(e.g., [Giordani et al., 2010]) solve this problem, yet they re-        will constantly explore, even if an agent has acquired his most
quire polynomial computational time and polynomial number               preferred alternative. In matching problems, though, agents
of messages – such as cost matrices [Ismail and Sun, 2017],             know (or have an estimate of) their own utilities. ALMA-
pricing information [Zavlanos et al., 2008], or a basis of the          Learning in particular, requires the knowledge of personal
LP [Bürger et al., 2012], etc. (see also [Kuhn et al., 2016;           preference ordering and pairwise differences of utility (which
Elkin, 2004] for general results in distributed approximabil-           are far easier to estimate than the exact utility table). The lat-
ity under only local information/computation).                          ter gives a great advantage to ALMA-Learning, since agents
   While the problem has been ‘solved’ from an algorithmic              do not need to continue exploring after successfully claiming
perspective – having both centralized and decentralized poly-           a resource, which stabilizes the learning process.
nomial algorithms – it is not so from the perspective of multi-
agent systems, for two key reasons: (1) complexity, and (2)             2     Proposed Approach: ALMA-Learning
communication. The proliferation of intelligent systems will            2.1    The Assignment Problem
give rise to large-scale, multi-agent based technologies. Al-
                                                                        The assignment problem refers to finding a maximum weight
gorithms for maximum-weight matching, whether centralized
                                                                        matching in a weighted bipartite graph, G = {N ∪ R, V}.
or distributed, have runtime that increases with the total prob-
                                                                        In the studied scenario, N = {1, . . . , N } agents compete to
lem size, even in the realistic case where agents are interested
                                                                        acquire R = {1, . . . , R} resources. The weight of an edge
in a small number of resources. Thus, they can only handle
                                                                        (n, r) ∈ V represents the utility (un (r) ∈ [0, 1]) agent n
problems of some bounded size. Moreover, they require a sig-
                                                                        receives by acquiring resource r. Each agent can acquire at
nificant amount of inter-agent communication. As the num-
                                                                        most one resource, and each resource can be assigned to at
ber and diversity of autonomous agents continue to rise, dif-
                                                                        most one agent. The goal is to maximize        the social welfare
ferences in origin, communication protocols, or the existence
                                                                        (sum of utilities), i.e., maxx≥0 (n,r)∈V un (r)xn,r , where
                                                                                                              P
of legacy agents will bring forth the need to collaborate with-
                                                                        x = (x1,1 , . . . , xN,R ), subject to r|(n,r)∈V xn,r = 1, ∀n ∈
                                                                                                              P
out any form of explicit communication [Stone et al., 2010].
Most importantly though, communication between partici-                 N , and n|(n,r)∈V xn,r = 1, ∀r ∈ R.
                                                                                P
pants (sharing utility tables, plans, and preferences) creates
high overhead. On the other hand, under reasonable assump-              2.2    Learning Rule
tions about the preferences of the agents, ALMA’s runtime is            We begin by describing (a slightly modified version of) the
constant in the total problem size, while requiring no message          ALMA heuristic of [Danassis et al., 2019a], which is used
exchange (i.e., no communication network) between the par-              as a subroutine by ALMA-Learning. The pseudo-codes for
ticipating agents. The proposed approach, ALMA-Learning,                ALMA and ALMA-Learning are presented in Algorithms 1
preserves the aforementioned two properties of ALMA.                    and 2, respectively. Both ALMA and ALMA-Learning are
   From the perspective of Multi-Agent Learning (MAL), the              run independently and in parallel by all the agents (to im-
problem at hand falls under the paradigm of multi-agent rein-           prove readability, we have omitted the subscript n ).
forcement learning, where for example it can be modeled as                 We make the following two assumptions: First, we as-
a Multi-Armed Bandit (MAB) problem [Auer et al., 2002],                 sume (possibly noisy) knowledge of personal utilities by each
or as a Markov Decision Process (MDP) and solved using a                agent. Second, we assume that agents can observe feedback
variant of Q-Learning [Busoniu et al., 2008]. In MAB prob-              from their environment to inform collisions and detect free
lems an agent is given a number of arms (resources) and at              resources. It could be achieved by the use of sensors, or by a
each time-step has to decide which arm to pull to get the               single bit (0 / 1) feedback from the resource (note that these
maximum expected reward. In Q-learning agents solve Bell-               messages would be between the requesting agent and the re-
man’s optimality equation [Bellman, 2013] using an iterative            source, not between the participating agents themselves).

             Copyright International Joint Conferences on Artificial Intelligence (IJCAI), 2021. All rights reserved.
3
                                                                                                                                                               1: procedure ALMA-L EARNING
                                            Table 1: Motivating adversarial example: Inaccurate loss estimate.                                                 2:        for all r ∈ R do                                               . Initializatio
                                            Agent n3 backs-off with high probability when contesting for re-                                                   3:             rewardHistory[r].add(u(r))
                                            source r1 assuming a good alternative, only to find resource r2 oc-                                                4:             reward[r] ← rewardHistory[r].getMean()
                                            cupied.                                                                                                            5:             loss[r] ← u(r) − u(rnext )
                                           Resources                                             Resources
                                                                                                       Algorithm 2 ALMA-Learning                               6:        r          ← arg max reward[r]
                                                                                                                                   loss(i) = un (ri7:) − un (ri+1 ) and ri+1r is the next best re-
                                                                                                                                                                           start
                                            r1       r2         r3                                r1      r2         r3 Resources
                                                                                                                                   source according            8:toinagent
                                                                                                                                                                         for tn’s     preferences           ≺n .
                                                                                                   1 Require:         0Sortrresources   r2 (Rr3 ⊆ R)
                                                                                                                                                     n
                                 n1          1        0        0.5                     n1                0.9                                                           decreasing∈ [1,   order
                                                                                                                                                                                          . . . , Tof] do
                                                                                                                                                                                                        utility                     . T : Time horizo
                  Agents                                               Agents                                r  ,  . . . , r
                                                                                                                                1
                                                                                                                                       The
                                                                                                                                      under    first
                                                                                                                                                ≺       example9:     is given      in   Table      1a.    Agent      3 backs-off
                                                                                                                                                                                                                    nloss[])             . Run ALM
                                 n2          0        1         0                      n2          0       1 0 n0.9            1
                                                                                                                               n
                                                                                                                             R −1
                                                                                                                                         0        0.5
                                                                                                                                                   n                          r  won    ←    ALM        A(r  start ,
                                 n3          1       0.9        0                      n3          Agents
                                                                                                   1 Require:
                                                                                                                       1
                                                                                                         0.9 n0rewardHistory[R][L],with high probability     10:       (higherloss[R]
                                                                                                                                                               reward[R],            than agent n1 ) when contesting
                                                                                                        1: procedure   2       0ALMA-L   1          0
                                                                                                                                   for resource EARNING  r 11:assuming arewardHistory[r
                                                                                                                                                                                  good alternative,             only    to findwon
                                                                                                                                                                                                                   ].add(u(r      re-))
                                                                                                                    n3 all 1r ∈ R0.9                0 1 12:                                                  start
 able 1: Motivating adversarial (a)         Table 2: Inaccurate
                                         example:         Motivatingloss    adversarial       (b) 2: Inaccurate
                                                                                 estimate.example:                 for             source
                                                                                                                               reward     do   r2 occupied.
                                                                                                                                           expec-                     Thus, reward[r
                                                                                                                                                                                n3 ends  . Initialization
                                                                                                                                                                                               up
                                                                                                                                                                                               startmatched        with resource start ].getMean
                                                                                                                                                                                                       ] ← rewardHistory[r
 gent n3 backs-off with high Table          tation.
                                      probability      Agents       n1 and
                                                          when contesting
                                                   1: Motivating               n3 foralways
                                                                           adversarial            start 3:
                                                                                          re- example:  by attempting
                                                                                                                Inaccurate            . The
                                                                                                                                tor3acquire
                                                                                                                                     loss        re-social13:
                                                                                                                          rewardHistory[r].add(u(r))
                                                                                                                                              estimate.        welfaren3ofifbacks-off
                                                                                                                                                               Agent               thestart
                                                                                                                                                                                  u(r     final    −allocation
                                                                                                                                                                                                ) with      won ) >is0 2,
                                                                                                                                                                                                       u(rhigh           thenwhich
  urce r1 assumingTable  1: Adversarial
                      a good                source
                                 alternative,     examples:
                                                    only   , to    (1a)
                                                              reasoning
                                                                find     Inaccurate
                                                                             that
                                                                       resource     it  isloss
                                                                                         oc-the   estimate.
                                                                                                    most4:     Agent
                                                                                                           preferred      reward[r]
                                                                                                                            one, isyet20%   ← worse
                                                                                                                                          each               14: theonly
                                                                                                                                                 rewardHistory[r].getMean()
                                                                                                                                                  ofalternative,                     loss[r              ← r n1 , n2 , n3 are
                                                                                                                                                                                               start ]agents
                  n3 backs-off withthem
                                       probability
                                               highonly
                                                       r 1 when contesting
                                                      probability
                                                                                     r 2 for resource r1 assuming                       a  good             than          optimal to     (where
                                                                                                                                                                                       find    resource         2
  pied.                                occupied.             wins r1 whenhalf ofcontesting
                                                                                    the times.for resource
                                                                                                        5:             r1loss[r] ← u(r) − u(rnext
                                                                                                                                   matched with resources    15: )           r3 , r2 , (1r1 ,−respectively,
                                                                                                                                                                                                  α)loss[rstart ]achieving
                                                                                                                                                                                                                      + α (u(rstart a ) − u(rwon ))
                  assuming a good            alternative,
                                           Resources           only   to  find   resource      r  2 occupied.
                                                                                                        6:         r           ←    arg   max                16:2.5). ALMA-Learning solves this problem
                                                                                                                                                     reward[r]
                  (1b) Inaccurate reward
                                                                                                                     start         social     welfare
                                                                                                                                                  r         of
                                            r1       rexpectation.
                                                       2
                                                                           Agents n1 and n3 always
                                                                r3 Resources                            7:          start
                                                                                                                              Resources
                                                                                                                                   by learning an empirical  17:               if rstart !of=the
                                                                                                                                                                             estimate              rwon lossthen
                                                                                                                                                                                                               an agent will in-
                  by attempting  n1     to   acquire
                                             1       0.9 resource
                                                                0 r1   r 1 , reasoning that it 8:       is  the   most
                                                                                                                   for    t ∈r [1, . .r. , T ] do            18:                         : Time
                                                                                                                                                                                   .rTstart     ← horizon
                                                                                                                                                                                                     arg maxr reward[r]
                                                                                 r          r                                      cur   if  he    r
                                                                                                                                                 backs-off         from    a  resource.
                  Agents one, yet each of them only wins r1 half of the time.
                  preferred                                                                                                                                                               . Run ALMA agent n3 will
                                                                                                                                                                                                In  this    case,
                                                                                   2          3                                 1         2           3
                                 n2          0        1       n0.9      1         0        0.5          9:          n1 rwon        ← 0.9
                                                                                                                               1 learn  ALM A(r     0hisstart  , loss[])
                                 n3          Agents
                                             1       0.9 n0
                                                                 1
                                                                                                   Agents
                                                                                                       10:                                   that         loss   is  not   1 −   0.9    =    0.1, but actually 1 − 0 = 1,
                                                                 2      0         1          0                      n2 0 and1thus 0.9              willstart
                                                                                                                                                           not].add(u(r
                                                                                                                                                                 back-offwon    in ))subsequent stage games, result-
                                                                                                       11:                 rewardHistory[r                   where rnext is the next most preferred resource to r, accord
                     For both ALMA, and ALMA-Learning,        n3 1 0.9                       0each 12: agent sorts  n3 reward[r1 ing0.9  in         0← rewardHistory[r
                                                                                                                                              an] optimal
 able 2: Motivating adversarial example: Inaccurate
                  his  available       resources         (possibly
                                                                        reward
                                                                         R   n
                                                                                ⊆
                                                                                    expec-
                                                                                     R)    in  decreasing          util-
                                                                                                                                          start              ingallocation.
                                                                                                                                                                    to agent n’s              ].getMean()≺ (see line 5 of Alg. 2). Sub
                                                                                                                                                                                         preferences
                                                                                                                                                                                      start
                                                                                                                                                                                                                 n
 tion.  Agents n1 and
  e 1: Motivating        n3 always
                      adversarial     Table start
                                          example:2:by  attempting
                                                            Inaccurate
                                                      Motivating          to  acquire
                                                                               loss       re-
                                                                                       estimate.
                                                                    (a) adversarial example: Inaccurate13:
                                                                                                        Agent        n     if
                                                                                                                        3(b)backs-off
                                                                                                                              u(r      In  ) another
                                                                                                                                              −
                                                                                                                                             withu(r
                                                                                                                                    reward expectation.
                                                                                                                                   start               high
                                                                                                                                                        wonexample
                                                                                                                                                             ) >    0 then
                                                                                                                                                             sequently,   (Table
                                                                                                                                                                       Agents          1b,     agents
                                                                                                                                                                                       1 and stage
                                                                                                                                                                               fornevery          n3 always n 1 andagent
                                                                                                                                                                                                            game,      n3 always
                                                                                                                                                                                                                               n starts by selectin
                  ity (r   , . . . , r   , . . . , r   n −1 ) under his preference ordering ≺n .
                                                                                                       14:                         start   byit attempting          topreferred
                                                                                                                                                                        acquire
  urce  r1 ,when
  ability   reasoning  that it isfor
                  contesting
                         0            the
                                       i     most
                                           resource
                                       start         Rpreferred
                                                          r1 assuming
                                                 by attempting       one,to yet  aeach
                                                                                    goodof
                                                                              acquire         alternative,
                                                                                            resource               only to find
                                                                                                            r1 , reasoning      loss[r
                                                                                                                                     thatresource
                                                                                                                                          start  ]  ←
                                                                                                                                                 is ther2mostresource       rstartresource
                                                                                                                                                                                      , andyet
                                                                                                                                                                                      one,       ends   1 ,up
                                                                                                                                                                                                      reach reasoning
                                                                                                                                                                                                               wining
                                                                                                                                                                                                               of          that it is rwon . The los
                                                                                                                                                                                                                          resource
  em only wins r1ALMA:
                    half of the     times.                                                             15:                          (1 −
                                                                                                                                   the   mostα)loss[r
                                                                                                                                                  preferred  of ]rone.
                                                                                                                                                                    + α (u(r               − u(rwon
                                                                                                                                                                                         )repeated          ))
  pied.                         ALtruistic
                                       them Figure
                                                 only  1:wins
                                                           Adversarial
                                                     MAtching                examples:
                                                                        Heuristic
                                                                    r1 half    of the times.   1a Inaccurate          loss estimate.        Agent                   startYet,
                                                                                                                                                                            is thenin aupdated             game,
                                                                                                                                                                                                        according  each     of them
                                                                                                                                                                                                                       to the   following averagin
                                                                                                                                                          start                  start
                                            n    backs-off       with   high
                  ALMA converges to a resource through repeated trials. Let
                                               3                                probability            16:
                                                                                                    when   contesting         for  only
                                                                                                                                    resourcewins  r  1 r1  half   of
                                                                                                                                                             process,  the   time
                                                                                                                                                                            where      (for
                                                                                                                                                                                        α  is  athesocial
                                                                                                                                                                                                      learningwelfare
                                                                                                                                                                                                                   rate: 2,  which
                  Resources                                           Resources                        17: resourceifrr2start                          then than the optimal 2.8), thus, in expectation,
                  A = {Y, Ar1 , assuming    . . . , ArRna}good         alternative,
                                                                             the setonly           to find     where roccupied.    is !28.5%     1bworse
                                                                                                                                         = rwon
                                                                 denote                    of actions, 18:                                                      loss[r   start ] ← (1 − α)loss[rstart ] + α (u(rstart ) − u(rwon ))
                   r       r          r     Inaccurate
                                      Algorithm               reward   r expectation.
                                                                                 r          r  Agents     n    and      n     always     start
                                                                                                                                          ←
                                                                                                                                   resource r1maxby
                                                                                                                                                arg             reward[r]
                                                                                                                                                         hasrutility      0.5. ALMA-Learning solves this by
                                                       and2toA ALMA-Learning                                                      start
                  Y 1 refers2 to yielding,
                                        3
                                            attempting            r refersresource
                                                                  acquire
                                                                         1      to2accessing  3
                                                                                                ,
                                                                                                       resource
                                                                                                   reasoning
                                                                                                             1
                                                                                                                  that
                                                                                                                           3
                                                                                                                        r,it is the   most     pre-
             n1 and1 let0g denote    0.5 the agent’s                    1 0.9
                                                              n1strategy.                   r01 as an agent has                    learning an empirical               estimate        of the     reward of       each17-18
                                                                                                                                                                                                                         resource.
  gents
                   0        1
                                             Agents
                                       Require:
                                       0 ferred one,      Sortn resources
                                                              yet   each0 of themAs    long
                                                                                  1(Ronly
                                                                                        n
                                                                                           0.9  R) inr1decreasing
                                                                                           ⊆ wins           half of the times. order of utility r0 , . . . , rRn −1theunder
                                                                                                                                                                 Finally,             last    condition
                                                                                                                                                                                              ≺n               (lines            of Alg. 2) ensure
             n      2   not acquired Require:   a resourcerewardHistory[R][L],
                                                                      yet,2 at every time-step, there                          arertwoloss[R]          In this case, afterthat          learning,
                                                                                                                                                                                              agentseither  who have  agentacquired
                                                                                                                                                                                                                                  n1 or n3 (or         both), will
                                                                                                                                                                                                                                                   resources      of high preferenc
                 n3       1        0.9          0                        n             1           0.9         0         where
                                                                                                                      reward[R],        next is the    start  nextfrom  most      preferred
                                                                                                                                                                            resource         r   .   resource
                                                                                                                                                                                                     Agent       n    towillr, accord-
                                                                                                                                                                                                                                 back-off         since    he has process.
                                                                                                                                                                                                                                                                     a
                        possible scenarios:       1: procedure If g = ALMA-L 3  Ar (strategy          EARNINGpoints to   ingresource
                                                                                                                                to agent n’sgood         preferences             ≺   stop(see exploring,
                                                                                                                                                                                               2
                                                                                                                                                                                                  line    5   of   thus
                                                                                                                                                                                                                    2
                                                                                                                                                                                                                   Alg.      stabilizing
                                                                                                                                                                                                                               2).   Sub-        the   learning
  e 2: Motivating             then
                     (a) adversarial    agent        source
                                                       attempts
                                                  example:         will   to decrease,
                                                                               acquire             thus
                                                                                                  that      in  the
                                                                                                          resource.    future
                                                                                                                           If      the
                                                                                                                               there       agent
                                                                                                                                           is         will         alternative,
                                                                                                                                                                 switch
                                                                                                                                                                                     n and the result will be the optimal allocation
 ample:        Inaccurate
                        r),      loss estimate.    n
                                                  2:           forInaccurate
                                                            Agent      allnr3(b)  backs-off
                                                                                  ∈    R do  reward     with expectation.
                                                                                                                  highsequently,   Agents             1 and stage
                                                                                                                                              fornevery            n3 always  game, agent n starts.by                         selecting
                                                                                                                                                                                                                         Initialization
                                                                                                                                                     , where         agents       n12.3         nConvergence
                                                                                                                                                                                                   3 are matched
                                                                                                                                                                                       , n2 , resource                         withloss resources r2 , r3 , r1
    by attempting
 source      r1 assuminga tocollision,
                                   a goodthe
                                acquire           3:to
                                               resource   an alternative
                                                      colliding
                                                 alternative,            only tostarting
                                                                         parties         back-off
                                                                 r1 ,rewardHistory[r].add(u(r))
                                                                         reasoning         find  that    resource,
                                                                                                         it (set
                                                                                                     resourceis the     ← andY preferred
                                                                                                                         resource
                                                                                                                    gr2most       (2)
                                                                                                                                  ) with ifrstart
                                                                                                                                              an agent
                                                                                                                                                     one,andyet  backs-
                                                                                                                                                                 ends   eachup wining
                                                                                                                                                                                  of                             rwon . The
                                                                                                                                                       (or     r   , r    , r   ),  respectively.
mgureonly1:wins      r1 some
              Adversarial        ofprobability.
                         halfexamples:the times.  4:off
                                                  1a        from reward[r]
                                                            Otherwise,
                                                       Inaccurate       contesting   if g ←
                                                                            loss estimate.      = resource, theragent
                                                                                                      YAgent          expecting
                                                                                                                         of rchooses
                                                                                                                               start is
                                                                                                      rewardHistory[r].getMean()          low then loss,updatedonly according
                                                                                                                                                                 1      3to   2      Convergence
                                                                                                                                                                                            to the following   of ALMA-Learning
                                                                                                                                                                                                                            averaging                 does not translate to a fixe
 3 backs-off withanother  high probability
                                       resource        when
                                                  5:findr forthatcontesting
                                                                       all   his high
                                                                   monitoring.
                                                                       loss[r]        for
                                                                                       ← resource
                                                                                               Ifutility
                                                                                              u(r)           alternatives
                                                                                                    the−resource
                                                                                                              r1
                                                                                                             u(r         process,
                                                                                                                          )is     are
                                                                                                                               free,     where
                                                                                                                                          already
                                                                                                                                         he           α     is  the
                                                                                                                                                          occupied,    learning         rate:
                                                                                                                                                                                     allocation           at   each        stage      game.          The     system has converge
                                                                                                                           r (loss[r]) willALMA-Learning
                                                                                                                    next
   suming a good Resources
                         alternative,
                        sets    g ← Aonly            to find resource r2 occupied.
                                             r . then his expected                          loss of resource 1b                                             increase,                when agents no longer switch their starting resource, rstar
                                                  6:           r             ←      arg      max          reward[r]        loss[rstart ] ← ALMA-Learning
                                                                                                                                                       (1 − α)loss[r                     ] + α (u(r        start )as  −au(r             ))
   accurate2 reward
orithm                   r1expectation.
                             Therback-off
                 ALMA-Learning        2        r3Agents
                                                     making     nstart
                                                       probability    and n
                                                                  1 him        more
                                                                               (P    always
                                                                                  3 (·))    reluctantstart
                                                                                                       r by
                                                                                               is computed     to back-off
                                                                                                                       individually in some future stage start                       Theuses   final ALMA allocation          sub-routine,
                                                                                                                                                                                                                                ofwoneach stage       specifically
                                                                                                                                                                                                                                                            game is controlled b
 tempting to    n1acquire 1     resource
                                   0.9         r017:
                                                   ,  reasoning
                                                     game.        In    that
                                                                        more     it is    the
                                                                                    detail,       most     pre-
                                                                                                    ALMA-Learning                   learns       and   as     a  coordination
                                                                                                                                                          maintains                        mechanism              for    each       stage       of  the    repeated
   gents
 uire:       Sort       and     locally      based       on
                                                   R) inr1for   each      agent’s          expected
                                                                                        . . , T ]ofdoutility  loss.    If more
                                                                                                                             Finally, than    the    last     condition              ALMA,
                                                                                                                                                                                  (lines     17-18    which
                                                                                                                                                                                                          of       means
                                                                                                                                                                                                                Alg.       2)    that
                                                                                                                                                                                                                                ensures   even      after    convergence there ca
                n2resources         1(Ronly                     decreasing             order                         r0 , . . . , rRn −1 under                                                                . T : Time           horizon
                                           n
  rred    one,  yet  each0 of them            ⊆8:wins
                                              0.9                 halft of∈ [1,the .times.                1                                            game.  ≺n Over time, ALMA-Learning                                     learns      which       resource      to than one agen
                        one     agent       compete  the following
                                                              for     resource   information
                                                                                          r      (step     8: of Alg.    that 1),    each
                                                                                                                                 agents         who       have       acquired        be     contest
                                                                                                                                                                                       resources           for
                                                                                                                                                                                                          of      a
                                                                                                                                                                                                                high  resource,
                                                                                                                                                                                                                          preference      i.e.,    having      more
 uire: rewardHistory[R][L],
                n3 of1 them                       9: reward[R],
                                   0.9will 0back-off                   rwon loss[R]         i
                                                                                   ← ALM A(rstart , loss[])                                            select       first    (r          ) when        running        . Run ALMA
                                                                                                                                                                                                                      ALMA,           and      an   accurate     em-
                                                         (i) rewardHistory[R][L]:
                                                                  with        probability              that        A 2D
                                                                                                               depends        array.
                                                                                                                              on     their  For     each       r    ∈    R
                                                                                                                         stop exploring, thus stabilizing the learning process. startselecting          the     same         starting        resource.         As    we will demon
 procedure ALMA-L EARNING                       10:                                                                                                    pirical estimate               on thelater, loss the      agent       will incur          by backing-off
ource
   mple:    will        expected
                  decrease,        thus  utility
                                            in   the it  maintains
                                                      loss.
                                                        future  The   the       the
                                                                         expected
                                                                              agent      L  willmost
                                                                                              loss   switchrecent
                                                                                                        array     is   reward
                                                                                                                      computed        values
                                                                                                                                         by          received           by           strate                  this    translates          to    fairer    allocations,     since agen
         forInaccurate
              all r ∈ R do     reward expectation.
                                                11:                    Agents
                                                                       rewardHistory[r n1 and n3 always       start ].add(u(rwon )) (loss[]).            . Initialization By learning             these preferences
                                                                                                                                                                                                              two values can        agents        take more        in- acquiring the
    an   alternative    ALMA-Learning
                          starting
 urce r1 ,rewardHistory[r].add(u(r))
                reasoning that it is12:  resource,   agent
                                                         andand  n,
                                                  the most preferred (2)i.e.,
                                                                 provided    if   the
                                                                                  an
                                                                       reward[rone,  as    input
                                                                                           L
                                                                                        agent    most  to
                                                                                                     backs- recent
                                                                                                            ALMA.       u2.3
                                                                                                                          The
                                                                                                                            (r      Convergence
                                                                                                                                  actual),    where         r           ←            with      similar                                      alternate       between
                                                                                            startyet  ] ←each        of   n      won                          won
                                                                                                              rewardHistory[r                              ].getMean()
                                                                                                                                                       formed          decisions,          specifically:            (1) If an agent              often
                                                                                                                                                                                                                                                     luckloses    the we only pro
                                                                                                                                                 start
                        back-off        probability  ALMcan            be   computed         Seewith  line to 11 monotonically
                                                                                                                    of Alg.      2. The array               is initial-
                                                                                                      u(rany                                                                         most       preferred          resource.           Due to                of space
   f from reward[r]
               contesting         resource                       A(r,      loss[]).
                                 ←              13:r expecting
                                       rewardHistory[r].getMean()      if u(rlowstart  loss, ) −only                     Convergence
                                                                                                            won ) > 0 then
                                                                                                                                                   of ALMA-Learning
                                                                                                                                                       contest        of    his   starting
                                                                                                                                                                                           does      not translate
                                                                                                                                                                                                 resource,         the
                                                                                                                                                                                                                             to a fixed
                                                                                                                                                                                                                          expected          reward       of  that  re-supplement.
 nd that all            decreasing
                   his high      utility     functionized     to
                                                              on   the     utility
                                                                               (see      [
                                                                                         of    each
                                                                                           Danassis       resource
                                                                                                              et  al.,   (line
                                                                                                                         2019a     3]  of
                                                                                                                                      ).  InAlg.      2).                            vide     a   sketch       of   the     proof.      Please       refer    to the
              loss[r]     ← u(r)         − alternatives
                                                14:next ) are already
                                              u(r                    loss      loss[rstart     occupied,
                                                                                                      ]←                 allocation at each stage game. The system has converged
                                                                                                                                                       source         will     decrease,         thus      in   the    future       the     agent     will    switch
   en his expected      thislosswork  of we      use P
                                           resource      (ii)     reward[R]:
                                                              r (loss[r])
                                                             (loss)        = f(1      will
                                                                                     (loss)        A , 1D
                                                                                                   β
                                                                                                        where
                                                                                                  increase,     array.
                                                                                                                    β controls
                                                                                                                        ]when  For       each no
                                                                                                                                        the
                                                                                                                                    agents           r−     ∈u(rRswitch
                                                                                                                                                         longer            it        their
                                                                                                                                                                                     Theorem   starting   1. resource,
                                                                                                                                                                                                                Thereand          rstart
                                                                                                                                                                                                                              exists         .
                                                                                                                                                                                                                                           time-step
         r
   akingstart
                  ← arg
             him more
                               maxr reward[r]
                        aggressiveness
                              reluctant
                                                15:
                                                     maintainsinansome
                                               to (willingness
                                                     back-off                 toempirical
                                                                                           − α)loss[r
                                                                                          future estimate
                                                                                    back-off),            andstarton
                                                                                                        stage
                                                                                                                          +α       (u(r
                                                                                                                             thefinal       start ) to
                                                                                                                                     expected           reward       won ))
                                                                                                                                                             an alternative
                                                                                                                                                                        re-             starting resource,                         (2) if     an agenttconv   backs-such    that ∀t >
                                                16:ceivedby starting at resource r and continue playing
                                                                                                                         The                  allocation            of     each staget           game
                                                                                                                                                                                                  :   r n is controlled
                                                                                                                                                                                                                 (t)     =      r n       by
                                                                                                                                                                                                                                           (t         ),  where     r n
                                                                                                                                                                                                                                                                           (t) denote
 ame. In more detail, ALMA-Learning learns and maintains                                                                                               off fromac-contesting resource r expecting low loss, only to
                                                                                                                                                                                       conv             start                     start       conv                   start
                                                                                                                         ALMA, which                   means          that     eventheafter        convergence
                                                                                                                                                                                             starting        resource        there
                                                                                                                                                                                                                                r      can  of   agent      n  at  the  stage  game  o
 ) in fordecreasing        . . , T ]ofdoutility
              t ∈ [1, .order                    17:    r  ,  . . .  , rif  r         under
                                                                                        !  =      r ≺
                                                                                            1.if loss       then                                 .   T   :
                                                                                                                                                       find Time that  horizon
                                                                                                                                                                         all    his   high     utility      alternatives           are
                                                                                                                                                                                                                                  start    already       occupied,
   e following information1 : cording                             1toR−Alg.  ,
                                                                              start                  Itn is≤computed
                                                                                                    won                          by averaging                  the re- i.e., time-step
                                                        0                  n  −1
                                                                                                                         be   contest          for    a    resource,                   having         more       than       one     agent
                                                                 
                       ← ALM A(r
              rwon loss[R]                      18:ward , loss[])                            ←                                                         then. Run  his ALMA                              t.
                                                                                                                                                                         expected loss of resource r (loss[r]) will increase,
 reward[R],
    (i) rewardHistory[R][L]:                 f start
                                               (loss)A 2D  = history  , rFor
                                                                 array.         of
                                                                                 start the
                                                                                        each       1arg
                                                                                               ifresource,
                                                                                                    r−∈loss
                                                                                                            maxi.e.,  reward[r]
                                                                                                                          ∀r ∈ R(1)the
                                                                                                              R ≤r  selecting               : reward[r]
                                                                                                                                                    same starting       ←         resource. As we will demon-
G
    maintains the L most recent                      rewardHistory[r].getM
                                                         reward1 −       values
                                                                              loss, received   otherwise   ean().
                                                                                                             by         See    linelater,
                                                                                                                                        12 ofthis  Alg.making 2.        him      moreProof. reluctant
                                                                                                                                                                                                    (Sketch)  to back-off
                                                                                                                                                                                                                      Theorem        in 2.1
                                                                                                                                                                                                                                          someoffuture         stage et al., 2019a
                                                                                                                                                                                                                                                        [Danassis
                                                                                                                         strate                           translates          to   fairer      allocations,           since agents
                                                                 
                                                                 
              rewardHistory[rstart ].add(u(r             (iii)    loss[R]:won ))A 1D          . Initialization
                                                                                                    array.      For    each      r   ∈    R     it     game.
                                                                                                                                                   maintains         In  anmore      detail,
                                                                                                                                                                                     proves        ALMA-Learning
                                                                                                                                                                                                    that     ALMA            (called learns at    and
                                                                                                                                                                                                                                                 line    maintains
                                                                                                                                                                                                                                                         9   of  Alg. 2) converge
gent n, i.e., the Agents     L mostthat     recentdo     un (r
                                                        not     have
                                                                   wongood ), where   alternativesrwon ←                 with
                                                                                                                           lesssimilar
                                                                                                               willinbeutility     likely        preferences can alternate between                                acquiring their
   (u(r)) reward[rstart168             ] ← rewardHistory[r
                                               Table empirical
                                                           2 presents    estimate
                                                                                the          on
                                                                                        second
                                                                                     start      ].getMean()
                                                                                                    the    loss
                                                                                                        example.        Agents       agent
                                                                                                                                       n     and n   n the
                                                                                                                                                     incurs    following
                                                                                                                                                            always  if   he
                                                                                                                                                                          start    information
                                                                                                                                                                                   byin attempting
                                                                                                                                                                                           polynomial
                                                                                                                                                                                                         1
                                                                                                                                                                                                           : totime
                                                                                                                                                                                                                  acquire   (in  resource
                                                                                                                                                                                                                                  fact,     under      some     assumptions,    it con
   LM A(r, loss[]).            See
                        to back-off    line   11     of   Alg.
                                             and )backs-off
                                                     vice           2.    The       array        is  initial-            most      preferred
                                                                                                                                         1             resource. Due to luck of space we only pro-
                                                                                                                                                        3
              if u(rstart      ) − u(r                  > 0versa.
                                                                thenfrom   The ones              that do back-off
                                                                                      theis contest
                                                                                                the mostofpreferredresource  select r. an  The      loss      of
                                                                                                                                                            a(i)    each game,                                     A 2D         array.       For1 each
History[r].getMean()                         wonr1 , reasoning              that     it                                            one.      Yet,of   inthe        rewardHistory[R][L]:
                                                                                                                                                                repeated             verges   each  in  of    them
                                                                                                                                                                                                         constant        only
                                                                                                                                                                                                                           time,   wins
                                                                                                                                                                                                                                      i.e.,  reach           r ∈game
                                                                                                                                                                                                                                                         stage      R converges i
   ed to the utility        of   each
                        alternative
                                      169resource          (line      3   of   Alg.        2).                           vide     a  sketch                      proof.       Please       refer     to  the    supplement.
 ext )              loss[r    start170 ] ←resource
                                                half resourceand examine
                                                        of theFor      r is (achieving
                                                                     times
                                                                                         its availability.
                                                                                  initialized            to loss[r]
                                                                                                         social
                                                                                                                      The ←
                                                                                                                    welfare
                                                                                                                             resource   n (r) −worse
                                                                                                                                  2,u28.5%             itunmaintains
                                                                                                                                                              (rnext
                                                                                                                                                                   than),thethe          L most
                                                                                                                                                                                   optimal
                                                                                                                                                                                     constant     2.8),   recent
                                                                                                                                                                                                            thus,Ininreward
                                                                                                                                                                                                       time).              expectation,
                                                                                                                                                                                                                         ALMA-Learning values received    agents by switch their star
    (ii) reward[R]:     selection
                        (1   −     A 1D
                                 α)loss[r is     array.
                                              performed   ] +     α  in
                                                                      (u(r  each r) −∈order,
                                                                          sequential               u(r  R      it
                                                                                                               starting
                                                                                                                ))           from
                                                                                                                         Theorem        the   1.    There          exists      time-step           tan         such       that     ∀t      >expected
 d[r]
   aintains an empirical                estimate
                                      171       resource
                                                 start 1 r has utility
                                                         on     the            start 0.5. ALMA-Learning
                                                                        expected             reward      won
                                                                                                            re-                        solves          agent
                                                                                                                                                    this     problemn,     i.e.,
                                                                                                                                                                              by   the
                                                                                                                                                                                    learning
                                                                                                                                                                                     ing   L    most
                                                                                                                                                                                             resourceconv  recent
                                                                                                                                                                                                         empirical
                                                                                                                                                                                                               only     uwhen
                                                                                                                                                                                                                           n estimate
                                                                                                                                                                                                                              (r  wonthe ),   where
                                                                                                                                                                                                                                             of           rwon
                                                                                                                                                                                                                                                             reward← for the curren
                                                            We     1have      omitted
                        most preferred resource (see step 3 of Alg. 1). n tconv : rstart (t)   the    subscript          from     all   the n  variables         and
                                                                                                                                                              = A(r,  n ar-
                                                                                                                                                                    rstart      (tconv     ),nwhere          r11n
                                                                                                                                                                                                                        (t)     denotes
eived by starting at resource         172       therays,
                                                       reward
                                                       r and  but of       each
                                                                      continue
                                                                     every           resource.
                                                                                 agent     playing
                                                                                             maintains    Inac-
                                                                                                              this
                                                                                                               theircase,
                                                                                                                       own     after learning,
                                                                                                                              estimates.
                                                                                                                                                       ALM  either      agent        1 or See
                                                                                                                                                                              loss[]).
                                                                                                                                                                                   nstarting    3 (orline both),
                                                                                                                                                                                                      resource     ofwill
                                                                                                                                                                                                               start     Alg.start
                                                                                                                                                                                                                        drops      2.below
                                                                                                                                                                                                                                       The
                                                                                                                                                                                                                                        fromarray the bestis initial-
                                                                                                                                                                                                                                                                alternative one, i.e
              if        Sources          of then
                                               Inefficiency
                                                resource r2by         . AgentALMA            :
                                                                                       n2 willtheis
                                                                                                  Timea   heuristic,
                                                                                                            horizon
                                                                                                       back-off          the    starting
                                                                                                                            i.e.,     sub-       resource
                                                                                                                                                       ized      to r the        of
                                                                                                                                                                              utilityagentof  eachn   at   the
                                                                                                                                                                                                        resource   stage  (linegame 3 ofofAlg. 2).
ording to Alg.   r         ! =
                   start 1. won   r  It173is computed                                .
                                                                             averaging   T                  re- since            he has a good alternative, and the result will be the optimal
                                                                                                                                                                      start
  ,ard    history rof
    loss[])             optimal
                          the←
                      start         arg by max
                                 resource,
                                      174     nature.
                                                allocation    It is∈
                                                      r reward[r]
                                                   i.e.,    ∀r          worth
                                                                    where R : agents  understanding
                                                                                    reward[r]   .nRun        ←, n3the
                                                                                                     1 , n2ALMA      aretime-step
                                                                                                                          sources
                                                                                                                            matchedof       t.with resources (ii) reward[R]:r2 , r3 , r1 (or     A r11D , r3 ,array.             For each r ∈ R it
                                                                                                                                                                                                                 r2 ), respectively.
                        inefficiency,           which
 ewardHistory[r].getM ean(). See line 12 of Alg. 2.           in    turn      motivated               ALMA-Learning.                     To            maintains             an
                                                                                                                         Proof. (Sketch) Theorem 2.1 of [Danassis et al., 2019a    empirical          estimate          on     the    expected
                                                                                                                                                                                                                                             ]         reward re-
   .add(u(r      won ))do
    (iii) loss[R]:       A 1D so, we  175provide
                                    array.      2.1.2
                                                 For each a couple r ∈ of
                                                              ALMA-Learning: R adversarial
                                                                                   it maintains         Aexamples.
                                                                                                             Multi-Agent
                                                                                                             an                      (Meta-)Learning   ceived         by      starting
                                                                                                                                                                           Algorithm
                                                                                                                         proves that ALMA (called at line 9 of Alg. 2) converges              at    resource         r     and     continue          playing      ac-
mpirical
 ardHistory[r
 e 2 presents   estimate
                     thestartInon
                           second the    original
                                 ].getMean()loss in ALMA
                                     theexample.          utility
                                                          Agentsagent     algorithm,
                                                                          n1 and     n nincurs     all agents
                                                                                                  always if  hestart start
                                                                                                                      by in   attempt- totime
                                                                                                                            attempting
                                                                                                                             polynomial                cording
                                                                                                                                                      acquire   (in      to
                                                                                                                                                                      resource
                                                                                                                                                                       fact,   Alg.
                                                                                                                                                                                 under      1.
                                                                                                                                                                                            some    It  is    computed
                                                                                                                                                                                                        assumptions,               by
                                                                                                                                                                                                                                  it  con-averaging          the   re-
                                                                                             3
 acks-off
 >     0  then  from    ing
                         the   to   claim
                                contest
                                      176 of   ALMA-Learning
                                               their     most
                                                     resource        preferred
   easoning that it is the most preferred one. Yet, in a repeated game,r.     The   uses loss   ALMA
                                                                                          resource, of   each  as
                                                                                                              and   a  sub-routine,
                                                                                                                     back-off
                                                                                                                         verges  eachwith
                                                                                                                                       in of    specifically
                                                                                                                                                       ward
                                                                                                                                                  them only
                                                                                                                                            constant                    as
                                                                                                                                                                    history
                                                                                                                                                               time,wins      a  coordination
                                                                                                                                                                                  of
                                                                                                                                                                           i.e.,reach  the     resource,  mechanism
                                                                                                                                                                                                                  i.e.,
                                                                                                                                                                                              stage game converges in      ∀r    for
                                                                                                                                                                                                                                  ∈    Reach  :  reward[r]         ←
                                                                                                                                                                                    1
   source      r  is  initialized
                        probability   177to thatstage
                                               loss[r]    of
                                                        depends the
                                                                ←      repeated
                                                                        u
   of the times (achieving social welfare 2, 28.5% worse than the optimal  on (r) their
                                                                                      −    game.
                                                                                             u loss(r    Over
                                                                                                         of        time,
                                                                                                               switching
                                                                                                              ),            ALMA-Learning
                                                                                                                                  to    the
                                                                                                                                     2.8),                      learns
                                                                                                                                                       rewardHistory[r].getM
                                                                                                                                                thus,IninALMA-Learning
                                                                                                                                                                expectation,which        resource        to    select
                                                                                                                                                                                                          ean().        Seefirst   (r
                                                                                                                                                                                                                                line    12      )
                                                                                                                                                                                                                                               of   Alg.    2.
                                                                           n                    n      next              constant         time).                                               agents switch their start-              start
   t ]1 + rα1 (u(r
 urce                   immediate
               has omitted
                    utility  )−    u(r178won   when
                                             next)) best
                                        ALMA-Learning      running
                                                                 resource.    ALMA,
                                                                          solves        this and
                                                                                        Specifically,    an accurate
                                                                                                   problem        in the
                                                                                                                  by           empirical
                                                                                                                              simplest
                                                                                                                       learning        an empiricalestimate           on theof
                                                                                                                                                                 estimate          loss it will incur by backing-off
       We have
                    start      0.5.
                                the   subscript           from       all   the     variables          and    ar-         ing    resource           only       when        the    expected           reward for the current
                        case,      the         (loss[]).
                                      179probability              By      learning
                                                                  to back-off
                                                                            learning,when     these
                                                                                                 eithertwo        values,
                                                                                                          contesting            agents both),
                                                                                                                             resource          take more      1      informed
                                                                                                                                                                Westarthave     omitted decisions,
                                                                                                                                                                                              the subscript  specifically:
                                                                                                                                                                                                                     n from           (1)theIfvariables and ar-
                                                                                                                                                                                                                             one,alli.e.,
                                                       n
   eward
  ys,   but of   each
             every       resource.
                     agent    maintains   In this
                                               their   case,
                                                        own      after
                                                                 estimates.                                 agent        1 or n3 (or
                                                                                                                      nstarting          resource will      drops            from
                                                                                                                                                                         below        the best         alternative
                        r     would   180 bean        agent
                                                  given        by oftenP      loses
                                                                          (loss(i))        the     =contest
                                                                                                         1   −    of  his
                                                                                                                  loss(i),   starting
                                                                                                                                  where      resource, rays,      the
                                                                                                                                                                 but     expected
                                                                                                                                                                        every     agent    reward
                                                                                                                                                                                            maintains    of    that
                                                                                                                                                                                                              their     resource
                                                                                                                                                                                                                       own      estimates.will
 urce r2 . Agent n2 will back-off since he has a good alternative, and the result will be the optimal
                          i
cation where agents n1 , n2 , n3 are matched with resources r2 , r3 , r1 (or r1 , r3 , r2 ), respectively.
   reward[r]
                           Copyright International Joint Conferences on Artificial Intelligence (IJCAI), 2021. All rights reserved.
2 ALMA-Learning: A Multi-Agent (Meta-)Learning Algorithm                          5
. Agents n1 and n3 always start by attempting to acquire resource
MA-Learning     uses
 ferred one. Yet,  inALMA   as agame,
                     a repeated  sub-routine,
                                       each ofspecifically as a coordination
                                               them only wins    r1             mechanism for each
eelfare
  of the2,repeated game. Over time, ALMA-Learning      learns which
           28.5% worse than the optimal 2.8), thus, in expectation,  resource   to select first (rstart )
n  running   ALMA,   and an accurate empirical  estimate  on  the
Learning solves this problem by learning an empirical estimate of loss it will incur  by backing-off
s[]). By learning these two values, agents take more informed decisions, specifically: (1) If
Algorithm 1 ALMA: Altruistic Matching Heuristic.                       Algorithm 2 ALMA-Learning
Require: Sort resources (R ⊆ R) in decreasing order of utility
                             n
                                                                       Require: Sort resources (Rn ⊆ R) in decreasing order of utility
    r0 , . . . , rRn −1 under ≺n                                           r0 , . . . , rRn −1 under ≺n
 1: procedure ALMA(rstart , loss[R])                                   Require: rewardHistory[R][L], reward[R], loss[R]
 2:      Initialize g ← Arstart                                         1: procedure ALMA-L EARNING
 3:      Initialize current ← −1                                        2:      for all r ∈ R do                                 . Initialization
 4:      Initialize converged ← F alse                                  3:             rewardHistory[r].add(u(r))
 5:      while !converged do                                            4:             reward[r] ← rewardHistory[r].getMean()
 6:             if g = Ar then                                          5:             loss[r] ← u(r) − u(rnext )
 7:                 Agent n attempts to acquire r                       6:      rstart ← arg maxr reward[r]
 8:                 if Collision(r) then                                7:
 9:                     Back-off (set g ← Y ) with prob. P (loss[r])    8:      for t ∈ [1, . . . , T ] do                  . T : Time horizon
10:                 else                                                9:             rwon ← ALM A(rstart , loss[])              . Run ALMA
11:                     converged ← T rue                              10:
12:             else (g = Y )                                          11:             rewardHistory[rstart ].add(u(rwon ))
13:                 current ← (current + 1) mod R                      12:             reward[rstart ] ← rewardHistory[rstart ].getMean()
14:                 Agent n monitors r ← rcurrent .                    13:             if u(rstart ) − u(rwon ) > 0 then
15:                 if Free(r) then set g ← Ar                         14:                 loss[rstart ] ←
16:       return r, such that g = Ar                                   15:                    (1 − α)loss[rstart ] + α (u(rstart ) − u(rwon ))
                                                                       16:
                                                                       17:             if rstart 6= rwon then
   (iii) loss[R]: A 1D array. For each r ∈ R it maintains an           18:                 rstart ← arg maxr reward[r]
empirical estimate on the loss in utility agent n incurs if he
backs-off from the contest of resource r. The loss of each
resource r is initialized to loss[r] ← un (r) − un (rnext ),           that reward[rstart ] < reward[rstart0
                                                                                                                ]. Given that utilities
where rnext is the next most preferred resource to r, accord-          are bounded in [0, 1], there is a maximum, finite number of
ing to agent n’s preferences ≺n (see line 5 of Alg. 2). Sub-           switches until rewardn [r] = 0, ∀r ∈ R, ∀n ∈ N . In that
sequently, for every stage game, agent n starts by selecting
resource rstart , and ends up winning resource rwon . The loss         case, the problem is equivalent to having N balls thrown ran-
of rstart is then updated according to the following averaging         domly and independently into N bins (since R = N ). Since
process, where α is the learning rate:                                 both R, N are finite, the process will result in a distinct allo-
                                                                       cation in finite steps with probability 1.
 loss[rstart ] ← (1 − α)loss[rstart ] + α (u(rstart ) − u(rwon ))
   Finally, the last condition (lines 17-18 of Alg. 2) ensures
that agents who have acquired resources of high preference             3    Evaluation
stop exploring, thus stabilizing the learning process.                 We evaluate ALMA-Learning in a variety of synthetic bench-
                                                                       marks and a meeting scheduling problem based on real data
2.3   Convergence
                                                                       from [Romano and Nunamaker, 2001]. Error bars represent-
Convergence of ALMA-Learning does not translate to a fixed             ing one standard deviation (SD) of uncertainty.
allocation at each stage game. The system has converged                   For brevity and to improve readability, we only present the
when agents no longer switch their starting resource, rstart .         most relevant results in the main text. We refer the interested
The final allocation of each stage game is controlled by               reader to the appendix for additional results for both Sections
ALMA, which means that even after convergence there can                3.1, 3.2, implementation details and hyper-parameters, and a
be contest for a resource, i.e., having more than one agent            detailed model of the meeting scheduling problem.
selecting the same starting resource. As we will demon-
strate later, this translates to fairer allocations, since agents      Fairness The usual predicament of efficient allocations is
with similar preferences can alternate between acquiring their         that they assign the resources only to a fixed subset of agents,
most preferred resource.                                               which leads to an unfair result. Consider the simple exam-
                                                                       ple of Table 2. Both ALMA (with higher probability) and
Theorem 1. There exists time-step tconv such that ∀t >                 any optimal allocation algorithm will assign the coveted re-
           n
tconv : rstart            n
                (t) = rstart   (tconv ), where rstart
                                                 n
                                                      (t) denotes      source r1 to agent n1 , while n3 will receive utility 0. But,
the starting resource rstart of agent n at the stage game of           using ALMA-Learning, agents n1 and n3 will update their
time-step t.                                                           expected loss for resource r1 to 1, and randomly acquire it
Proof. (Sketch; see Appendix A) Theorem 2.1 of [Danassis               between stage games, increasing fairness. Recall that con-
et al., 2019a] proves that ALMA (called at line 9 of Alg. 2)           vergence for ALMA-Learning does not translate to a fixed
converges in polynomial time (in fact, under some assump-              allocation at each stage game. To capture the fairness of this
tions, it converges in constant time, i.e., each stage game con-       ‘mixed’ allocation, we report the average fairness on 32 eval-
verges in constant time). In ALMA-Learning agents switch               uation time-steps that follow the training period.
their starting resource only when the expected reward for the             To measure fairness, we used the Gini coefficient [Gini,
current starting resource drops below the best alternative one,        1912]. An allocation x = (x1 , . . . , xN )> is fair iff G(x) = 0,
                                                                                         PN PN                                 PN
i.e., for an agent to switch from rstart to rstart
                                               0
                                                    , it has to be     where: G(x) = ( n=1 n0 =1 |xn − xn0 |)/(2N n=1 xn )

             Copyright International Joint Conferences on Artificial Intelligence (IJCAI), 2021. All rights reserved.
Resources                                                                                                                     0.5

                                                                         Relative Difference in SW (%)
                                                                                                                                                                                       Hungarian
                                 r1    r2    r3                                                                                                                                        Greedy
                          n1     1     0.5   0                                                            -5                                                     0.4                   ALMA

                                                                                                                                              Gini Coefficient
                Agents
                          n2     0      1    0                                                                                                                                         ALMA-Learning
                                                                                                         -10
                          n3     1    0.75  → 0                                                                                                                 0.3

                                                                                                         -15
Table 2: Adversarial example: Unfair allocation. Both ALMA (with                                                   Greedy
                                                                                                                   ALMA
                                                                                                                                                                 0.2
higher probability) and any optimal allocation algorithm will assign                                     -20       ALMA-Learning
the coveted resource r1 to agent n1 , while n3 will receive utility 0.                                         2 4 8 16 32 64    256   1024
                                                                                                                                                                 0.1 4   8 16 32 64      256     1024
                                                                                                                      #Resources                                              #Resources

3.1    Test Case #1: Synthetic Benchmarks                                            (a) Relative Difference in SW                               (b) Gini Index (lower is better)
Setting We present results on three benchmarks:                          Figure 1: Map test-case. Results for increasing number of resources
 (a) Map: Consider a Cartesian map on which the agents and               ([2, 1024], x-axis in log scale), and N = R. ALMA-Learning was
resources are randomly distributed. The utility of agent n               trained for 512 time-steps.
for acquiring resource r is proportional to the inverse of their
distance, i.e., un (r) = 1/dn,r . Let dn,r denote                        Table 3: Range of the average loss (%) in social welfare compared
                                              √ the Manhattan            the (centralized) optimal for the three different benchmarks.
distance. We assume a grid length of size 4 × N .
 (b) Noisy Common Utilities: This pertains to an anti-                                                                      Greedy                  ALMA                        ALMA-Learning
coordination scenario, i.e., competition between agents with
                                                                            (a) Map                                  1.51% − 18.71%       0.00% − 9.57%                          0.00% − 0.89%
similar preferences. We model the utilities as: ∀n, n0 ∈                  (b) Noisy                                  8.13% − 12.86%       2.96% − 10.58%                         1.34% − 2.26%
N , |un (r)−un0 (r)| ≤ noise, where the noise is sampled from            (c) Binary                                  0.10% − 14.70%       0.00% − 16.88%                         0.00% − 0.39%
a zero-mean Gaussian distribution, i.e., noise ∼ N (0, σ 2 ).
 (c) Binary Utilities: This corresponds to each agent being
indifferent to acquiring any resource amongst his set of de-             Zunino and Campo, 2009; Maheswaran et al., 2004; Ben-
sired resources, i.e., un (r) is randomly assigned to 0 or 1.            Hassine and Ho, 2007; Hassine et al., 2004; Crawford and
                                                                         Veloso, 2005; Franzin et al., 2002]. The advent of social
Baselines We compare against: (i) the Hungarian algo-                    media brought forth the need to schedule large-scale events,
rithm [Kuhn, 1955], which computes a maximum-weight                      while the era of globalization and the shift to working-from-
matching in a bipartite graphs, (ii) ALMA [Danassis et al.,              home require business meetings to account for participants
2019a], and (iii) the Greedy algorithm, which goes through               with diverse preferences (e.g., different timezones).
the agents randomly, and assigns them their most preferred,                 Meeting scheduling is an inherently decentralized prob-
unassigned resource.                                                     lem. Traditional approaches (e.g., distributed constraint opti-
Results We begin with the loss in social welfare. Figure 1a              mization [Ottens et al., 2017; Maheswaran et al., 2004]) can
depicts the results for the Map test-case, while Table 3 ag-             only handle a bounded, small number of meetings. Interde-
gregates all three test-cases2 . ALMA-Leaning reaches near-              pendences between meetings’ participants can drastically in-
optimal allocations (less than 2.5% loss), in most cases in just         crease the complexity. While there are many commercially
32−512 training time-steps. The exception is the Noisy Com-              available electronic calendars (e.g., Doodle, Google calen-
mon Utilities test-case, where the training time was slightly            dar, Microsoft Outlook, Apple’s Calendar, etc.), none of these
higher. Intuitively we believe that this is because ALMA                 products is capable of autonomously scheduling meetings,
already starts with a near optimal allocation (especially for            taking into consideration user preferences and availability.
R > 256), and given the high similarity on the agent’s utility              While the problem is inherently online, meetings can
tables (especially for σ = 0.1), it requires a lot of fine-tuning        be aggregated and scheduled in batches, similarly to the
to improve the result.                                                   approach for tackling matchings in ridesharing platforms
   Moving on to fairness, ALMA-Leaning achieves the most                 [Danassis et al., 2019b]. In this test-case, we map meeting
fair allocations in all of the test-cases. As an example, Figure         scheduling to an allocation problem and solve it using ALMA
1b depicts the Gini coefficient for the Map test-case. ALMA-             and ALMA-Learning. This showcases an application were
Learning’s Gini coefficient is −18% to −90% lower on av-                 ALMA-Learning can be used as a negotiation protocol.
erage (across problem sizes) than ALMA’s, −24% to −93%
                                                                         Modeling Let E = {E1 , . . . , En } denote the set of events
lower than Greedy’s, and −0.2% to −7% lower than Hungar-
                                                                         and P = {P1 , . . . , Pm } the set of participants. To formulate
ian’s.
                                                                         the participation, let part : E → 2P , where 2P denotes the
3.2    Test Case #2: Meeting Scheduling                                  power set of P. We further define the variables days and
                                                                         slots to denote the number of days and time slots per day
Motivation The problem of scheduling a large number of
                                                                         of our calendar (e.g., days = 7, slots = 24). In order to
meetings between multiple participants is ubiquitous in ev-
                                                                         add length to each event, we define an additional function
eryday life [Nigam and Srivastava, 2020; Ottens et al., 2017;
                                                                         len : E → N. Participants’ utilities are given by:
   2
    For the Noisy Common Utilities test-case, we report results          pref : E ×part(E)×{1, . . . , days}×{1, . . . , slots} → [0, 1].
for σ = 0.1; which is the worst performing scenario for ALMA-               Mapping the above to the assignment problem of Section
Learning. Similar results were obtained for σ = 0.2 and σ = 0.4.         2.1, we would have the set of (day, slot) tuples to correspond

             Copyright International Joint Conferences on Artificial Intelligence (IJCAI), 2021. All rights reserved.
10                                                                    10

                                                                        Relative Difference in SW (%)
                                                                                                                                                                                           0.16
  Relative Difference in SW (%)
                                                                                                                                                                 10

                                                                                                                                 Relative Difference in SW (%)
                                   5                                                                                                                                                       0.14
                                                                                                         0

                                                                                                                                                                        Gini Coefficient
                                                                                                   CPLEX                                                                                   CPLEX
                                   0                                                               Upper bound                                                    5                        0.12 bound
                                                                                                                                                                                           Upper                                      CPLEX
                                                                                                   -10
                                                                                                   Greedy                                                                                  Greedy                                     GreedyCPLEX
                                   -5                                                              MSRAC                                                                                    0.1
                                                                                                                                                                                           MSRAC                                      MSRAC
                                                                                                   ALMA
                                                                                                                                                                  0                        ALMA                                       ALMAUpper bound
                                                                                                   -20
                                                                                                   ALMA-Learning                                                                           ALMA-Learning                                    Greedy
                                  -10                                                                                                                                                      0.08                                       ALMA-Learning
                                                                                                                                                                  -5                                                                      MSRAC
                                  -15 10   15 20             50   100                                   -30 70     140     210                                   280                       0.06 70                                        ALMA
                                                                                                                                                                                                             140      210    280
                                                   #Events                                                           #Events                                                                                #Events                       ALMA-Learning
                                                                                                                                                                 -10
                                  (a) Relative Difference in SW                                         (b) Relative Difference in SW                                  (c) Gini Coefficient (lower is better)
                                                                        -15
Figure 2: Meeting Scheduling. Results for 100 participants (P) and increasing10  15 of20events (x-axis50in log scale).
                                                                              number                             100 ALMA-Learning was
trained for 512 time-steps.                                                               #Events

to R, while each event is represented by one event agent                                                                                                  Table 4: Range of the average loss (%) in social welfare compared to
(that aggregates the participant preferences), the set of which                                                                                           the IBM ILOG CP optimizer for increasing number of participants,
                                                                                                                                                          P (|E| ∈ [10, 100]). The final line corresponds to the loss compared
would correspond to N .                                                                                                                                   to the upper bound for the optimal solution for the large test-case
Baselines We compare against four baselines: (a) We used                                                                                                  with |P| = 100, |E| = 280 (Figure 2b).
the IBM ILOG CP optimizer [Laborie et al., 2018] to formu-
late and solve the problem as a CSP3 . An additional benefit of                                                                                                                                   Greedy       MSRAC               ALMA       ALMA-Learning
this solver is that it provides an upper bound for the optimal                                                                                             |P| = 20
                                                                                                                                                           |P| = 30
                                                                                                                                                                                           6.16% − 18.35%
                                                                                                                                                                                           1.72% − 14.92%
                                                                                                                                                                                                            0.00% − 8.12%
                                                                                                                                                                                                            1.47% − 10.81%
                                                                                                                                                                                                                             0.59% − 8.69%
                                                                                                                                                                                                                             0.50% − 8.40%
                                                                                                                                                                                                                                              0.16% − 4.84%
                                                                                                                                                                                                                                              0.47% − 1.94%
solution (which is infeasible to compute). (b) A modified ver-                                                                                             |P| = 50                        3.29% − 12.52%   0.00% − 15.74%   0.07% − 7.34%    0.05% − 1.68%
sion of the MSRAC algorithm [BenHassine and Ho, 2007],                                                                                                    |P| = 100                         0.19% − 9.32%    0.00% − 8.52%   0.15% − 4.10%    0.14% − 1.43%

and finally, (c) the greedy and (d) ALMA, as before.                                                                                                         |E| = 280                     0.00% − 15.31%   0.00% − 22.07%   0.00% − 10.81%   0.00% − 8.84%

Designing Large Test-Cases As the problem size grows,
CPLEX’s estimate on the upper bound of the optimal solu-                                                                                                  ALMA-Learning loses less than 9% compared to the possible
tion becomes too loose (see Figure 2a). To get a more accu-                                                                                               upper bound of the optimal solution.
rate estimate on the loss in social welfare for larger test-cases,                                                                                           Moving on to fairness, Figure 2c depicts the Gini coeffi-
we designed a large-instance by combining smaller problem                                                                                                 cient for the large, hand-crafted instances (|P| = 100, |E|
instances, making it easier for CPLEX to solve which in turn                                                                                              up to 280). ALMA-Learning exhibits low inequality, up to
allowed for tighter upper bounds as well (see Figure 2b).                                                                                                 −9.5% lower than ALMA in certain cases. It is worth noting,
   We begin by solving two smaller problem instances with                                                                                                 though, that the fairness improvement is not as pronounced
a low number of events. We then combine the two in a cal-                                                                                                 as in Section 3.1. In the meeting scheduling problem, all of
endar of twice the length by duplicating the preferences, re-                                                                                             the employed algorithms exhibit high fairness, due to the na-
sulting in an instance of twice the number of events (agents)                                                                                             ture of the problem. Every participant has multiple meetings
and calendar slots (resources). Specifically, in this case we                                                                                             to schedule (contrary to only being matched to a single re-
generated seven one-day long sub-instances (with 10, 20, 30                                                                                               source), all of which are drawn from the same distribution.
and 40 events each), and combined then into a one-week long                                                                                               Thus, as you increase the number of meetings to be sched-
instance with 70, 140, 210 and 280 events, respectively. The                                                                                              uled, the fairness naturally improves.
fact that preferences repeat periodically, corresponds to par-
ticipants being indifferent on the day (yet still have a prefer-                                                                                          4            Conclusion
ence on time).
                                                                                                                                                          The next technological revolution will be interwoven to the
   These instances are depicted in Figure 2b and in the last
                                                                                                                                                          proliferation of intelligent systems. To truly allow for scal-
line of Table 4.
                                                                                                                                                          able solutions, we need to shift from traditional approaches to
Results Figures 2a and 2b depict the relative difference                                                                                                  multi-agent solutions, ideally run on-device. In this paper, we
in social welfare compared to CPLEX for 100 participants                                                                                                  present a novel learning algorithm (ALMA-Learning), which
(|P| = 100) and increasing number of events for the reg-                                                                                                  exhibits such properties, to tackle a central challenge in multi-
ular (|E| ∈ [10, 100]) and larger test-cases (|E| up to 280),                                                                                             agent systems: finding an optimal allocation between agents,
respectively. Table 4 aggregates the results for various val-                                                                                             i.e., computing a maximum-weight matching. We prove that
ues of P. ALMA-Learning is able to achieve less than 5%                                                                                                   ALMA-Learning converges, and provide a thorough empiri-
loss compared to CPLEX, and this difference diminishes                                                                                                    cal evaluation in a variety of synthetic scenarios and a real-
as the problem instance increases (less than 1.5% loss for                                                                                                world meeting scheduling problem. ALMA-Learning is able
|P| = 100). Finally, for the largest hand-crafted instance                                                                                                to quickly (in as little as 64 training steps) reach allocations
(|P| = 100, |E| = 280, last line of Table 4 and Figure 2b),                                                                                               of high social welfare (less than 5% loss) and fairness.
                3
                             Computation time limit 20 minutes.

                                           Copyright International Joint Conferences on Artificial Intelligence (IJCAI), 2021. All rights reserved.
Appendix: Contents                                                       game that we select rstart as the starting resource, the re-
In this appendix we include several details that have been               ward of every other resource remains (1) unchanged and (2)
omitted from the main text for the shake of brevity and to               reward[r] < reward[rstart ], ∀r ∈ R {rstart } (except when
improve readability. In particular:                                      an agent switches starting resources).
                                                                            (iii) There is a finite number of times each agent can switch
    - In Section A, we prove Theorem 1.                                  his starting resource rstart . This is because un (r) ∈ [0, 1]
    - In Section B, we describe in detail the modeling of                and |un (r) − un (r0 )| > δ, ∀n ∈ N , r ∈ R, where δ is a
      the meeting scheduling problem, including the problem              small, strictly positive minimum increment value. This means
      formulation, the data generation, the modeling of the              that either the agents will perform the maximum number of
      events, the participants, and the utility functions, and fi-       switches until rewardn [r] = 0, ∀r ∈ R∀n ∈ N (which will
      nally several implementation related details.                      happen in finite number of steps), or the process will have
                                                                         converged before that.
    - In Section C, we provide a thorough account of the sim-               (iv) If rewardn [r] = 0, ∀r ∈ R, ∀n ∈ N , the question
      ulation results – including but not limited to omitted             of convergence is equivalent to having N balls thrown ran-
      graphs and tables – both for the synthetic benchmarks              domly and independently into R bins and asking whether you
      and the meeting scheduling problem.                                can have exactly one ball in each bin – or in our case, where
                                                                         N = R, have no empty bins. The probability of bin r being
A     Proof of Theorem 1                                                 empty is R−1
                                                                                           N
                                                                                               , i.e., being occupied is 1 − R−1
                                                                                                                                      N
                                                                                                                                         . The
                                                                                       R                                           R
Proof. Theorem 2.1 of [Danassis et al., 2019a] proves that                                                                              N
                                                                                                                                           R
                                                                         probability of all the bins to be occupied is 1 − R−1                .
                                                                                                                                       
ALMA (called at line 9 of Algorithm 2) converges in polyno-                                                                          R
mial time.                                                               The expected number of trials until this event occurs is
    In fact, under the assumption that each agent is interested
                                                                                          N R
                                                                         1/ 1 − R−1                , which is finite, for finite N, R.
in a subset of the total resources (i.e., Rn ⊂ R) and thus at                         R

each resource there is a bounded number of competing agents
(N r ⊂ N ) Corollary 2.1.1 of [Danassis et al., 2019a] proves
                                                                         A.1    Complexity
that the expected number of steps any individual agent re-               ALMA-Learning is an anytime algorithm. At each training
quires to converge is independent of the total problem size              time-step, we run ALMA once. Thus, the computational
(i.e., N and R). In other words, by bounding these two quan-             complexity is bounded by T times the bound for ALMA,
tities (i.e., we consider Rn and N r to be constant functions            where T denotes the number of training time-steps (see Equa-
of N , R), the convergence time of ALMA is constant in the               tion 2, where N and R denote the number of agents and re-
total problem size N , R. Thus, under the aforementioned as-             sources, respectively, p∗ = f (loss∗ ), and loss∗ is given by
sumptions:                                                               the Equation 3).
                                                                                             2 − p∗
                                                                                                                  
         Each stage game converges in constant time.                                                    1
                                                                                       O TR               log N + R                            (2)
                                                                                            2(1 − p∗ ) p∗
Now that we have established that the call to the ALMA
procedure will return, the key observation to prove conver-                                                                               
gence for ALMA-Learning is that agents switch their start-                loss∗ = arg min        min     (lossrn ), 1 −    max (lossrn )       (3)
                                                                                     lossr     r∈R,n∈N                    r∈R,n∈N
ing resource only when the expected reward for the current                               n

starting resource drops below the best alternative one, i.e.,
for an agent to switch from rstart to rstart 0
                                                    , it has to be
                                                                         B     Modeling of the Meeting Scheduling
that reward[rstart ] < reward[rstart ]. Given that utilities
                                      0                                        Problem
are bounded in [0, 1], there is a maximum, finite number of              B.1    Problem Formulation
switches until rewardn [r] = 0, ∀r ∈ R, ∀n ∈ N . In that
                                                                         Let E = {E1 , . . . , En } denote the set of events we want
case, the problem is equivalent to having N balls thrown ran-
                                                                         to schedule and P = {P1 , . . . , Pm } the set of participants.
domly and independently into N bins (since R = N ). Since
                                                                         Additionally, we define a function mapping each event to
both R, N are finite, the process will result in a distinct al-
                                                                         the set of its participants part : E → 2P , where 2P de-
location in finite steps with probability 1. In more detail, we
                                                                         notes the power set of P. Let days and slots denote the
can make the following arguments:
                                                                         number of days and time slots per day of our calendar (e.g.,
    (i) Let rstart be the starting resource for agent n, and
                                                                         days = 7, slots = 24 would define a calendar for one week
  0
rstart    ← arg maxr∈R/{rstart } reward[r]. There are two
                                                                         where each slot is 1 hour long). In order to add length to each
possibilities. Either reward[rstart ] > reward[rstart  0
                                                             ] for all   event we define an additional function len : E → N, where
time-steps t > tconverged – i.e., reward[rstart ] can oscillate          N denotes the set of natural numbers (excluding 0). We do
but always stays larger than reward[rstart0
                                               ] – or there exists       not limit the length; this allows for events to exceed a sin-
time-step t when reward[rstart ] < reward[rstart  0
                                                         ], and then     gle day and even the entire calendar if needed. Finally, we
agent n switches to the starting resource rstart .
                                             0
                                                                         assume that each participant has a preference for attending
    (ii) Only the reward of the starting resource rstart changes         certain events at a given starting time, given by:
at each stage game. Thus, for the reward of a resource to in-
crease, it has to be the rstart . In other words, at each stage          pref : E ×part(E)×{1, . . . , days}×{1, . . . , slots} → [0, 1].
For example, pref(E1 , P1 , 2, 6) = 0.7 indicates that partici-                                          1.00
pant P1 has a preference of 0.7 to attend event E1 starting at
day 2 and slot 6. The preference function allows participants       0.75                                 0.75
to differentiate between different kinds of meetings (personal,
business, etc.), or assign priorities. For example, one could be
                                                                    0.50                                 0.50
available in the evening for personal events while preferring
to schedule business meetings in the morning.
   Finding a schedule consists of finding a function that as-       0.25                                 0.25
signs each event to a given starting time, i.e.,                           0.0         0.5        1.0           0.0         0.5     1.0
      sched : E → ({1, . . . , days} × {1, . . . , slots}) ∪ ∅                   (a) p = 10                           (b) p = 50
where sched(E) = ∅ means that the event E is not scheduled.         1.0                                  1.0
For the schedule to be valid, the following hard constraints
need to be met:
  1. Scheduled events with common participants must not             0.5                                  0.5
overlap.
  2. An event must not be scheduled at a (day, slot) tuple
if any of the participants is not available. We represent an
unavailable participant as one that has a preference of 0 (as       0.0                                  0.0
given by the function pref) for that event at the given (day,             0.0         0.5         1.0          0.0         0.5      1.0
slot) tuple.                                                                     (c) p = 100                          (d) p = 200
  More formally the hard constraints are:
                                                                   Figure 3: Generated datapoints on the 1 × 1 plane for p number of
                  ∀E1 ∈ E, ∀E2 ∈ E \ {E1 } :                       people.
(sched(E1 ) 6= ∅ ∧ sched(E2 ) 6= ∅ ∧ part(E1 ) ∩ part(E2 ) 6= ∅)
     ⇒ (sched(E1 ) > end(E2 ) ∨ sched(E2 ) > end(E1 ))
                                                                   people were chosen for an event at a time4 (only 3% of the
and                                                                meetings exceeds this number). Finally, for instances where
                                                                   the number of people in the entire system was below 90, that
                           ∀E ∈ E :                                bound was reduced accordingly.
 (∃P ∈ P, ∃d ∈ [1, days], ∃s ∈ [1, slots] : pref(E, P, d, s) = 0      In order to simulate the naturally occurring clustering of
                    ⇒ sched(E) 6= (d, s))                          people (see also B.4) we assigned each participant to a point
                                                                   on a 1 × 1 plane in a way that enabled the emergence of clus-
where end(E) returns the ending time (last slot) of the event      ters. Participants that are closer on the plane, are more likely
E as calculated by the starting time sched(E) and the length       to attend the same meeting. In more detail, we generated the
len(E).                                                            points in an iterative way. The first participant was assigned
   In addition to finding a valid schedule, we focus on maxi-      a uniformly random point on the plane. For each subsequent
mizing the social welfare, i.e., the sum of the preferences for    participant, there is a 30% probability that they also get as-
all scheduled meetings:                                            signed a uniformly random point. With a 70% probability the
                X       X                                          person would be assigned a point based on a normal distri-
                             pref(E, P, sched(E))                  bution centered at one of the previously created points. The
                E∈E      P ∈P                                      selection of the latter point is based on the time of creation; a
             sched(E)6=∅
                                                                   recently created point is exponentially more likely to be cho-
B.2     Modeling Events, Participants, and Utilities               sen as the center than an older one. This ensures the creation
                                                                   of clusters, while the randomness and the preference on re-
Event Length To determine the length of each generated             cently generated points prohibits a single cluster to grow ex-
event, we used information on real meeting lengths in cor-         cessively. Figure 3 displays an example of the aforedescribed
porate America in the 80s [Romano and Nunamaker, 2001].            process.
That data were then used to fit a logistic curve, which in turn
was used to yield probabilities for an event having a given        Utilities The utility function has two independent compo-
length. The function was designed such that the maximum            nents. The first was designed to roughly reflect availability
number of hours was 11. According to the data less than 1%         on an average workday. This function depends only on the
of meetings exceeded that limit.                                   time and is independent of the day. For example, a partici-
                                                                   pant might prefer to schedule meetings in the morning rather
Participants To determine the number of participants in an         than the afternoon (or during lunch time). A second function
event, we used information on real meeting sizes [Romano           decreases the utility for an event over time. This function
and Nunamaker, 2001]. As before, we used the data to fit a         is independent of the time slot and only depends on the day
logistic curve, which we used to sample the number of par-
                                                                      4
ticipants. The curve was designed such that no more than 90               The median is significantly lower.
You can also read