Planning in Factored State and Action Spaces with Learned Binarized Neural Network Transition Models - IJCAI

Page created by Martha Schultz
 
CONTINUE READING
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

    Planning in Factored State and Action Spaces with Learned Binarized Neural
                            Network Transition Models

                                       Buser Say, Scott Sanner
             Department of Mechanical & Industrial Engineering, University of Toronto, Canada
                                   {bsay,ssanner}@mie.utoronto.ca

                          Abstract                                       planning with these BNN transition models poses two non-
                                                                         trivial questions: (i) What is the most efficient compilation of
     In this paper, we leverage the efficiency of Bina-                  BNNs for planning in domains with factored state and (con-
     rized Neural Networks (BNNs) to learn complex                       current) action spaces? (ii) Given that BNNs may learn in-
     state transition models of planning domains with                    correct domain models, how can a planner repair BNN com-
     discretized factored state and action spaces. In or-                pilations to improve their planning accuracy?
     der to directly exploit this transition structure for
     planning, we present two novel compilations of                         To answer question (i), we present two novel compilations
     the learned factored planning problem with BNNs                     of the learned factored planning problem with BNNs based
     based on reductions to Boolean Satisfiability (FD-                  on reductions to Boolean Satisfiability (FD-SAT-Plan) and
     SAT-Plan) as well as Binary Linear Programming                      Binary Linear Programming (FD-BLP-Plan). Over three fac-
     (FD-BLP-Plan). Experimentally, we show the ef-                      tored planning domains with multiple size and horizon set-
     fectiveness of learning complex transition models                   tings, we test the effectiveness of learning complex state tran-
     with BNNs, and test the runtime efficiency of both                  sition models with BNNs, and test the runtime efficiency of
     encodings on the learned factored planning prob-                    both encodings on the learned factored planning problems.
     lem. After this initial investigation, we present an                While there are methods for learning PDDL models from
     incremental constraint generation algorithm based                   data [Yang et al., 2007; Amir and Chang, 2008] and excel-
     on generalized landmark constraints to improve the                  lent PDDL planners [Helmert, 2006; Richter and Westphal,
     planning accuracy of our encodings. Finally, we                     2010], we remark that BNNs are strictly more expressive than
     show how to extend the best performing encoding                     PDDL-based learning paradigms for learning concurrent ef-
     (FD-BLP-Plan+) beyond goals to handle factored                      fects in factored action spaces that may depend on the joint
     planning problems with rewards.                                     execution of one or more actions. Furthermore, while Monte
                                                                         Carlo Tree Search (MCTS) methods [Kocsis and Szepesvári,
                                                                         2006; Keller and Helmert, 2013] including AlphaGo [Silver
1   Introduction                                                         et al., 2016] and AlphaGoZero [Silver et al., 2016] could
Deep neural networks have significantly improved the abil-               technically plan with a BNN-learned black box model of tran-
ity of autonomous systems to perform complex tasks, such as              sition dynamics, unlike this work, they would not be able
image recognition [Krizhevsky et al., 2012], speech recogni-             to exploit the BNN transition structure and they would not
tion [Deng et al., 2013] and natural language processing [Col-           be able to provide optimality guarantees with respect to the
lobert et al., 2011], and can outperform humans and human-               learned model.
designed super-human systems in complex planning tasks                      To answer question (ii), we introduce an incremental al-
such as Go [Silver et al., 2016] and Chess [Silver et al., 2017].        gorithm based on generalized landmark constraints from the
   In the area of learning and online planning, recent work on           decomposition-based cost-optimal classical planner [Davies
HD-MILP-Plan [Say et al., 2017] has explored a two-stage                 et al., 2015], where during online planning we detect and con-
framework that (i) learns transitions models from data with              strain invalid sets of action selections from the decision space
ReLU-based deep networks and (ii) plans optimally with re-               of the planners and efficiently improve their planning accu-
spect to the learned transition models using mixed-integer lin-          racy. Finally, building on the above answers to (i) and (ii),
ear programming, but did not provide encodings that are able             we extend the best performing encoding to handle factored
to learn and plan with discrete state variables. As an alter-            planning problems with general rewards (FD-BLP-Plan+).
native to ReLU-based deep networks, Binarized Neural Net-                   In summary, this work provides the first planner capable of
works (BNNs) [Hubara et al., 2016] have been introduced                  learning complex transition models in domains with mixed
with the specific ability to learn compact models over discrete          (continuous and discrete) factored state and action spaces as
variables, providing a new formalism for transition learning             BNNs and capable of exploiting their structure in satisfiabil-
and planning in factored [Boutilier et al., 1999] discrete state         ity (or optimization) encodings for planning purposes. Em-
and action spaces that we explore in this paper. However                 pirical results demonstrate strong performance in both goal-

                                                                  4815
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

oriented and reward-oriented planning in both the learned and
original domains, and provide a new transition learning and
planning formalism to the data-driven model-based planning
community.

2     Preliminaries
                                                                         Figure 1: Visualization of HD-MILP-Plan, where blue circles rep-
Before we present the SAT and BLP compilation of the                     resent state variables s ∈ S, red circles represent action variables
learned planning problem, we review the preliminaries mo-                a ∈ A, gray circles represent hidden units (i.e., ReLUs and linear ac-
tivating this work.                                                      tivation units) and w represent the weights of a deep-neural network.
                                                                         During the learning stage, the weights w are learned from data. In
2.1    Problem Definition                                                the planning stage, the weights are fixed and HD-MILP-Plan opti-
A deterministic factored planning problem is a tuple Π =                 mizes a given reward function with respect to the free action a ∈ A
hS, A, C, T, I, Gi where S = {S d , S c } is a mixed set of              and state variables s ∈ S.
state variables with discrete S d and continuous S c domains,
A = {Ad , Ac } is a mixed set of action variables with discrete
Ad and continuous Ac domains, C : S × A → {true, false}                  During inference, BNNs reduce memory requirements of a
is a function that returns true if action a ∈ A and state                system by replacing most arithmetic operations with bit-wise
s ∈ S pair satisfies constraints that represent global con-              operations. BNN layers are stacked in the following order:
straints on state and action variables, T : S × A → S de-
                                                                         Real or Binary Input Layer: Binary units in all layers, with
notes the stationary transition function between time steps
                                                                         the exception of the first layer, receive binary input. When
t and t + 1, T (st , at ) = st+1 if c(st , at ) = true for all
                                                                         the input of the first layer has real-valued domains x ∈ R,
global constraints c ∈ C, and is undefined otherwise. Finally,
                                                                         m bits of precision
                                                                                        Pm can be used for a practical representation
I ⊂ C is the initial state and G ⊂ C is the goal state con-
                                                                         such that x̃ = i=1 2i−1 xi wi [Hubara et al., 2016].
straints over the set and subset of state variables S, respec-
tively. Given a planning horizon H, a solution (i.e. plan)               Binarization Layer: Given input xj,l of binary unit j ∈ J(l)
to Π is a value assignment to action at and state st vari-               at layer l ∈ {1, . . . , L} the deterministic activation function
ables such that T (st , at ) = st+1 and c(st , at ) = true for           used to compute output yj,l is: yj,l = 1 if xj,l ≥ 0, −1
all global constraints c ∈ C and time steps t ∈ {1, . . . , H}.          otherwise, where L denotes the number of layers and J(l)
The RDDL [Sanner, 2010] formalism is extended to handle                  denotes the set of binary units in layer l ∈ {1, . . . , L}.
goal-specifications and used to describe the problem Π.
                                                                         Batch Normalization Layer: For all layers l ∈ {1, . . . , L},
2.2    Factored Planning with Deep-Net Learned                           Batch Normalization [Ioffe and Szegedy, 2015] is a method
       Transition Models                                                 for transforming
                                                                                  P        the weighted sum of outputs at layer l − 1 in
                                                                         ∆j,l = i∈J(l−1) wi,j,l−1 yi,l−1 to input xj,l of binary unit
Factored planning with deep-net learned transition models                                                      ∆ −µ
is a two-stage framework for learning and solving nonlinear              j ∈ J(l) at layer l such that: xj,l = √ j,l2 j,l γj,l +βj,l where
                                                                                                                   σj,l +j,l
factored planning problems as first introduced in HD-MILP-                                              2
                                                                         parameters wi,j,l−1 , µj,l ,  σj,l j,l , γj,l and βj,l denote the
                                                                                                            ,
Plan [Say et al., 2017]. Given samples of state transition data,
the first stage of the framework learns the transition function          weight, input mean, input variance, numerical stability con-
                                                                         stant (i.e., epsilon), input scaling and input bias respectively,
T̃ using a deep-neural network with Rectified Linear Units
                                                                         where all parameters are computed at training time.
(ReLUs) [Nair and Hinton, 2010] and linear activation units.
The data is sampled from the RDDL-based domain simulator
RDDLsim [Sanner, 2010]. In the second stage, the learned                 2.4    Boolean Satisfiability Problem
transition function T̃ is used to construct the factored plan-
ning problem Π̃ = hS, A, C, T̃ , I, Gi. That is, the trained             The Boolean Satisfiability Problem (SAT) is the problem of
deep-neural network with fixed weights is used to predict the            determining whether there exists a value assignment to the
state st+1 at time t+1 for free state st and action at variables         variables of a Boolean formula such that the formula eval-
at time t such that T̃ (st , at ) = st+1 . As visualized in Fig-         uates to true (i.e., satisfiable) [Davis and Putnam, 1960].
ure 1, the learned transition function T̃ is sequentially chained        While the theoretical worst-case complexity of SAT is NP-
over horizon t ∈ {1, . . . , H}, and compiled into a Mixed-              Complete, state-of-art SAT solvers are shown to scale exper-
Integer Linear Program yielding the planner HD-MILP-Plan.                imentally well for large instances with millions of variables
Since HD-MILP-Plan utilizes only ReLUs and linear acti-                  and constraints [Biere et al., 2009].
vation units in its learned transition models, the state vari-
ables s ∈ S c are restricted to have only continuous domains             Boolean Cardinality Constraints
S c ⊆ S.
                                                                         Boolean cardinality constraints describe bounds on the num-
2.3    Binarized Neural Networks                                         ber of Boolean variables that are allowed Pnto be true, and are in
Binarized Neural Networks (BNNs) are neural networks with                the form of AtM ostk (x1 , . . . , xn ) = i=1 xi ≤ k. An effi-
binary weights and activation functions [Hubara et al., 2016].           cient encoding of AtM ostk (x1 , . . . , xn ) in conjunctive nor-

                                                                  4816
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

mal form (CNF) [Sinz, 2005] is presented below:                               3.2    Parameters
  ^                  ^                                                        Next we define the additional parameters used in FD-SAT-
(      (¬s1,j )) ∧ (    (¬xi ∨ si,1 ))                                        Plan.
 1
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

(2-3) map the binary units at the input layer of the BNN (i.e.,            for all 1 ≤ t ≤ H, 2 ≤ l ≤ L. Similarly, clause (6) is
l = 1) to a unique state or action variable, respectively. Sim-            replaced by the following linear constraint:
ilarly, clause (4) maps the binary units at the output layer of                                 X                    X
the BNN (i.e., l = L) to a unique state variable. Clauses (5-6)            k 0 (1 − Zj,l,t ) ≤          Zi,l−1,t +      (1 − Zi,l−1,t )
encode the binary activation of every unit in the BNN.                                      wi,j,l−1 =−1             wi,j,l−1 =1
                                                                                             i∈J(l−1)                i∈J(l−1)
Global Constraints                                                                                                                    (16)
                                                                           for all 1 ≤ t ≤ H, 2 ≤ l ≤ L. Finally, clauses      (7-9) are
       ^
              i
         (c(Ys,t 1 ≤ t ≤ H + 1, s ∈ S, 1 ≤ i ≤ m,                                                                       P
                                                                           replaced by linear constraints in the form of i ai xi ≤ k.
        c∈C
           i
          Xa,t 1 ≤ t ≤ H, a ∈ A, 1 ≤ i ≤ m))                     (7)       5    Incremental Factored Planning Algorithm
where clause (7) represents domain-dependent global con-                        for FD-SAT-Plan and FD-BLP-Plan
straints on state and action variables. Some common exam-                  Now we introduce an incremental algorithm for excluding in-
ples of global constraints c ∈ C such as mutual exclusion on               valid plans from the search space of FD-SAT-Plan and FD-
Boolean action variables and one-hot encodings for the out-                BLP-Plan. Similar to OpSeq [Davies et al., 2015], we update
put of the BNN (i.e., exactly one Boolean state variable must              our planners with generalized landmark constraints
be true) are respectively encoded below by clauses (8-9):                        X X                         X
                                                                                     (        (1 − Xa,t ) +        Xa,t ) ≥ 1     (17)
               AtM ost1 (Xa,t a ∈ A)                             (8)              a∈A 1≤t≤H                    1≤t≤H
                                                                                      (a,t)∈Ln                (a,t)6∈Ln
                                            _
               AtM ost1 (Ys,t s ∈ S) ∧ (          Ys,t )         (9)
                                            s∈S                            where Ln is the set of actions a ∈ A executed at time 1 ≤
                                                                           t ≤ H at the n-th iteration of the algorithm outlined below.
PIn general, linear global constraints in the form of
   i ai xi ≤ k, such as bounds on state and action variables,              Algorithm 1 Incremental Factored Planning Algorithm for
can be encoded in SAT where ai are positive integer coeffi-                FD-SAT-Plan and FD-BLP-Plan
cients and xi are decision variables with non-negative integer
                                                                           1:   n = 1, planner = FD-SAT-Plan or FD-BLP-Plan
domains [Abı́o and Stuckey, 2014].
                                                                           2:   Ln ← Solve planner.
                                                                           3:   if Ln is ∅ or Ln is a plan for Π then
4     BLP Compilation of the Learned Factored                              4:       return Ln
                                                                           5:   else planner ← Constraint (17)
      Planning Problem
                                                                           6:   n ← n + 1, go to line 2.
Given FD-SAT-Plan, we present the BLP compilation of the
learned factored planning problem Π̃ with BNNs, which we                      For a given horizon H, Algorithm 1 iteratively computes
denote as Factored Deep BLP Planner (FD-BLP-Plan).                         a set of actions Ln , or returns infeasibility for the learned
4.1    Binary Variables and Parameters                                     factored planning problem Π̃. If the set of actions Ln is non-
                                                                           empty, we evaluate whether Ln is a valid plan for the original
FD-BLP-Plan uses the same set of decision variables and pa-                factored planning problem Π (i.e., line 3) either in the actual
rameters as FD-SAT-Plan.                                                   domain or using a high fidelity domain simulator – in our case
                                                                           RDDLsim. If the set of actions Ln constitutes a plan for Π,
4.2    The BLP Compilation                                                 Algorithm 1 returns Ln as a plan. Otherwise, the planner is
FD-BLP-Plan replaces clauses (1-9) with equivalent linear                  updated with the new set of generalized landmark constraints
constraints. Clauses (1-4) are replaced by the following                   to exclude Ln and the loop repeats. Since the original ac-
equality constraints:                                                      tion space is discretized and represented upto m bits of pre-
        i
                                                                           cision, Algorithm 1 can be shown to terminate in no more
      Ys,1 = Isi                            ∀s∈S,1≤i≤m          (10)       than n = 2|A|×m×H iterations by constructing an inductive
        i
      Ys,H+1  = Gis                        ∀s∈S G ,1≤i≤m        (11)       proof similar to the termination criteria of OpSeq where ei-
        i
                                                                           ther a feasible plan for Π is returned or there does not exist a
      Ys,t = ZIn(s,i),1,t           ∀1≤t≤H,s∈S,1≤i≤m            (12)       plan to both Π̃ and Π for the given horizon H.
        i
      Xa,t = ZIn(a,i),1,t           ∀1≤t≤H,a∈A,1≤i≤m            (13)
        i
      Ys,t+1   = ZOut(s,i),L,t      ∀1≤t≤H,s∈S,1≤i≤m            (14)
                                                                           6    Experimental Results
                                                                           In this section, we evaluate the effectiveness of factored plan-
Clause (5) is replaced by the following linear constraint:                 ning with BNNs. First, we present the benchmark domains
                X                     X                                    used to test the efficiency of our learning and factored plan-
  kZj,l,t ≤            Zi,l−1,t +            (1 − Zi,l−1,t )               ning framework with BNNs. Second, we present the accuracy
               wi,j,l−1 =1           wi,j,l−1 =−1                          of BNNs to learn complex state transition models for factored
               i∈J(l−1)               i∈J(l−1)
                                                                           planning problems. Third, we test the efficiency and scalabil-
                                                                (15)       ity of planning with FD-SAT-Plan and FD-BLP-Plan on the

                                                                    4818
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

learned factored planning problems Π̃ across multiple prob-              For each instance of a domain, state transition pairs were col-
lem sizes and horizon settings. Fourth, we demonstrate the               lected and the data was treated as independent and identically
effectiveness of Algorithm 1 to find a plan for the factored             distributed. After random permutation, the data was split into
planning problem Π. Finally we test the effectiveness of fac-            training and test sets with 9:1 ratio. The BNNs were trained
tored planning with the best performing encoding over the                on MacBookPro with 2.8 GHz Intel Core i7 16 GB mem-
benchmark domains with reward specifications.                            ory using the code available [Hubara et al., 2016]. Over-
                                                                         all, Navigation instances required the smallest BNN struc-
6.1   Domain Descriptions                                                tures for learning due to their purely Boolean state and ac-
The deterministic RDDL domains used in the experiments,                  tion spaces, while both Inventory and SysAdmin instances
namely Navigation [Sanner and Yoon, 2011], Inventory Con-                required larger BNN structures for accurate learning, owing
trol (Inventory) [Mann and Mannor, 2014], and System Ad-                 to their non-Boolean state and action spaces.
ministrator (SysAdmin) [Guestrin et al., 2001; Sanner and
Yoon, 2011] are described below.                                         Table 1: Transition Learning Performance Table measured by error
                                                                         on test data (in %) for all domains and instances.
Navigation models an agent in a two-dimensional (m-by-
n) maze with obstacles where the goal of the agent is to move               Domain              Network Structure          Test Error (%)
from the initial location to the goal location at the end of hori-          Navigation(3,3)        12:64:64:9                   2.12
zon H. The transition function T describes the movement of                  Navigation(4,3)        16:80:80:12                  6.59
the agent as a function of the topological relation of its cur-             Inventory(2)            7:96:96:5                   5.51
rent location to the maze, the moving direction and whether                 Inventory(4)           8:128:128:5                  4.58
the location the agent tries to move to is an obstacle or not.              SysAdmin(3)           12:128:128:9                  3.73
This domain is a deterministic version of its original from                 SysAdmin(4)          16:96:96:96:12                 8.99
IPPC2011 [Sanner and Yoon, 2011]. Both the action and the
state space is Boolean. We report the results on instances with
two maze sizes m-by-n and three horizon settings H where                 6.3   Planning Performance on the Learned
m = 3, 4, n = 3, H = 4, 6, 8.                                                  Factored Planning Problems
                                                                         We compare the effectiveness of using a SAT-based encod-
Inventory describes the inventory management control                     ing against a BLP-based encoding, namely FD-SAT-Plan and
problem with alternating demands for a product over time                 FD-BLP-Plan to find plans for the learned factored planning
where the management can order a fixed amount of units to                problem Π̃. We ran the experiments on MacBookPro with 2.8
increase the number of units in stock at any given time. The             GHz Intel Core i7 16GB memory. For FD-SAT-Plan and FD-
transition function T updates the state based on the change in           BLP-Plan, we used Glucose [Audemard and Simon, 2014]
stock as a function of demand, the current order quantity, and           and CPLEX 12.7.1 [IBM, 2017] solvers respectively, with 30
whether an order has been made or not. The action space is               minutes total time limit per domain instance.
Boolean (either order a fixed positive integer amount, or do
not order) and the state space is non-negative integer. We re-           Overall Performance Analysis
port the results on instances with two demand cycle lengths d            In Figure 2, we present the runtime performance of FD-SAT-
and three horizon settings H where d = 2, 4 and H = 4, 6, 8.             Plan (Red Bar) and FD-BLP-Plan (Blue Bar) over Navigation
                                                                         (Figure 2a), Inventory (Figure 2b) and SysAdmin (Figure 2c)
SysAdmin models the behavior of a computer network P                     domains. The analysis of the runtime performances across
where the administrator can reboot a limited number of com-              all three domains show that the increase in the size of the
puters to keep the number of computers running above a spec-             underlying BNN structure (as presented in Table 1) signif-
ified safety threshold over time. The transition function T              icantly increases the computational effort required to find a
describes the status of a computer which depends on its topo-            plan. Similarly for both planners, we observed that increas-
logical relation to other computers, its age and whether it has          ing the size of the horizon, with the exception of instance
been rebooted or not, and the age of the computer which de-              (Nav,3,8), increases the cost of computing a plan.
pends on its current age and whether it has been rebooted or
not. This domain is a deterministic modified version of its              Comparative Performance per Encoding
original from IPPC2011 [Sanner and Yoon, 2011]. The ac-                  The pairwise comparison of FD-SAT-Plan and FD-BLP-Plan
tion space is Boolean and the state space is a non-negative              over all problem settings show a clear dominance of FD-
integer where concurrency between actions are allowed. We                BLP-Plan over FD-SAT-Plan in terms of runtime perfor-
report the results on instances with two network sizes |P | and          mance. Overall, FD-SAT-Plan computed plans for 15 out
three horizon settings H where |P | = 3, 4 and H = 2, 3, 4.              of 18 instances while FD-BLP-Plan successfully found plans
                                                                         for all instances. Moreover, with the exception of instances
6.2   Transition Learning Performance                                    (Nav,4,6) and (Nav,4,8), FD-BLP-Plan found plans for the
In Table 1, we present test errors for different configurations          learned factored planning problems under 20 seconds. On
of the BNNs on each domain instance where the sample data                average, FD-BLP-Plan is two orders of magnitude faster than
was generated using a simple stochastic exploration policy.              FD-SAT-Plan.

                                                                  4819
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

                (a) Navigation                                 (b) Inventory                                  (c) SysAdmin

Figure 2: Timing comparison between FD-SAT-Plan (Red Bar) and FD-BLP-Plan (Blue Bar). Over all problem settings, the BLP encoding
of the learned factored planning problem consistently outperformed its SAT equivalent.

                (a) Navigation                                 (b) Inventory                                  (c) SysAdmin

Figure 3: Timing comparison between FD-BLP-Plan (Blue Bar) and FD-BLP-Plan+ (Green Bar). The additional computational resources
required to solve the factored planning problem with reward specifications varied across different domains where the computational effort for
finding a plan increased minimally, moderately and significantly in Inventory, Navigation and SysAdmin domains, respectively.

6.4   Planning Performance on the Factored                                computational effort required to solve Π+ is minimal, mod-
      Planning Problems                                                   erate and significant in Inventory, Navigation and SysAdmin
We test the planning efficiency of the incremental factored               domains, respectively. Especially in (Sys,4,3) and (Sys,4,4)
planning algorithm using the best performing planner, that is             instances, FD-BLP-Plan ran out of time at its fifth iteration of
FD-BLP-Plan, for solving the factored planning problem Π.                 Algorithm 1, as the solver could not prove the optimality of
Overall only three instances, namely (Inv,4,8), (Sys,4,3) and             the incumbent solution found.
(Sys,4,4), required constraint generation to find a plan where
the maximum number of constraints required was equal to                   7    Conclusion
one. The instances that required the generation of landmark               In this work, we utilized the efficiency and ability of BNNs
constraints, the runtime performance of FD-BLP-Plan was al-               to learn complex state transition models of factored planning
most identical to the results presented in Figure 2.                      domains with discretized state and action spaces. We intro-
6.5   Planning Performance on the Factored                                duced two novel compilations, a SAT (FD-SAT-Plan) and a
                                                                          BLP (FD-BLP-Plan) encoding, that directly exploit the struc-
      Planning Problems with Reward Specifications                        ture of BNNs to plan for the learned factored planning prob-
Finally, we extend FD-BLP-Plan to handle domains with re-                 lem, which provide optimality guarantees with respect to the
ward specifications. Hereby, we extend the definition of the              learned model if they successfully terminate. We further in-
factored planning problem Π to include the reward function                troduced an incremental factored planning algorithm based
Q : S × A → R such that Π+ = hS, A, C, T, I, G, Qi. Given                 on generalized landmark constraints that improve planning
a planning horizon H, an optimal solution to Π+ is a plan that            accuracy of both encodings. Finally, we extended the best
maximizes the total reward function over all time steps such              performing encoding to handle factored planning problems
      PH
that t=1 Q(st+1 , at ). Similar to HD-MILP-Plan [Say et                   with reward specifications (FD-BLP-Plan+). Empirical re-
al., 2017], we assume the knowledge on the reward function                sults showed we can accurately learn complex state transition
Q and add Q as an objective function to our BLP encoding,                 models using BNNs and demonstrated strong performance
which we denote as FD-BLP-Plan+.                                          in goal-oriented and reward-oriented planning in both the
   In Figure 3, we show results for solving the factored plan-            learned and original domains. In sum, this work provides a
ning problems with reward specifications using Algorithm                  novel and effective factored state and action transition learn-
1. The pairwise comparison of FD-BLP-Plan and FD-BLP-                     ing and planning formalism to the data-driven model-based
Plan+ over all domain instances shows that the additional                 planning community.

                                                                   4820
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

References                                                              [Kocsis and Szepesvári, 2006] Levente Kocsis and Csaba
[Abı́o and Stuckey, 2014] Ignasi Abı́o and Peter J. Stuckey.               Szepesvári. Bandit based Monte-Carlo planning. In
                                                                           ECML, pages 282–293, 2006.
  Encoding linear constraints into sat.       In Principles
  and Practice of Constraint Programming, pages 75–91.                  [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever,
  Springer Int Publishing, 2014.                                           and Geoffrey E. Hinton. Imagenet classification with deep
                                                                           convolutional neural networks. In 25th NIPS, pages 1097–
[Amir and Chang, 2008] Eyal Amir and Allen Chang.                          1105, 2012.
  Learning partially observable deterministic action models.
  JAIR, 33:349–402, 2008.                                               [Mann and Mannor, 2014] Timothy Mann and Shie Mannor.
                                                                           Scaling up approximate value iteration with options: Bet-
[Audemard and Simon, 2014] Gilles Audemard and Laurent                     ter policies with fewer iterations. In 21st ICML, volume 1,
  Simon. Lazy Clause Exchange Policy for Parallel SAT                      2014.
  Solvers, pages 197–205. Springer Int. Publishing, 2014.
                                                                        [Nair and Hinton, 2010] Vinod Nair and Geoffrey E. Hinton.
[Biere et al., 2009] Armin Biere, Marijn Heule, Hans van                   Rectified linear units improve restricted boltzmann ma-
   Maaren, and Toby Walsh, editors. Handbook of Satisfi-                   chines. In 27th ICML, pages 807–814, 2010.
   ability, volume 185 of Frontiers in Artificial Intelligence          [Richter and Westphal, 2010] Silvia Richter and Matthias
   and Applications. IOS Press, 2009.
                                                                           Westphal. The lama planner: Guiding cost-based anytime
[Boutilier et al., 1999] Craig Boutilier, Thomas Dean, and                 planning with landmarks. JAIR, 39(1):127–177, 2010.
  Steve Hanks. Decision-theoretic planning: Structural as-              [Sanner and Yoon, 2011] Scott Sanner and Sungwook Yoon.
  sumptions and computational leverage. JAIR, 11(1):1–94,                  International probabilistic planning competition. 2011.
  1999.
                                                                        [Sanner, 2010] Scott Sanner. Relational dynamic influence
[Collobert et al., 2011] Ronan Collobert, Jason Weston,                    diagram language (rddl): Language description. 2010.
  Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and
                                                                        [Say et al., 2017] Buser Say, Ga Wu, Yu Qing Zhou, and
  Pavel Kuksa. Natural language processing (almost) from
  scratch. JMLR, 12:2493–2537, 2011.                                       Scott Sanner. Nonlinear hybrid planning with deep net
                                                                           learned transition models and mixed-integer linear pro-
[Davies et al., 2015] Toby O. Davies, Adrian R. Pearce, Pe-                gramming. In 26th IJCAI, pages 750–756, 2017.
  ter J. Stuckey, and Nir Lipovetzky. Sequencing operator               [Silver et al., 2016] David Silver, Aja Huang, Christopher J.
  counts. In 25th ICAPS, pages 61–69, 2015.
                                                                           Maddison, Arthur Guez, Laurent Sifre, George van den
[Davis and Putnam, 1960] Martin Davis and Hilary Putnam.                   Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda
  A computing procedure for quantification theory. Journal                 Panneershelvam, Marc Lanctot, Sander Dieleman, Do-
  of the ACM, 7(3):201–215, 1960.                                          minik Grewe, John Nham, Nal Kalchbrenner, Ilya
[Deng et al., 2013] Li Deng, Geoffrey E. Hinton, and Brian                 Sutskever, Timothy Lillicrap, Madeleine Leach, Koray
  Kingsbury. New types of deep neural network learning for                 Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mas-
  speech recognition and related applications: an overview.                tering the game of go with deep neural networks and tree
  In IEEE International Conference on Acoustics, Speech                    search. Nature, pages 484–503, 2016.
  and Signal Processing, pages 8599–8603, 2013.                         [Silver et al., 2017] David Silver, Thomas Hubert, Julian
                                                                           Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur
[Guestrin et al., 2001] Carlos Guestrin, Daphne Koller, and
                                                                           Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran,
  Ronald Parr. Max-norm projections for factored MDPs. In                  Thore Graepel, Timothy Lillicrap, Karen Simonyan, and
  17th IJCAI, pages 673–680, 2001.                                         Demis Hassabis. Mastering chess and shogi by self-play
[Helmert, 2006] Malte Helmert. The fast downward plan-                     with a general reinforcement learning algorithm. 2017.
  ning system. JAIR, 26(1):191–246, 2006.                               [Sinz, 2005] Carsten Sinz. Towards an Optimal CNF Encod-
[Hubara et al., 2016] Itay Hubara, Matthieu Courbariaux,                   ing of Boolean Cardinality Constraints, pages 827–831.
  Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Bina-                    Springer Berlin Heidelberg, Berlin, Heidelberg, 2005.
  rized neural networks. In 29th NIPS, pages 4107–4115.                 [Yang et al., 2007] Qiang Yang, Kangheng Wu, and Yunfei
  Curran Associates, Inc., 2016.                                           Jiang. Learning action models from plan examples using
[IBM, 2017] IBM. IBM ILOG CPLEX Optimization Studio                        weighted max-sat. AIJ, 171(2):107–143, 2007.
   CPLEX User’s Manual, 2017.
[Ioffe and Szegedy, 2015] Sergey Ioffe and Christian
   Szegedy. Batch normalization: Accelerating deep net-
   work training by reducing internal covariate shift. In 32nd
   ICML, pages 448–456, 2015.
[Keller and Helmert, 2013] Thomas Keller and Malte
  Helmert. Trial-based heuristic tree search for finite
  horizon MDPs. In 23rd ICAPS, pages 135–143, 2013.

                                                                 4821
You can also read