From Statistical Relational to Neural Symbolic Artificial Intelligence: a Survey.

Page created by Tony Bowman
 
CONTINUE READING
From Statistical Relational to Neural Symbolic Artificial Intelligence: a Survey.
From Statistical Relational to Neural Symbolic
                                                 Artificial Intelligence: a Survey.
arXiv:2108.11451v1 [cs.AI] 25 Aug 2021

                                          Giuseppe Marra1 , Sebastijan Dumančić1 , Robin Manhaeve1 , and
                                                                 Luc De Raedt1,2
                                                            firstname.lastname@kuleuven.be
                                             1
                                             KU Leuven, Department of Computer Science and Leuven.AI
                                         2
                                           Örebro University, Center for Applied Autonomous Sensor Systems

                                                                            August 27, 2021

                                                                                 Abstract
                                                     Neural-symbolic and statistical relational artificial intelligence both
                                                 integrate frameworks for learning with logical reasoning. This survey
                                                 identifies several parallels across seven different dimensions between these
                                                 two fields. These cannot only be used to characterize and position neural-
                                                 symbolic artificial intelligence approaches but also to identify a number of
                                                 directions for further research.

                                         1       Introduction
                                         The integration of learning and reasoning is one of the key challenges in artificial
                                         intelligence and machine learning today, and various communities have been
                                         addressing it. That is especially true for the field of neural-symbolic computation
                                         (NeSy) [10, 21], where the goal is to integrate symbolic reasoning and neural
                                         networks. NeSy already has a long tradition, and it has recently attracted a lot
                                         of attention from various communities (cf. the keynotes of Y. Bengio and H.
                                         Kautz on this topic at AAAI 2020, the AI Debate [9] between Y. Bengio and G.
                                         Marcus ).
                                             Another domain that has a rich tradition in integrating learning and reason-
                                         ing is that of statistical relational learning and artificial intelligence (StarAI)
                                         [39, 85]. But rather than focusing on integrating logic and neural networks, it
                                         is centred around the question of integrating logic with probabilistic reasoning,
                                         more specifically probabilistic graphical models. Despite the common interest in
                                         combining symbolic reasoning with a basic paradigm for learning, i.e., proba-
                                         bilistic graphical models or neural networks, it is surprising that there are not
                                         more interactions between these two fields.

                                                                                      1
From Statistical Relational to Neural Symbolic Artificial Intelligence: a Survey.
This discrepancy is the key motivation behind this survey: it aims at pointing
out the similarities between these two endeavours and in this way it wants to
stimulate cross-fertilization. In doing so, we start from the literature on StarAI,
following the key concepts and techniques outlined in several textbooks and
tutorials such as [92, 85], because it turns out that the same issues and techniques
that arise in StarAI apply to NeSy as well. As the key contribution of this survey,
we identify seven dimensions that these fields have in common and that can be
used to categorize both StarAI and NeSy approaches. These seven dimensions are
concerned with (1) type of logic, (2) model vs proof-based inference, (3) directed
vs undirected models, (4) logical semantics, (5) learning parameters or structure,
(6) representing entities as symbols or sub-symbols, and (7) integrating logic
with probability and/or neural computation. We provide evidence for our claim
by positioning a wide variety of StarAI and NeSy systems along these dimensions
and pointing out analogies between them. This provides not only new insights
into the relationships between StarAI and NeSy, but it also allows one to carry
over and adapt techniques from one field to another. Thus the insights provided
in this paper can be used to create new opportunities for cross-fertilization
between StarAI and NeSy, by focusing on those dimensions that have not been
fully exploited yet. The classification of numerous methods within the same
categories sometimes comes at the cost of oversimplification. Thus, the individual
dimensions are accompanied by examples of specific methods. For each example,
a final discussion frames the specific technique inside the dimension. With this
approach, we present a very broad overview of the research field but we still
provide specific intuitions on how the different features are implemented. Unlike
some other perspectives on neural-symbolic computation [10, 21], the present
survey limits itself to a logical and probabilistic perspective, which it inherits
from StarAI, and to developments in neural-symbolic computation that are
consistent with this perspective. Furthermore, it focuses on representative and
prototypical systems rather than aiming at completeness (which would not be
possible given the fast developments in the field). Several other surveys about
neural symbolic AI have been proposed. An early overview of neural-symbolic
computation is that of [4]. Unlike the present survey it focuses very much on
a logical and a reasoning perspective. Today, the focus has shifted very much
to learning. More recently, [59] analysed the intersection between NeSy and
graph neural networks (GNN). [105] described neural symbolic systems in terms
of the composition of blocks described by few patterns, concerning processes
and exchanged data. In contrast, this survey is more focused on the underlying
principles that govern such a composition. [20] exploits instead a neural network
viewpoint by investigating in which components (i.e. input, loss or structure)
symbolic knowledge is injected.
    The following sections of the paper each describe one dimension. We summa-
rize various neural-symbolic approaches along these dimensions in Table 1. For
ease of writing, we do not always repeat the references to these approaches in
the paper, the table mentions the key reference for each of them.

                                         2
From Statistical Relational to Neural Symbolic Artificial Intelligence: a Survey.
Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5 Dimension 6 Dimension 7
                       (P)ropositional
                                                                       (L)ogic                                      Logic (L/l)
                       (R)elational    (M)odel-based    (D)irected                     (P)arameter   (S)ymbols
                                                                       (P)robability                                Probability(P/p)
                       (FOL)           (P)roof -based   (U)ndirected                   (S)tructure   (Sub)symbols
                                                                       (F)uzzy                                      Neural (N/n)
                       (LP)

    ∂ILP [31]                R               P                D              F            P+S              S              Ln
    DeepProbLog [64]        LP               P                D              P               P          S+Sub            LpN
    DiffLog [99]             R               P                D              F             P+S             S              Ln
    LRNN [125]              LP               P                D              F             P+S          S+Sub             Ln
    LTN [26]                FOL              M                U              F               P            Sub              lN
    NeuralLP [118]           R               M                D              L               P             S              Ln
    NeurASP [119]           LP               P                D              P               P             S             LpN

3
    NGS [60]                 LP              P                D              L               P             S              Ln
    NLM [27]                 R               M                D              L             P+S             S              Ln
    NLog [103]              LP               P                D              L               P             S              Ln
    NLProlog [112]          LP               P                D              P             P+S          S+Sub            LpN
    NMLN [69]               FOL              M                U              P             P+S          S+Sub             lPN
    NTP [90]                 R               P                D              L             P+S          S+Sub             Ln
    RNM [67]                FOL              M                U              P               P          S+Sub             lPN
    SL [114]                 P               M                U              P               P          S+Sub             lPN
    SBR [25]                FOL              M                U              F               P            Sub              lN
    Tensorlog [13]           R               P                D              P               P          S+Sub             Ln

        Table 1: Taxonomy of a (non-exhaustive) list of NeSy models according to the 7 dimensions outlined in the paper.
2      Logic
Let us start by providing an introduction to clausal logic. We focus on clausal
logic as it is a standard form to which any first order logical formula can be
converted.
    In clausal logic, everything is represented in terms of clauses. More formally,
a clause is an expression of the form h1 ∨ ... ∨ hk ← b1 ∧ ... ∧ bn . The hk are
head literals or conclusions, while the bi are body literals or conditions. Clauses
with no conditions (n = 1) and one conclusion (k = 1) are facts. Clauses with
only one conclusion (k = 1) are definite clauses. Definite clauses are the basic
constructs used in the programming language Prolog.

    Example 1 (Propositional Clausal logic). Consider the famous alarm
    problem expressed as a set of definite clauses.
                     burglary.
                     hears_alarm_mary.

                     earthquake.
                     hears_alarm_john.

                     alarm ← earthquake.
                     alarm ← burglary.
                     calls_mary ← alarm,hears_alarm_mary.
                     calls_john ← alarm,hears_alarm_john.

    In the above example, the literals did not have any internal structure, we
were working in propositional logic. This contrasts with first-order logic in which
the literals take the form p(t1 , ..., tm ), with p a predicate of arity m and the ti
terms, that is, constants, variables, or structured terms of the form f (t1 , ..., tq ),
where f is a functor and the ti are again terms. Intuitively, constants represent
objects or entities, functors represent functions, variables make abstraction of
specific objects, and predicates specify relationships amongst objects. The subset
of first order logic where there are no functors is called relational logic.

    Example 2 (Clausal logic). In contrast to the previous example, we now
    write the theory in a more compact manner using first order logic. By
    convention, constants start with a lowercase letter, while variables start
    with an uppercase. Essential is the use of the variable X in the rule for the
    calls predicate, which is implicitly universally quantified, and which states
    that X will call when the alarm goes off, and X hears_alarm.
                     burglary.
                     hears_alarm(mary).

                                           4
earthquake.
                    hears_alarm(john).

                    alarm ← earthquake.
                    alarm ← burglary.
                    calls(X) ← alarm, hears_alarm(X).

    Let us also introduce some basic concepts that will be useful for the rest
of the paper. When an expression (i.e, clause, atom or term) does not contain
any variable it is called ground. A substitution θ is an expression of the form
{X1 /c1 , ..., Xk /ck } with the Xi different variables, the ci terms. Applying a
substitution θ to an expression e (term, atom or clause) yields the instantiated
expression eθ where all variables Xi in e have been simultaneously replaced by
their corresponding terms ci in θ. We can take for instance the atom calls(X)
and apply the substitution {X/mary} to yield calls(mary).
    Propositional logic is a subset of relational logic, which itself a subset of first
order logic. Therefore, first order logic is also more expressive than relational
logic, which itself is more expressive than propositional logic.
    Propositional and first-order logic form the two extremes on the spectrum of
logical reasoning frameworks and are essential for understanding the capabilities
of StarAI and NeSy systems. Propositional logic is the simplest and, consequently,
the least expressive formalism. However, due to the mentioned restrictions,
inference for propositional logic is decidable, whereas it is only semi-decidable for
first order logic. The major weakness of propositional restriction is that specifying
complex knowledge can be tedious and requires substantial effort. The strengths
and weaknesses of first-order logic are complementary: due to its expressiveness,
complex problems are easy to specify but come with a computational price.
Relational logic is somewhat in the middle and is more in line with a database
perspective.
    Interestingly, any problem expressed in first-order logic can be equivalently
expressed in relational logic; any problem expressed in relational logic can
likewise be expressed in propositional logic by grounding out the clauses [83, 34].
Grounding is the process whereby all possible substitutions that ground the
clause are applied. Notice that grounding a first order logical theory may result in
an infinite set of ground clauses (when there are functors), and an exponentially
larger set of clauses (when working with finite domains). The rules of chess can
fit a single page if written in first-order logic, while they take several hundred
pages if grounded out in propositional logic.

StarAI and NeSy along Dimension 1 Understanding which type of logic
a StarAI or NeSy system is built is important for assessing the capabilities
of that system. StarAI approaches typically focus on the most expressive
logics, such as logic programming [22, 94] and first-order logic [88]. For NeSy,
systems based on propositional logic, like Semantic Loss (SL) [114] can do the
simplest logical reasoning, but often can do it efficiently. Datalog and relational

                                          5
logic-based systems are well-suited for problems that require database queries.
Datalog systems are the most predominant ones in NeSy, like DiffLog [99], θILP
[31], Lifted Relational Neural Networks [125] and Neural Theorem Provers [90].
Systems leveraging answer-set programming (see below), like NeurASP [119],
are also suited for database queries, but also common-sense reasoning and
reasoning with preferences. Systems based on logic programming and Prolog,
like DeepProbLog [64], NLog [103], NLProlog [112] are suited for tasks that
require a full-fledged programming language for, e.g., data structure or state
manipulations. Grammars, like CFG[50] or unification-based-grammars [98],
have been often targeted in the logic programming community, cf. Definite
Clause Grammars [34]. The close nature between the two approaches has given
rise to grammar-based neural symbolic systems, like NGS [60] and DeepStochLog
[113], that are very close to logic programming based systems. Finally, some
systems are not restricted to definite clauses and allow general first-order-logic
theories, like Logic Tensor Networks [26], Semantic Based Regularization [25],
Relational Neural Machines [67] and Logical Neural Networks [89].

    Dimension 1: Propositional, Relational, First Order Logic

    Propositional logic is a subset of relational logic, which itself a subset of
    first order logic. Therefore, first order logic is also more expressive than
    relational logic, which itself is more expressive than propositional logic.
    Propositional and first-order logic form the two extremes on the spectrum
    of logical reasoning frameworks and are essential for understanding the
    capabilities of StarAI and NeSy systems. Propositional logic is the
    simplest and, consequently, the least expressive formalism but inference
    for propositional logic is decidable. First order logic is a more expressive
    and compact formalism but inference is only semi-decidable. Relational
    logic is somewhat in the middle and is more in line with a database
    perspective.

3    About proofs and models, and rules and con-
     straints
So far we have introduced the syntax of clausal logic but have neither discussed
semantics nor inference. The semantics of different approaches is usually defined
in terms of models. For inference, one is usually interested in finding proofs for
certain logical queries or one wants to find assignments to certain variables that
satisfy a given theory.
    In the setting of logic programming, definite clauses (rules) are interpreted
as computational rules (compute h by computing b1 , ... , and bn ) and are
typically used for forward or backward inference to prove that certain atoms hold.
Inference typically proceeds by searching for proofs for queries as illustrated in

                                         6
Example 3. This gives rise to a proof-theoretic perspective on logic. Although
proofs and proof-trees in Prolog are built using SLD- or SLDNF-resolution, we
depict the proofs as an AND-OR tree for ease of exposition.

  Example 3 (Logic programs and proofs). Consider the following logic
  program:
           burglary.
           hears_alarm_mary.

           earthquake.
           hears_alarm_john.

           alarm :- earthquake.
           alarm :- burglary.
           calls_mary :- alarm,hears_alarm_mary.
           calls_john :- alarm,hears_alarm_john.
  and the proofs for the query calls_mary as an AND/OR tree.
                                  calls_mary

                                     AND

                      alarm                  hears_alarm_mary

                       OR

      burglary                   earthquake

     In logic programs, we use Prolog’s ":-" instead of ←, to differentiate its
  semantics from first order logic implications. The rules for alarm state that
  there will be an alarm if there is a burglary or an earthquake.

    On the other hand, we have the model theoretic perspective on logic that
relies on the notions of interpretations and models. In this paper, we restrict to
Herbrand models and Herbrand interpretations. The Herbrand base of a set of
clauses is the set of ground atoms that can be constructed using the predicates,
functors and constants occurring in the theory.
Definition 1 (Interpretation and possible world). A Herbrand interpretation,
or a possible world, is a set of truth assignments {a1 = v1 , ..., an = vn }, where
a1 , ..., an are all the ground atoms in the Herbrand base and vi are the corre-
sponding assigned truth values.
   Equivalently, one can define an interpretation as a subset of the Herbrand

                                        7
base containing only the true atoms, while all the others are false.
     A Herbrand interpretation is a model of a clause h1 ∨ ... ∨ hk ← b1 ∧ ... ∧ bn
if for every substitution θ such that all b1 θ ∧ ... ∧ bn θ is true in the interpretation,
at least one of the hi θ is true in the interpretation as well. An interpretation I
is a model of a theory T , and we write I |= T , if it is a model of all clauses in
the theory. We say that the theory is satisfiable. The satisfiability problem, that
is, deciding whether a theory has a model, is one of the most fundamental ones
in computer science (cf. the SAT problem for propositional logic).
     Differently from proof-based techniques, the model-theoretic ones use logic
as constraints on a set of variables, that is, that the variables are related to one
another, without giving any directed relationships between them. More details
on these connections can be found in [85, 34].

  Example 4 (Model-theoretic). Consider the theory composed of the fol-
  lowing clauses:

            calls_mary ← hears_alarm_mary ∧ alarm
            calls_john ← hears_alarm_john ∧ alarm
            alarm ← burglary
            alarm ← earthquake
      A model of the previous theory is the set:

           M = {burglary, hears_alarm_john, alarm, calls_john}

  By considering all the elements of this set True and all the others False the
  four clauses are satisfied.
    The model theoretic semantics of clausal logic differs from that of logic
programs in the form of definite clauses. The model-theoretic semantics of a
clausal theory corresponds to the set of all Herbrand models, while for definite
clausal logic programs it is given by the smallest Herbrand model with respect
to set inclusion, the so-called least Herbrand model (LHM), which is unique. We
say that a logic program T entails an atom denoted T |= e if and only if e is true
in the least Herbrand model of T . This corresponds to making the closed world
assumption, every statement that cannot be proven is assumed to be false. A
least Herbrand semantics allows using definite clauses as a programming language
and they form the basis for "pure" Prolog. It allows naturally supporting data
structures and compute, for instance, transitive closures, which is impossible
in standard first order logic. This use of a least Herbrand model is important
because there are models of a definite clause theory that are not minimal, as
shown in Example 5.

  Example 5 (Model-based vs logic program semantics). Let us consider the
  following set of clauses:

                                            8
edge(1,2) ← True
              path(A,B) ← edge(A,B)
              path(A,B) ← edge(A,C) ∧ path(C,B)
     If we consider them as clauses of a logic program, then the unique least
  Herbrand model (LHM) is:

                              M LHM = {edge(1, 2), path(1, 2)}

      On the other hand, the model-based semantics allows for all the models
  of the theory:

                        M1 = M LHM = {edge(1, 2), path(1, 2)}
                        M2 = {edge(1, 2), path(1, 2), path(1, 1)}
                        M2 = {edge(1, 2), path(1, 2), path(2, 1)}
                        ...

    These differences are also important for StarAI and NeSy. Indeed, StarAI and
NeSy systems based on first-order logic, such as Markov logic networks [88] and
Probabilistic Soft Logic [3], cannot model transitive closure, which can lead to
unintuitive inference results. They view, as we shall show in Example 14, logical
formulae as (soft) constraints. In contrast, systems based on logic programming,
such as Problog [22], have no difficulties with transitive closure. They use the
clauses as inference rules to build proofs and derivations.
    It is worth noting that there are various flavours of logic programming.
Datalog is the relational subset of definite clauses logic, it is strongly related
to database languages such as SQL. Furthermore, because it prohibits the use
of structured terms, it guarantees termination. Answer-set programming [37]
is a popular logic programming framework that is not restricted to definite
clauses and that takes the constraint perspective. Answer-set programs can have
multiple models and support features such as soft and hard constraints and
preferences. For a detailed introduction to answer-set programs, we refer to [37].
    The difference between the logic programming perspective and that of full
clausal logic can thus be related to the difference between a proof theoretic
and a model theoretic perspective. In the model theoretic perspective, we view
the clauses as constraints that need to be satisfied, while in the proof theoretic
perspective, we view them as rules to answer particular queries. This is clear
when looking at propositional logic: propositional definite clauses can be viewed
as simple IF ... THEN rules that can be chained in the forward or the backward
direction in order to derive new conclusions1 , while propositional clauses in a
SAT theory are disjunctive constraints that need to be satisfied.

StarAI along Dimension 2 Many StarAI systems use logic as a kind of
template to ground out the relational model in order to obtain a grounded
  1 General   clauses can be used in proofs.

                                               9
model and perform inference. This is akin to the model-based perspective
of logic. This grounded model can be a graphical model, or alternatively, a
ground weighted logical theory on which traditional inference methods apply,
such as belief propagation or weighted model counting. This is used in well
known systems such as Markov Logic Networks (MLNs) [88], Probabilistic
Soft Logic (PSL) [3], Bayesian logic programs (BLPs) [54] and probabilistic
relational models (PRMs) [36]. Some systems like PRMs and BLPs additionally
use aggregates, or combining rules, in order to combine multiple conditional
probability distributions into one using, e.g., noisy-or.
    Alternatively, one can follow a proof or trace based approach to define the
probability distribution and perform inference. This is akin to what happens in
probabilistic programming (cf. also [92]), in StarAI frameworks such as proba-
bilistic logic programs (PLPs) [86], probabilistic databases [106] and probabilistic
unification based grammars such as Stochastic Logic Programs (SLPs) [74]. Just
like pure logic supports the model-theoretic and proof-theoretic perspectives,
both perspectives have been explored in parallel for some of the probabilistic
logic programming languages such as ICL [81] and ProbLog [32].

NeSy along Dimension 2 These two perspectives carry over to the neural-
symbolic methods. Approaches like LRNN, LNN, NTPs, DeepProblog, ∂ILP,
DiffLog, NeuralLP and Neural Logic Machines (NLM) [27] are proof-based.
The probabilities or certainties that these systems output are based on the
enumerated proofs, and they can also learn how to combine them.
    In contrast, approaches of NeurASP, Logic Tensor Networks (LTNs) [26],
Semantic Based Regularization (SBR) [25], SL, Relational Neural Machines
(RNM) [67] and Neural Markov Logic Networks (NMLN) [69] are all based
on the model-theoretic perspective. Learning in these models is done through
learning the (shared) parameters over the ground model and inference is based
on possible groundings of the model.

    Dimension 2: Rules or Constraints
    In the model theoretic perspective, we view the clauses as constraints
    that need to be satisfied, while in the proof theoretic perspective, we view
    them as rules to answer particular queries.

4    Probabilistic graphical models
Probabilistic graphical models [58] are graphical models that compactly represent
a (joint) probability distribution P (X1 , ..., Xn ) over n discrete or continuous
random variables X1 , ..., Xn . The key idea is that the joint factorizes over some
factors f i specified over subsets X i of the variables {X1 , ..., Xn }.
                                          1
                    P (X1 , ..., Xn ) =     f1 (X 1 ) × ... × fk (X k )
                                          Z

                                            10
The random variables correspond to the nodes in the graphical structure,
and the factorization is determined by the edges in the graph.
    There are two classes of graphical models: directed, or Bayesian networks,
and undirected, or Markov Networks. In Bayesian networks, the underlying graph
structures is a directed acyclic graph, and the factors f i (Xi |parents(Xi )) cor-
respond to the conditional probabilities P (Xi |parents(Xi )), where parents(Xi )
denotes the set of random variables that are a parent of Xi in the graph. In
Markov networks, the graph is undirected and the factors f i (X i ) correspond to
the set of nodes X i that form (maximal) cliques in the graph. Furthermore, the
factors are non negative and Z is a normalisation constant.

4.1    StarAI along Dimension 3
The distinction between the directed and undirected graphical models [58], has
led to two distinct types of StarAI systems [85]. The first type of StarAI
systems generalizes directed models and resembles Bayesian networks. It includes
well-known representations such as plate notation [58], probabilistic relational
models (PRMs) [36], probabilistic logic programs (PLPs) [86], and Bayesian
logic programs (BLPs) [54]. Today the most typical and popular representatives
of this category are the probabilistic (logic) programs.
    Probabilistic logic programs were introduced by Poole [80] and the first
learning algorithm was by Sato Sato [93]. Probabilistic logic programs are
essentially definite clause programs where every fact is annotated with the
probability that it is true. This then results in a possible world semantics. The
reason why probabilistic logic programs are viewed as directed models is clear
when looking at the derivations for a query, cf. Example 3. At the top of the
AND-OR tree, there is the query that one wants to prove and the structure of the
tree is that of a directed graph (even though need not be acyclic). One can also
straightforwardly map directed graphical models, i.e., Bayesian networks, on such
probabilistic logic programs by associating one definite clause to every entry in a
conditional probability table, i.e., a factor of the form P (X|Y1 , ..., Yn ). Assuming
boolean random variables, each entry x, y1 , ..., yn with parameter value v can be
represented using the definite clause X(x) ← Y1 (y1 ) ∧ ... ∧ Yn (yn ) ∧ px,y1 ,...,yn
and probabilistic facts v :: px,y1 ,...,yn . A probabilistic version of Example 3 is
shown in Example 6 using the syntax of ProbLog [22].

  Example 6 (ProbLog). We show a probabilistic extension for the alarm
  program using ProbLog notation.
           0.1::burglary.
           0.3::hears_alarm(mary).
           0.05::earthquake.
           0.6::hears_alarm(john).
           alarm :- earthquake.
           alarm :- burglary.

                                          11
burglary             earthquake

       hears_alarm(john)              alarm            hears_alarm(mary)

                        calls(john)           calls(mary)

Figure 1: The Bayesian network corresponding to the ProbLog program in
Example 6

           calls(X) :- alarm, hears_alarm(X).
     This program can be mapped to the Bayesian network in Figure 1
     This probabilistic logic program defines a distribution p over possible
  worlds ω. Let P be a problog program and F = {p1 : c1 , · · · , pn : cn }
  be the set of ground probabilistic facts ci of the program and pi their
  corresponding probabilities. Problog defines a probability distribution over
  ω in the following way:

                                                         if ω 6|= P
                     
                     0, Y             Y
              p(ω) =
                               pi ·         (1 − p j ), if ω |= P
                        ci ∈ω:ci =T   cj ∈ω:cj =F

    The second type of StarAI systems generalizes undirected graphical models
like Markov networks or random fields. The prototypical example is Markov
Logic Networks (MLNs) [88], and also Probabilistic Soft Logic (PSL) [3] follows
this idea.
    Undirected StarAI methods define a set of weighted clauses w : h1 ∨ ... ∨ hk ←
b1 ∧ ... ∧ bm , and a domain D. The idea is that once the clauses are grounded over
the domain D, they become soft constraints. The higher the weight of a ground
clause, the less likely possible worlds that violate these constraints are. In the
limit, when the weight is +∞ the constraint must be satisfied and becomes a
pure logical constraint. The weighted clauses specify a more general relationship
between the conclusion and the condition than the definite clauses of directed
models. While clauses of undirected models can still be used in (resolution)
theorem provers, they are usually viewed as constraints that relate these two
sets of atoms as is common in Answer Set Programming [38].
    Undirected models can be mapped to an undirected probabilistic graphical
model in which there is a one-to-one correspondence between grounded weighted
clauses and factors, as we show in Example 7.

                                        12
burglary             earthquake

      hears_alarm(john)                alarm            hears_alarm(mary)

                         calls(john)           calls(mary)

Figure 2: The Markov Field corresponding to the Markov logic network in
Example 7

  Example 7 (Markov Logic Networks). We show a probabilistic extension
  of the theory in Example 4 using the formalism of Markov Logic Networks.
  We use a First Order language with domain D = {john, mary} and weighted
  clauses α1 and α2 , i.e.:

               α1 : 1.5 :: calls(X) ← hears_alarm(X) ∧ alarm
               α2 : 2 :: alarm ← burglary
               α3 : 2 :: alarm ← earthquake

     In Figure 2, we show the corresponding Markov field.
     A Markov Logic Network defines a probability distribution over possible
  worlds as follows. Let A = [α1 , · · · , αn ] be a set of logical clauses and let
  B = [β1 , · · · , βn ] the corresponding positive weights. Let θj be a grounding
  substitution for the clause αi over the domain D of interest and αi θj
  the corresponding grounded clause. Finally, let 1(ω, αi θi ) be an indicator
  function, evaluating to 1 if the ground clause is true in ω, 0 otherwise.
     The probabilistic semantics of Markov Logic is the distribution
                               1
                                            1(ω, αi θj )
                                     X X                 
                      p(ω) =     exp   βi
                               Z     i    j

      Intuitively, in MLNs, a world is more probable if it makes many ground
  clauses true.

4.2    NeSy along Dimension 3
Many neural symbolic systems retain the directed nature of logical inference
and can be classified as directed models. The most prominent members of this

                                         13
category are NeSy systems based on Prolog or Datalog, such as Neural Theorem
Provers (NTPs) [90], NLProlog [112], DeepProbLog [64] and DiffLog [99]. Lifted
Relational Neural Networks (LRNNs) [125] and ∂ILP [31] are other examples of
non-probabilistic directed models, where weighted definite clauses are compiled
into a neural network architecture in a forward chaining fashion. The systems
that imitate logical reasoning with tensor calculus, Neural Logic Programming
(NeuralLP) [118] and Neural Logic Machines (NLM) [27], are likewise instances
of directed logic. An example of a directed NeSy model is given in Example 8.

  Example 8 (Knowledge-Based Artificial Neural Networks). Knowledge-
  Based Artificial Neural Networks (KBANN) is one of the first methods to
  use definite clausal logic to template a neural network. They incorporate
  many of the common patterns of directed NeSy models.
  KBANN turns a program into a neural network in several steps:

     1. KBANN starts from a definite clause program.
     2. The program is turned into an AND-OR tree.
     3. The AND-OR tree is turned into a neural network with a similar
        structure. Nodes are divided into layers. The weights and the biases
        are set such that evaluating the network returns the same outcome of
        querying the program.
     4. New hidden units are added. Hidden units play the role of unknown
        rules that need to be learned. They are initialized with zero weights;
        i.e. they are inactive.

     5. New links are added from each layer to the upper one, obtaining the
        final neural network.
  An example of this process is shown in Figure 3. KBANN need some
  restrictions over the kind of rules. In particular, the rules are assumed to
  be conjunctive, nonrecursive, and variable-free. Many of these restrictions
  are removed by more recent systems.
    Differently from the directed class, the undirected NeSy approaches do not
exploit clauses to perform logical reasoning (e.g. using resolution) but consider
logic as a constraint on the behaviour of a neural model. Rules are then used
as an objective function for a neural model more than as a template for a
neural architecture. So the indirectness of rules have a very large impact on how
the symbolic part is exploited w.r.t. the directed methods. A large group of
approaches, including Semantic Based regularization (SBR) [25], Logic Tensor
Networks(LTN) [26], Semantic Loss (SL) [114] and DL2 [33] exploits logical
knowledge as a soft regularization constraint over the hypothesis space in a
way that favours solutions consistent with the encoded knowledge. SBR and
LTN compute atom truth assignments as the output of a neural network and
translates the provided logical formulas into a real valued regularization loss

                                       14
calls_mary AND

 alarm :- earthquake.
                                              alarm            hears_alarm_mary
 alarm :- burglary.                                    OR
 calls_mary :- alarm,
                                            burglary        earthquake
              hears_alarm_mary.

                   (1)                                       (2)

                   (3)                                       (4)

                                      (5)

Figure 3: Knowledge-Based Artificial Neural Network. Network creation process.
(1) the initial logic program; (2) the AND-OR tree for the query calls_mary; (3)
mapping the tree into a neural network; (4) adding hidden neurons, (5) adding
interlayer connections.

term using fuzzy logic. SL uses marginal probabilities of the target atoms to
define the regularization term and relies on arithmetic circuits [18] to evaluate
it efficiently, as detailed in Example 9. DL2 defines a numerical loss providing
no specific semantics (probability or fuzzy), which allows including numerical
variables in the formulas (e.g. by using a logical term x > 1.5). Another
group of approaches, including Neural Markov Logic Networks (NMLN) [69]
and Relational Neural Machines (RNM) [67] extend MLNs, allowing factors of
exponential distributions to be implemented as neural architectures. Finally,
[91, 24] compute ground atoms scores as dot products between relation and
entities embeddings; implication rules are then translated into a logical loss
through a continuous relaxation of the implication operator.

  Example 9 (Semantic Loss). The Semantic Loss [114] is an example of an
  undirected model where (probabilistic) logic is exploited as a regularization
  term in training a neural model.

                                       15
Let p = [p1 , . . . , pn ] be a vector of probabilities for a list of propositional
variables X = [X1 , . . . , Xn ]. In particular, pi denotes the probability of
variable Xi being True and corresponds to a single output of a neural net
having n outputs. Let α be a logic sentence defined over X.
Then, the semantic loss between α and p is:
                                      X Y               Y
             Loss(α, p) ∝ − log                  pi           (1 − pi ).
                                   x|=α i:x|=Xi     i:x|=¬Xi

The authors provide the intuition behind this loss:

      The semantic loss is proportional to a negative logarithm of the
      probability of generating a state that satisfies the constraint when
      sampling values according to p.

Suppose you want to solve a multi-class classification task, where each
input example must be assigned to a single class. Then, ones would like
to enforce mutual exclusivity among the classes. This can be easily done
on supervised examples, by coupling a softmax activation layer with a
cross entropy loss. However, there is not a standard way of imposing this
constraint for unlabeled data, which can be useful in a semi-supervised
setting.
The solution provided by the Semantic Loss framework is to encode mutual
exclusivity into the propositional constraint β:

    β = (X1 ∧ ¬X2 ∧ ¬X3 ) ∨ (¬X1 ∧ X2 ∧ ¬X3 ) ∨ (¬X1 ∧ ¬X2 ∧ X3 )

   Consider a neural network classifier with three outputs p = [p1 , p2 , p3 ].
Then, for each input example (both labeled or unlabeld), we can build the
semantic loss term:

  L(β, p) = p1 (1 − p2 )(1 − p3 ) + (1 − p1 )p2 (1 − p3 ) + (1 − p1 )(1 − p2 )p3

which can be summed up to the standard cross entropy term for the labeled
examples.
    It is worth comparing this method with KBANN (see Example 8). Here,
the logic is turned into a loss function that is used during training. The
function constrains the underlying probabilities, but there are no directed
or causal relationships among them. Moreover, during evaluation, the
probabilities p of the variables are just the outputs of the neural network.
On the contrary, in KBANN, the logic is compiled into the architecture of
the network and so it will be exploited also at evaluation time to compute
the score of the test query. The different focus on the neural or logic part is
further investigated in Section 8.

                                        16
Dimension 3: Directed and Undirected models
     There are two classes of graphical models: in Bayesian networks, the
     underlying graph structures is a directed acyclic graph, while, in Markov
     networks, the graph is undirected. This distinction is carried over to
     NeSy where logical rules are used either to define the forward structure
     of the neural network or to define a regularization term for the training.

5      Boolean, Probabilistic and Fuzzy logic
One of the most important and complex questions in the neural symbolic
community is how to integrate the discrete nature of Boolean logic with the
continuous nature of neural representations (e.g. embeddings).
    Boolean logic assigns values to atoms in the set {T rue, F alse} (or {F, T }
or {0, 1}), which are interpreted as truth values. Connectives (e.g. ∧, ∨) are
mapped to binary functions of truth values, which are usually described in terms
of truth tables.

    Example 10 (Boolean Logic). Let us consider the following propositions:
    alarm, burglary and earthquake.
        Defining the semantics of this language is about assigning truth values
    to the propositions and truth tables to connectives.
        For example:
                                        A B A∨B                A B B←A
           I = {alarm = T,              F   F       F          F F        T
                                        F T        T           F T        T
                 burglary = T,
                                        T F        T           T F        F
                 earthquake = F }       T T        T           T T        T
        Once we have defined the semantics of the language, we can evaluate
    logic sentences, e.g.:

                      alarm ← (burglary ∨ earthquake) = T

        This evaluation can be performed automatically by parsing the expression
    in the corresponding expression tree:

                                        17
←

                  alarm                                 ∨

                                  burglary                         earthquake

  The truth value of the sentence is computed by evaluating the tree bottom-
  up.

    Probabilistic logic uses the distribution semantics [93] as the key concept
to integrate Boolean logic and probability2 . The basic idea is that we interpret
each propositional binary variable as a binary random variable. Then, a specific
assignment ω of values to the random variables, also called a possible world, is
just a specific interpretation of the Boolean logic theory. Any joint distribution
p(ω) over the random variables is also a distribution over logic interpretations.
The probability of an atom or formula α is defined as the probability that any
of the possible worlds that are models of α will occur. Since possible worlds are
mutually exclusive, this is just the sum of their probabilities:
                                         X
                                  p(α) =     p(ω)                              (1)
                                                 ω|=α

This is known as the Weighted Model Counting (WMC) problem.

   Example 11 (Distribution semantics). We can illustrate the distribu-
   tion semantics by describing a distribution over possible worlds in tabular
   form by listing all the worlds and the corresponding probabilities. Let
   B = burglary, E = earthquake, J = hears_alarm_john and M =
   hears_alarm_mary). We omit deterministic atoms for clarity. Table 2 re-
   ports all the possible worlds over these four variables and the corresponding
   probabilities.
       Suppose we want to compute the probability of the formula burglary ∧
   earthquake. This is done by marginalizing over all those worlds (indicated
   by a ∗ in Table 2), where both burglary and earthquake are true.

    Fuzzy logic, and in particular t-norm fuzzy logic, assigns a truth value to
atoms in the continuous real interval [0, 1]. Logical operators are then turned
into real-valued functions, mathematically grounded in the t-norm theory. A
t-norm t(x, y) is a real function t : [0, 1] × [0, 1] → [0, 1] that models the logical
    2 In this paper, we use the distribution semantics or possible world semantics as representative

of the probabilistic approach to logic. While this is the most common solution in StarAI, many
other solutions exist [74, 46], whose description is out of the scope of the current survey. A
detailed overview of the different flavours of formal reasoning about uncertainty can be found
in [47].

                                                18
B   E    J   M     p(ω)
                           F   F    F   F     0.2394
                           F   F    F   T     0.1026
                           F   F    T   F     0.3591
                           F   F    T   T     0.1539
                           F   T    F   F     0.0126
                           F   T    F   T     0.0054
                           F   T    T   F     0.0189
                           F   T    T   T     0.0081
                           T   F    F   F     0.0266
                           T   F    F   T     0.0114
                           T   F    T   F     0.0399
                           T   F    T   T     0.0171
                           T   T    F   F     0.0014   *
                           T   T    F   T     0.0006   *
                           T   T    T   F     0.0021   *
                           T   T    T   T     0.0009   *

Table 2: A distribution over possible worlds for the four proposi-
tional variables burglary (B), earthquake (E), hears_alarm_john (J) and
hears_alarm_mary (M). The ∗ indicates those worlds where burglary ∧
earthquake is true.

AND and from which the other operators can be derived. Table 3 shows the
most notorious t-norms and the functions corresponding to their connectives. A
fuzzy logic formula is mapped to a real valued function of its input atoms. Fuzzy
logic generalizes Boolean logic to continuous values. All the different t-norms
are coherent with Boolean logic in the endpoints of the interval [0, 1], which
correspond to completely true and completely false values.
    The concept of model in fuzzy logic can be easily recovered from an extension
of the model-theoretic semantics of the Boolean logic (see Section 2). Any fuzzy
interpretation is a model of a formula if the formula evaluates to 1.
    Fuzzy logic deals with vagueness, which is different and orthogonal to un-
certainty (as in probabilistic logic). This difference is clear when one compares,
for example, the fuzzy assignment earthquake = 0.01, which means "very mild
earthquake", with p(earthquake = T rue) = 0.01, which means a low probability
of an earthquake.

  Example 12 (Fuzzy logic). Let us consider the same propositional language
  of Example 10. Defining a fuzzy semantics to this language is about
  assigning truth degrees to each of the propositions and selecting a t-norm
  implementation of the connectives.
      Let us consider the Łukasiewicz t-norm and the following interpretation
  of the language:

                                        19
Product             Łukasiewicz          Gödel
            x∧y             x·y              max(0, x + y − 1)     min(x, y)
            x∨y           x+y−x·y             min(1, x + y)        max(x, y)
             ¬x             1−x                   1−x               1−x
        x ⇒ y (x > y)        y/x             min(1, 1 − x + y)        y

Table 3: Logical connectives on the inputs x, y when using the fundamental
t-norms.

                            I = {alarm = 0.7,
                                  burglary = 0.6,
                                  earthquake = 0.3}

      Once we have defined the semantics of the language, we can evaluate
  logic sentences, e.g.:

          alarm ← (burglary ∨ earthquake) =
           min(1, 1 − min(1, burglary + earthquake) + alarm) = 0.8

      This evaluation can be performed automatically by parsing the logi-
  cal sentence in the corresponding expression tree and then compiling the
  expression tree using the corresponding t-norm operation:

                     ←                                  t←       (0.8)

          alarm           OR                     0.7         tOR         (0.9)

              burglary          earthquake             0.6               0.3

  The resulting circuit represents a differentiable function and the truth degree
  of the sentence is computed by evaluating the circuit bottom-up.

5.1    StarAI along Dimension 4
StarAI is deeply linked to probabilistic logic. The StarAI community has provided
several formalisms to define such probability distributions over possible worlds
using labeled logic theories. Probabilistic Logic Programs (cf. Example 6) and
Markov logic networks (cf. Example 7) are two prototypical frameworks. For
example, the distribution in Table 2 is the one modeled by the ProbLog program

                                        20
AND                                (0.0435)       *

                    OR         hears_alarm(mary)        (0.145)        +       0.3

             AND         AND                      (0.045)       *          *    (0.1)

¬burglary                      burglary               1 - 0.1
                      OR                                               +       0.1

         earthquake        ¬earthquake                          0.05       1-0.05

Figure 4: dDNNF (left) and arithmetic circuit (right) corresponding to the
ProbLog program in Example 6

in Example 6. Probabilistic inference (i.e. weighted model counting) is generally
intractable. That is why, in StarAI, techniques such as knowledge compilation
(KC) [19] are used. Knowledge compilation transforms logical formulae into a new
representation in an expensive offline step. For this new representation, a certain
set of queries are efficient (i.e. poly-time in the size of the new representation).
From a probabilistic viewpoint, this translation solves the disjoint-sum problem,
i.e. it encodes in the resulting formula the probabilistic dependencies in the
theory. After the translation, the probabilities of any conjunction and of any
disjunction can be simply computed by multiplying, resp. summing up, the
probabilities of their operands. Therefore, the formula can be compiled into
an arithmetic circuit ac(α). The weighted model count of the query formula is
computed by simply evaluating bottom up the corresponding arithmetic circuit;
i.e. p(α) = ac(α).

  Example 13 (Knowledge Compilation). Let us consider the ProbLog
  program in Example 6 and the corresponding tabular form in Table 2.
  Let us consider the unary formula α = calls(mary). We can Equation 1
  to compute the probability that the formula α holds. To do this, we
  can iterate over the table and sum all the rows where calls(mary) = T ,
  which are those where either burglary = T or earthquake = T and where
  hears_alarm(mary) = T . We get that p(α) = 0.0435. This method would
  require to iterate over 2N terms (where N is the number of probabilistic
  facts).
      Knowledge compilation compiles the program and the query into a repre-

                                          21
sentation that is logically equivalent. In Figure 4, the target representation
  is a decomposable, deterministic negative normal form (d-DNNF) [17], for
  which weighted model counting is poly-time in the size of the formula.
  Decomposability means that, for every conjunction, the two conjuncts do
  not share any variables. Deterministic means that, for every disjunction,
  the two disjuncts are independent. The formula in dNNF can then be
  straightforwardly turned into an arithmetic circuit by substituting AND
  nodes with multiplication and OR nodes by summation (i.e. the proba-
  bility semiring). In Figure 4, we show the dDNNF and the arithmetic
  circuit of the distribution defined by the ProbLog program in Example 6.
  The bottom-up evaluation of this arithmetic circuit computes the correct
  marginal probability p(α) much more efficiently than the naive iterative
  sum that we have shown before.
    Even though probabilistic Boolean logic is the most common choice in
StarAI, there are approaches using probabilistic fuzzy logic. The most prominent
approach is Probabilistic Soft Logic (PSL) [3], as we show in Example 14.
Similarly to Markov logic networks, Probabilistic Soft Logic (PSL) defines log
linear models where features are represented by ground clauses. However, PSL
uses a fuzzy semantics of the logical theory. Therefore, atoms are mapped to
real valued random variables and ground clauses are now real valued factors.

  Example 14 (Probabilistic Soft Logic). Let us consider the logical rule
  α = calls(X) ← alarm, hears_alarm(X) with weight β .
     As we have seen in Example 7, Markov Logic translates the formula into
  a discrete factor by using the indicator functions 1(ω, αθ):

           φM LN (ω, α) = β 1(ω, α{X/mary}) + β 1(ω, α{X/john})
     Instead of discrete indicator functions, PSL translates the formula into
  a continuous t-norm based function:

    t(ω, αθ) = min(1 − max(0, alarm + hears_alarm(X) − 1) + calls(X))

     and the corresponding potential is then translated into the continuous
  and differentiable function:

            φP SL (ω, α) = βt(ω, α{X/mary}) + βt(ω, α{X/john})

      Another important task in StarAI is MAP inference. In MAP inference,
  given the distribution p(ω), one is interested in finding an interpretation ω ?
  where p is maximum, i.e:

                               ω ? = arg max p(ω)                            (2)
                                             ω

                                        22
When the ω is a boolean interpretation, i.e. ω ∈ {0, 1}n , like in ProbLog
  or MLN, this problem is known to be strictly related with the Weighted
  Model Count problem with which it shares the same complexity class.
  However
       P in PSL, ω is a fuzzy interpretation, i.e. ω ∈ [0, 1] and p(ω) ∝
                                                                   n

  exp     i βi φ(ω, αi ) is a continuous and differentiable function.   The basic
  idea exploited by PSL is to compile the function Φ(ω) =
                                                                   P
                                                                      i βi φ(ω, αi )
  into a parametric circuit (cf. Example 12, where the set of parameters is
  represented by the fuzzy interpretation ω. The MAP inference problem
  can thus be approximated more efficiently than its boolean counterpart (i.e.
  Markov Logic) by gradient-based techniques.

5.2    NeSy along Dimension 4
We have seen that in StarAI, one can turn inference tasks into the evaluation (as
in KC) or gradient-based optimization (as in PSL) of a differentiable parametric
circuit. The parameters are scalar values (e.g. probabilities or truth degrees)
which are attached to basic elements of the theory (facts or clauses).
    A natural way of carrying over the StarAI approach to the neural symbolic
domain is the reparameterization method. The reparameterization method is
to substitute the scalar values assigned to facts or formulas with the output of
a neural network. One can interpret this substitution in terms of a different
parameterization of the original model. Many probabilistic methods parameterize
the underlying distribution in terms of neural components. In particular, as
we show in Example 15, DeepProbLog exploits neural predicates to compute
the probabilities of probabilistic facts as the output of neural computations
over vectorial representations of the constants, which is similar to SL in the
propositional counterpart (see Example 9). NeurASP also inherits the concept
of neural predicate from DeepProbLog.

  Example 15 (Probabilistic semantics reparameterization in DeepProbLog).
  DeepProbLog [64] is a neural extension of the probabilistic logic program-
  ming language ProbLog. DeepProbLog allows images or other sub-symbolic
  representations as terms of the program.
      Let us consider a possible neural extension of the program in Exam-
  ple 6. We could extend the predicate calls(X) with two extra inputs, i.e.
  calls(B, E, X). B is supposed to contain an image of a security camera,
  while E is supposed to contain the time-series of a seismic sensor. We would
  like to answer queries like calls( , , mary), i.e. what is the probability
  that mary calls, given that the security camera has captured the image
  and the sensor the data      .
      DeepProbLog allows answering this query by modeling the following
  program:
  nn(nn_burglary, [B]) :- burglary(B).
  nn(nn_earthquake, [E]) :- earthquake(E).

                                         23
0.3::hears_alarm(mary).
  0.6::hears_alarm(john).
  alarm(B,_) :- burglary(B).
  alarm(_,E) :- earthquake(E).
  calls(B,E, X) :- alarm(B,E), hears_alarm(X).
       Here, the program has been extended in two ways. First, new arguments
  (i.e. B and E) have been introduced in order to deal with the sub-symbolic
  inputs. Second, the probabilistic facts burglary and earthquake have been
  turned into neural predicates. Neural predicates are special probabilistic
  facts that are annotated by neural networks instead of by scalar probabilities.
       Inference in DeepProbLog mimics exactly that of ProbLog. Given
  the query and the program, knowledge compilation is used to build the
  arithmetic circuit in Figure 5.
       Since the program is structurally identical to the pure symbolic one
  in Example 13, the arithmetic circuit is exactly the same, with as only
  difference now that some leaves of the tree (i.e. probabilities of the facts)
  can also be neural networks.
       Given a set of queries that are true, i.e.:

                         D = {calls(      ,    , mary),
                                calls(   ,     , john),
                                calls(   ,     , mary), ...},

  we can train the parameters θ of the DeepProbLog program (both neural
  networks and scalar probabilities) by maximizing the likelihood of the
  training queries using gradient descent:
                                      X
                                 max       p(q)
                                    θ
                                         q∈D

    Similarly to DeepProbLog, NMLN and RNM use neural networks to parame-
terize the factors (or the weights) of a Markov Logic Network. [91] computes
marginal probabilities as logistic functions over similarity measures between
embeddings of entities and relations. An alternative solution to exploit a prob-
abilistic semantics is to use knowledge graphs (see also Section 10) to define
probabilistic priors to neural network predictions, as done in [101].
    SBR and LTN reparametrize fuzzy atoms using neural networks that take as
inputs the feature representation of the constants and return the corresponding
truth value, as shown in Example 16. Logical rules are then relaxed into soft
constraints using fuzzy logic. Many other systems in other communities exploit
fuzzy logic to inject knowledge into neural models [43, 61]. All these methods
often differ for small implementation details and they can be regarded as variants
of a unique conceptual framework.

                                         24
*

                                      +         0.3

                                *           *

                                                      +

               nn_burglary                       nn_earthquake

Figure 5: A neural reparametrization of the arithmetic circuit in Example 13
as done by DeepProbLog (cf. Example 15). Dashed arrows indicate a negative
output, i.e 1 - x

  Example 16 (Semantic-Based Regularization). Semantic-Based Regular-
  ization (SBR) [25] is an example of an undirected model where fuzzy logic
  is exploited as a regularization term in training a neural model.
      Let us consider a possible grounding for the rule in Example 14:
  calls(mary) ← alarm,hears_alarm(mary)
      For each grounded rule r, SBR builds a regularization loss term L(r)
  in the following way. First, it maps each constant c (e.g. mary) to a set
  of (perceptual) features xc (e.g. a tensor of pixel intensities xmary ). Each
  relation r (e.g. calls, hears_alarm) is then mapped to a neural network
  fr (x), where x is the tensor of features of the input constants and the
  output is a truth degree in [0, 1]. For example, the atom calls(mary)
  is mapped to the function call fcalls (xmary ). Propositional variables (e.g.
  alarm) are mapped to free parameters in [0, 1], e.g. talarm (exactly like
  in PSL, Example 14). Then, a fuzzy logic t-norm is selected and logic
  connectives are mapped to the corresponding real valued functions. For
  example, when the Łukasiewicz t-norm is selected, the implication is mapped
  to the binary real function f← (x, y) = min(1, 1−y +x) while the conjunction
  is f∧ (x, y) = max(0, x + y − 1).
      For the rule above, the Semantic-Based Regularization loss term is (for

                                       25
the Łukasiewicz t-norm):

                                                                                  
   LŁ (r) = min 1, 1 − fcalls (xmary ) + max(0, talarm + fhears_alarm (xmary ) − 1)

      The aim of Semantic-Based Regularization is to use the regularization
  term together with classical supervised learning loss function in order to
  learn the functions associated to the relations, e.g. fcalls and fhears_alarm .
      It is worth comparing this method with the Semantic Loss one (Example
  9). Both methods turn a logic formula (either propositional or first-order) to
  a real valued function that is used as a regularization term. However, because
  of the different semantics, these two methods show different properties. On
  one hand, SL preserves the original logical semantics, by using probabilistic
  logics. However, due to the probabilistic assumption, the input formula
  cannot be compiled directly into a differentiable loss but needs to be first
  translated into an equivalent deterministic and decomposable formula. While
  this step is necessary in order for the probabilistic modeling to be sound,
  the size of the resulting formula is usually exponential in the size of the
  grounded theory. On the other hand, in SBR, the formula can be compiled
  directly into a differentiable loss, whose size is linear in the size of the
  grounded theory. However, in order to do so, the semantics of the logic is
  altered, by turning it to fuzzy logic.

    Fuzzy logic can also be used to relax rules. For example, in LRNN, ∂ILP,
DiffLog and [110], the scores of the proofs are computed by using fuzzy logic
connectives. The great algebraic variance of the t-norm theory has allowed
identifying parameterized (i.e. weighted) classes of t-norms [100, 89] that are
very close to standard neural computation patterns (e.g. ReLU or sigmoidal
layers). This creates an interesting, still not fully understood, connection between
soft logical inference and inference in neural networks. A large class of methods
[71, 24, 13, 112] relaxes logical statements numerically, giving no other specific
semantics. Here, atoms are assigned scores in R computed by a neural scoring
function over embeddings. Numerical approximations are then applied either to
combine these scores according to logical formulas or to aggregate proofs scores.
The resulting neural architecture is usually differentiable and, thus, trained
end-to-end.
    As for PSL, some NeSy methods have used mixed probabilistic and fuzzy
semantics. In particular, [68] extends PSL by adding neurally parameterized fac-
tors to the Markov field, while [51] uses fuzzy logic to train posterior regularizers
for standard deep networks using knowledge distillation techniques [49].
    Fuzzy logic in NeSy is used mostly for computational reasons and not for an
actual need to deal with vagueness. Indeed, all the fuzzy systems described in
this survey starts from a Boolean theory, relaxes it to a fuzzy theory and, finally,
return to Boolean logic to provide answers or to take decisions. We investigate
this issue further in Section 9, however we want to define two common reasons to
exploit fuzzy logic. The first one is to relax logical reasoning, and, in particular,

                                          26
weighted SATisfability. This is actually made explicit in systems like LTNs.
However, as we will show later, this causes the systems to output fuzzy solutions,
which can be incoherent with the Boolean solution of the problem. The second
reason is to approximate probabilistic inference, either by providing bounds [89]
or by providing initialization for sampling techniques [3]. For example, PSL
solves a fuzzy weighted SAT problem, similar to LTN, to find a fuzzy relaxation
of the MAP state (see Example 14), which is then used as starting point for
Markov Chain Monte Carlo (MCMC) inference.

    Dimension 4: Boolean, Probabilistic and Fuzzy logic

    This dimension concerns with the value assigned to atoms and formu-
    las of a logical theory. Boolean logic assigns values in the discrete
    set {T rue, F alse}, e.g. earthquake = T rue. Probabilistic logic al-
    lows computing the probability that an atom or a formula is T rue,
    e.g. p(earthquake = T rue) = 0.05. Fuzzy logic assigns soft truth degrees
    in the continuous set [0, 1], e.g. earthquake = 0.6. While probabilistic
    logic brings well-known computational challenges to probabilistic infer-
    ence, fuzzy logic introduces semantics issues when used as a relaxation of
    Boolean logic. This trade-off has not yet been clearly understood.

6    Learning: Structure versus Parameters
The learning approaches in StarAI and NeSy can be broadly divided in two
categories: structure [57] and parameter learning [44, 63]. In structure learning,
we are interested in discovering a logical theory, a set of logical clauses and their
corresponding probabilities, that reliably explain the given examples, starting
from an empty model. What explaining the examples exactly means changes
depending on the learning setting. In discriminative learning, we are interested
in learning a theory that explains, or predicts, a specific target relation given
background knowledge. In generative learning, there is no specific target relation;
instead, we are interested in a theory that explains the interactions between
all relations in a dataset. In contrast to structure learning, parameter learning
starts with a deterministic logical theory and only learns the corresponding
probabilities.
    The two modes of learning belong to vastly different complexity classes:
structure learning is an inherently NP-complete problem of searching for the
right combinatorial structure, whereas parameter learning can be achieved with
any curve fitting technique, such as gradient descent or least-squares. While
parameter learning is, in principle, an easier problem to solve, it comes with a
strong dependency on the provided user input - if the provided clauses are of
low quality, the resulting model will also be of low quality. Structure learning,
on the other hand, depends less on the provided input, but is faced with an
inherently more difficult problem.

                                         27
You can also read