Trainable Sentence Planning for Complex Information Presentation in Spoken Dialog Systems

Trainable Sentence Planning for Complex Information
                Presentation in Spoken Dialog Systems
        Amanda Stent                     Rashmi Prasad              Marilyn Walker
    Stony Brook University           University of Pennsylvania   University of Sheffield
    Stony Brook, NY 11794             Philadelphia, PA 19104        Sheffield S1 4DP
            U.S.A.                             U.S.A.                      U.K. 

                   Abstract                          context, may be inferior to that of a template-
A challenging problem for spoken dialog sys-         based system unless domain-specific rules are
tems is the design of utterance generation mod-      developed or general rules are tuned for the par-
ules that are fast, flexible and general, yet pro-   ticular domain. Furthermore, full NLG may be
duce high quality output in particular domains.      too slow for use in dialog systems.
A promising approach is trainable generation,           A third, more recent, approach is trainable
which uses general-purpose linguistic knowledge      generation: techniques for automatically train-
automatically adapted to the application do-         ing NLG modules, or hybrid techniques that
main. This paper presents a trainable sentence       adapt NLG modules to particular domains or
planner for the MATCH dialog system. We              user groups, e.g. (Langkilde, 2000; Mellish,
show that trainable sentence planning can pro-       1998; Walker, Rambow and Rogati, 2002).
duce output comparable to that of MATCH’s            Open questions about the trainable approach
template-based generator even for quite com-         include (1) whether the output quality is high
plex information presentations.                      enough, and (2) whether the techniques work
                                                     well across domains. For example, the training
1   Introduction                                     method used in SPoT (Sentence Planner Train-
One very challenging problem for spoken dialog       able), as described in (Walker, Rambow and Ro-
systems is the design of the utterance genera-       gati, 2002), was only shown to work in the travel
tion module. This challenge arises partly from       domain, for the information gathering phase of
the need for the generator to adapt to many          the dialog, and with simple content plans in-
features of the dialog domain, user population,      volving no rhetorical relations.
and dialog context.                                     This paper describes trainable sentence
   There are three possible approaches to gener-     planning for information presentation in the
ating system utterances. The first is template-      MATCH (Multimodal Access To City Help) di-
based generation, used in most dialog systems        alog system (Johnston et al., 2002). We pro-
today. Template-based generation enables a           vide evidence that the trainable approach is
programmer without linguistic training to pro-       feasible by showing (1) that the training tech-
gram a generator that can efficiently produce        nique used for SPoT can be extended to a
high quality output specific to different dialog     new domain (restaurant information); (2) that
situations. Its drawbacks include the need to        this technique, previously used for information-
(1) create templates anew by hand for each ap-       gathering utterances, can be used for infor-
plication; (2) design and maintain a set of tem-     mation presentations, namely recommendations
plates that work well together in many dialog        and comparisons; and (3) that the quality
contexts; and (3) repeatedly encode linguistic       of the output is comparable to that of a
constraints such as subject-verb agreement.          template-based generator previously developed
   The second approach is natural language gen-      and experimentally evaluated with MATCH
eration (NLG), which divides generation into:        users (Walker et al., 2002; Stent et al., 2002).
(1) text (or content) planning, (2) sentence            Section 2 describes SPaRKy (Sentence Plan-
planning, and (3) surface realization. NLG           ning with Rhetorical Knowledge), an extension
promises portability across domains and dialog       of SPoT that uses rhetorical relations. SPaRKy
contexts by using general rules for each genera-     consists of a randomized sentence plan gen-
tion module. However, the quality of the output      erator (SPG) and a trainable sentence plan
for a particular domain, or a particular dialog      ranker (SPR); these are described in Sections 3
strategy:recommend                                              Alt Realization                                  H     SPR
 items: Chanpen Thai                                             2    Chanpen Thai, which is a Thai restau-       3     .28
 relations:justify(nuc:1;sat:2); justify(nuc:1;sat:3); jus-           rant, has decent decor. It has good
           tify(nuc:1;sat:4)                                          service. It has the best overall quality
 content: 1. assert(best(Chanpen Thai))                               among the selected restaurants.
           2. assert(has-att(Chanpen Thai, decor(decent)))       5    Since Chanpen Thai is a Thai restau-        2.5   .14
           3. assert(has-att(Chanpen Thai, service(good))             rant, with good service, and it has de-
           4. assert(has-att(Chanpen Thai, cuisine(Thai)))            cent decor, it has the best overall qual-
                                                                      ity among the selected restaurants.
Figure 1: A content plan for a recommendation                    6    Chanpen Thai, which is a Thai restau-       4     .70
for a restaurant in midtown Manhattan                                 rant, with decent decor and good ser-
                                                                      vice, has the best overall quality among
                                                                      the selected restaurants.
 items: Above, Carmine’s
 relations:elaboration(1;2);   elaboration(1;3);     elabora-   Figure 3: Some alternative sentence plan real-
           tion(1,4); elaboration(1,5); elaboration(1,6);       izations for the recommendation in Figure 1. H
           elaboration(1,7); contrast(2;3); contrast(4;5);
                                                                = Humans’ score. SPR = SPR’s score.
 content: 1. assert(exceptional(Above, Carmine’s))
           2. assert(has-att(Above, decor(good)))                Alt Realization                                  H     SPR
           3. assert(has-att(Carmine’s, decor(decent)))          11   Above and Carmine’s offer exceptional       2     .73
           4. assert(has-att(Above, service(good)))                   value among the selected restaurants.
           5. assert(has-att(Carmine’s, service(good)))               Above, which is a New American
           6. assert(has-att(Above, cuisine(New Ameri-                restaurant, with good decor, has good
           can)))                                                     service. Carmine’s, which is an Italian
           7. assert(has-att(Carmine’s, cuisine(italian)))            restaurant, with good service, has de-
                                                                      cent decor.
Figure 2: A content plan for a comparison be-                    12   Above and Carmine’s offer exceptional       2.5   .50
                                                                      value among the selected restaurants.
tween restaurants in midtown Manhattan                                Above has good decor, and Carmine’s
                                                                      has decent decor. Above and Carmine’s
                                                                      have good service. Above is a New
                                                                      American restaurant. On the other
and 4. Section 5 presents the results of two                          hand, Carmine’s is an Italian restau-
experiments. The first experiment shows that                          rant.
                                                                 13   Above and Carmine’s offer exceptional       3     .67
given a content plan such as that in Figure 1,                        value among the selected restaurants.
SPaRKy can select sentence plans that commu-                          Above is a New American restaurant.
nicate the desired rhetorical relations, are sig-                     It has good decor. It has good service.
                                                                      Carmine’s, which is an Italian restau-
nificantly better than a randomly selected sen-
                                                                      rant, has decent decor and good service.
tence plan, and are on average less than 10%                     20   Above and Carmine’s offer exceptional       2.5   .49
worse than a sentence plan ranked highest by                          value among the selected restaurants.
human judges. The second experiment shows                             Carmine’s has decent decor but Above
                                                                      has good decor, and Carmine’s and
that the quality of SPaRKy’s output is compa-                         Above have good service. Carmine’s is
rable to that of MATCH’s template-based gen-                          an Italian restaurant. Above, however,
erator. We sum up in Section 6.                                       is a New American restaurant.
                                                                 25   Above and Carmine’s offer exceptional       NR NR
                                                                      value among the selected restaurants.
2   SPaRKy Architecture                                               Above has good decor. Carmine’s is
                                                                      an Italian restaurant. Above has good
Information presentation in the MATCH sys-                            service. Carmine’s has decent decor.
tem focuses on user-tailored recommendations                          Above is a New American restaurant.
                                                                      Carmine’s has good service.
and comparisons of restaurants (Walker et al.,
2002). Following the bottom-up approach to                      Figure 4: Some of the alternative sentence plan
text-planning described in (Marcu, 1997; Mel-                   realizations for the comparison in Figure 2. H
lish, 1998), each presentation consists of a set of             = Humans’ score. SPR = SPR’s score. NR =
assertions about a set of restaurants and a spec-               Not generated or ranked
ification of the rhetorical relations that hold be-
tween them. Example content plans are shown
in Figures 1 and 2. The job of the sentence                        The architecture of the spoken language gen-
planner is to choose linguistic resources to real-              eration module in MATCH is shown in Figure 5.
ize a content plan and then rank the resulting                  The dialog manager sends a high-level commu-
alternative realizations. Figures 3 and 4 show                  nicative goal to the SPUR text planner, which
alternative realizations for the content plans in               selects the content to be communicated using a
Figures 1 and 2.                                                user model and brevity constraints (see (Walker
DIALOGUE                        3       Sentence Plan Generation
                                                    As in SPoT, the basis of the SPG is a set of
                       Goals                        clause-combining operations that operate on tp-
                                                    trees and incrementally transform the elemen-
                                                    tary predicate-argument lexico-structural rep-
                      Text                          resentations (called DSyntS (Melcuk, 1988))
                                                    associated with the speech-acts on the leaves
                    What to Say
                                                    of the tree. The operations are applied in a
                                                    bottom-up left-to-right fashion and the result-
         Sentence       Surface      Prosody        ing representation may contain one or more sen-
          Planner       Realizer     Assigner
                                                    tences. The application of the operations yields
                    How to Say It
                                                    two parallel structures: (1) a sentence plan
                                                    tree (sp-tree), a binary tree with leaves labeled
                       Synthesizer                  by the assertions from the input tp-tree, and in-
                                                    terior nodes labeled with clause-combining op-
                                                    erations; and (2) one or more DSyntS trees
                    UTTERANCE                       (d-trees) which reflect the parallel operations
                                                    on the predicate-argument representations.
Figure 5: A dialog system with a spoken lan-           We generate a random sample of possible
guage generator                                     sentence plans for each tp-tree, up to a pre-
                                                    specified number of sentence plans, by ran-
                                                    domly selecting among the operations accord-
et al., 2002)). The output is a content plan for
                                                    ing to a probability distribution that favors pre-
a recommendation or comparison such as those
                                                    ferred operations1. The choice of operation is
in Figures 1 and 2.
                                                    further constrained by the rhetorical relation
   SPaRKy, the sentence planner, gets the con-      that relates the assertions to be combined, as
tent plan, and then a sentence plan generator       in other work e.g. (Scott and de Souza, 1990).
(SPG) generates one or more sentence plans          In the current work, three RST rhetorical rela-
(Figure 7) and a sentence plan ranker (SPR)         tions (Mann and Thompson, 1987) are used in
ranks the generated plans. In order for the         the content planning phase to express the rela-
SPG to avoid generating sentence plans that are     tions between assertions: the justify relation
clearly bad, a content-structuring module first     for recommendations, and the contrast and
finds one or more ways to linearly order the in-    elaboration relations for comparisons. We
put content plan using principles of entity-based   added another relation to be used during the
coherence based on rhetorical relations (Knott      content-structuring phase, called infer, which
et al., 2001). It outputs a set of text plan        holds for combinations of speech acts for which
trees (tp-trees), consisting of a set of speech     there is no rhetorical relation expressed in the
acts to be communicated and the rhetorical re-      content plan, as in (Marcu, 1997). By explicitly
lations that hold between them. For example,        representing the discourse structure of the infor-
the two tp-trees in Figure 6 are generated for      mation presentation, we can generate informa-
the content plan in Figure 2. Sentence plans        tion presentations with considerably more inter-
such as alternative 25 in Figure 4 are avoided;     nal complexity than those generated in (Walker,
it is clearly worse than alternatives 12, 13 and    Rambow and Rogati, 2002) and eliminate those
20 since it neither combines information based      that violate certain coherence principles, as de-
on a restaurant entity (e.g Babbo) nor on an        scribed in Section 2.
attribute (e.g. decor).                                The clause-combining operations are general
   The top ranked sentence plan output by the       operations similar to aggregation operations
SPR is input to the RealPro surface realizer        used in other research (Rambow and Korelsky,
which produces a surface linguistic utterance       1992; Danlos, 2000). The operations and the
(Lavoie and Rambow, 1997). A prosody as-                1
signment module uses the prior levels of linguis-       Although the probability distribution here is hand-
                                                    crafted based on assumed preferences for operations such
tic representation to determine the appropriate     as merge, relative-clause and with-reduction, it
prosody for the utterance, and passes a marked-     might also be possible to learn this probability distribu-
up string to the text-to-speech module.             tion from the data by training in two phases.

          nucleus:assert-com-list_exceptional                         infer

                       contrast                                         contrast                                         contrast

nucleus:assert-com-decor                     nucleus:assert-com-service                  nucleus:assert-com-cuisine

                nucleus:assert-com-decor                    nucleus:assert-com-service                      nucleus:assert-com-cuisine


         nucleus:assert-com-list_exceptional                     contrast

                              infer                                                                       infer

nucleus:assert-com-decor         nucleus:assert-com-cuisine         nucleus:assert-com-decor         nucleus:assert-com-cuisine
                nucleus:assert-com-service                                              nucleus:assert-com-service

                                  Figure 6: Two tp-trees for alternative 13 in Figure 4.

constraints on their use are described below.                             vice;Chanpen Thai has the best overall quality
   merge applies to two clauses with identical                            among the selected restaurants) yields Chanpen
matrix verbs and all but one identical argu-                              Thai, which is a Thai restaurant, with decent
ments. The clauses are combined and the non-                              decor and good service, has the best overall qual-
identical arguments coordinated. For example,                             ity among the selected restaurants. relative-
merge(Above has good service;Carmine’s has                                clause also applies only for the relations infer
good service) yields Above and Carmine’s have                             and justify.
good service. merge applies only for the rela-                               cue-word inserts a discourse connective
tions infer and contrast.                                                 (one of since, however, while, and, but, and on
   with-reduction is treated as a kind of                                 the other hand), between the two clauses to be
“verbless” participial clause formation in which                          combined. cue-word conjunction combines
the participial clause is interpreted with the                            two distinct clauses into a single sentence with a
subject of the unreduced clause. For exam-                                coordinating or subordinating conjunction (e.g.
ple, with-reduction(Above is a New Amer-                                  Above has decent decor BUT Carmine’s has
ican restaurant;Above has good decor) yields                              good decor), while cue-word insertion inserts
Above is a New American restaurant, with good                             a cue word at the start of the second clause, pro-
decor. with-reduction uses two syntactic                                  ducing two separate sentences (e.g. Carmine’s
constraints: (a) the subjects of the clauses must                         is an Italian restaurant. HOWEVER, Above
be identical, and (b) the clause that under-                              is a New American restaurant). The choice of
goes the participial formation must have a have-                          cue word is dependent on the rhetorical relation
possession predicate. In the example above, for                           holding between the clauses.
instance, the Above is a New American restau-                                Finally, period applies to two clauses to be
rant clause cannot undergo participial forma-                             treated as two independent sentences.
tion since the predicate is not one of have-                                 Note that a tp-tree can have very different
possession. with-reduction applies only for                               realizations, depending on the operations of the
the relations infer and justify.                                          SPG. For example, the second tp-tree in Fig-
   relative-clause combines two clauses with                              ure 6 yields both Alt 11 and Alt 13 in Figure 4.
identical subjects, using the second clause to                            However, Alt 13 is more highly rated than Alt
relativize the first clause’s subject. For ex-                            11. The sp-tree and d-tree produced by the SPG
ample, relative-clause(Chanpen Thai is a                                  for Alt 13 are shown in Figures 7 and 8. The
Thai restaurant, with decent decor and good ser-                          composite labels on the interior nodes of the sp-

 assert-com-list_exceptional                            PERIOD_contrast

                                        PERIOD_infer                             RELATIVE_CLAUSE_infer

                  PERIOD_infer              assert-com-service      assert-com-cuisine         MERGE_infer

assert-com-cuisine        assert-com-decor                                     assert-come-decor     assert-com-service

                   Figure 7: Sentence plan tree (sp-tree) for alternative 13 in Figure 4

                          offer                                                         PERIOD

Above_and_Carmine’s       value         among                         PERIOD                                 HAVE1

                        exceptional    restaurant                               HAVE1            Carmine’s                decor
                                       selected                           Above       service      BE3                decent      AND2
                                                    BE3               HAVE1
                                          Above       restaurant                       good Carmine’s    restaurant               service
                                                                   Above     decor
                                                                                                         Italian                  good
                                                      New_American           good

                        Figure 8: Dependency tree (d-tree) for alternative 13 in Figure 4

tree indicate the clause-combining relation se-                        each strategy. The SPG produced as many as 20
lected to communicate the specified rhetorical                         distinct sp-trees for each content plan. The sen-
relation. The d-tree for Alt 13 in Figure 8 shows                      tences, realized by RealPro from these sp-trees,
that the SPG treats the period operation as                            were then rated by two expert judges on a scale
part of the lexico-structural representation for                       from 1 to 5, and the ratings averaged. Each sp-
the d-tree. After sentence planning, the d-tree                        tree was an example input for RankBoost, with
is split into multiple d-trees at period nodes;                        each corresponding rating its feedback.
these are sent to the RealPro surface realizer.                           Features used by RankBoost: RankBoost
   Separately, the SPG also handles referring ex-                      requires each example to be encoded as a set of
pression generation by converting proper names                         real-valued features (binary features have val-
to pronouns when they appear in the previous                           ues 0 and 1). A strength of RankBoost is that
utterance. The rules are applied locally, across                       the set of features can be very large. We used
adjacent sequences of utterances (Brennan et                           7024 features for training the SPR. These fea-
al., 1987). Referring expressions are manipu-                          tures count the number of occurrences of certain
lated in the d-trees, either intrasententially dur-                    structural configurations in the sp-trees and the
ing the creation of the sp-tree, or intersenten-                       d-trees, in order to capture declaratively de-
tially, if the full sp-tree contains any period op-                    cisions made by the randomized SPG, as in
erations. The third and fourth sentences for Alt                       (Walker, Rambow and Rogati, 2002). The fea-
13 in Figure 4 show the conversion of a named                          tures were automatically generated using fea-
restaurant (Carmine’s) to a pronoun.                                   ture templates. For this experiment, we use
                                                                       two classes of feature: (1) Rule-features: These
4    Training the Sentence Plan                                        features are derived from the sp-trees and repre-
     Ranker                                                            sent the ways in which merge, infer and cue-
The SPR takes as input a set of sp-trees gener-                        word operations are applied to the tp-trees.
ated by the SPG and ranks them. The SPR’s                              These feature names start with “rule”. (2) Sent-
rules for ranking sp-trees are learned from a la-                      features: These features are derived from the
beled set of sentence-plan training examples us-                       DSyntSs, and describe the deep-syntactic struc-
ing the RankBoost algorithm (Schapire, 1999).                          ture of the utterance, including the chosen lex-
  Examples and Feedback: To apply Rank-                                emes. As a result, some may be domain specific.
Boost, a set of human-rated sp-trees are en-                           These feature names are prefixed with “sent”.
coded in terms of a set of features. We started                           We now describe the feature templates used
with a set of 30 representative content plans for                      in the discovery process. Three templates were
used for both sp-tree and d-tree features; two         dominated by a node labeled with that op-
were used only for sp-tree features. Local feature     eration in that tree (MIN); (2) the maximal
templates record structural configurations local       number of leaves dominated by a node la-
to a particular node (its ancestors, daughters         beled with that operation (MAX); and (3)
etc.). Global feature templates, which are used        the average number of leaves dominated by
only for sp-tree features, record properties of the    a node labeled with that operation (AVG).
entire sp-tree. We discard features that occur         For example, the sp-tree in Figure 7 has
fewer than 10 times to avoid those specific to         value 3 for “PERIOD infer max”, value 2 for
particular text plans.                                 “PERIOD infer min” and value 2.5 for “PE-
                                                       RIOD infer avg”.
    Strategy     System   Min    Max   Mean   S.D.
 Recommend      SPaRKy    2.0    5.0    3.6    .71
                HUMAN     2.5    5.0    3.9    .55     5   Experimental Results
               RANDOM     1.5    5.0    2.9    .88
                                                       We report two sets of experiments. The first ex-
   Compare2     SPaRKy    2.5    5.0    3.9    .71
                HUMAN     2.5    5.0    4.4    .54
                                                       periment tests the ability of the SPR to select a
               RANDOM     1.0    5.0    2.9    1.3     high quality sentence plan from a population of
   Compare3     SPaRKy     1.5   4.5    3.4    .63     sentence plans randomly generated by the SPG.
                HUMAN      3.0   5.0    4.0    .49     Because the discriminatory power of the SPR is
               RANDOM      1.0   4.5    2.7    1.0
                                                       best tested by the largest possible population of
Table 1: Summary of Recommend, Compare2                sentence plans, we use 2-fold cross validation for
and Compare3 results (N = 180)                         this experiment. The second experiment com-
                                                       pares SPaRKy to template-based generation.
                                                          Cross Validation Experiment: We re-
   There are four types of local feature               peatedly tested SPaRKy on the half of the cor-
template: traversal features, sister features,         pus of 1756 sp-trees held out as test data for
ancestor features and leaf features.          Local    each fold. The evaluation metric is the human-
feature templates are applied to all nodes in a        assigned score for the variant that was rated
sp-tree or d-tree (except that the leaf feature is     highest by SPaRKy for each text plan for each
not used for d-trees); the value of the resulting      task/user combination. We evaluated SPaRKy
feature is the number of occurrences of the            on the test sets by comparing three data points
described configuration in the tree. For each          for each text plan: HUMAN (the score of the
node in the tree, traversal features record the        top-ranked sentence plan); SPARKY (the score
preorder traversal of the subtree rooted at            of the SPR’s selected sentence); and RANDOM
that node, for all subtrees of all depths. An          (the score of a sentence plan randomly selected
example is the feature “rule traversal assert-         from the alternate sentence plans).
com-list exceptional” (with value 1) of the               We report results separately for comparisons
tree in Figure 7. Sister features record all           between two entities and among three or more
consecutive sister nodes. An example is the fea-       entities. These two types of comparison are gen-
ture “rule sisters PERIOD infer RELATIVE               erated using different strategies in the SPG, and
 CLAUSE infer” (with value 1) of the                   can produce text that is very different both in
tree in Figure 7.        For each node in the          terms of length and structure.
tree, ancestor features record all the ini-               Table 1 summarizes the difference between
tial subpaths of the path from that node               SPaRKy, HUMAN and RANDOM for recom-
to the root.       An example is the feature           mendations, comparisons between two entities
“rule ancestor PERIOD contrast*PERIOD                  and comparisons between three or more enti-
 infer” (with value 1) of the tree in Figure 7.        ties. For all three presentation types, a paired
Finally, leaf features record all initial substrings   t-test comparing SPaRKy to HUMAN to RAN-
of the frontier of the sp-tree. For example, the       DOM showed that SPaRKy was significantly
sp-tree of Figure 7 has value 1 for the feature        better than RANDOM (df = 59, p < .001) and
“leaf #assert-com-list exceptional#assert-com-         significantly worse than HUMAN (df = 59, p
cuisine”.                                              < .001). This demonstrates that the use of a
   Global features apply only to the sp-               trainable sentence planner can lead to sentence
tree. They record, for each sp-tree and for            plans that are significantly better than baseline
each clause-combining operation labeling a non-        (RANDOM), with less human effort than pro-
frontier node, (1) the minimal number of leaves        gramming templates.
System     Realization                                    H
   Comparison with template generation:
                                                       Template   Among the selected restaurants, the fol-       4.5
For each content plan input to SPaRKy, the                        lowing offer exceptional overall value.
judges also rated the output of a template-                       Uguale’s price is 33 dollars. It has good
based generator for MATCH. This template-                         decor and very good service.         It’s a
                                                                  French, Italian restaurant. Da Andrea’s
based generator performs text planning and sen-                   price is 28 dollars. It has good decor and
tence planning (the focus of the current pa-                      very good service. It’s an Italian restau-
per), including some discourse cue insertion,                     rant. John’s Pizzeria’s price is 20 dollars.
                                                                  It has mediocre decor and decent service.
clause combining and referring expression gen-                    It’s an Italian, Pizza restaurant.
eration; the templates themselves are described        SPaRKy     Da Andrea, Uguale, and John’s Pizze-           4
in (Walker et al., 2002). Because the templates                   ria offer exceptional value among the se-
are highly tailored to this domain, this genera-                  lected restaurants. Da Andrea is an Ital-
                                                                  ian restaurant, with very good service, it
tor can be expected to perform well. Example                      has good decor, and its price is 28 dol-
template-based and SPaRKy outputs for a com-                      lars. John’s Pizzeria is an Italian , Pizza
parison between three or more items are shown                     restaurant. It has decent service. It has
                                                                  mediocre decor. Its price is 20 dollars.
in Figure 9.                                                      Uguale is a French, Italian restaurant,
                                                                  with very good service. It has good decor,
    Strategy    System    Min   Max    Mean    S.D.               and its price is 33 dollars.
 Recommend     Template   2.5   5.0    4.22    0.74
                SPaRKy    2.5    4.5    3.57   0.59
               HUMAN      4.0   5.0    4.37    0.37   Figure 9: Comparisons between 3 or more
   Compare2    Template   2.0   5.0    3.62    0.75   items, H = Humans’ score
                SPaRKy    2.5   4.75    3.87   0.52
               HUMAN      4.0   5.0    4.62    0.39
   Compare3    Template   1.0   5.0    4.08    1.23
                                                      not easily model, but that a trainable sentence
                SPaRKy    2.5   4.25   3.375   0.38   planner can. For example, Table 3 shows the
               HUMAN      4.0   5.0    4.63    0.35   nine rules generated on the first test fold which
                                                      have the largest negative impact on the final
Table 2: Summary of template-based genera-            RankBoost score (above the double line) and
tion results. N = 180                                 the largest positive impact on the final Rank-
                                                      Boost score (below the double line), for com-
   Table 2 shows the mean HUMAN scores for            parisons between three or more entities. The
the template-based sentence planning. A paired        rule with the largest positive impact shows that
t-test comparing HUMAN and template-based             SPaRKy learned to prefer that justifications in-
scores showed that HUMAN was significantly            volving price be merged with other information
better than template-based sentence planning          using a conjunction.
only for compare2 (df = 29, t = 6.2, p < .001).          These rules are also specific to presentation
The judges evidently did not like the template        type. Averaging over both folds of the exper-
for comparisons between two items. A paired           iment, the number of unique features appear-
t-test comparing SPaRKy and template-based            ing in rules is 708, of which 66 appear in the
sentence planning showed that template-based          rule sets for two presentation types and 9 ap-
sentence planning was significantly better than       pear in the rule sets for all three presentation
SPaRKy only for recommendations (df = 29, t           types. There are on average 214 rule features,
= 3.55, p < .01). These results demonstrate           428 sentence features and 26 leaf features. The
that trainable sentence planning shows promise        majority of the features are ancestor features
for producing output comparable to that of a          (319) followed by traversal features (264) and
template-based generator, with less program-          sister features (60). The remainder of the fea-
ming effort and more flexibility.                     tures (67) are for specific lexemes.
   The standard deviation for all three template-        To sum up, this experiment shows that the
based strategies was wider than for HUMAN             ability to model the interactions between do-
or SPaRKy, indicating that there may be               main content, task and presentation type is a
content-specific aspects to the sentence plan-        strength of the trainable approach to sentence
ning done by SPaRKy that contribute to out-           planning.
put variation. The data show this to be cor-
rect; SPaRKy learned content-specific prefer-         6   Conclusions
ences about clause combining and discourse cue        This paper shows that the training technique
insertion that a template-based generator can-        used in SPoT can be easily extended to a new
N   Condition                                  αs
