Neural Language Generation: Formulation, Methods, and Evaluation

Page created by Mathew Mann
 
CONTINUE READING
Neural Language Generation: Formulation, Methods, and Evaluation

                                                                        Cristina Gârbacea1 , Qiaozhu Mei1,2
                                                        1
                                                         Department of EECS, University of Michigan, Ann Arbor, MI, USA
                                                       2
                                                         School of Information, University of Michigan, Ann Arbor, MI, USA
                                                                         {garbacea, qmei}@umich.edu

                                                              Abstract                           of text generation is fundamental in natural lan-
                                                                                                 guage processing and aims to produce realistic and
                                            Recent advances in neural network-based gen-
arXiv:2007.15780v1 [cs.CL] 31 Jul 2020

                                                                                                 plausible textual content that is indistinguishable
                                            erative modeling have reignited the hopes in
                                                                                                 from human-written text (Turing, 1950). Broadly
                                            having computer systems capable of seam-
                                            lessly conversing with humans and able to un-        speaking, the goal of predicting a syntactically
                                            derstand natural language. Neural architec-          and semantically correct sequence of consecutive
                                            tures have been employed to generate text ex-        words given some context is achieved in two steps
                                            cerpts to various degrees of success, in a mul-      by first estimating a distribution over sentences
                                            titude of contexts and tasks that fulfil vari-       from a given corpus, and then sampling novel and
                                            ous user needs. Notably, high capacity deep          realistic-looking sentences from the learnt distri-
                                            learning models trained on large scale datasets
                                                                                                 bution. Ideally, the generated sentences preserve
                                            demonstrate unparalleled abilities to learn pat-
                                            terns in the data even in the lack of explicit su-
                                                                                                 the semantic and syntactic properties of real-world
                                            pervision signals, opening up a plethora of new      sentences, and are different from the training ex-
                                            possibilities regarding producing realistic and      amples used to estimate the model (Zhang et al.,
                                            coherent texts. While the field of natural lan-      2017b). Language generation is an inherently
                                            guage generation is evolving rapidly, there are      complex task, which requires considerable linguis-
                                            still many open challenges to address. In this       tic and domain knowledge at multiple levels, in-
                                            survey we formally define and categorize the         cluding syntax, semantics, morphology, phonol-
                                            problem of natural language generation. We
                                                                                                 ogy, pragmatics, etc. Moreover, texts are gener-
                                            review particular application tasks that are in-
                                            stantiations of these general formulations, in       ated to fulfill a communicative goal (Reiter, 2019),
                                            which generating natural language is of prac-        such as to provide support in decision making,
                                            tical importance. Next we include a compre-          summarize content, translate between languages,
                                            hensive outline of methods and neural archi-         converse with humans, make specific texts more
                                            tectures employed for generating diverse texts.      accessible, as well as to entertain users or encour-
                                            Nevertheless, there is no standard way to as-        age them to change their behaviour. Therefore
                                            sess the quality of text produced by these gen-
                                                                                                 generated texts should be tailored to their specific
                                            erative models, which constitutes a serious bot-
                                            tleneck towards the progress of the field. To
                                                                                                 audience in terms of appropriateness of content
                                            this end, we also review current approaches          and terminology used (Paris, 2015), as well as for
                                            to evaluating natural language generation sys-       fairness and transparency reasons (Mayfield et al.,
                                            tems. We hope this survey will provide an in-        2019). For a long time natural language gener-
                                            formative overview of formulations, methods,         ation models have been rule-based or relied on
                                            and assessments of neural natural language           training shallow models on sparse high dimen-
                                            generation.                                          sional features. With the recent resurgence of neu-
                                                                                                 ral networks, neural networks based models for
                                         1 Introduction
                                                                                                 text generation trained with dense vector represen-
                                         Recent successes in deep generative modeling and        tations have established unmatched prior perfor-
                                         representation learning have led to significant ad-     mance and reignited the hopes in having machines
                                         vances in natural language generation (NLG), mo-        able to understand language and seamlessly con-
                                         tivated by an increasing need to understand and         verse with humans. Indeed, generating meaning-
                                         derive meaning from language. The research field        ful and coherent texts is pivotal to many natural
language processing tasks. Nevertheless, design-        ation presents rich practical opportunities.
ing neural networks that can generate coherent text
and model long-term dependencies has long been          2.1 Generic / Free-Text Generation
a challenge for natural language generation due to      The problem of generic text generation aims to
the discrete nature of text data. Beyond that, the      produce realistic text without placing any exter-
ability of neural network models to understand lan-     nal user-defined constraints on the model output.
guage and ground textual concepts beyond picking        Nevertheless, it does consider the intrinsic history
up on shallow patterns in the data still remains lim-   of past words generated by the model as context.
ited. Finally, evaluation of generative models for      We formally define the problem of free-text gener-
natural language is an equally active and challeng-     ation.
ing research area of significant importance in driv-       Given a discrete sequence of text tokens x =
ing forward the progress of the field.                  (x1 , x2 , . . . , xn ) as input where each xi is drawn
   In this work we formally define the problem          from a fixed set of symbols, the goal of language
of neural text generation at particular contexts        modeling is to learn the unconditional probability
and present the diverse practical applications of       distribution p(x) of the sequence x. This distri-
text generation in Section 2. In Section 3 we in-       bution can be factorized using the chain rule of
clude a comprehensive overview of deep learning         probability (Bengio et al., 2003) into a product of
methodologies and neural model architectures em-        conditional probabilities:
ployed in the literature for neural network-based
                                                                                n
                                                                                Y
natural language generation. We review methods
                                                                       p(x) =         p(xi |x
may carry different semantics for different readers,       2.3 Constrained Text Generation
therefore we want to clarify that in this survey the       The problem of constrained text generation is fo-
definition of conditional text generation considers        cusing on generating coherent and logical texts
as context only external attributes to the model and       that cover a specific set of concepts (such as pre-
not any model intrinsic attributes such as for exam-       defined nouns, verbs, entities, phrases or sentence
ple, the history of past generated words which is          fragments) desired to be present in the output, and/
already included in the formulation of the generic         or abide to user-defined rules which reflect the par-
text generation problem in Section 2.1.                    ticular interests of the system user. Lexically con-
   Conditional language models are used to learn           strained text generation (Hokamp and Liu, 2017)
the distribution p(x|c) of the data x conditioned on       places explicit constraints on independent attribute
a specific attribute code c. Similar to the formula-       controls and combines these with differentiable ap-
tion of generic text generation, the distribution can      proximation to produce discrete text samples. In
still be decomposed using the chain rule of proba-         the literature the distinction between conditional,
bility as follows:                                         controlled and constrained text generation is not
                                                           clearly defined, and these terms are often used in-
                       n
                                                           terchangeably. In fact, the first work that proposed
                       Y
            p(x|c) =         p(xi |x
first constructed, followed by training a con-     eration task is not as simple for machine learning
     ditional text generation model to capture their    models (Lin et al., 2019).
     co-occurence and generate text which con-
     tains the constrained keywords. Nevertheless,      2.4 Natural Language Generation Tasks
     this approach does not guarantee that all de-      In what follows we present natural language gener-
     sired keywords will be preserved during gen-       ation tasks which are instances of generic, condi-
     eration; some of them may get lost and will        tional and constrained text generation. All these
     not be found in the generated output, in par-      applications demonstrate the practical value of
     ticular when there are constraints on simulta-     generating coherent and meaningful texts, and that
     neously including multiple keywords.               advances in natural language generation are of im-
                                                        mediate applicability and practical importance in
  • Hard-constrained text generation: refers to         many downstream tasks.
    the mandatory inclusion of certain keywords
    in the output sentences. The matching func-         2.4.1 Neural Machine Translation
    tion f is in this case a binary indicator, which    The field of machine translation is focusing on
    rules out the possibility of generating infea-      the automatic translation of textual content from
    sible sentences that do not meet the given          one language into another language. The field
    constraints. Therefore, by placing hard con-        has undergone major changes in recent years,
    straints on the generated output, all lexi-         with end-to-end learning approaches for auto-
    cal constraints must be present in the gen-         mated translation based on neural networks re-
    erated output. Unlike soft-constrained mod-         placing conventional phrase-based statistical meth-
    els which are straightforward to design, the        ods (Bahdanau et al., 2014), (Wu et al., 2016a). In
    problem of hard-constrained text generation         contrast to statistical models which consist of sev-
    requires the design of complex dedicated neu-       eral sub-components trained and tuned separately,
    ral network architectures.                          neural machine translation models build and train
                                                        a single, large neural network end-to-end by feed-
   Constrained text generation is useful in many        ing it as input textual content in the source lan-
scenarios, such as incorporating in-domain ter-         guage and retrieving its corresponding translation
minology in machine translation (Post and Vilar,        in the target language. Neural machine transla-
2018), avoiding generic and meaningless re-             tion is a typical example of conditional text gen-
sponses in dialogue systems (Mou et al., 2016), in-     eration, where the condition encapsulated by the
corporating ground-truth text fragments (such as        conditional attribute code c is represented by the
semantic attributes, object annotations) in image       input sentence in the source language and the goal
caption generation (Anderson et al., 2017). Typ-        task is to generate its corresponding translation in
ical attributes used to generate constrained nat-       the target language. In addition, neural machine
ural language are the tense and the length of           translation is also an instance of constrained text
the summaries in text summarization (Fan et al.,        generation given that it imposes the constraint to
2018a), the sentiment of the generated content in       generate text in the target language. Additional
review generation (Mueller et al., 2017), language      constraints can be placed on the inclusion in the
complexity in text simplification or the style in       target sentence of named entities already present
text style transfer applications. In addition, con-     in the source sentence. In what follows we for-
strained text generation is used to overcome lim-       mally define the problem of neural machine trans-
itations of neural text generation models for dia-      lation.
logue such as genericness and repetitiveness of re-        We denote with Vs the vocabulary of the source
sponses (See et al., 2019), (Serban et al., 2016).      language and with Vt the vocabulary of the target
   Nevertheless, generating text under specific lex-    language, with |Vt | ≈ |Vs | and Vt ∩ Vs = φ. Let us
ical constraints is challenging (Zhang et al., 2020).   also denote with with Vs∗ and Vt∗ all possible sen-
While for humans it is straightforward to gener-        tences under Vs , respectively Vt . Given a source
ate sentences that cover a given set of concepts or     sentence X = (x1 , x2 , . . . , xl ), X ∈ Vs∗ , xi ∈ Vs ,
abide to pre-defined rules by making use of their       where xi is the ith word in X, ∀i = 1, . . . , l, the
commonsense reasoning ability, generative com-          goal is to generate the distribution over the possi-
monsense reasoning with a constrained text gen-         ble output sentences Y = (y1 , y2 , . . . , yl′ ), Y ∈
Vt∗ , yj ∈ Vt , where yj is the j th word in Y ,             of the most salient pieces of information from the
                 ′
∀j = 1, . . . , l by factoring Y into a chain of condi-      input document(s).
tional probabilities with left-to-right causal struc-           Text summarization is a conditional text gener-
ture using a neural network with parameters θ:               ation task where the condition is represented by
                                                             the given document(s) to be summarized. Addi-
                      ′
                       +1
                      lY
                                                             tional control codes are used in remainder summa-
       p(Y |X; θ) =         p(yt |y0:t−1 , x1:l ; θ)   (5)   rization offering flexibility to define which parts
                      t=1                                    of the document(s) are of interest, for eg., remain-
                                                             ing paragraphs the user has not read yet, or in
Special sentence delimiters y0 () and                     source-specific summarization to condition sum-
yl′ +1 () are commonly added to the vocab-                maries on the source type of input documents,
ulary to mark the beginning and end of target                for eg., newspapers, books or news articles. Be-
sentence Y . Typically in machine translation the            sides being a conditional text generation task, text
source and target vocabularies consist of the most           summarization is also a typical example of con-
frequent words used in a language (for eg., top              strained text generation where the condition is set
15,000 words), while the remaining words are                 such that the length of the resulting summary is
replaced with a special  token. Every                   strictly less than the length of the original docu-
source sentence X is usually mapped to exactly               ment. Unlike machine translation where output
one target sentence Y , and there is no sharing of           length varies depending on the source content, in
words between the source sentence X and the                  text summarization the length of the output is fixed
target sentence Y .                                          and pre-determined. Controlling the length of the
    Although neural network based approaches to              generated summary allows to digest information
machine translation have resulted in superior per-           at different levels of granularity and define the
formance compared to statistical models, they are            level of detail desired accounting for particular
computationally expensive both in training and in            user needs and time budgets; for eg., a document
translation inference time. The output of machine            can be summarized into a headline, a single sen-
translation models is evaluated by asking human              tence or a multi-sentence paragraph. In addition,
annotators to rate the generated translations on var-        explicit constraints can be placed on specific con-
ious dimensions of textual quality, or by compar-            cepts desired for inclusion in the summary. Most
isons with human-written reference texts using au-           frequently, named entities are used as constraints
tomated evaluation metrics.                                  in text summarization to ensure the generated sum-
2.4.2 Text Summarization                                     mary is specifically focused on topics and events
                                                             describing them. In addition, in the particular case
Text summarization is designed to facilitate a
                                                             of extractive summarization, there is the additional
quick grasp of the essence of an input document
                                                             constraint that sentences need to be picked explic-
by producing a condensed summary of its content.
                                                             itly from the original document. In what follows
This can be achieved in two ways, either by means
                                                             we formally define the task of text summarization.
of extractive summarization or through abstrac-
tive/generative summarization. While extractive                 We consider the input consisting of a sequence
summarization (Nallapati et al., 2017) methods               of M words x = (x1 , x2 , . . . , xM ), xi ∈ VX , i =
produce summaries by copy-pasting the relevant               1, . . . , M , where VX is a fixed vocabulary of size
portions from the input document, abstractive sum-           |VX |. Each word xi is represented as an indicator
marization (Rush et al., 2015), (Nallapati et al.,           vector xi ∈ {0, 1}VX , sentences are represented
2016), (See et al., 2017) algorithms can generate            as sequences of indicators and X denotes the set
novel content that is not present in the input doc-          of all possible inputs. A summarization model
ument. Hybrid approaches combining extractive                takes x as input and yields a shorter version of it in
summarization techniques with a a neural abstrac-            the form of output sequence y = (y1 , y2 , . . . , yN ),
tive summary generation serve to identify salient            with N < M and yj ∈ {0, 1}VY , ∀j = 1, . . . , N .
information in a document and generate distilled                Abstractive / Generative Summarization We de-
Wikipedia articles (Liu et al., 2018b). Character-           fine Y ⊂ ({0, 1}VY , . . . , {0, 1}VY ) as the set of all
istics of a good summary include brevity, fluency,           possible generated summaries of length N , with
non-redundancy, coverage and logical entailment              y ∈ Y. The summarization system is abstractive
if it tries to find the optimal sequence y ∗ , y ∗ ⊂ Y,   task. The condition is represented by the input
under the scoring function s : X × Y → R, which           document for which the text compression system
can be expressed as:                                      needs to output a condensed version. The task is
                                                          also constrained text generation given the system
               y ∗ = arg max s(x, y)               (6)    needs to produce a compressed version of the in-
                          y∈Y
                                                          put strictly shorter lengthwise. In addition, there
   Extractive Summarization As opposed to ab-             can be further constraints specified when the text
stractive approaches which generate novel sen-            compression output is desired to be entity-centric.
tences, extractive approaches transfer parts from            We denote with Ci = {ci1 , ci2 , . . . , cil } the set
the input document x to the output y:                     of possible compression spans and with yy,c a bi-
                                                          nary variable which equals 1 if the cth token of the
        y∗ =    arg max s(x, x[m1 ,...,mN ] )      (7)    ith sentence sˆi in document D is deleted, we are in-
               m∈{1,...,M }N
                                                          terested in modeling the probability p(yi,c |D, sˆi ).
Abstractive summarization is notably more chal-           Following the same definitions from section 2.4.2,
lenging than extractive summarization, and al-            we can formally define the optimal compressed
lows to incorporate real-world knowledge, para-           text sequence under scoring function s as:
phrasing and generalization, all crucial compo-
nents of high-quality summaries (See et al., 2017).
In addition, abstractive summarization does not            y∗ =          arg max            s(x, x[m1 ,...,mN ] ) (8)
impose any hard constraints on the system out-                    m∈{1,...,M }N ,mi−1
such as children, people with low education, peo-           tion between the source sentence and the target
ple who have reading disorders or dyslexia, and             sentence can be one-to-many or many-to-one, as
non-native speakers of the language. In the lit-            simplification involves splitting and merging op-
erature text simplification has been addressed at           erations (Surya et al., 2018). Furthermore, infre-
multiple levels: i) lexical simplification (Devlin,         quent words in the vocabulary cannot be simply
1999) is concerned with replacing complex words             dropped out and replaced with an unknown to-
or phrases with simpler alternatives; ii) syntactic         ken as it is typically done in machine translation,
simplification (Siddharthan, 2006) alters the syn-          but they need to be simplified appropriately corre-
tactic structure of the sentence; iii) semantic sim-        sponding to their level of complexity (Wang et al.,
plification (Kandula et al., 2010), sometimes also          2016a). Lexical simplification and content re-
known as explanation generation, paraphrases por-           duction is simultaneously approached with neu-
tions of the text into simpler and clearer variants.        ral machine translation models in (Nisioi et al.,
More recently, end-to-end models for text simpli-           2017), (Sulem et al., 2018c). Nevertheless, text
fication attempt to address all these steps at once.        simplification presents particular challenges com-
   Text simplification is an instance of conditional        pared to machine translation. First, simplifica-
text generation given we are conditioning on the            tions need to be adapted to particular user needs,
input text to produce a simpler and more readable           and ideally personalized to the educational back-
version of a complex document, as well as an in-            ground of the target audience (Bingel, 2018),
stance of constrained text generation since there           (Mayfield et al., 2019). Second, text simplifica-
are constraints on generating simplified text that is       tion has the potential to bridge the communica-
shorter in length compared to the source document           tion gap between specialists and laypersons in
and with higher readability level. To this end, it is       many scenarios. For example, in the medical do-
mandatory to use words of lower complexity from             main it can help improve the understandability
a much simpler target vocabulary than the source            of clinical records (Shardlow and Nawaz, 2019),
vocabulary. We formally introduce the text simpli-          address disabilities and inequity in educational
fication task below.                                        environments (Mayfield et al., 2019), and assist
                                                            with providing accessible and timely information
   Let us denote with Vs the vocabulary of the
                                                            to the affected population in crisis management
source language and with Vt the vocabulary of the
                                                            (Temnikova, 2012).
target language, with |Vt | ≪ |Vs | and Vt ⊆ Vs . Let
us also denote with with Vs∗ and Vt∗ all possible           2.4.4 Text Style Transfer
sentences under Vs , respectively Vt . Given source         Style transfer is a newly emerging task designed
sentence X = (x1 , x2 , . . . , xl ), X ∈ Vs∗ , xi ∈ Vs ,   to preserve the information content of a source
where xi is the ith word in X, ∀i = 1, . . . , l, the       sentence while delivering it to meet desired pre-
goal is to produce the simplified sentence Y =              sentation constraints. To this end, it is important
(y1 , y2 , . . . , yl′ ), Y ∈ Vt∗ , yj ∈ Vt , where yj is   to disentangle the content itself from the style in
                                        ′
the j th word in Y , ∀j = 1, . . . , l by modeling the      which it is presented and be able to manipulate the
conditional probability p(Y |X). In the context of          style so as to easily change it from one attribute
neural text simplification, a neural network with           into another attribute of different or opposite po-
parameters θ is used to maximize the probability            larity. This is often achieved without the need for
p(Y |X; θ).                                                 parallel data for source and target styles, but ac-
   Next we highlight differences between machine            counting for the constraint that the transferred sen-
translation and text simplification. Unlike ma-             tences should match in style example sentences
chine translation where the output sentence Y               from the target style. To this end, text style trans-
does not share any common terms with the in-                fer is an instance of constrained text generation.
put sentence X, in text simplification some or              In addition, it is also a typical scenario of con-
all of the words in Y might remain identical                ditional text generation where we are condition-
with the words in X in cases when the terms in              ing on the given source text. Style transfer has
X are already simple. In addition, unlike ma-               been originally used in computer vision applica-
chine translation where the mapping between the             tions for image-to-image translation (Gatys et al.,
source sentence and the target sentence is usu-             2016), (Liu and Tuzel, 2016), (Zhu et al., 2017),
ally one-to-one, in text simplification the rela-           and more recently has been used in natural natu-
ral language processing applications for machine                        Latent VAE representations are manipulated to
translation, sentiment modification to change the                       generate textual output with specific attributes, for
sentiment of a sentence from positive to negative                       eg. contemporary text written in Shakespeare style
and vice versa, word substitution decipherment                          or improving the positivity sentiment of a sentence
and word order recovery (Hu et al., 2017).                              (Mueller et al., 2017). Style-independent con-
   The problem of style transfer in language gener-                     tent representations are learnt via disentangled la-
ation can be formally defined as follows. Given                         tent representations for generating sentences with
                              (1) (2)       (n)                         controllable style attributes (Shen et al., 2017),
two datasets X1 = {x1 , x1 , . . . , x1 } and
            (1) (2)          (n)
X2 = {x2 , x2 , . . . , x2 } with the same con-                         (Hu et al., 2017). Language models are employed
tent distribution but different unknown styles y1                       as style discriminators to learn disentangled rep-
and y2 , where the samples in dataset X1 are drawn                      resentations for unsupervised text style transfer
from the distribution p(x1 |y1 ) and the samples                        tasks such as sentiment modification (Yang et al.,
in dataset X2 are drawn from the distribution                           2018d).
p(x2 |y2 ), the goal is to estimate the style trans-                    2.4.5 Dialogue Systems
fer functions between them p(x1 |x2 ; y1 , y2 ) and
p(x2 |x1 ; y1 , y2 ). According to the formulation of                   A dialogue system, also known as a conversational
the problem we can only observe the marginal dis-                       agent, is a computer system designed to converse
tributions p(x1 |y1 ) and p(x2 |y2 ), and the goal is                   with humans using natural language. To be able
to recover the joint distribution p(x1 , x2 |y1 , y2 ),                 to carry a meaningful conversation with a human
which can be expressed as follows assuming the                          user, the system needs to first understand the mes-
existence of latent content variable z generated                        sage of the user, represent it internally, decide how
from distribution p(z):                                                 to respond to it and issue the target response using
                                                                        natural language surface utterances (Chen et al.,
                                                                        2017a). Dialogue generation is an instance of con-
                         Z
                                                                        ditional text generation where the system response
p(x1 , x2 |y1 , y2 ) =        p(z)p(x1 |y1 , z)p(x2 |y2 , z)dz
                          z                                             is conditioned on the previous user utterance and
                                                 (9)                    frequently on the overall conversational context.
Given that x1 and x2 are independent from each                          Dialogue generation can also be an instance of
other given z, the conditional distribution corre-                      constrained text generation when the conversation
sponding to the style transfer function is defined:                     is carried on a topic which explicitly involves en-
                                                                        tities such as locations, persons, institutions, etc.
                             Z                                          From an application point of view, dialogue sys-
 p(x1 |x2 ; y1 , y2 ) =              p(x1 , z|x2 ; y1 , y2 )dz          tems can be categorized into (Keselj, 2009):
                             Zz
                     =               p(x1 |y1 , z)p(x2 |y2 , z)dz         • task-oriented dialogue agents: are designed
                                 z                                          to have short conversations with a human user
                     = Ez∼p(z|x2,y2 ) [p(x1 |y1 , z)]                       to help him/ her complete a particular task.
                                                     (10)                   For example, dialogue agents embedded into
                                                                            digital assistants and home controllers assist
Models proposed in the literature for style transfer                        with finding products, booking accommoda-
rely on encoder-decoder models. Given encoder                               tions, provide travel directions, make restau-
E : X × Y → Z with paramters θE which infers                                rant reservations and phone calls on behalf of
the content z and style y for a given sentence x,                           their users. Therefore, task-oriented dialogue
and generator G : Y × Z → X with parameters                                 generation is an instance of both conditional
θG which given content z and style y generates                              and constrained text generation.
sentence x, the reconstruction loss can be defined
as follows:                                                               • non-task oriented dialogue agents or chat-
                                                                            bots: are designed for carrying extended
                                                                            conversations with their users on a wide
  Lrec =Ex1∼X1 [− log pG (x1 |y1 , E(x1 , y1 ))]+                           range of open domains. They are set up to
          Ex2∼X2 [− log pG (x2 |y2 , E(x2 , y2 ))]                          mimic human-to-human interaction and un-
                                                                 (11)       structured human dialogues in an entertaining
way. Therefore, non-task oriented dialogue is        trieved and ranked appropriately from knowledge
     an instance of conditional text generation.          bases and textual documents (Kratzwald et al.,
                                                          2019), answer generation aims to produce more
   We formally define the task of dialogue genera-        natural answers by using neural models to gener-
tion. Generative dialogue models take as input a          ate the answer sentence. Question answering can
dialogue context c and generate the next response         be considered as both a conditional text generation
x. The training data consists of a set of samples         and constrained text generation task. A question
of the form {cn , xn , dn } ∼ psource (c, x, d), where    answering system needs to be conditioned on the
d denotes the source domain. At testing time, the         question that was asked, while simultaneously en-
model is given the dialog context c and the target        suring that concepts needed to answer the question
domain, and must generate the correct response            are found in the generated output.
x. The goal of a generative dialogue model is
                                                             A question answering system can be formally
to learn the function F : C × D → X which
                                                          defined as follows. Given a context paragraph
performs well on unseen examples from the tar-
                                                          C = {c1 , c2 , . . . , cn } consisting of n words
get domain after seeing the training examples on
                                                          from word vocabulary V and the query Q =
the source domain. The source domain and the
                                                          {q1 , q2 , . . . , qm } of m words in length, the goal of
target domain can be identical; when they differ
                                                          a question answering system is to either: i) output
the problem is defined as zero-shot dialogue gen-
                                                          a span S = {ci , ci+1 , . . . , ci+j }, ∀i = 1, . . . , n
eration (Zhao and Eskenazi, 2018). The dialogue
                                                          and ∀j = 0, . . . , n − i from the original context
generation problem can be summarized as:
                                                          paragraph C, or ii) generate a sequence of words
                                                          A = {a1 , a2 , . . . , al }, ak ∈ V, ∀k = 1, . . . , l as
  Training data : {cn , xn , dn } ∼ psource (c, x, d)     the output answer. Below we differentiate between
   Testing data : {c, x, d} ∼ ptarget (c, x, d)           multiple types of question answering tasks:
           Goal : F : C × D → X
                                                            • Factoid Question Answering: given a descrip-
                                                   (12)
                                                              tion of an entity (person, place or item) for-
A common limitation of neural networks for di-                mulated as a query and a text document, the
alogue generation is that they tend to generate               task is to identify the entity referenced in the
safe, universally relevant responses that carry little        given piece of text. This is an instance of
meaning (Serban et al., 2016), (Li et al., 2016a),            both conditional and constrained text gener-
(Mou et al., 2016); for example universal replies             ation, given conditioning on the input ques-
such as “I don’t know” or “something” frequently              tion and constraining the generation task to
occur in the training set are likely to have high             be entity-centric. Factoid question answering
estimated probabilities at decoding time. Addi-               methods combine word and phrase-level rep-
tional factors that impact the conversational flow            resentations across sentences to reason about
in generative models of dialogue are identified as            entities (Iyyer et al., 2014), (Yin et al., 2015).
repetitions and contradictions of previous state-
ments, failing to balance specificity with gener-           • Reasoning-based Question Answering: given
icness of the output, and not taking turns in ask-            a collection of documents and a query,
ing questions (See et al., 2019). Furthermore, it             the task is to reason, gather, and synthe-
is desirable for generated dialogues to incorporate           size disjoint pieces of information spread
explicit personality traits (Zheng et al., 2019) and          within documents and across multiple docu-
control the sentiment (Kong et al., 2019a) of the             ments to generate an answer (De Cao et al.,
generated response to resemble human-to-human                 2019). The task involves multi-step rea-
conversations.                                                soning and understanding of implicit rela-
                                                              tions for which humans typically rely on
2.4.6 Question Answering                                      their background commonsense knowledge
Question answering systems are designed to find               (Bauer et al., 2018). The task is conditional
and integrate information from various sources to             given that the system generates an answer
provide responses to user questions (Fu and Feng,             conditioned on the input question, and may
2018). While traditionally candidate answers con-             be constrained when the information across
sist of words, phrases or sentence snippets re-               documents is focused on entities or specific
concepts that need to be incorporated in the      and end  of a sentence, as well as the un-
     generated answer.                                 known token  used for all words not
                                                       present in the vocabulary V , and V ∗ denotes all
  • Visual Question Answering: given an image          possible sentences over V . Given training set
    and a natural language question about the im-      D = {(I, y ∗ )} containing m pairs of the form
    age, the goal is to provide an accurate natural    (Ij , yj∗ ), ∀j = 1, . . . , m consisting of input im-
    language answer to the question posed about        age Ij and its corresponding ground-truth caption
    the image (Antol et al., 2015). By its nature      yj∗ = (yj∗1 , yj∗2 , . . . , yj∗M ), yj∗ ∈ V ∗ and yj∗k ∈
    the task is conditional, and can be constraint     V, ∀k = 1, . . . , M , we want to maximize the prob-
    when specific objects or entities in the image     abilistic model p(y|I; θ) with respect to model pa-
    need to be included in the generated answer.       rameters θ.
   Question answering systems that meet var-
ious information needs are proposed in the
literature, for eg., for answering mathemat-           2.4.8 Narrative Generation / Story Telling
ical questions (Schubotz et al., 2018), med-
ical information needs (Wiese et al., 2017),           Neural narrative generation aims to produce co-
(Bhandwaldar and Zadrozny, 2018), quiz bowl            herent stories automatically and is regarded as an
questions (Iyyer et al., 2014), cross-lingual and      important step towards computational creativity
multi-lingual questions (Loginova et al., 2018).       (Gervás, 2009). Unlike machine translation which
In practical applications of question answering,       produces a complete transduction of an input sen-
users are typically not only interested in learning    tence which fully defines the target semantics,
the exact answer word, but also in how this is         story telling is a long-form open-ended text gen-
related to other important background information      eration task which simultaneously addresses two
and to previously asked questions and answers          separate challenges: the selection of appropriate
(Fu and Feng, 2018).                                   content (“what to say”) and the surface realization
2.4.7 Image / Video Captioning                         of the generation (“how to say it”)(Wiseman et al.,
                                                       2017). In addition, the most difficult aspect of
Image captioning is designed to generate captions
                                                       neural story generation is producing a a coherent
in the form of textual descriptions for an image.
                                                       and fluent story which is much longer than the
This involves the recognition of the important ob-
                                                       short input specified by the user as the story ti-
jects present in the image, as well as object prop-
                                                       tle. To this end, many neural story generation
erties and interactions between objects to be able
                                                       models assume the existence of a high-level plot
to generate syntactically and semantically correct
                                                       (commonly specified as a one-sentence outline)
natural language sentences (Hossain et al., 2019).
                                                       which serves the role of a bridge between titles and
In the literature the image captioning task has been
                                                       stories (Chen et al., 2019a), (Fan et al., 2018b),
framed from either a natural language generation
                                                       (Xu et al., 2018b), (Drissi et al., 2018), (Yao et al.,
perspective (Kulkarni et al., 2013), (Chen et al.,
                                                       2019). Therefore, narrative generation is a con-
2017b) where each system produces a novel sen-
                                                       strained text generation task since explicit con-
tence, or from a ranking perspective where exist-
                                                       straints are placed on which concepts to include
ing captions are ranked and the top one is selected
                                                       in the narrative so as to steer the generation in par-
(Hodosh et al., 2013). Image/ video captioning is
                                                       ticular topic directions. In addition, another con-
a conditional text generation task where the cap-
                                                       straint is that the output length needs to be strictly
tion is conditioned on the input image or video. In
                                                       greater than the input length. We formally define
addition, it can be a constrained text generation
                                                       the task of narrative generation below.
task when specific concepts describing the input
need to be present in the generated output.               Assuming as input to the neural story generation
   Formally, the task of image/ video captioning       system the title x = x1 , x2 , . . . , xI consisting of
takes as input an image or video I and generates       I words, the goal is to produce a comprehensible
a sequence of words y = (y1 , y2 , . . . , yN ), y ∈   and logical story y = y1 , y2 , . . . , yJ of J words in
V ∗ and yi ∈ V, ∀i = 1, . . . , N , where V de-        length. Assuming the existence of a one sentence
notes the vocabulary of output words and in-           outline z = z1 , z2 , . . . , zK that contains K words
cludes special tokens to mark the beginning         for the entire story, the latent variable model for
neural story generation can be formally expressed:         lines of sentences. The process is interactive and
                                                           the author can keep modifying terms to reflect his
                    X                                      writing intent. Poetry generation is a constrained
  P (y|x; θ, γ) =       P (z|x; θ)P (y|x, z; γ) (13)       text generation problem since user defined con-
                    z
                                                           cepts need to be included in the generated poem.
where P (z|x; θ) defines a planning model parame-          At the same time, it can also be a conditional text
terized by θ and P (y|x, z; γ) defines a generation        generation problem given explicit conditioning on
model parameterized by γ.                                  the stylistic features of the poem. We define the
   The planning model P (z|x; θ) receives an input         petry generation task below.
the one sentence title z for the narrative and gener-         Given as input a set of keywords that
ates the narrative outline given the title:                summarize an author’s writing intent K =
                                                           {k1 , k2 , . . . , k|K| }, where each ki ∈ V, i =
                        K
                        Y                                  1, . . . , |K| is a keyword term from vocabulary V ,
         P (z|x; θ) =         P (zk |x, z
time between the word-level loss function opti-                 timization problem can therefore be expressed as:
mized by MLE and humans focusing on whole se-                                        X
quences of poem lines and assessing fine-grained                              max             log p(r|a)        (18)
criteria of the generated text such as fluency, coher-                              (a,r)∈D
ence, meaningfulness and overall quality. These                    Generating long, well-structured and informa-
human evaluation criteria are modeled and incor-                tive reviews requires considerable effort when
porated into the reward function of a mutual re-                written by human users and is a similarly challeng-
inforcement learning framework for poem genera-                 ing task to do automatically (Li et al., 2019a).
tion (Yi et al., 2018). For a detailed overview of
poetry generation we point the reader to (Oliveira,             2.4.11    Miscellaneous tasks related to natural
2017).                                                                    language generation
                                                                Handwriting synthesis aims to automatically gen-
2.4.10    Review Generation                                     erate data that resembles natural handwriting and
                                                                is a key component in the development of intelli-
Product reviews allow users to express opinions                 gent systems that can provide personalized experi-
for different aspects of products or services re-               ences to humans (Zong and Zhu, 2014). The task
ceived, and are popular on many online review                   of handwritten text generation is very much analo-
websites such as Amazon, Yelp, Ebay, etc. These                 gous to sequence generation. Given as input a user
online reviews encompass a wide variety of writ-                defined sequence of words x = (x1 , x2 , . . . , xT )
ing styles and polarity strengths. The task of re-              which can be either typed into the computer sys-
view generation is similar in nature to sentiment               tem or fed as an input image I to capture the user’s
analysis and a lot of past work has focused on                  writing style, the goal of handwriting generation is
identifying and extracting subjective content in re-            to train a neural network model which can produce
view data (Liu, 2015), (Zhao et al., 2016). Auto-               a cursive handwritten version of the input text to
matically generating reviews given contextual in-               display under the form of output image O (Graves,
formation focused on product attributes, ratings,               2013). Handwriting generation is a conditional
sentiment, time and location is a meaningful con-               generation task when the system is conditioning
ditional text generation task. Common product at-               on the input text. In addition, it is also a con-
tributes used in the literature are the user ID, the            strained text generation task since the task is con-
product ID, the product rating or the user senti-               strained on generating text in the user’s own writ-
ment for the generated review (Dong et al., 2017),              ing style. While advances in deep learning have
(Tang et al., 2016). The task can also be con-                  given computers the ability to see and recognize
strained text generation when topical and syntac-               printed text from input images, generating cursive
tic characteristics of natural languages are explic-            handwriting is a considerably more challenging
itly specified as constraints to incorporate in the             problem (Alonso et al., 2019). Character bound-
generation process. We formally define the review               aries are not always well-defined, which makes it
generation task below.                                          hard to segment handwritten text into individual
   Given as input a set of product attributes a =               pieces or characters. In addition, handwriting eval-
(a1 , a2 , . . . , a|a| ) of fixed length |a|, the goal is to   uation is ambiguous and not well defined given the
generate a product review r = (y1 , y2 , . . . , y|r| ) of      multitude of existent human handwriting style pro-
variable length |r| by maximizing the conditional               files (Mohammed et al., 2018).
probability p(r|a):                                                Other related tasks where natural language
                                                                generation plays an important role are generat-
                          |r|                                   ing questions, arguments, counter-arguments and
                          Y
               p(r|a) =         p(yt |y
tasks illustrate the widespread importance of hav-          Autoregressive (Fully-observed) generative
ing robust models for natural language generation.       models model the observed data directly without
                                                         introducing dependencies on any new unobserved
3 Models                                                 local variables. Assuming all items in a sequence
                                                         x = (x1 , x2 , . . . , xN ) are fully observed, the prob-
Neural networks are used in a wide range of su-          ability distribution p(x) of the data is modeled in
pervised and unsupervised machine learning tasks         an auto-regressive fashion using the chain rule of
due to their ability to learn hierarchical representa-   probability:
tions from raw underlying features in the data and
model complex high-dimensional distributions. A                                        N
                                                                                       Y
wide range of model architectures based on neural         p(x1 , x2 , . . . , xN ) =         p(xi |x1 , x2 , . . . , xi−1 )
networks have been proposed for the task of nat-                                       i=1
ural language generation in a wide variety of con-                                                      (19)
texts and applications. In what follows we briefly       Training autoregressive models is done by max-
discuss the main categories of generative models         imizing the data likelihood, allowing these mod-
in the literature and continue with presenting spe-      els to be evaluated quickly and exactly. Sampling
cific models for neural language generation.             from autoregressive models is exact, but it is ex-
   Deep generative models have received a lot of         pensive since samples need to be generated in se-
attention recently due to their ability to model         quential order. Extracting representions from fully
complex high-dimensional distributions. These            observed models is challenging, but this is cur-
models combine uncertainty estimates provided            rently an active research topic.
by probabilistic models with the flexibility and            Latent variable generative models explain hid-
scalability of deep neural networks to learn in an       den causes by introducing an unobserved random
unsupervised way the distribution from which data        variable z for every observed data point. The data
is drawn. Generative probabilistic models are use-       likelihood p(x) is computed as follows:
ful for two reasons: i) can perform density esti-
                                                                        Z
mation and inference of latent variables, and ii)
can sample efficiently from the probability density       log p(x) =        pθ (x|z)p(z)dz = Ep(z) [pθ (x|z)]
represented by the input data and generate novel                                                          (20)
content. Deep generative models can be classified        Latent models present the advantage that sampling
into either explicit or implicit density probabilistic   is exact and cheap, while extracting latent features
models. On the one hand, explicit density mod-           from these models is straightforward. They are
els provide an explicit parametric specification of      evaluated using the lower bound of the log like-
the data distribution and have tractable likelihood      lihood.
functions. On the other hand, implicit density mod-         Implicit density models (among which the most
els do not specify the underlying distribution of        famous models are GANs) introduce a second dis-
the data, but instead define a stochastic process        criminative model able to distinguish model gen-
which allows to simulate the data distribution af-       erated samples from real samples in addition to
ter training by drawing samples from it. Since           the generative model. While sampling from these
the data distribution is not explicitly specified, im-   models is cheap, it is inexact. The evaluation
plicit generative models do not have a tractable         of these models is difficult or even impossible to
likelihood function. A mix of both explicit and          carry, and extracting latent representations from
implicit models have been used in the literature         these models is very challenging. We summarize
to generate textual content in a variety of settings.    in Table 1 characteristics of the three categories of
Among these, we enumerate explicit density mod-          generative models discussed above.
els with tractable density such as autoregressive           In what follows we review models for neural
models (Bahdanau et al., 2014), (Vaswani et al.,         language generation from most general to the most
2017), explicit density models with approxi-             specific according to the problem definition cate-
mate density like the Variational Autoencoder            gorization presented in Section 2; for each model
(Kingma and Welling, 2013), and implicit direct          architecture we first list models for generic text
density generative models such as Generative Ad-         generation, then introduce models for conditional
versarial Networks (Goodfellow et al., 2014).            text generation, and finally outline models used
Table 1: Comparison of generative model frameworks.            is the hidden bias vector and by is the output bias
                                                               vector. H is the function that computes the hid-
 Model type             Evaluation          Sampling           den layer representation. Gradients in an RNN
 Fully-observed         Exact and           Exact and          are computed via backpropagation through time
                        Cheap               Expensive          (Rumelhart et al., 1986), (Werbos, 1989). By def-
 Latent models          Lower Bound         Exact and          inition, RNNs are inherently deep in time con-
                                            Cheap              sidering that the hidden state at each timestep is
 Implicit models        Hard or             Inexact and        computed as a function of all previous timesteps.
                        Impossible          Cheap              While in theory RNNs can make use of informa-
                                                               tion in arbitrarily long sequences, in practice they
                                                               fail to consider context beyond the few previous
for constrained text generation. We begin with re-
                                                               timesteps due to the vanishing and exploding gra-
current neural network models for text generation
                                                               dients (Bengio et al., 1994) which cause gradient
in Section 3.1, then present sequence-to-sequence
                                                               descent to not be able to learn long-range tempo-
models in Section 3.2, generative adversarial net-
                                                               ral structure in a standard RNN. Moreover, RNN-
works (GANs) in Section 3.4, variational autoen-
                                                               based models contain millions of parameters and
codes (VAEs) in Section 3.5 and pre-trained mod-
                                                               have traditionally been very difficult to train, lim-
els for text generation in Section 3.8. We also pro-
                                                               iting their widespread use (Sutskever et al., 2011).
vide a comprehensive overview of text generation
                                                               Improvements in network architectures, optimiza-
tasks associated with each model.
                                                               tion techniques and parallel computation have re-
3.1     Recurrent Architectures                                sulted in recurrent models learning better at large-
                                                               scale (Lipton et al., 2015).
3.1.1   Recurrent Models for Generic /
                                                                  Long Short Term Memory (LSTM)
        Free-Text Generation
                                                               (Hochreiter and Schmidhuber, 1997) networks are
Recurrent         Neural         Networks           (RNNs)     introduced to overcome the limitations posed by
(Rumelhart et al., 1986), (Mikolov et al., 2010)               vanishing gradients in RNNs and allow gradient
are able to model long-term dependencies in                    descent to learn long-term temporal structure. The
sequential data and have shown promising results               LSTM architecture largely resembles the standard
in a variety of natural language processing tasks,             RNN architecture with one hidden layer, and
from language modeling (Mikolov, 2012) to                      each hidden layer node is modified to include a
speech recognition (Graves et al., 2013) and                   memory cell with a self-connected recurrent edge
machine translation (Kalchbrenner and Blunsom,                 of fixed weight which stores information over
2013). An important property of RNNs is the                    long time periods. A memory cell ct consists of a
ability of learning to map an input sequence of                node with an internal hidden state ht and a series
variable length into a fixed dimensional vector                of gates, namely an input gate it which controls
representation.                                                how much each LSTM unit is updated, a forget
   At each timestep, the RNN receives an input,                gate ft which controls the extent to which the
updates its hidden state, and makes a prediction.              previous memory cell is forgotten, and an output
Given an input sequence x = (x1 , x2 , . . . , xT ),           gate ot which controls the exposure of the internal
a standard RNN computes the hidden vector se-                  memory state. The LSTM transition equations at
quence h = (h1 , h2 , . . . , hT ) and the output vec-         timestep t are:
tor sequence y = (y1 , y2 , . . . , yT ), where each dat-
apoint xt , ht , yt , ∀ t ∈ {1, . . . , T } is a real valued
vector, in the following way:                                       it = σ(W (i) xt + U (i) ht−1 + b(i) )
                                                                    ft = σ(W (f ) xt + U (f ) ht−1 + b(f ) )
        ht = H(Wxh xt + Whh ht−1 + bh )                             ot = σ(W (o) xt + U (o) ht−1 + b(o) )
                                                      (21)                                                     (22)
        yt = Why ht + by
                                                                    ut = σ(W (u) xt + U (t) ht−1 + b(t) )
In Equation 21 terms W denote weight matrices,                      ct = it ⊙ ut + ft ⊙ ct−1
in particular Wxh is the input-hidden weight ma-                    ht = ot ⊙ tanh(ct )
trix and Whh is the hidden-hidden weight ma-
trix. The b terms denote bias vectors, where bh                In Equation 22, xt is the input at the current
timestep t, σ denotes the logistic sigmoid func-                         that accumulate and amplify quickly over the gen-
tion and ⊙ denotes elementwise multiplication. U                         erated sequence, (Lamb et al., 2016). As a remedy,
and W are learned weight matrices. LSTMs can                             Scheduled Sampling (Bengio et al., 2015) mixes
represent information over multiple time steps by                        inputs from the ground-truth sequence with inputs
adjusting the values of the gating variables for                         generated by the model at training time, gradually
each vector element, therefore allowing the gra-                         adjusting the training process from fully guided
dient to pass without vanishing or exploding. In                         (i.e. using the true previous token) to less guided
both RNNs and LSTMs the data is modeled via                              (i.e. using mostly the generated token) based on
a fully-observed directed graphical model, where                         curriculum learning (Bengio et al., 2009). While
the distribution over a discrete output sequence                         the model generated distribution can still diverge
y = (y1 , y2 , . . . , yT ) is decomposed into an or-                    from the ground truth distribution as the model
dered product of conditional distributions over to-                      generates several consecutive tokens, possible so-
kens:                                                                    lutions are: i) make the self-generated sequences
                                                                         short, and ii) anneal the probability of using self-
                                      T
                                      Y                                  generated vs. ground-truth samples to 0, accord-
P (y1 , y2 , . . . , yT ) = P (y1 )         P (yt |y1 , . . . , yt−1 )
                                      t=1
                                                                         ing to some schedule. Still, models trained with
                                                    (23)                 scheduled sampling are shown to memorize the
   Similar to LSTMs, Gated Recurrent Units                               distribution of symbols conditioned on their posi-
(GRUs) (Cho et al., 2014) learn semantically and                         tion in the sequence instead of the actual prefix of
syntactically meaningful representations of natu-                        preceding symbols (Huszár, 2015).
ral language and have gating units to modulate the
flow of information. Unlike LSTMs, GRU units                                Many extensions of vanilla RNN and LSTM
do not have a separate memory cell and present a                         architectures are proposed in the literature
simpler design with fewer gates. The activation hjt                      aiming to improve generalization and sample
at timestep t linearly interpolates between the acti-                    quality (Yu et al., 2019). Bidirectional RNNs
vation at the previous timestep htj−1 and the candi-                     (Schuster and Paliwal, 1997), (Berglund et al.,
date activation e hjt . The update gate ztj decides how                  2015) augment unidirectional recurrent models
much the current unit updates its content, while the                     by introducing a second hidden layer with con-
reset gate rtj allows it to forget the previously com-                   nections flowing in opposite temporal order to
puted state. The GRU update equations at each                            exploit both past and future information in a
timestep t are:                                                          sequence. Multiplicative RNNs (Sutskever et al.,
                                                                         2011) allow flexible input-dependent transitions,
                                                                         however many complex transition functions hard
      hjt = (1 − ztj )hjt−1 + ztj e
                                  hjt                                    to bypass. Gated feedback RNNs and LSTMs
                                                                         (Chung et al., 2014) rely on gated-feedback
      ztj = σ(Wz xt + Uz ht−1 )j
                                                               (24)      connections to enable the flow of control signals
      hjt = tanh(W xt + U (rt ⊙ ht−1 ))j
      e                                                                  from the upper to lower recurrent layers in the net-
      rtj = σ(Wr xt + Ur ht−1 )j                                         work. Similarly, depth gated LSTMs (Yao et al.,
                                                                         2015) introduce dependencies between lower
    Models with recurrent connections are trained                        and upper recurrent units by using a depth gate
with teacher forcing (Williams and Zipser, 1989),                        which connects memory cells of adjacent layers.
a strategy emerging from the maximum likelihood                          Stacked LSTMs stack multiple layers at each
criterion designed to keep the recurrent model pre-                      time-step to increase the capacity of the network,
dictions close to the ground-truth sequence. At                          while nested LSTMs (Moniz and Krueger, 2018)
each training step the model generated token ŷt                         selectively access LSTM memory cells with inner
is replaced with its ground-truth equivalent token                       memory. Convolutional LSTMs (Sainath et al.,
yt , while at inference time each token is generated                     2015), (Xingjian et al., 2015) are designed for
by the model itself (i.e. sampled from its condi-                        jointly modeling spatio-temporal sequences. Tree-
tional distribution over the sequence given the pre-                     structured LSTMs (Zhu et al., 2015), (Tai et al.,
viously generated samples). The discrepancy be-                          2015) extend the LSTM structure beyond a linear
tween training and inference stages leads to expo-                       chain to tree-structured network topologies, and
sure bias, causing errors in the model predictions                       are useful at semantic similarity and sentiment
classification tasks.       Multiplicative LSTMs       forms local operations such as insertion, deletion
(Krause et al., 2016) combine vanilla LSTM             and replacement in the sentence space for any ran-
networks of fixed weights with multiplicative          domly selected word in the sentence.
RNNs to allow for flexible input-dependent
                                                          Hard constraints on the generation of scientific
weight matrices in the network architecture. Mul-
                                                       paper titles are imposed by the use of a forward-
tiplicative Integration (Wu et al., 2016b) RNNs
                                                       backward recurrent language model which gener-
achieve better performance than vanilla RNNs by
                                                       ates both previous and future words in a sentence
using the Hadamard product in the computational
                                                       conditioned on a given topic word (Mou et al.,
additive building block of RNNs. Mogrifier
                                                       2015). While the topic word can occur at any
LSTMs (Melis et al., 2019) capture interactions
                                                       arbitrary position in the sentence, the approach
between inputs and their context by mutually
                                                       can only generate sentences constrained precisely
gating the current input and the previous output of
                                                       on one keyword. Multiple constraints are incor-
the network. For a comprehensive review of RNN
                                                       porated in sentences generated by a backward-
and LSTM-based network architectures we point
                                                       forward LSTM language model by lexically substi-
the reader to (Yu et al., 2019).
                                                       tuting constrained tokens with their closest match-
3.1.2   Recurrent Models for Conditional Text          ing neighbour in the embedding space (Latif et al.,
        Generation                                     2020). Guiding the conversation towards a des-
                                                       ignated topic while integrating specific vocabu-
A recurrent free-text generation model becomes a
                                                       lary words is achieved by combining discourse-
conditional recurrent text generation model when
                                                       level rules with neural next keywords prediction
the distribution over training sentences is condi-
                                                       (Tang et al., 2019). A recurrent network based
tioned on another modality. For example in ma-
                                                       sequence classifier is used for extractive summa-
chine translation the distribution is conditioned on
                                                       rization in (Nallapati et al., 2017). Poetry genera-
another language, in image caption generation the
                                                       tion which obeys hard rhythmic, rhyme and topic
condition is the input image, in video description
                                                       constraints is proposed in (Ghazvininejad et al.,
generation we condition on the input video, while
                                                       2016).
in speech recognition we condition on the input
speech.
   Content and stylistic properties (such as senti-
ment, topic, style and length) of generated movie      3.2 Sequence-to-Sequence Architectures
reviews are controlled in a conditional LSTM
language model by conditioning on context vec-         Although the recurrent models presented in Sec-
tors that reflect the presence of these proper-        tion 3.1 present good performance whenever large
ties (Ficler and Goldberg, 2017). Affective di-        labeled training sets are available, they can only be
alogue responses are generated by conditioning         applied to problems whose inputs and targets are
on affect categories in an LSTM language model         encoded with vectors of fixed dimensionality. Se-
(Ghosh et al., 2017). A RNN-based language             quences represent a challenge for recurrent models
model equipped with dynamic memory outper-             since RNNs require the dimensionality of their in-
forms more complex memory-based models for di-         puts and outputs to be known and fixed beforehand.
alogue generation (Mei et al., 2017). Participant      In practice, there are many problems in which the
roles and conversational topics are represented as     sequence length is not known a-priori and it is nec-
context vectors and incorporated into a LSTM-          essary to map variable length sequences into fixed-
based response generation model (Luan et al.,          dimensional vector representations. To this end,
2016).                                                 models that can map sequences to sequences are
                                                       proposed. These models makes minimal assump-
3.1.3   Recurrent Models for Constrained Text          tions on the sequence structure and learn to map an
        Generation                                     input sequence into a vector of fixed dimensional-
Metropolis-Hastings sampling (Miao et al., 2019)       ity and then map that vector back into an output
is proposed for both soft and hard constrained sen-    sequence, therefore learning to decode the target
tence generation from models based on recurrent        sequence from the encoded vector representation
neural networks. The method is based on Markov         of the source sequence. We present these models
Chain Monte Carlo (MCMC) sampling and per-             in detail below.
You can also read