TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage Method

Page created by Jerome Thompson
 
CONTINUE READING
TeleMelody: Lyric-to-Melody Generation with a
 Template-Based Two-Stage Method
 Zeqian Ju1 , Peiling Lu2 , Xu Tan* 2 , Rui Wang2 , Chen Zhang3 , Songruoyao Wu3 , Kejun Zhang3 ,
 Xiangyang Li1 , Tao Qin2 , Tie-Yan Liu2
 1
 University of Science and Technology of China
 2
 Microsoft Research Asia
 3
 Zhejiang University, China
 juzeqian@mail.ustc.edu.cn, {peil, xuta, ruiwa, taoqin, tyliu}@microsoft.com,
 {zc99, 22021296, zhangkejun}@zju.edu.cn, xiangyangli@ustc.edu.cn
arXiv:2109.09617v1 [cs.SD] 20 Sep 2021

 Abstract Srivastava, and Canales 2021; Sheng et al. 2020; Bao et al.
 Lyric-to-melody generation is an important task in auto-
 2019; Lee, Fang, and Ma 2019). However, these systems suf-
 matic songwriting. Previous lyric-to-melody generation sys- fer from the following issues: 1) They require large amount
 tems usually adopt end-to-end models that directly generate of paired lyric-melody data to learn the correlation between
 melodies from lyrics, which suffer from several issues: 1) syllables in lyrics and notes in melodies (Sheng et al. 2020).
 lack of paired lyric-melody training data; 2) lack of control However, collecting lots of paired data is quite difficult and
 on generated melodies. In this paper, we develop TeleMelody, costs much. Sheng et al. (2020) have attempted to allevi-
 a two-stage lyric-to-melody generation system with music ate the low-resource challenge by unsupervised pre-training
 template (e.g., tonality, chord progression, rhythm pattern, on lyric-to-lyric and melody-to-melody model respectively.
 and cadence) to bridge the gap between lyrics and melodies However, the utilization of the unpaired data helps on the un-
 (i.e., the system consists of a lyric-to-template module and derstanding and generation of lyrics and melodies while has
 a template-to-melody module). TeleMelody has two advan-
 tages. First, it is data efficient. The template-to-melody mod-
 little effect on the correlation learning between lyrics and
 ule is trained in a self-supervised way (i.e., the source tem- melodies. 2) They generate melodies directly from lyrics,
 plate is extracted from the target melody) that does not need which hinders end users to control musical elements (e.g.
 any lyric-melody paired data. The lyric-to-template module tonality and chord progression) over the generation process.
 is made up of some rules and a lyric-to-rhythm model, which Without controllability, requirements from users may be ig-
 is trained with paired lyric-rhythm data that is easier to ob- nored and the application scenarios are limited.
 tain than paired lyric-melody data. Second, it is controllable. In this paper, we propose TeleMelody1 , a two-stage lyric-
 The design of template ensures that the generated melodies to-melody generation system with a carefully designed tem-
 can be controlled by adjusting the musical elements in tem- plate as a bridge to connect lyrics and melodies. The tem-
 plate. Both subjective and objective experimental evaluations
 demonstrate that TeleMelody generates melodies with higher
 plate contains tonality, chord progression, rhythm pattern,
 quality, better controllability, and less requirement on paired and cadence. This designed template is effective because: 1)
 lyric-melody data than previous generation systems. It is convenient to be extracted from melodies and predicted
 from lyrics, and can successfully catch their characteristics;
 2) It is easy to be understood and manipulated from users.
 1 Introduction Accordingly, we break down the lyric-to-melody task into a
 With the rapid development of artificial intelligence, au- lyric-to-template module and a template-to-melody module.
 tomatic songwriting has drawn much attention from both This can reduce the task difficulty, improve data efficiency
 academia and industry. Automatic songwriting covers many and enable controllability on generated melodies. The de-
 tasks, such as lyric generation (Malmi et al. 2016; Xue et al. tails are described as follows:
 2021), melody generation (Wu et al. 2020; Choi, Fazekas, This two-stage framework can help reduce the difficulty
 and Sandler 2016; Zhu et al. 2018), lyric-to-melody gener- of learning the correlation between lyrics and melodies in
 ation (Yu, Srivastava, and Canales 2021; Sheng et al. 2020; end-to-end models. In the template-to-melody module, we
 Bao et al. 2019; Lee, Fang, and Ma 2019), and melody-to- train a template-to-melody model with templates extracted
 lyric generation (Sheng et al. 2020; Watanabe et al. 2018; Li from melodies by rules. Generating melodies from templates
 et al. 2020). In this paper, we focus on lyric-to-melody gen- is much easier than from lyrics, since the correlation be-
 eration, since it is one of the most important and common tween templates and melodies is much stronger than that be-
 tasks in songwriting, and is still under-explored. tween lyrics and melodies, and the paired template-melody
 Recent years, deep learning techniques have been widely data can be easily got from a self-supervised way. In the
 used to develop end-to-end lyric-to-melody systems (Yu, lyric-to-template module, rhythm pattern in template is ob-
 * This work was conducted at Microsoft. Corresponding author:
 1
 Xu Tan, xuta@microsoft.com Tele is from TEmpLatE.
    
    
tained by a lyric-to-rhythm model, which is trained with
    
 C F G F G C

paired lyric-rhythm data. This paired data is obtained by
     
extracting the rhythm pattern from crawled lyric-audio data Twin-kle twin-kle lit -tle star, how I won-der what you are.
through audio processing tools, which is much easier to get
 (a) The melody, lyric, and chord progression.
than paired lyric-melody data. Cadence is inferred based
on punctuation mappings. Chord progression and tonality
 C C C C F F G F F F F G G C Chord
in the template can be acquired with predefined musical C
 major 0 1 2 3 0 1 2 0 1 2 3 0 1 2 Rhythm pattern
knowledge. In this way, the two modules either rely on self- ( Tonality)
 Cadence
 No No No No No No Half No No No No No No Authentic
supervised learning or data mining on external lyric-audio
data, which do not require any paired lyric-melody data and (b) The corresponding template.
are more data-efficient than end-to-end models.
 Moreover, benefiting from the template based framework, Figure 1: The song “Twinkle Twinkle Little Star” in “C ma-
end users can control the generated melodies by changing jor” tonality.
the musical elements in templates. Besides, in the sequence-
to-sequence based template-to-melody model, we use musi-
cal knowledge to guide the learning of attention alignments musical element. Fukayama et al. (2010) obtain the opti-
between the template tokens and the corresponding melody mal pitch sequence by maximizing the conditional probabil-
tokens, which can lead to better controllability. ity given chords, tonality, accompaniment bass, rhythm and
 The main contributions of this work are as follows: pitch accent information, but the generated melodies may
 • We propose TeleMelody, a two-stage lyric-to-melody suffer from bad musical structure without repetition pat-
 system with a carefully designed template as the bridge. terns. Meanwhile the algorithm cannot be directly applied
 It decomposes lyric-to-melody task into a lyric-to- to lyrics written in “stress accent” languages like English.
 template module and a template-to-melody module. This Recently, developing lyric-to-melody systems based on
 framework can help reduce the task difficulty and im- machine learning methods attracts lots of attentions. Scirea
 prove data efficiency. et al. (2015) allocate the same number of notes as the
 syllables in lyrics using Markov chain. Ackerman and
 • Chord progression, tonality, rhythm pattern and cadence Loker (2017) leverage random forest model to construct a
 designed in templates can help control basic musical el- rhythm model and a melody model separately on paired
 ements and high-level music structures. We introduce lyric-audio data. Bao et al. (2019) and Yu, Srivastava, and
 alignment regularization based on musical knowledge to Canales (2021) use sequence-to-sequence models to gen-
 ensure better controllability for generated melodies. erate melody from lyrics. However, deep learning methods
 • Experimental results demonstrate that TeleMelody sig- usually require large amount of paired lyric-melody data for
 nificantly outperforms previous end-to-end lyric-to- learning the correlation between lyrics and melodies. Sheng
 melody generation models in terms of both objective and et al. (2020) attempt to address low-resource challenge by
 subjective evaluations on generation quality, and is capa- performing pre-training for lyric-to-lyric and melody-to-
 ble of better controlling the generated melodies. melody modules, and incorporating supervised learning into
 the pre-training to learn a shared latent space between lyrics
 2 Background and melodies. This is pioneering, but the unpaired data has
 not been sufficiently utilized on correlation learning between
In this section, we introduce the background of lyric-to- lyrics and melodies. Moreover, these works do not consider
melody generation task and then briefly introduce each mu- to control specific musical elements of generated melodies.
sical element in template for better understanding. Inspired by template-based methods in language generation
 (Fabbri et al. 2020; Yang et al. 2020; Li et al. 2020), we pro-
2.1 Lyric-to-Melody Generation pose TeleMelody, a two-stage template-based system, which
Considerable development on lyric-to-melody generation consists of a lyric-to-template module and a template-to-
has been seen in recent years, from rule or statistical based melody module. The two-stage framework can help address
methods to deep learning based methods. Traditional rule the issues of limited paired data, and the designed template
or statistical based methods usually need lots of manual together with the alignment regularization in this frame-
designs with abundant domain knowledge in music, and work is able to ensure better controllability over generated
also hinder end users to control musical elements. Mon- melodies.
teith, Martinez, and Ventura (2012) generate rhythm pattern
with rule-based modules, and construct an n-gram model to 2.2 Music Background Knowledge
predict note pitch2 . Long, Wong, and Sze (2013) and Ra- In this subsection, we use the song “Twinkle Twinkle Lit-
bin (1963) learn the lyric-note correlation by performing tle Star” in Figure 1(a) as an example to introduce musical
statistical based method with limited paired data. In these elements of the template.
works, the melody is generated either without the control- • Tonality3 is composed of a scale4 and a root note. For a
ling of any musical element or just considering one specific
 3
 https://en.wikipedia.org/wiki/Tonality
 2 4
 https://en.wikipedia.org/wiki/Pitch (music) https://en.wikipedia.org/wiki/Scale (music)
given tonality, the notes will be distributed within spe- Lyric
 Lyric-to-Template Module
 Template
 Template-to-Melody Module
 Melody

 cific order and pitch intervals. For example, the tonality
 of the melody in Figure 1(a) is “C major”, since notes are (a) Pipeline of TeleMelody.
 ordered within pitches in major scale5 , and ending in root
 pitch “C”. Tonality Rhythm Pattern
 Template
 Cadence Chord Progression

 • Chord progression6 is an ordered sequence of chords, in
 which a chord7 is a set of more than two notes with har- Lyric-to-Rhythm
 Model
 Punctuation-to-Cadence
 Mapping
 monious pitches. As in Figure 1(a), the chord progression
 is “C-F-G-F-G-C’. Chord progression, interacting with Lyric

 melody, should create a sense of harmony when compos- (b) Architecture of lyric-to-template module.
 ing music.
 • Rhythm8 generally refers to the pattern of occurrence of Melody ··· i ℎ Bar Position Pitch Duration ቀi+1 ℎ Bar Position ···
 notes and rests. In Figure 1(a), each note is aligned with a
 syllable and notes in green boxes are in the same rhythm Template Tonality ··· i ℎ Chord Rhythm Pattern Cadence ···
 patterns.
 • Cadence9 occurs at the end of phrase and gives a sense (c) The representation of template and its connection to
 of ending in melody, and is often aligned with punctu- melody.
 ation marks in lyrics. In Figure 1(a), we assign the half
 cadence9 and the authentic cadence9 to the comma and Figure 2: Architecture of TeleMelody.
 the period respectively.

 3 Methodology one token at the start of sequence to represent the tonality
The system architecture is shown in Figure 2(a). We intro- of the melody and three consecutive tokens (chord, rhythm
duce a template as a bridge between lyrics and melodies, pattern, cadence) to represent musical elements of each note.
and decompose lyric-to-melody generation into a lyric-to- For better understanding, we provide an example of tem-
template module and a template-to-melody module. In this plate in Figure 1(b), which corresponds to the melody in
section, we describe each component (i.e, template, lyric- Figure 1(a) .
to-template module, and template-to-melody module) of the The influence of musical elements in templates on gen-
system in detail. erated melodies are described as follows (shown in Fig-
 ure 2(c)): 1) Tonality can control pitch distribution. 2)
3.1 Template Chord can influence the harmony of generated melodies. 3)
 Rhythm pattern can constrain note position and control high-
In this subsection, we introduce template from several as-
 level musical structure with repetitive patterns. 4) Cadence
pects: definition, design principles, contained musical ele-
 can guarantee the accordance between punctuation in lyrics
ments and their representation, and the connection with the
 and rest notes in melodies.
generated melodies.
 The template is a well-designed sequence of musical el- 3.2 Template-to-Melody Module
ements that can capture the characteristics of both lyrics
and melodies. With the connection of this template, we de- The template-to-melody module is to generate melodies
compose lyric-to-melody generation into a lyric-to-template from given templates. An encoder-attention-decoder Trans-
module and a template-to-melody module as shown in Fig- former model is adopted in this module, which can learn the
ure 2(a). alignment between source and target sequences by attention
 Several high-level principles of template design are as fol- mechanism. In this subsection, we first introduce how to ex-
lows: 1) Templates can be extracted directly from melodies tract musical elements in templates from melodies for model
so that a template-to-melody model can be trained in a self- training and then introduce alignment regularization, which
supervised way. 2) Templates obtained from lyrics should be can leverage musical knowledge to improve model control-
in accordance with those extracted from melodies. 3) Com- lability.
pared with hidden representation in lyric-to-melody end-to- Template Extraction Method We introduce the method
end model, templates should be more easily manipulated on for extracting templates (i.e., tonality, chord progression,
demand. rhythm pattern, cadence) from melodies for model training:
 Based on these principles, our designed template consists
of tonality, chord progression, rhythm pattern, and cadence. • Tonality can be inferred according to the note pitch distri-
The template representation is shown in Figure 2(c): We use bution of a whole melody, following Liang et al. (2020).
 5
 • Chord progression can be inferred based on note pitch
 https://en.wikipedia.org/wiki/Major scale distribution based on a viterbi algorithm proposed by
 6
 https://en.wikipedia.org/wiki/Chord progression Magenta10 .
 7
 https://en.wikipedia.org/wiki/Chord (music)
 8 10
 https://en.wikipedia.org/wiki/Rhythm#Composite rhythm https://github.com/magenta/note-seq/blob/master/note seq/
 9
 https://en.wikipedia.org/wiki/Cadence chord inference.py
• Rhythm pattern can be inferred based on position infor- • We add an alignment between tonality (at the begin-
 mation of notes in melodies. ning of the template) and every note pitch in the melody,
 • Cadence can be inferred based on note pitch, onset and since tonality controls the pitch distribution of the entire
 duration11 . Specifically, 1) “no cadence” is assigned to melody.
 the note when the note duration is short (e.g., less than • We add an alignment between the chord of ith note in the
 1 beat12 ) and the onset interval between the current note template and pitch of ith note in the melody, since chord
 with the next note is small (e.g., less than 1.5 beats). 2) influences the note pitch.
 “authentic cadence” is assigned to the note when the note • We add an alignment between the rhythm pattern of ith
 is the root note13 of tonic14 chord, or the inferred chord note in the template and the position of ith note in the
 of this note is tonic chord. There is also a probability of p melody, since rhythm pattern determines the note posi-
 (e.g., 0.3) for labeling this note with “authentic cadence” tion.
 when the note is other notes in tonic chord rather than
 root note, and the onset interval is large (e.g., more than • We add an alignment between the cadence of ith note in
 2 beats). 3) “half cadence” is assigned to notes outside the template and the duration of ith note in the melody,
 the above two situations. since cadence controls the note duration. Besides, since
 onset interval between ith note and i+1th note is also
Alignment Regularization With the introduction of tem- closely related to the pause, we add alignments between
plates, we are able to control the melody generation by ad- cadence of ith note with the bar index and position of
justing musical elements in templates. To further increase i+1th note. We also add an alignment between cadence
model controllability, we introduce musical knowledge to of ith note and the pitch of ith note, since pitch is an im-
the template-to-melody model through well-designed align- portant factor in distinguishing “half cadence” and “au-
ment regularization during training. Each alignment is de- thentic cadence”.
signed based on musical knowledge and is imposed on the
encoder-decoder attention. 3.3 Lyric-to-Template Module
 Following Garg et al. (2019), alignment regularization is In this subsection, we describe the lyric-to-template mod-
described as follows: We denote mk as the k th note infor- ule, which generates musical elements (i.e., tonality, chord
mation token in melody sequence, and tj as the j th musical progression, rhythm pattern, and cadence) in templates from
element token in template sequence. ŵ denotes a 0-1 matrix lyrics. Tonality and chord progression are weakly correlated
such that ŵk,j = 1 if mk is aligned with tj . We simply nor- with lyrics and can be manipulated on demand. Therefore,
malize the rows of matrix ŵ to get a matrix w. We expect we focus on generating rhythm pattern and cadence in the
encoder-decoder attention weight Ak,j between mk and tj following subsections.
to be closer to wk,j defined as:
 1 Lyric-to-Rhythm Model Previous works (Monteith, Mar-
 if mk is aligned to tj , tinez, and Ventura 2012; Fukayama et al. 2010) usually gen-
 wk,j = T (1)
 0 otherwise, erate a fixed rhythm pattern based on syllable count in a bar,
where T is the number of tokens in the template sequence which need lots of manual design and constrain the diversity
that mk is aligned to. The alignment regularization term is of rhythm pattern. Therefore, we introduce a lyric-to-rhythm
defined as follows: model to predict rhythm pattern in template with given
 lyrics. We utilize an encoder-attention-decoder Transformer
 K
 1 XX
 J model to generate rhythm pattern in an auto-regressive way,
 Lattn = wk,j ∗ logAk,j , (2) which requires large amount of paired lyric-rhythm data for
 K ∗J j=1 training. To collect adequate training data, we extracted the
 k=1
where J and K are the number of tokens in the source and paired data from crawled lyric-singing audio data with audio
target sentence respectively. Finally, we train our model to processing tools. The data pipeline is similar to Xue et al.
minimize Lattn in conjunction with the negative log likeli- (2021) and is described in Appendix A.
hood loss Lnll . The overall loss is defined as follows: Punctuation-Cadence Mapping Since punctuation
 L = Lnll + λattn ∗ Lattn , (3) marks in lyrics are closely related to cadences in melodies,
where λattn is a hyperparameter. we design musical knowledge based mapping as follows:
 As shown in Figure 2(c), we give an example to illustrate 1) We align “authentic” cadence and “half cadence” with
the alignment between the consecutive ith and i+1th note in period and comma in lyrics respectively for indicating the
the melody, and the corresponding musical elements in the end of the sentence. 2) We align the “no cadence” label with
template. The note information is consisted of bar index, po- each syllable in lyrics since it is a voiced part.
sition in a bar, pitch and duration. The alignments mentioned
in Equation (1) are designed as follows: 4 Experimental Settings
 11
 In this section, we introduce the experimental settings, in-
 https://en.wikipedia.org/wiki/Duration (music) cluding datasets, model configurations, and evaluation met-
 12
 https://en.wikipedia.org/wiki/Beat (music) rics. Our code is available via this link15 .
 13
 https://en.wikipedia.org/wiki/Root (chord)
 14 15
 https://en.wikipedia.org/wiki/Tonic (music) https://github.com/microsoft/muzic/tree/main/telemelody
4.1 Dataset Objective Evaluation Following previous work (Sheng
We conduct experiments on both English (EN) and Chi- et al. 2020), we consider two metrics to measure the sim-
nese (ZH) lyric-to-melody generation tasks to verify the ilarity of the generated and the ground-truth melodies: 1)
effectiveness of TeleMelody. For lyric-to-template gener- Similarity of pitch and duration distribution (PD and DD).
ation, we construct two lyric-rhythm datasets in English 2) Melody distance (MD).
and Chinese respectively for training lyric-to-rhythm model. Besides, we also use accuracy of tonality, chord, rhythm
For template-to-melody generation, we only require one pattern, and cadence (TA, CA, RA, AA) to measure the
template-melody dataset since templates are language inde- consistency between the generated melody and musical el-
pendent. The processing details about melody and template ements in the template. The more consistent the generated
are provided in Appendix B. melody is with the template, the more controllable the model
 is. TA, RA, AA of ground-truth melodies are 100%, while
Lyric-Rhythm Dataset Following data mining pipeline the CA score of ground-truth melodies is less than 100%
mentioned in Appendix A, we can collect adequate paired since introducing some notes outside chord are encouraged
lyric-rhythm training data. Finally, we obtain 9, 761 sam- to avoid monotony. Therefore, CA scores is better if it is
ples in English and 74, 328 samples in Chinese. Statistics of closer to that of the ground-truth melodies. Accordingly, we
these two datasets are shown in Table 4 of Appendix C. consider that the closer TA, CA, RA, and AA are to the
 ground-truth melodies, the more controllable the model is.
Template-Melody Dataset Since templates are language The definitions of these metrics are described in Appendix
independent, we only require one template-melody model F.
for lyrics in two languages. We use LMD-matched MIDI
dataset16 (Raffel 2016), which contains 45, 129 MIDI data. Subjective Evaluation We invite 10 participants with mu-
We perform pre-processing to get high-quality melody data sic knowledge as human annotators to evaluate 10 songs in
as follows: First, we extract the melody from the track17 each language. We require each participant to rate properties
with at least 50 notes that has the highest average pitch of the melodies in a five-point scale, from 1 (Poor) to 5 (Per-
among all the tracks, and then delete polyphonic notes. Sec- fect). The whole evaluation is conducted in a blind-review
ond, we normalize the tonality to “C major” or “A minor”, mode. Following previous works (Sheng et al. 2020; Zhu
and normalize the pitches to fit the range of vocals. Fi- et al. 2018; Watanabe et al. 2018), we use following sub-
nally, we filter the empty bars in each sample. After these jective metrics to evaluate the generated melodies: 1) Har-
steps, we can obtain the processed target melody dataset. mony: Is the melody itself harmonious? 2) Rhythm : Is the
To construct paired template-melody dataset, we utilize our rhythm of the melody sounds natural and suitable for lyrics?
proposed knowledge-based rules to extract the correspond- 3) Structure: Does the melody consist of repetitive and im-
ing template from the target melody. Statistics of template- pressive segments? 4) Quality: What is the overall quality of
melody dataset are shown in Table 5 of Appendix C. the melody?

4.2 System Configurations 5 Experimental Results
We choose encoder-attention-decoder Transformer in our In this section, we first compare TeleMelody with baselines
system. The configurations of the two models in TeleMelody to demonstrate its effectiveness. Then, we show the analysis
are as follows: 1) Lyric-to-rhythm Transformer model con- of TeleMelody in the aspects of controllability and data ef-
sists of 4 encoder layers and 4 decoder layers. The hidden ficiency. Audio samples of the generated melodies are avail-
size and the number of attention heads are set as 256 and 4. able via this link18 .
During training, we apply dropout with the rate of 0.2. We
use Adam optimizer (Kingma and Ba 2014) with a learning 5.1 Main Results
rate of 0.0005. 2) Template-to-melody Transformer model In this subsection, we evaluate the performance of
has a 4-layer encoder and 4-layer decoder with 4 attention TeleMelody. Two baselines are considered: 1) Song-
heads in each layer. We set hidden size as 256 and dropout MASS (Sheng et al. 2020), an end-to-end Transformer
as 0.0005. The alignment regularization weight λattn is set model that also deals with low-resource scenario by unsu-
as 0.05 and the learning rate of Adam optimizer is set as pervised lyric-to-lyric and melody-to-melody pre-training;
0.0005. 3) To ensure the diversity of generated melodies, 2) Transformer baseline, a Transformer model directly
we use stochastic sampling inference following Huang et al. trained with paired lyric-melody data, which is similar
(2018). The temperature and top-k parameters are set as 0.5 to SongMASS without pre-training and and an alignment
and 2 in lyric-to-rhythm generation, and as 0.5 and as 10 in constraint. Our TeleMelody (including lyric-to-rhythm and
template-to-melody generation. template-to-melody models) has similar number of model
 parameters with that of SongMASS and Transformer base-
4.3 Evaluations Metrics line for fair comparison. For English, the two baselines use
We conduct both objective and subjective evaluations to val- 8, 000 paired lyric-melody data (Yu, Srivastava, and Canales
idate the performance of our proposed system. 2021), and SongMASS additionally uses 362, 237 unpaired
 lyrics and 176, 581 unpaired melodies for pre-training. For
 16
 https://colinraffel.com/projects/lmd/
 17 18
 https://en.wikipedia.org/wiki/Multitrack recording https://speechresearch.github.io/telemelody
Table 1: Objective and subjective evaluation of TeleMelody and the baselines systems.

 Objective Subjective
 PD(%)↑ DD(%)↑ MD↓ Harmony↑ Rhythm↑ Structure↑ Quality↑
 (EN) Transformer Baseline 24.41 47.04 2.69 2.0 2.1 1.8 1.8
 (EN) SongMASS 34.32 47.88 2.61 2.4 2.3 2.3 2.2
 (EN) TeleMelody 43.86 52.06 2.40 3.2 3.4 3.3 3.3
 (ZH) Transformer Baseline 11.40 38.03 6.75 1.5 2.0 1.5 1.5
 (ZH) SongMASS 26.36 50.09 3.29 1.8 2.2 1.8 1.7
 (ZH) TeleMelody 49.76 51.48 2.80 3.0 3.5 3.3 3.2

Chinese, the two baselines use 18, 000 paired lyric-melody Controllability Study Previous works in lyric-to-melody
data (Bao et al. 2019), and SongMASS additionally uses generation based on deep learning techniques hinders end
228, 000 unpaired lyrics and 283, 000 unpaired melodies users to control musical elements in the generated melodies.
crawled from the Web. As shown in Table 2, TeleMelody with template is quite high
 As shown in Table 1, TeleMelody significantly outper- in RA and AA, and close to the ground truth in CA, which
forms Transformer baseline on all the objective metrics (Im- indicates that the melody generated by TeleMelody is highly
provement on EN: 19.45% in PD, 5.02% in DD, and 0.29 consistent with the rhythm, chord, and cadence in the tem-
in MD; Improvement on ZH: 38.36% in PD, 13.45% in DD plate. Meanwhile, TeleMelody can also control the tonality
and 3.95 in MD). Compared with SongMASS, TeleMelody with a good TA accuracy with the template. Moreover, the
also performs better on all the objective metrics (Improve- proposed alignment regularization (AR) can further improve
ment on EN: 9.54% in PD, 4.18% in DD, and 0.21 in the controllability (with closer TA, CA, RA, and AA to the
MD; Improvement on ZH: 23.40% in PD, 1.39% in DD ground truth).
and 0.49 in MD). Meanwhile, for all the subjective metrics,
TeleMelody is better than both the Transformer baseline and Table 2: Results on model controllability. The closer that TA,
SongMASS. Specifically for the quality, TeleMelody out- CA, RA, AA are to Ground Truth, the better the controlla-
performs Transformer baseline by 1.49 in English and 1.7 bility is.
in Chinese, and outperforms SongMASS by 1.10 in English
and 1.43 in Chinese. Metric TA(%) CA(%) RA(%) AA(%)
 The results show that an end-to-end Transformer model
has poor performance, since the available lyric-melody Ground Truth 100.00 62.10 100.00 100.00
paired data is limited. In SongMASS, the lyric-to-lyric and Template 75.64 64.11 99.88 99.94
melody-to-melody unsupervised pre-training can effectively Template+AR 77.41 62.62 99.99 99.98
improve the performance of end-to-end models, but it is still
insufficient since the unlabeled data is not effectively uti-
lized to improve the correlation learning between lyrics and To show the effect of the proposed alignment regulariza-
melodies. TeleMelody performs the best, since it success- tion intuitively, we further visualize the average encoder-
fully reduces the difficulty and effectively utilizes the un- decoder attention weights of all heads in the last layer. As
paired melodies in the two-stage framework. Besides, the shown in Figure 3, after adding alignment regularization
improvements in Table 1 are also consistent with the intu- (template + AR), the related elements in the template and
ition in designing the template: 1) tonality and chord pro- the generated melody are clearly aligned, according to the
gression in the template can control note pitch and thus help alignment rules introduced in Section 3.2.
improve PD and harmony; 2) rhythm pattern and cadence in We also conduct a case study to illustrate the controlla-
the template can control both onset and duration of notes, bility of TeleMelody and how the elements in the template
and thus help improve DD and rhythm; 3) the repetitive pat- affect the generated melody. The baseline melody is shown
terns in template can help improve structure. in Figure 4(a), with a “C major” tonality and a “Am-F-C-
 G” chord progression. We evaluate the control performance
 from the following aspects:
5.2 Method Analyses
In this subsection, we analyze the proposed method from • Tonality determines all the pitches in the melody. As
two aspects: 1) verifying the controllability of TeleMelody, shown in Figure 4(b), when we change the tonality in
as well as the effect of the designed template and the align- the template from “C major” to “A minor”, most of the
ment regularization on controllability; 2) verifying data ef- pitches change, and the pitch of the ending note changes
ficiency by testing TeleMelody with varying paired lyric- from tonic pitch of “C major” to tonic pitch of “A minor”.
rhythm training data scale. Besides, in Appendix D, we also • For each note, a chord is provided in the template, which
evaluate the effectiveness of the rule to extract cadence from affects the pitch of the note. As shown in Figure 4(c),
melodies. when we change the chords of the notes in the third
3-Rhythm
 1.0
 3-Rhythm
 1.0 Table 3: Objective evaluations of lyric-to-rhythm model with
 3-Cadence
 3-Chord 0.8
 3-Cadence
 3-Chord 0.8
 varying lyric-rhythm training data.
 2-Rhythm 2-Rhythm
 2-Cadence 2-Cadence
 0.6 0.6
 2-Chord 2-Chord
 1-Rhythm
 1-Cadence
 1-Rhythm
 1-Cadence
 PD(%)↑ DD(%)↑ MD↓
 0.4 0.4
 1-Chord 1-Chord
 0-Rhythm
 0-Cadence
 0-Rhythm
 0-Cadence 0.2
 (EN) TeleMelody (100% data) 43.86 52.06 2.40
 0.2
 0-Chord
 Toanlity
 0-Chord
 Toanlity
 (EN) TeleMelody (50% data) 41.42 48.59 2.40
 0.0
 (EN) TeleMelody (0% data) 35.60 47.93 2.55

 0-Position
 0-Pitch
 0-Duration

 1-Position
 1-Pitch
 1-Duration

 2-Position
 2-Pitch
 2-Duration

 3-Position
 3-Pitch
 3-Duration
 0-Bar

 1-Bar

 2-Bar

 3-Bar
 0.0
 0-Position
 0-Pitch
 0-Duration

 1-Position
 1-Pitch
 1-Duration

 2-Position
 2-Pitch
 2-Duration

 3-Position
 3-Pitch
 3-Duration
 0-Bar

 1-Bar

 2-Bar

 3-Bar
 (EN) SongMASS 34.32 47.88 2.61
 (a) Template. (b) Template+AR. (ZH) TeleMelody (100% data) 49.76 51.48 2.80
 (ZH) TeleMelody (50% data) 46.25 50.14 3.12
Figure 3: Alignment visualization. We denote x-coordinate (ZH) TeleMelody (0% data) 30.22 33.84 4.95
 (ZH) SongMASS 26.36 50.09 3.29
“i-feature” as the feature of the ith note in the melody se-
quence, and y-coordinate “i-element” as the musical ele-
ment of ith note in the template sequence.

 
                  
 Am F C G
 model with extracted templates from melodies in a self-
 supervised way. 2) In lyric-to-template module, only the
 lyric-to-rhythm model requires paired training data, which
 days like this I want to drive a -way, days like this I want to drive a -way. is much easier to obtain than paired lyric-melody data. We
(a) Basic melody. Tonality is “C major” and chord progression is further study the performance with varying paired lyric-
“Am–F-C-G”. rhythm data on the lyric-to-rhythm model, to demonstrate

            
 this model is also data-efficient. As shown in Table 3, when

          
 Am F C G
 reducing the paired lyric-rhythm data to 50%, the perfor-
 mance of TeleMelody only declines slightly, while still out-
 days like this I want to drive a -way, days like this I want to drive a -way. performs SongMASS on all the metrics. We further re-
 (b) Adjusting tonality to “A minor”. duce the paired lyric-rhythm data to zero, to demonstrate
 that TeleMelody can be applied in the scenario without

 
                  
 any paired data. We replace the lyric-to-rhythm model with
  
   
 Am F Am F

  
 hand-craft rules in TeleMelody. The details of the hand-
 craft rules are described in the Appendix E. Table 3 shows
 days like this I want todrive a -way, days like this I want to drive a -way.
 that when replacing the lyric-to-rhythm model with hand-
 (c) Adjusting chord progression to “Am-F-Am-F” craft rules, the performance of TeleMelody significantly de-
 grades, which demonstrates the advantage of the lyric-to-
Figure 4: Case study on template adjustment. Similar bars rhythm model in TeleMelody. Moreover, it is also shown
are labeled by blue boxes while repetitive bars are labeled in Table 3, comparing to the performance on SongMASS
by green boxes. based on the large amount of paired data and large scale pre-
 training, TeleMelody achieves comparable performance on
 all the objective metrics in English without any paired data.
 bar from “C” to “Am”, the pitches in the third bar are This promising result illustrates that TeleMelody is also ef-
 changed correspondingly. fective in zero resource scenario.
 • Rhythm affects the onset positions of the notes. For ex-
 ample, in Figure 4, when we use the same rhythm for the
 first and third bars as labeled by green or blue boxes in 6 Conclusion
 each melody, the onset positions of the notes in the two
 bars are the same.
 In this paper, we proposed TeleMelody, a two-stage lyric-
 • Cadence affects the pauses in the generated melody. to-melody generation system with music template (e.g.,
 Meanwhile, for a note with “authentic cadence” in the tonality, chord progression, rhythm pattern, and cadence)
 template, the pitch should be tonic pitch. As shown in to bridge the gap between lyrics and melodies. TeleMelody
 Figure 4, a pause exists in the melody for each punc- is data efficient and can be controlled by end users by ad-
 tuation mark in the end of the second and fourth bars justing the musical elements. Both subjective and objec-
 as labeled by pink boxes, since a “half cadence” or an tive experimental evaluations demonstrate that TeleMelody
 “authentic cadence” is added for each punctuation mark. can generate melodies with higher quality than previous
 Moreover, the pitch of the ending note of the entire lyric-to-melody generation systems. Moreover, experimen-
 melody is the tonic pitch as labeled by orange boxes, tal studies also verify the data efficiency and controllability
 since an “authentic cadence” is added for each period. of TeleMelody. In future work, we will extend our proposed
 two-stage framework to other music generation tasks (e.g.,
Data Efficiency Study Our proposed TeleMelody is data- melody-to-lyric generation and melody-to-accompaniment
efficient: 1) In template-to-melody module, we train a generation).
References Rabin, M. O. 1963. Probabilistic automata. Information and
Ackerman, M.; and Loker, D. 2017. Algorithmic songwrit- control, 6(3): 230–245.
ing with alysia. In International conference on evolutionary Raffel, C. 2016. Learning-based methods for comparing se-
and biologically inspired music and art, 1–16. Springer. quences, with applications to audio-to-midi alignment and
Bao, H.; Huang, S.; Wei, F.; Cui, L.; Wu, Y.; Tan, C.; matching. Columbia University.
Piao, S.; and Zhou, M. 2019. Neural melody composition Scirea, M.; Barros, G. A.; Shaker, N.; and Togelius, J. 2015.
from lyrics. In CCF International Conference on Natu- SMUG: Scientific Music Generator. In ICCC, 204–211.
ral Language Processing and Chinese Computing, 499–511. Sheng, Z.; Song, K.; Tan, X.; Ren, Y.; Ye, W.; Zhang, S.;
Springer. and Qin, T. 2020. SongMASS: Automatic Song Writing
Choi, K.; Fazekas, G.; and Sandler, M. 2016. Text-based with Pre-training and Alignment Constraint. arXiv preprint
LSTM networks for automatic music composition. arXiv arXiv:2012.05168.
preprint arXiv:1604.05358. Watanabe, K.; Matsubayashi, Y.; Fukayama, S.; Goto, M.;
Fabbri, A. R.; Ng, P.; Wang, Z.; Nallapati, R.; and Xiang, B. Inui, K.; and Nakano, T. 2018. A melody-conditioned lyrics
2020. Template-based question generation from retrieved language model. In Proceedings of the 2018 Conference of
sentences for improved unsupervised question answering. the North American Chapter of the Association for Com-
arXiv preprint arXiv:2004.11892. putational Linguistics: Human Language Technologies, Vol-
Fukayama, S.; Nakatsuma, K.; Sako, S.; Nishimoto, T.; and ume 1 (Long Papers), 163–172.
Sagayama, S. 2010. Automatic song composition from the Wu, J.; Liu, X.; Hu, X.; and Zhu, J. 2020. PopMNet: Gener-
lyrics exploiting prosody of the Japanese language. In Proc. ating structured pop music melodies using neural networks.
7th Sound and Music Computing Conference (SMC), 299– Artificial Intelligence, 286: 103303.
302. Xue, L.; Song, K.; Wu, D.; Tan, X.; Zhang, N. L.; Qin, T.;
Garg, S.; Peitz, S.; Nallasamy, U.; and Paulik, M. 2019. Zhang, W.-Q.; and Liu, T.-Y. 2021. DeepRapper: Neural
Jointly learning to align and translate with transformer mod- Rap Generation with Rhyme and Rhythm Modeling. arXiv
els. arXiv preprint arXiv:1909.02074. preprint arXiv:2107.01875.
Huang, C.-Z. A.; Vaswani, A.; Uszkoreit, J.; Shazeer, N.; Yang, J.; Ma, S.; Zhang, D.; Li, Z.; and Zhou, M. 2020. Im-
Simon, I.; Hawthorne, C.; Dai, A. M.; Hoffman, M. D.; Din- proving neural machine translation with soft template pre-
culescu, M.; and Eck, D. 2018. Music transformer. arXiv diction. In Proceedings of the 58th Annual Meeting of the
preprint arXiv:1809.04281. Association for Computational Linguistics, 5979–5989.
Kingma, D. P.; and Ba, J. 2014. Adam: A method for Yu, Y.; Srivastava, A.; and Canales, S. 2021. Conditional
stochastic optimization. arXiv preprint arXiv:1412.6980. lstm-gan for melody generation from lyrics. ACM Transac-
Lee, H.-P.; Fang, J.-S.; and Ma, W.-Y. 2019. iComposer: An tions on Multimedia Computing, Communications, and Ap-
automatic songwriting system for Chinese popular music. plications (TOMM), 17(1): 1–20.
In Proceedings of the 2019 Conference of the North Ameri- Zhu, H.; Liu, Q.; Yuan, N. J.; Qin, C.; Li, J.; Zhang, K.;
can Chapter of the Association for Computational Linguis- Zhou, G.; Wei, F.; Xu, Y.; and Chen, E. 2018. Xiaoice band:
tics (Demonstrations), 84–88. A melody and arrangement generation framework for pop
Li, P.; Zhang, H.; Liu, X.; and Shi, S. 2020. Rigid formats music. In Proceedings of the 24th ACM SIGKDD Interna-
controlled text generation. In Proceedings of the 58th An- tional Conference on Knowledge Discovery & Data Mining,
nual Meeting of the Association for Computational Linguis- 2837–2846.
tics, 742–751.
Liang, H.; Lei, W.; Chan, P. Y.; Yang, Z.; Sun, M.; and A Pipeline of Crawling Lyric-Rhythm Data
Chua, T.-S. 2020. Pirhdy: Learning pitch-, rhythm-, and We crawl paired lyric-singing audio data from the Web and
dynamics-aware embeddings for symbolic music. In Pro- then utilize spleeter19 , a public music separation tool, to sep-
ceedings of the 28th ACM International Conference on Mul- arate the vocal from the accompaniment part. The times-
timedia, 574–582. tamps of lyrics and the tempo of melodies are two neces-
Long, C.; Wong, R. C.-W.; and Sze, R. K. W. 2013. T- sary information during collection of rhythm information.
Music: A melody composer based on frequent pattern min- To extract lyric timestamps, we first split the singing au-
ing. In 2013 IEEE 29th International Conference on Data dio into sentence-level segments with crawled start and end
Engineering (ICDE), 1332–1335. IEEE. timestamp. We convert lyrics into sequences of phonemes
Malmi, E.; Takala, P.; Toivonen, H.; Raiko, T.; and Gio- via Phonemizer20 and then obtain the vocal-lyric alignment
nis, A. 2016. Dopelearning: A computational approach to in phoneme level with an audio alignment tool, Montreal
rap lyrics generation. In Proceedings of the 22nd ACM Forced Aligner21 . Based on these phoneme-level vocal-lyric
SIGKDD International Conference on Knowledge Discov- alignments, we can obtain the corresponding timestamp of
ery and Data Mining, 195–204. 19
 https://github.com/deezer/spleeter
Monteith, K.; Martinez, T. R.; and Ventura, D. 2012. Au- 20
 https://github.com/bootphon/phonemizer
tomatic Generation of Melodic Accompaniments for Lyrics. 21
 https://github.com/MontrealCorpusTools/Montreal-Forced-
In ICCC, 87–94. Aligner
each lyric. To extract tempo information, we perform a di- Table 5: Statistics of template-melody dataset.
rect estimation from the accompaniment audio with an au-
dio information retrieval tool, librosa22 . Finally, with tempo # of data samples 157,702
information and timestamp of each lyric, we can infer beat- Average # of notes per song 28.76
level onset, that is, the corresponding rhythm of each lyric. Average # of bars per song 5.93

B Processing Details of Melody and Template D Effect of Cadence Extraction Rule
B.1 Melody
 Table 6: Statistics of notes labeled with different cadences in
In this paper, we only consider the melody with a constant template-to-melody dataset. Both duration and onset interval
tempo and a time signature of 4/423 . Each note is repre- are quantified to 1/16 note.
sented by four consecutive tokens (bar, position, pitch and
duration). We use 256 tokens to represent different bars and Cadence Average # per song Duration Onset Interval
16 tokens to represent different positions in a bar with gran-
 No 24.57 1.74 2.29
ularity of 1/16 note. We use 128 tokens to represent pitch Half 2.53 8.54 9.66
values following the MIDI format. We use 16 tokens to rep- Authentic 1.66 10.34 11.70
resent duration values ranging from a 1/16 note to a whole
note.
 In lyric-to-template module, we directly obtain cadence
 from lyrics in inference stage through punctuation-to-
B.2 Template cadence mapping. In template-to-melody module, we ex-
In this paper, template contains musical elements includ- tract cadence from melodies in training stage through ca-
ing tonality, chord progression, rhythm pattern, and cadence. dence extraction rule. Therefore, a question may arise: is
We only consider “C major” and “A minor” tonalities for there a gap in cadence between training and inference? To
simplicity, since other tonalities can be transposed to these answer this question, we explore the statistics of cadences in
two tonalities based on their scales. Chord consists of a template-melody dataset. The results are shown in Table 6:
root note and a chord quality. We consider 12 chord roots • Notes labeled with “no cadence”, which are aligned with
(C, C], D, D], E, F, F ], G, G], A, A], B) and 7 chord qual- syllables in lyrics, have short duration and onset inter-
ities (major, minor, diminished, augmented, major7, minor7, val. Notes labeled with “half cadence”, which are aligned
half diminished), resulting in 84 possible chords in total. We with commas in lyrics, have 4.91× longer duration and
use 4 tokens ranging from 0 to 3 to represent rhythm pat- 4.22× longer onset interval than those labeled with “no
terns, that is, beat-level onset position in a bar. For cadence, cadence”. Notes labeled with ”authentic cadence”, which
we consider “half cadence”, “authentic cadence” and “no ca- are aligned with periods in lyrics, have 5.94× loner du-
dence”, which is aligned with comma, period and other syl- ration and 5.11× longer onset interval than those labeled
lables in lyrics respectively. with “no cadence”. This is in consistent with musical
 knowledge, since punctuation marks are usually aligned
 with pauses.
 C Dataset Statistics
 • The average number of “no cadence” is 5.86× greater
 than the average number of “half cadence’ and “authen-
 Table 4: Statistics of lyric-rhythm dataset. tic cadence” combined. This ratio is similar to the ratio
 of syllables to punctuation marks in lyrics, as shown in
 (a) English Table 4.
 # of data samples 9, 761 E Lyric-to-Rhythm Hand-Craft Rules
 Average # of words per song 26.02
 Average # of syllables per song 31.15 We design several lyric-to-rhythm rules to demonstrate that
 Average # of punctuation marks per song 3.79 TeleMelody can be applied in the scenario without any
 paired data. Specifically, for English lyrics, a note is corre-
 (b) Chinese
 sponding to a syllable, and we generate the rhythm patterns
 syllable by syllable, where the onset interval between a note
 # of data samples 74, 328 and its previous note is 2 beats if its corresponding syllable is
 Average # of words per song 78.77 the start of a sentence, is 1 beat if its corresponding syllable
 Average # of punctuation marks per song 9.30 is the start of a word but not the start of a sentence, and is 0.5
 beat otherwise. For Chinese lyrics, a note is corresponding
 to a character, and we generate the rhythm patterns character
 22 by character, where the onset interval between a note and its
 https://github.com/librosa/librosa
 23
 4/4 denotes that that each beat is a 1/4 note and each bar has 4 previous note is 2 beats if its corresponding character is the
beats. start of a sentence, and is 1 beat otherwise.
F Definitions of TA, RA, CA, and AA
We evaluate model controllability with accuracy of tonality,
rhythm pattern, chord, and cadence (TA, RA, CA, AA), that
is, the proportion of notes in consistent with given template.
Specifically, a melody is in consistent with tonality if the in-
ferred tonality is the same as the template tonality; a note is
in consistent with rhythm pattern if its position is in accord
with given rhythm pattern information; a note is in consis-
tent with chord if its pitch is within the chord; a note is in
consistent with cadence if both its duration and its onset in-
terval between this note and the next note comply with the
extraction rules described in Section 3.2.
You can also read