Music generation using tracker music and machine learning

Page created by Roberto Lindsey
 
CONTINUE READING
Music generation using tracker music and machine learning
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2021

Music generation using
tracker music and machine
learning
BJÖRN A. LINDQVIST

KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Music generation using tracker music and machine learning
Music generation using tracker music and machine learning
Music generation using
tracker music and machine
learning

BJÖRN A. LINDQVIST BJOLIN2@KTH.SE

Degree Programme in Computer Science
Date: June 27, 2021
Supervisor: Bobby Lee Townsend Sturm JR
Examiner: Sten Ternström
School of Electrical Engineering and Computer Science
Swedish title: Musikgenerering med trackermusik och
maskininlärning
Music generation using tracker music and machine learning
Music generation using tracker music and machine learning
iii

Abstract
We investigate the modelling of polyphonic “tracker music” using deep
neural networks. Tracker music is a music storage format invented
in the late 1980s for use on that time’s home computers and is often
used for storing synthesized electronic music. Tracker music differs
significantly from other music formats and has properties that makes
it both harder and easier to use for training neural networks than other
formats. This makes it interesting to explore what methods are most
suitable for extracting musical information from the format. As far
as we know, we are the first to explore how to use tracker music for
music generation.
    We design a method for turning tracker music into sequential data
usable for training neural networks. The sequential nature of the
data means that musically unaware sequence models can be used for
training. The method is general and can be applied to other kinds of
symbolic music.
    We then compile a dataset of about 20 000 freely available in-
strumental songs in the tracker format MOD, downloaded from the
website The Mod Archive. We use the dataset to train several differ-
ent sequence models, including a Long Short-Term Memory (LSTM)
network and a Transformer model. We evaluate the models using a
sequence completion task and we investigate the statistical properties
of the output. We also conduct a listener study involving some 100
participants to determine how often music generated by the models
is preferred over human-composed music. The listener study’s result
indicates that music generated by the models trained on the dataset is
sometimes competitive with music composed by humans.
    We conclude that neural networks for music generation can be
trained using tracker music using our proposed conversion method,
but that it is cumbersone. Due to how the tracker music format is
constructed it is significantly more difficult to get musical information
out of it than we initially thought.
iv

Sammanfattning
Vi undersöker hur man bäst använder sig av trackermusik för att träna
neurala nätverk till att generera polyfonisk musik. Trackermusik kan
sägas både vara en speciell instrumental musikgenre och ett speciellt
musikformat. Formatet som skiljer sig markant från exempelvis MIDI
och MP3 har en del egenskaper som gör det svårare och andra som
gör det lättare att träna neurala nätverk med det än med jämförbara
musikformat. Därför är det intressant att utforska vilka metoder som är
bäst att använda för att utvinna musikinformation ur formatet. Så vitt
vi vet har ämnet inte utforskats tidigare och vårt utforskande av det
är vår uppsats centrala bidrag till forskningen kring musikgenerering
med neurala nätverk.
    I uppsatsen föreslår vi en metod som konverterar trackermusik till
ett sekventiellt format som är lämpligt att använda för att träna neu-
rala nätverk. Vi demonstrerar också att metoden fungerar i praktiken
genom att träna ett antal neurala nätverk med en samling på cirka 20
000 instrumentala sånger i trackermusiklagringsformatet MOD som vi
sedan utvärderar. Utvärderingen består bland annat i en lyssnarstudie.
Resultatet av den visar att den musik som genereras av tre neurala
nätverk som vi tränar med trackermusiksamlingen ibland föredras av
lyssnare över musik skapad av människor.
    Vi drar slutsatsen att musikgenererande neurala nätverk kan tränas
med hjälp av trackermusik med vår föreslagna konverteringsmetod,
men att det är bökigt. På grund av hur trackermusik är uppbyggt och
organiserat är det betydligt svårare än vad vi inledningsvis trodde att
utvinna musikinformation ur det.
v

Acknowledgements
I would like to thank my supervisor Bob L. Sturm without whose
plentiful support and insistence this thesis would not have been com-
pleted.
Contents

1   Introduction                                                                                      1

2   Theoretical background                                                                            4
    2.1 Symbolic music . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .    4
    2.2 Sequence modelling . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .    5
        2.2.1 Sequence completion . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .    6
        2.2.2 Decoding . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .    7
        2.2.3 Evaluation . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .    9
    2.3 Artificial neural networks . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   11
        2.3.1 Feed-forward networks . .          .   .   .   .   .   .   .   .   .   .   .   .   .   14
        2.3.2 Recurrent neural networks          .   .   .   .   .   .   .   .   .   .   .   .   .   15
        2.3.3 The Transformer . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   18
        2.3.4 Inferencing . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   21

3   Related work                                                                                     22
    3.1 Sturm, Santos, et al. (2016) . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   23
    3.2 Hadjeres and François Pachet (2016)          .   .   .   .   .   .   .   .   .   .   .   .   25
    3.3 Donahue, Mao, Y. E. Li, et al. (2019)        .   .   .   .   .   .   .   .   .   .   .   .   29
    3.4 Huang et al. (2019) . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   31

4   Tracker music and machine learning                                                               33
    4.1 The MOD file format . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   33
    4.2 Turning MOD files to training data       .   .   .   .   .   .   .   .   .   .   .   .   .   35
         4.2.1 Filtering . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   38
         4.2.2 Dcode . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   39
    4.3 Turning training data to music . .       .   .   .   .   .   .   .   .   .   .   .   .   .   39

5   Dataset and neural network training                                                              41

                                    vi
CONTENTS                        vii

6   Evaluation                                                             45
    6.1 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . 46
         6.1.1 Plagiarism . . . . . . . . . . . . . . . . . . . . . . . 50
    6.2 Listener study . . . . . . . . . . . . . . . . . . . . . . . . . 52

7   Discussion                                                                                                  57
    7.1 Training on modules . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   57
    7.2 Sequential encodings . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   59
    7.3 LSTM versus pcode GPT-2         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   60
    7.4 Ethics and sustainability .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   61

Bibliography                                                                                                    63

8   Appendix                                                                                                    66
Chapter 1

Introduction

Music is the art of combining sounds with silence in a way that elicits
emotions. The sounds are not merely noise, but has meaning, structure,
and a purpose. To create music is to compose – to organize smaller
atomic elements into a larger whole that is greater than the sum of
its parts. While most music is composed by humans, an intriguing
question is whether some algorithm can excel at the same task. This
leads to deeper questions on what would be required of such an
algorithm and how it should be implemented.
    The recent surge of interest in machine learning has seen many
researchers use neural networks to generate music (Sturm, Santos, et al.
2016; Peracha 2019; Donahue, Mao, and McAuley 2018; Huang et al.
2019). The challenge for them and for us is that music is predictable
and unpredictable, familiar and alien, structured and chaotic, all at the
same time. Good music cannot be to much of either; too structured
music is repetitive and dull – too chaotic music is confusing and
without meaning. Music is structured into multiple layers such as
beats, bars, riffs, voices, etc, that interact with each other. Capturing the
interactions between all these layers makes music generation difficult.
    Tracker music is a unique way of representing synthesized elec-
tronic music invented in in the late 1980s and designed for use on that
time’s home computers. Tracker music is sample-based and frequently
used for composing chiptunes, a genre of instrumental music that
sounds reminiscent of vintage arcade machines, computers, and video
game consoles (Driscoll and Diaz 2009). The music is created with
tracker software – a type of music sequencing software. The term
“tracker” derives from the first tracker, Ultimate Soundtracker, writ-

                                     1
2      CHAPTER 1. INTRODUCTION

ten by Karsten Obarski and released in 1987 for Commodore Amiga.
Trackers made it possible for amateurs without access to expensive
synthesizers to create music with their own computers. Obarski’s
software became hugely popular with videogame developers, hobbyist
musicians, and on the demoscene. The success of Ultimate Sound-
tracker spawned several clones and lookalikes that expanded on the
tracker concept, including ProTracker, NoiseTracker, Scream Tracker,
and FastTracker II (Cant 2020). Ultimate Soundtracker also introduced
the MOD file format for storing tracker music. MOD is shorthand for
module and tracker music is often called MOD/module music (see
chapter 4 for details on the MOD format).
    The objective of this thesis is to explore how tracker music can be
used for training neural networks to generate music. We fullfill our
objective by training neural networks on a corpus of freely available
tracker music in MOD format. To the best of our knowledge, neither
this dataset nor tracker music in general has been used for music
generation. We also evaluate the trained networks; in a listener study
conducted online involving over 100 participants we find that the net-
works produce music that is preferred over human-composed music
a fairly often (see section 6.2 for details). This demonstration on how
to use tracker music for music generation is our main contribution. A
secondary contribution is our analysis of the neural networks’ perfor-
mance. A tertiary contribution is the software we have developed for
training and evaluating neural networks trained on tracker music that
we have published online.1
    The target audience for this thesis follows from its objective; ma-
chine learning practitioners interested in novel datasets for music
generation. It ought to be an interesting read for them because it
introduces a music format they probably have not encountered. We
consider this thesis a proof-of-concept to inspire others to improve the
methods we present, resulting in better ways to use tracker music. Or,
at the very least, to adapt our proposed methods for other musical
datasets.
    The rest of this thesis is structured as follows. In chapter 2, we
define symbolic music and we discuss sequence modelling and why it
is a good tool for generating symbolic music. We review the theory
underlying artificial neural networks, paying special attention to two
specific state of the art sequencing approaches - the LSTM and the
    1 See   https://github.com/bjourne/musicgen.
CHAPTER 1. INTRODUCTION            3

Transformer architectures – employed to implement our networks. In
chapter 3, we discuss some recent music modelling research and how
it relates to our work. In chapter 4, we discuss tracker music and
explain our algorithm for encoding it as sequences amendable for
sequence modelling. In chapter 5 and 6, we present the tracker music
dataset, our training work, and what results it yielded. In chapter 7, in
we discuss what we have learned and stake out directions for future
research.
Chapter 2

Theoretical background

The theoretical basis of this thesis is symbolic music, sequence mod-
elling, and artificial neural networks. We define symbolic music in
section 2.1, discuss sequence modelling, including decoding and eval-
uation, in section 2.2, and in section 2.3, we discuss artificial neural
networks in general and the LSTM and the Transformer architecture
in particular.

2.1      Symbolic music
Symbolic music are compositions presented in symbolic notation. Ex-
amples of symbolic music are sheet music, tabulature, and MIDI files.
The fundamental elements in symbolic music are notes. They instruct
the performer on how to play the symbolic music. A note has three
central properties: an onset, defining when the note is played relative
to the start of the composition; a duration, defining for how the long
the note sounds; and a pitch, defining the frequency of the note.1
    A composition’s duration is usually subdivided into intervals of
non-overlapping measures. The measures’ durations are in turn subdi-
vided into intervals of non-overlapping note durations. For example,
if a composition contains 16 measures and every measure contains
four quarter-notes and the duration of each quarter-note is 500 ms, the
composition’s total length is 32 seconds. Quarter-note durations may
be further subdivided into non-overlapping triplets or quintuplets,
called sixteenth-notes. Other subdivisions of time exists and it is also
  1 Notescan have additional properties called embellishments. Embellishments are
beyond the scope of our discussion.

                                       4
CHAPTER 2. THEORETICAL BACKGROUND                 5

possible for durations of measures to vary, but for simplicity’s sake, in
this thesis we assume that all measures are exactly 16 sixteenth-notes
long, that there are always four sixteenth-notes per quarter-note, and
that all notes begin at and cover one or more sixteenth-notes.
    The composition’s tempo is the pace of its rhythm and specifies
the duration of its quarter-notes. It is often notated as the number of
quarter-notes, or beats, per minute (BPM). For example, 125 BPM is
the same as 480 ms per quarter-note.
    A note’s pitch is its frequency relative to some base frequency. In
Western music, notes’ frequencies are fit into octaves – non-overlapping
subdivisions of the frequency spectrum. Octaves contain 12 relative
frequences known as tone steps or “notes”; terminology that may
invite some confusion since a “note” in a composition also has an onset
and a duration. The 12 tone steps in an octave are, in ascending order
of frequency: C, C#, D, D#, E, F, F#, G, G#, A, Bb, and B. Octaves wrap
around so that the B note of the first octave is followed by the C note
of the second octave and so on.
    Scientific pitch notation specifies pitch by note and octave. For ex-
ample, F#3 denotes the third octave’s F# note. The absolute frequency
f of a note given in scientific pitch notation is given by:

                             f = A0 × 2i/12 .                        (2.1)

A0 is a standard frequency, usually set to 440 Hz, and i is the distance
in half-steps between the note and A0 . E.g., if the note is D1 , the
distance is 5 because there are four notes in between: Bb0 , B0 , C1 , and
C#1 .

2.2     Sequence modelling
We choose to model music as token sequences because sequence
modelling is a mature field; we inherit all its theory. A major use
case for sequence modelling is language modelling, wherein text is
modelled as sequences of words, letters, or lexemes. Lots of research
have gone into developing language models and we want to reuse the
result of this research for music modelling.
    One caveat though; most music is polyphonic, meaning that multi-
ple notes can sound at the same time. This is unlike text in Western
scripts whose letters and words are strictly ordered from first to last.
6      CHAPTER 2. THEORETICAL BACKGROUND

There are several solutions to this “simultaneity” problem. The one
we use is to interleave simultaneous notes. See section 4.2 for details
on our various music encoding schemes.
    In sequence modelling, the goal is to model a latent probability
distribution over variable-length token sequences drawn from a finite
vocabulary. We denote token sequences x = ( x1 , . . . , xn ), where n is
the length of the token sequence,2 the latent distribution p∗ (x), and
our model of the distribution pθ (x). The model is parametrized by θ
and it should resemble p∗ (x) so that p∗ (x) is approximately equal to
pθ (x) for all x (Welleck, Kulikov, Roller, et al. 2019).
    The chain rule lets us factorize the joint probabilities as the product
of conditional probabilities:

                                       |x|
                            pθ (x) =   ∏ p θ ( x t | x < t ).                   (2.2)
                                       t =1

|x| denotes the length of the sequence and x
CHAPTER 2. THEORETICAL BACKGROUND                     7

completion in language modelling – the model is prompted with
questions and has to come up with a suitable answers.
   Let x p = ( x1 , . . . , xk ) be the prefix and xc = ( xk+1 , . . . , xn ) the
continuation so that ( x1 , . . . , xk , xk+1 , . . . , xn ) is the completion. Our
goal is to find xc that maximizes the likelihood:
                                     n
                                    ∏        p θ ( x t | x < t ).            (2.4)
                                  t = k +1

This is known as maximum a posteriori (MAP) decoding since pθ is a
probability model. However, since the search space is exponentially
large, solving the problem exactly is intractable, and one has to resort
to approximate decoding algorithms (Gu, Cho, and V. O. K. Li 2017;
Meister, Vieira, and Cotterell 2021).

2.2.2     Decoding
Generating sequences from a sequence model is called decoding. Several
decoding strategies have been invented, each with its own advantages
and disadvantages. Broadly speaking, they can be categorized as either
deterministic or stochastic. In the following sections, we review some
popular ones.

Deterministic decoding
Greedy decoding is a simple approximate decoding algorithm. It
selects the highest probability token at each time step:

                                 argmax pθ ( xt |x
8    CHAPTER 2. THEORETICAL BACKGROUND

    Beam search generalizes greey decoding. It keeps B sequences in
memory and at each time step and for each sequence it considers the B
tokens with the highest conditional probability. Among the B2 possible
continuations, it selects B sequences with the highest likelihood and
repeats the process. When a sufficient number of tokens have been
generated, the sequence with the highest likelihood is returned. With
B = 1 beam search is equivalent to greedy decoding.
    While beam search will always yield sequences with equal or higher
likelihood than greedy decoding, it is also much more computation-
ally expensive. Furthermore, several authors have shown that beam
search suffer from severe excessive repetition just like greedy decoding
(Holtzman et al. 2020).

Stochastic decoding
Stochastic decoding uses randomness to generate sequences that, ide-
ally, are more “diverse” than deterministically decoded sequences.
In its most basic form, it samples one token from the distribution
pθ ( xt |x
CHAPTER 2. THEORETICAL BACKGROUND                  9

suggested setting the p parameter to values in the range 0.9 to 1.4
    A third variant is tempered sampling which samples from q, where
q is derived from pθ ( xt |x
10     CHAPTER 2. THEORETICAL BACKGROUND

mean negative log-likelihood per token:
                                    |x|
                            (                              )
                                1
                               |x| t∑
             PPL(x) = exp −             log pθ ( xt | x
CHAPTER 2. THEORETICAL BACKGROUND                    11

the set of 2-grams or bigrams:

          {(hello, how), (how, are), (are, you), (you, doing)},

the set of 3-grams or trigrams:

         {(hello, how, are), (how, are, you), (are, you, doing)},

and so on.
    BLEU computes the fraction of the ngrams in the candidate sen-
tence that is matched by an identical ngram in a reference sentence.
The metric therefore ranges from 0 to 1 and is 1 only if the candidate
translation has the same ngrams as one of the reference translations
(Papineni et al. 2002).
    Let C be the set of ngrams in the candidate sequence and R =
{ R1 , . . . , Rn } a set of the sets of ngrams in the reference sequences and
let πS ( g) be the number of occurances of the ngram g in the sequence
S, then BLEU is defined as:
                                                        
    BLEU (C, R) =    ∑         min max{π R (c)}, πC (c) / ∑ πC (c) (2.10)
                                R∈R                      c∈C
                     c∈C

    While BLEU measures how similar sentences are to each other,
Zhu et al. (2018) proposed turning the metric around to measure the
diversity of generated data which they called Self-BLEU. Self-BLEU
works by sampling, say, 1000 sentences and for each sentence calculate
its BLEU score by using the others as reference. The lower the score,
the higher the diversity.

2.3     Artificial neural networks
Artificial neural networks is a subfield of machine learning centered
around computational graphs, loosely inspired by how neurons com-
municate with each other (Goodfellow, Bengio, and Courville 2016,
Chapter 6). The neurons in artificial neural networks (often called
“neural networks” or just “networks”) form directed graphs along
whose edges they propagate signals. The signals’ strengths depend
on the weights of the edges they are transmitted on. If the strength
exceeds some threshold, the neuron propagates the signal further.
Eventually, the signal reaches a set of output neurons from which
12       CHAPTER 2. THEORETICAL BACKGROUND

the network’s value is read. Neural networks have been applied for
many tasks for which it is difficult to come up with explicit logical
constraints, such as pattern recognition, predictive analysis, and image
recognition. Thus, they ought to be an excellent choice for modelling
music for which coming up with such constraints is difficult.
   Mathematically speaking, a neural network is a general class of
parametric nonlinear functions from a vector x of input variables to a
vector o of output variables,

                                      o = f (x, θ ).                  (2.11)

θ represents the parameters; i.e. the set of weights for all edges in the
graph.
    The Universal Approximation Theorem states that sufficiently large
neural networks can approximate any function. Thus, we can use
a neural network for solving prediction problems by finding proper
values for θ.
    The method used to fit a neural network is analoguous to poly-
nomial curve fitting. Let D = {(x1 , y1 ), . . . , (xn , yn )} be a sequence
of training examples, comprised of pairs of input vectors together
with matching target vectors. The parameters that minimize an error
function over the training examples, such as the sum-of-squares, is
sought:
                            1 n
                    E(θ ) = ∑ || f (xi , θ ) − yi ||2 .                (2.12)
                            2 i =1

If f is smooth and continuous,5 it follows that E is too. Therefore, E
forms a surface over θ. All its minima will occur where its gradient is
zero:
                               ∇ E(θ ) = 0.                      (2.13)
Due to the large number of free parameters, finding the gradient’s
zero points with analytical methods is infeasible. Instead, iterative
numerical methods are used – commonly gradient descent. Gradient
descent can be imagined as a ball dropped on a hilly surface consisting
of peaks and valleys. No matter where the ball is dropped, it will
roll downhill until it comes to rest in one of the valleys. The gradient
at that point must be zero, otherwise the ball would keep rolling.
     5 Any   neural network function must have these properties.
CHAPTER 2. THEORETICAL BACKGROUND                 13

Gradient descent simulates the process by picking some initial values
for the ball’s position, θ (0) , and updates them in a stepwise fashion:

                       θ ( t +1) = θ ( t ) − η ∇ E ( θ ( t ) ).    (2.14)

η is a scalar parameter called learning rate. It controls how far the ball
rolls in the gradient’s opposite direction on each update. The process
repeats until E(θ ) stops decreasing at which point a local minima has
been found.
    The local minima gradient descent finds may be different and larger
than the error function’s global minima. Suppose the hilly surface
contains a valley in a mountainous region. The lowest elevation
of the valley may be higher than the elevation of the flatlands in
the other regions of the surface. Gradient descent could get stuck
in this valley, unable to escape the local minima. To alleviate this
problem, researchers have proposed variants of the basic gradient
descent update process to let it explore a larger portion of the search
space (Bottou 1991). One such variant is stochastic gradient descent.
    Unlike total gradient descent, which computes ∇ E(θ (t) ) on every
example in the dataset D and is infeasible to compute if the dataset
is large, stochastic gradient descent computes ∇ E(θ (t) ) on a single
randomly selected element of D . This introduces noise and helps the
optimization process escape local minima. But it may also lead to the
error function’s variance becoming prohibitively large, making the
optimization process inefficient. Furthermore, since only one piece of
data is used at a time, data parallelism hardware cannot be exploited.
    Mini-batch gradient descent offers a compromise between these two
extremes. It computes ∇ E(θ (t) ) based on a random subsequence of D .
The size of the subsequence is called the batch size and is commonly
set to values ranging from 10 to 1 000. Mini-batch gradient descent is
the gradient descent variant most often used in practice.
    So far, we have not discussed how neural networks are imple-
mented. The reason is because there are many different types of
networks and, besides what we stated above, they have little in com-
mon with each other. While it is not a property all neural networks
share, many of them organize their neurons into layers. The network’s
input represents the values of the neuron’s in the input layer and the
network’s output the values of the neuron’s in the output layer (x and
o in equation 2.11). In between the input and output layers sits one or
more hidden layers, so called because they are not directly observable
14      CHAPTER 2. THEORETICAL BACKGROUND

from the inputs and the outputs (Reed and Marks II 1999, Chapter 4).
Organizing neurons into layers makes neural networks very flexible
because the parameters of each layer, like the number of neurons, can
be configured independently. Indeed, one can even think of the layers,
rather than the neurons themselves, as nodes in a computational graph
and the job of the practitioner to be to combine these layers by drawing
the graph’s edges.
   In the following sections, we describe a few different neural net-
work designs; feed-forward networks, recurrent neural networks, and
the LSTM and the transformer architecture – two architectures de-
signed for sequential data and which are the ones we use in this
thesis.

2.3.1     Feed-forward networks
The simplest neural network architecture is the feed-forward network.
The information in feed-forward networks flows in a single direction –
there are no cycles in the computational graph. This sets them apart
from recurrent networks wherein cycles exist (Goodfellow, Bengio, and
Courville 2016, Chapter 6). A feed-forward network is a composition
of parametric functions;
      f (x, (θ1 , . . . , θn )) = f n ( f n−1 (. . . f 1 (x, θ1 ) . . . , θn−1 ), θn ).   (2.15)
The functions define the layers of the network and the number of layers
its depth. “Deep” in “deep learning” comes from this terminology.
Most layers in most feed-forward networks are fully-connected layers
– so called because every neuron in the layer is connected to every
neuron in the preceding layer. A fully-connected layer composes a
parametrized affine transformation with a non-parametrized scalar
activation function. The activation function is often nonlinear which
allows the network to represent nonlinear functions. (Goodfellow,
Bengio, and Courville 2016, Chapter 6). A network with only fully-
connected layers is called a fully-connected network.
    Let θ = ((W1 , b1 ), . . . , (Wn , bn )) specify the parameters of a fully-
connected network’s affine transformations and σ1 , . . . , σn its activation
functions, then the network computes:
     f (x, θ ) = σn (Wn σn−1 (. . . σ1 (xW1 + b1 ) · · · + bn−1 ) + bn ).                 (2.16)
    While the Universal Approximation Theorem states that a two-layer
fully-connected network can represent any function, it may require a
CHAPTER 2. THEORETICAL BACKGROUND                   15

prohibitively large number of parameters or may generalize poorly
(Goodfellow, Bengio, and Courville 2016, Chapter 6). Therefore, more
sophisticated architectures have been designed to overcome the limita-
tions of fully-connected feed-forward networks.

2.3.2    Recurrent neural networks
A shortcoming of feed-forward networks is that they have no notion
of “previous” and cannot recall what they were doing previously.
For example, suppose a neural network classifies frames in movies.
The current frame is of course important, but information from prior
frames could perhaps improve the network’s predictions. If the last
frame was of an elephant the probability of the next frame also being
of an elephant must be quite high. It is unclear how a feed-forward
neural network could incorporate such information. Recurrent neural
networks (RNNs) were invented to address this issue. They excel on
sequence prediction tasks and are a good fit for symbolic music which,
as described in section 2.2, can be modelled as sequences.
    Unlike feed-forward networks, RNNs contain state, known as hid-
den state, that is affected by the data that passes through them. Let ht
be the network’s hidden state at time t and xt be the t:th element of a
sequence, then the network is defined by the recurrence relation

                            h t = f ( h t −1 , x t , θ ).             (2.17)

Typically, an RNN will use the hidden state to make predictions
(Goodfellow, Bengio, and Courville 2016, Chapter 10):

                                 o t = g ( h t ).                     (2.18)

If the network is employed to predict the next element of the sequence
then xt+1 = ot and we can write the above as

                         xt+1 = g( f (ht−1 , xt , θ )).               (2.19)

   Like feed-forward networks, RNNs can be layered. Suppose we
have three recurrent layers, defined by the hidden states h(1) , h(2) , and
h(3) and the parameters θ (1) , θ (2) , and θ (3) , then the network computes
16    CHAPTER 2. THEORETICAL BACKGROUND

the following recurrences and prediction function g:
                            (1)            (1)
                          ht      = f ( h t −1 , x t , θ (1) )
                            (2)            (2)      (1)
                          ht      = f ( h t −1 , h t , θ (2) )
                                                                           (2.20)
                            (3)            (3)      (2)
                          ht      = f ( h t −1 , h t , θ (3) )
                                           (3)
                             o t = g ( h t ).

    To train an RNN, the recurrence relation has to be removed by
unfolding it over training sequences of fixed lengths. Unfolding effec-
tively means making a copy of the network for every element in the
training sequence with every copy sharing the same parameters. Let
x = (x1 , . . . , xn ) be a training sequence of length n, then the unfolded
computation is defined by iteratively applying f to the hidden state
and the sequence elements:

                    h n = f ( . . . f ( h0 , x1 , θ ) . . . , x n , θ ).   (2.21)

The initial hidden state, h0 , is set to some default value, commonly
zero (Zimmermann et al. 2005). The unfolded computation is free
of cycles and the RNN can be trained using backpropagation and
gradient descent like a feed-forward network.
   The unfolding of the computation graph causes two major prob-
lems for RNNs. First, the depth is proportional to n, causing the
complexity of the training to also become proportional to n. Due to
the inherently sequential nature of the data flow, the problem cannot
be meaningfully parallelized, making training with longer sequences
very expensive. Furthermore, the activations of the neurons inside
the network are replicated n times which consumes memory pro-
portional to n (Hwang and Sung 2017). The second problem is the
vanishing/exploding gradient problem, caused by long chains of re-
peated multiplication necessary for propagating gradients in deep
computational graphs. Repeated multiplication of numbers greater
than one tends towards infinity and causes gradients to explode, while
repeated multiplication of numbers less than one tends towards zero
and causes gradients to vanish (Goodfellow, Bengio, and Courville
2016, Chapter 10). Vanishing gradients causes training to become very
slow and exploding gradients cause it to be wildly unstable. This
puts an bound on the length of the temporal dependencies RNNs can
learn (McGonagle, Williams, and Khim 2021). Goodfellow, Bengio,
CHAPTER 2. THEORETICAL BACKGROUND                  17

and Courville (2016) estimates that the limit is somewhere around
10 to 20 tokens for traditional RNNs trained with stochastic gradient
descent. To overcome this limitation, gated RNNs were invented. They
are based on the idea of creating paths through time whose derivatives
will neither vanish nor explode (Goodfellow, Bengio, and Courville
2016, Chapter 10).

LSTM
The Long Short-Term Memory (LSTM) network is one of the most
successful gated RNN types (Mauthes 2018). Researchers have shown
that LSTMs can learn long-term dependencies more easily than sim-
ple recurrent architectures (Goodfellow, Bengio, and Courville 2016,
Chapter 10).
    The LSTM has a cell state in addition to its hidden state. The cell
state works as an auxiliary memory onto which the LSTM puts and
removes information that needs to be remembered long-term. Thus,
the LSTM’s recurrence relation is:

                      h t , C t = f ( h t −1 , C t −1 , x t , θ ).   (2.22)

Ct is the cell state at time t and the other variables are as previously
defined. The equations implementing the recurrence are:

                    f t = σ ( x t U f + h t −1 W f + b f )
                    it = σ(xt Ui + ht−1 Wi + bi )
                   ot = σ(xt Uo + ht−1 Wo + bo )
                                                                     (2.23)
                   e t = tanh(xt Ug + ht−1 Wg + bC )
                   C
                   C t = f t ◦ C t −1 + i t ◦ C
                                              et
                   ht = tanh(Ct ) ◦ ot

The subscripted U, W, and b variables are appropriately sized pa-
rameter matrices and biases, ◦ the Hadamard product (element-wise
multiplication), and σ the sigmoid function:

                                              1
                              σ( x) =                 .              (2.24)
                                           1 + e x −1
   Three gates, the forget, input, and output gate, represented by the
vectors ft , it , and ot in the previous equations, control how the LSTM
18      CHAPTER 2. THEORETICAL BACKGROUND

stores information in the cell state and how it uses it to update its
hidden state (Olah 2015). They all apply a parametric affine transfor-
mation to ht−1 and xt , followed by the sigmoid function. The domain
of the sigmoid function is the range 0 to 1 so the gates’ values are also
constrained to that range. Note that the range of the tanh function is
-1 to 1, meaning that the range of ht is -1 to 1.
    The forget gate determines what fraction of the cell state to discard.
The input gate determines how much of the candidate cell state, C   e t , to
discard. 0 again means discard everything and 1 means keep every-
thing. The update of Ct in equation 2.11 can be thought of assigning a
weighted sum of the old cell state, Ct−1 and the candidate cell state, C e t.
This is what allows the LSTM to solve the vanishing gradient problem
and to better capture long-term dependencies. The final gate is the
output gate and controls how much memory information is passed
through to the predictor.

2.3.3    The Transformer
The Transformer is a family of neural network architectures based on
the attention mechanism (Vaswani et al. 2017). It was popularized
by OpenAI’s Transformer-based GPT-2 architecture, which achieved
state-of-the-art results on many natural language processing tasks.
    Like RNNs, the Transformer is designed for sequence prediction.
However, it is wider and shallower than RNNs, lending itself well to
parallelization. This makes it possible to build extremy large Trans-
former models (Alammar 2018). An important feature of the Trans-
former is that it processes all sequence elements in parallel, rather than
serially like RNNs.
    At a high level, the Transformer is a Sequence-to-Sequence architec-
ture that implements an Encoder-Decoder architecture. Sequence-to-
Sequence (or seq2seq) means that it transforms an input sequence to
an output sequence and Encoder-Decoder that it consists of an encoder
and a decoder unit. The encoder unit takes the input sequence and
returns a context vector. The decoder unit takes the context vector and
returns the output sequence (Allard 2019). The encoder and decoder
units consists of multiple identical encoder and decoder layers stacked
on top of each other, each with its own set of learnable parameters.
CHAPTER 2. THEORETICAL BACKGROUND                      19

   The Transformer’s data flow is:6
                    h0 = PE(x)
                   hn = En (. . . E1 (h0 ) . . . )
                                                                          (2.25)
                hm+n = Dm (. . . D2 ( D1 (hn , hn ), hn ) . . . , hn )
                 f (x) = FC (hm+n ).

E1 , . . . , En and D1 . . . , Dm represents the encoder and decoder layers,
h0 to hm+n the context vector as it flows through the layers, PE is a
layer for embedding and positionally encoding sequence elements, and
FC a fully-connected layer that produces predictions. Note that the
decoder layers takes two values as input; the output of the previous
decoder layer and the output of the last encoder layer.
     The encoder layers’ data flow is:

                          h0 = LN ( MH A(h, h, h) + h)
                                                                          (2.26)
                       E(h) = LN ( FFN (h0 ) + h0 )

and the decoder layers’ is:

                           h0 = LN ( MH A(h, h, h) + h)
                          h00 = LN ( MH A(he , he , h0 ) + h0 )           (2.27)
                                                00      00
                   D (h, he ) = LN ( FFN (h ) + h ).

he is the context vector returned by the last encoder layer, MH A is the
multi-head attention mechanism, LN a normalization layer, and FNN
a two-layer feed-forward network with ReLU activation inbetween:

                   FFN ( x ) = max(0, xW1 + b1 )W2 + b2 .                 (2.28)

Multi-head attention
The multi-head attention mechanism is the heart of the Transformer
architecture. In a recurrent seq2seq model, the encoder processes
the input sequence one element at a time. Each element updates the
encoder’s context vector or hidden state. The decoder takes the context
vector as input and produces predictions. The problem with this is
  6 For brevity, in this and the following equations we omit most parameters and
dropout rates. We hope it is clear from context where parameters are missing. For
these and other details about the Transformer, see Vaswani et al. (2017).
20    CHAPTER 2. THEORETICAL BACKGROUND

that older elements tend to be forgotten or “overwritten” by more
recent elements.
    To deal with this problem the attention mechanism was invented.
It creates “shortcut” connections between the context vector and the
entire input sequence. The weights of these shortcut connections are
learnable for each input element, allowing the model to learn which
prior elements are most important for predicting the current element.
In other words, the model learns which elements to pay attention to.
    There are many ways to implement attention. In the Transformer
model it is implemented as

                                   FC (q) FC (k ) T
                                                   
         MH A(q, k, v) = softmax       √              FC (v).   (2.29)
                                         s/n
s is the dimensionality of the context vector and n the number of
attention heads in the layer (see below). FC is a fully-connected layer
with no activation layer and no bias, i.e a multiplication of a learnable
matrix with a vector.
    The multi-headed nature of the Transformer’s attention mechanism
is not communicated in the equation above. The Transformer splits
FC (q), FC (k), and FC (v) into as many parts as there are heads and
computes the attention for each part separately. Intuitively, multiple
attention heads allows for attending to parts of the sequence differently.

Positional encodings
As the elements in a sequence simultaneously flow through the Trans-
former, it does not have any sense for the elements’ ordering. While
implicit in RNNs, ordering has to be imputed explicity using a posi-
tional encoding to the Transformer by adding a piece of information
to every element so that the model can track its order. The positional
encoding is fixed in the vanilla Transformer but learnable in GPT-2.

GPT-2
GPT-2 (Generative Pre-trained Transformer 2) is a variant of the Trans-
former architecture. Developed by OpenAI and published in 2019, it
is a refinement of the GPT architecture from 2018 which, in turn, is
a refinement of the Transformer-Decoder architecture (Radford, Wu,
et al. 2019). It differs from the vanilla Transformer in that it uses
only decoder layers and a learnable positional encoding, rather than a
CHAPTER 2. THEORETICAL BACKGROUND                  21

fixed one. The following equations describe the architecture (Radford,
Narasimhan, et al. 2018):

                      h0 = xWe + Wp
                     hm = Dm (. . . D2 ( D1 (h0 )) . . . )          (2.30)
                    f (x) =   softmax(hm WeT )

Wp and We are matrices containing the embedding vectors and the
positional encoding. Both are learnable parameters. Note that the
embedding matrix is reused for producing predictions.
   The GPT-2’s decoder layers are similar to the encoder layers in the
Transformer:
                       d0 = LN (d)
                      d00 = MH A(d0 , d0 , d0 ) + d                 (2.31)
                                                00             00
                    D (d) = FORW ( LN (d )) + d .

FORW is a feed-forward network with Gaussian Error Linear Unit as
the activation function and LN and MH A are as previously defined.

2.3.4   Inferencing
To get a neural network to perform inferencing over categorical dis-
tributions, its last layer needs to be a computation whose result is a
probability distribution. A probability distribution is a vector whose
element-wise sum is 1, and whose elements are all in the range 0 to
1. Very often, this is accomplished by using softmax as the activation
function for the last layer:

                                               e xi
                       softmaxi (x) =                 xj
                                                           .        (2.32)
                                            ∑Kj=1 e

K is the number of classes in the categorical distribution. For classi-
fication problems cross-entropy loss is almost always preferred over
other types of error functions (Janocha and Czarnecki 2017):

                   E(θ ) = −      ∑       y log( f (x, θ )).        (2.33)
                                (x,y)∈D
Chapter 3

Related work

A common approach to generating symbolic music is to model it as
token sequences and generate it using autoregressive models. This is
the sequence modelling approach we discussed in section 2.2 applied
to music. The model starts with a possibly empty sequence and adds
one token at a time to the sequence until the sequence reaches a
sufficient length or until an end token is seen.
   The autoregressive method’s major drawback is that it often fails to
consider the overall structure of the music it is generating. Melodies
can wander and, while pleasantly sounding notes may be generated,
they often lack coherence. The opposite, that the generated music is
too repititive, is also a problem. A musical phrase or a motif perhaps
should be repeated four to eight times, depending on the musical
genre, but repeating it more might make the music sound repetitive.
   The cause of these problems may be the models’ sizes. To capture
long-term structure, models need to operate on very long sequences.

 Name                   Year   Representation   Architecture     Length
 folk-rnn               2015   ABC notation     3-layer LSTM     200 tokens
 vgm-rnn                2018   MIDI             RNN              4-8 measures
 LakhNES                2019   Event-based      Transformer-XL   5-10s
 DeepBach               2016   Pitch matrix     LSTMs + CNNs     12s
 Relative Transformer   2019   Mixed            Transformer      10-15s

Table 3.1: Overview of some music generation models. The length column
indicates the duration of clips used in the model authors’ listener survey or,
in the case of folk-rnn, the average number of tokens per transcription (all
models are able to generate clips of any length).

                                      22
CHAPTER 3. RELATED WORK             23

But the longer the sequences, the more computationally expensive the
models become. With today’s hardware, autoregressive models are
limited to operating on sequences a few hundreds of tokens long.
     Many models reviewed in this chapter use novel techniques to
get around the autoregressive paradigm’s limitations. Sturm, Santos,
et al. (2016) modeled complete musical pieces that required, for the
most part, no more than 200 tokens to describe – short enough to
fit into their model’s memory. Donahue, Mao, Y. E. Li, et al. (2019)
limited the model’s output to about 500 tokens, equivalent to nine
seconds of audio. Huang et al. (2019) implemented a memory-efficient
Transformer able to operate on 2,000-tokens-long sequences. Coming
up with space-efficient music encodings that minimizes the number of
tokens required to represent music appears to be very important for
creating successful models.
     Deep neural networks’ impressive performances stem from having
massive datasets to train on. Such datasets are readily available for
modeling, for example, English text, but not for modelling music.
Often, data augmentation techniques, including transposing songs and
adjusting the tempo by some small percentage, are used to expand the
training data available.

3.1     Sturm, Santos, et al. (2016)
Sturm, Santos, et al. (2016) used a dataset consisting of some 23
thousand folk music songs in ABC notation to train an RNN. ABC
notation is a simple text-based format created for notating folk music.
The notation is very compact – a few hundred characters is enough
to represent an entire song – and can be read by humans. While
there are extensions to ABC notation to make it usable for polyphonic
multi-instrumental music, it is more suitable for monophonic music.
Figure 3.1 shows a reel (a type of folk dance) notated in ABC and sheet
notation.
    The authors trained two networks using the dataset; char-rnn and
folk-rnn, using different tokenization strategies. In char-rnn individual
characters were tokens, while in folk-rnn syntactic units were tokens.
For example, “:|” (right repeat) and “M:4/4” (meter) would be single
tokens to folk-rnn but sequences of tokens to char-rnn. Both networks
were three-layer LSTM-networks with 512 units in each layer, totaling
24       CHAPTER 3. RELATED WORK

     X:262
     T:ReelDuGin
     M:4/4
     L:1/8
     K:Gmaj
     |:gdgbagab|gdgbagef
     |gdgbagbg|gedBABG2:|
     |:gedBABGA|BGABcdef
     |gedBABGA|BGEBA2G2:|

Figure 3.1: The same reel in both ABC notation and sheet notation. Note the
repetition markers on every other line.

 Figure 3.2: Screnshot of folk-rnn’s web interface at https://folkrnn.org

about 5.6 million parameters. Figure 3.2 shows a screenshot of folk-rnn
in action. Creating a full transcription takes only a few seconds.
    In their evaluation of folk-rnn, they found that the probability of
each token in the generated data roughly matched that of the training
data, but that the generated transcriptions’ lengths differed from the
ones in the training data (Sturm, Santos, et al. 2016). folk-rnn tended
to generate transcriptions whose lengths were closer to the peaks. In
other words, the variance of the distribution of the number of tokens
were lower than expected.
    The compact notation and the homogenity of the dataset probably
contributed to the good performance of the networks they trained.
Folk music is a very regular type of music with similar themes and
structures recurring in many songs. 54% of all transcriptions in the
dataset had an AABB structure with each section being eight bars
longs, according to the authors. I.e., eight bars repeated once, followed
by another set of eight bars also repeated once.
CHAPTER 3. RELATED WORK               25

   A drawback of their model was that it only generated monophonic
music. However, the authors argued that “harmony is implicit in the
melody” and that richer compositions could be constructed based
on the generated transcriptions. One interesting issue they did not
cover is the effect of various sampling techniques on the quality of the
generated transcriptions.

3.2     Hadjeres and François Pachet (2016)
Hadjeres and François Pachet (2016) created a program called Deep-
Bach for generating four-voice chorales in the style of Johann Sebastian
Bach. DeepBach first generates a random chorale and then iteratively
refines it using random resampling (Gibbs sampling). This, they
argued, is a more flexible approach than autoregressive generation
because a part of the chorale can be held constant while the remainder
is resampled. For example, the user can specify one voice and let
DeepBach reharmonize the chorale. I.e generate the three remaining
voices based on the user-specified voice.
    DeepBach represents each chorale as a pitch matrix. The pitch
matrix has four rows for the four voices: soprano (S), alto (A), tenor
(T), and bass (B). The columns are time in the chorale quantized
as sixteenth notes. Thus, each element in the matrix describes a
voice’s pitch at a given time step. Special hold values (–––) are used
to distinguish between repeating notes and notes longer than one
sixteenth note. For example, a C# eight-note in the third octave is
notated as C#3 followed by –––.
    In addition to the pitch matrix, the representation includes two
sequences whose elements correspond to the chorale’s time steps;
the subdivision list and the fermata list. The subdivision list cycles
four integers, 1, 2, 3, 4, 1, 2, 3, 4, 1, . . ., and representes the indexes of
the sixteenth notes within their beats. The elements of the fermata
list are 1 where notes annotated by the fermata symbol occur and 0
elsewhere. Fermatas are used in sheet music to introduce slight pauses
in performances and demarcates musical phrases. The subdivision list
helps DeepBach keep the rhythm steady and the fermata list helps it
understand the overall structure of the chorale it generates. Figure 3.3
shows the same measure represented in three ways; in DeepBach’s
grid representation, in piano roll format, and in sheet notation.
26         CHAPTER 3. RELATED WORK

     qn1                     qn2                 qn3                   qn4
S:   D-5   ---   E-5   F-5   D-5   --- --- --- C-5 --- ---       ---   E-5   ---   ---   ---
A:   A-4   ---   ---   ---   G-4   --- F-4 --- E-4 --- E-4       ---   E-4   ---   ---   ---
T:   C-4   ---   ---   ---   B-3   --- --- --- G-3 --- ---       ---   A-3   ---   ---   ---
B:   F-3   ---   D-3   ---   G-3   --- --- --- C-2 --- ---       ---   C#2   ---   ---   ---
s:     1     2     3     4     1     2    3   4    1   2     3     4     1     2     3     4
f:     0     0     0     0     0     0    0   0    1   1     1     1     0     0     0     0
                                     (a) Grid representation

                                          (b) Piano roll

                                        (c) Sheet notation

Figure 3.3: A measure in the grid representation Hadjeres and François
Pachet (2016) created and the equivalent piano roll and sheet music notation.
The top four lines shows the voices (soprano, alt, tenor, and bass) and the
bottom two shows the subdivision (s) and fermata (f) lists. The four ones in
the fermata list denotes one fermata over one of the notes in the third quarter
note. qn1 to qn4 is not part of the representation – they clarify at which time
steps quarter notes can begin.
CHAPTER 3. RELATED WORK              27

   The authors formalized DeepBach as a dependency network over
the conditional probability distributions for each pitch in the chorale
as
             pi,t (Vi,t |V\i,t , M, θi,t ), for i ∈ [4] and t ∈ [ T ]. (3.1)
Vi,t is the pitch voice i plays at time step t, V\i,t the pitches for all
voices at all time steps except for Vi,t , M the subdivision and fermata
lists, and θi,t the distributions’ parametrizations. They dropped the t
indexes to make DeepBach time invariant, resulting in distributions of
the form
                             pi (Vi,t |V\i,t , M, θi ).             (3.2)
I.e., each voice is modeled separately but depends on every other voice.
Note that this formalization, unlike traditional step-by-step-generation
methods, incorporates information from both the past and the future.
     To implement this sophisticated scheme, DeepBach uses four net-
works for the four voices. Each of the four networks consists of several
subnetworks; two LSTMs for processing the voice’s past and future
pitches, and a feed-forward network for processing simultaneous
pitches. The three networks can be thought of as looking left, right,
and vertically (up and down). A fourth network merges the underlying
networks outputs, yielding an approximation for pi (Vi,t |V\i,t , M, θi ).
The LSTMs only considers 16 future and past time steps, rather than
summing up over the whole of V\i,t to keep DeepBach fast. The authors
stated that “[t]his approximation appears to be accurate since musi-
cal anaysis reveals that Bach chorales do not exhibit clear long-term
dependencies.”
     DeepBach generates chorales by first selecting a subdivision and
fermata list from an existing chorale and then initializing the pitch
matrix with random values. It then repeatedly updates the pitch
matrix a set number of times by resampling a randomly choosen Vi,t .
To improve efficiency, it batches updates in groups of 16 or 32. Figure
3.4 shows the piano roll of two measures as DeepBach incrementally
improves the measures using Gibbs sampling.
     The authors trained DeepBach on the Johann Sebastian Bach Chorales
dataset, a collection of 382 roughly one-minute long, four-voice chorales.
To this collecton they added all chorale transpositions fitting within
the vocal ranges defined by the initial dataset, growing the dataset to
2,503 chorales.
     In a test of whether humans could tell DeepBach’s harmonizations
28     CHAPTER 3. RELATED WORK

Figure 3.4: Piano rolls of two measures (32 sixteenth-notes) iteratively up-
dated by Gibbs sampling. Topleft figure shows the random initialization, the
right figure the state after 50 batched iterations, the next figure the state after
100, and so on. The bottomright figure shows the state after 250 iterations.
CHAPTER 3. RELATED WORK                   29

apart from real harmonizations written by Bach, the authors found that
DeepBach outperformed two baselines; a Maximum Entropy model as
in (Hadjeres, Sakellariou, and François Pachet 2016) and a multilayer
perceptron network. The models were given random soprano voices
from the chorales in the validation dataset and had to generate the
alto, tenor, and bass voices. Around 50% of the time, the survey takers
would judge a reharmonization generated by DeepBach as composed
by Bach, which they considerd to be a good score.

3.3      Donahue, Mao, Y. E. Li, et al. (2019)
Donahue, Mao, Y. E. Li, et al. (2019) trained a Transformer-XL network
to generate video game music in the style of the 1980s video game
console the Nintendo Entertainment System (NES). The NES sound
system has four channels;1 P1, P2, and TR for playing melodic notes,
and NO for generating noise serving as percussion. The channels
playing melodic notes cover six to seven octaves and the noise channel
can play 16 types of noise. Each channel can only play one note at a
time.
    The authors created a simple event-based format for NES-compatible
music, with NOTEON and NOTEOFF events for turning pitches and per-
cussive sounds on and off, and DT (delta tick) events for advancing
time. Since they subdivided time into 44,100 ticks per second, they
quantized time advancements to keep the number of events manage-
able. For example, instead of having one time advance event for every
conceivable tick delta they represented a time advancement of 1,840
ticks as the three-event sequence DT_1000, DT_800, and DT_40. On
average, nine seconds of audio required about 500 events to represent.
    They used the Lakh MIDI dataset for pre-training and NES-MDB
for fine-tuning. Lakh MIDI is a collection of about 175 thousand songs
in MIDI format and NES-MDB a collection of about five thousand
songs from the soundtracks to about three hundred NES games (Don-
ahue, Mao, and McAuley 2018). Since the MIDI format is much richer
than the authors’ event-based format, they invented a protocol for
converting MIDI files to their format by randomly mapping melodic
MIDI instruments to one of the NES’s three melodic channels and per-
   1A fifth channel exists for playing low-quality samples, but it wasn’t used in the
authors’ work.
30     CHAPTER 3. RELATED WORK

P2_NOTEON_87, DT_30, TR_NOTEON_20, DT_50, P2_NOTEOFF,
NO_NOTEON_12, DT_50, P2_NOTEON_90, DT_50, P1_NOTEON_76,
TR_NOTEON_22, ...

Figure 3.5: Example of Donahue, Mao, Y. E. Li, et al. (2019)’s event-based
representation and resulting piano roll. The event sequence should be read
as follows: play pitch 87 on P2, wait 30 ticks, play pitch 20 on TR, wait 50
ticks, silence P2, play noise 12 on NO, wait 50 ticks, play pitch 90 on P2, wait
50 ticks, play pitch 76 on P1, change to pitch 22 on TR, and so on.

cussive instruments to one of the 16 noise types. This conversion grew
their pre-training dataset to some 775,000 songs as there were multiple
ways to convert each song. They also augmented their datasets by
transposing songs and randomly adjusting the speed of songs by some
small percentage.
    They compared their Transformer-XL with other neural networks
of similar sizes and found that it performed very well both when
comparing perplexities, and in a listener survey. In the listener survey
respondents indicated which of two clips they preferred and which
they thought were composed by a human. However, they only used
five and nine seconds long clips. Perhaps because their network failed
to stay coherent for longer durations.
    They theorized that a beat-based format, rather than their tick-
based one, would fare better. Likely, it would reduce the number
of tokens required to encode the music (500 tokens is a lot for nine
seconds of audio) and also make it easier for the network to learn
rhythms.
CHAPTER 3. RELATED WORK             31

         qn1               qn2               qn3               qn4
    S:   65 65   65   65   72 72   70   70   69 69   67   67   65 65   65   65
    A:   60 60   60   60   60 60   60   60   60 60   60   60   62 62   64   64
    T:   57 57   57   57   55 55   55   55   53 53   55   55   57 57   58   58
    B:   53 53   53   53   52 52   52   52   53 53   52   52   50 50   50   50

Figure 3.6: The opening measure of Chorale #305 encoded in grid representa-
tion and the resulting piano roll. The tenor and bass voices overlap in the
third quarter note.

3.4      Huang et al. (2019)
Like previously mentioned authors, Huang et al. (2019) used sequence-
modeling for generating symbolic music. They developed a memory-
efficient implementation of the Relative Transformer model introduced
by Shaw, Uszkoreit, and Vaswani (2018), allowing them to train on
2000-token-long sequences, far longer than the training sequences
used in other works. The authors reported state-of-the-art results for
their model which outperformed two baselines – an LSTM model
augmented with attention and a vanilla Transformer model – both in
listener tests and when measuring negative log-likelihood.
    The authors used two dataset for evaluating their model; the Johann
Sebastian Bach Chorales dataset (see section 3.2 for a description of the
dataset and the structure of chorales), and the Piano-e-Competition
dataset, consisting of 1,100 classical piano performances in MIDI for-
mat. Due to differences in the structure of the songs in each dataset,
they encoded them differently.
    The authors used a pitch matrix with time quantized as sixteenth
notes for representing the chorales, as shown in figure 3.6. Thus, one
measure is represented using a 4 × 16-grid of integers.2 The matrix is
  2 Figure6 on page 11 in Huang et al. (2019) indicates that time was actually
quantized as quarter-notes and not sixteenth-notes so that one measure would
You can also read