DIGITAL MUSIC INPUT RENDERING FOR GRAPHICAL PRESENTATIONS IN SOUNDSTROLL(MAXMSP)

Page created by Chris Gray

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

DIGITAL MUSIC INPUT RENDERING FOR GRAPHICAL PRESENTATIONS IN SOUNDSTROLL(MAXMSP)

Digital Music Input Rendering for Graphical
            Presentations in SoundStroll(MaxMSP)

                                                             Justin Kerobo
                                               School of Computer Science and Music
                                                          Earlham College
                                                     Richmond, Indiana 47374
                                                  Email: jakerobo15@earlham.edu

    Abstract—A graphical presentation is produced at a display        and manipulation of virtual worlds comprised of three dimen-
of a host computer such that a scene description is rendered and      sional objects in a 3D space with applications in many fields,
updated by a received digital music input. The digital music input    from microscopic imaging to galactic modeling and notably
is matched to trigger events of the scene description and actions     computer graphics for films and gaming environments. There
of each matched trigger event are executed in accordance with         have been a few attempts to associate the direct performance
action processes of the scene description, thereby updating the
                                                                      of music with computer video and graphics to create new art
scene description with respect to objects depicted in the scene on
Which the actions are executed. The updated scene description         forms. One program, Bliss Paint for the Macintosh, used MIDI
is then rendered. The system provides a means for connecting a        input to change colors on an evolving rendering of a fractal
graphics API to a musical instrument digital interface (e.g., MIDI)   image. Another program, ArKaos, uses MIDI commands to
data stream, possibly from Ableton, Reason, to MaxMSP and             play video clips in a DJ-like process other program, MaxMSP,
producing a graphical presentation in SoundStroll in MaxMSP.          uses MIDI commands in a flexible environment to drive video
                                                                      clips, audio clips, and drive external events.
   Keywords—MIDI, LATEX, OpenGL, Graphics, MaxMSP, Music
Technology, Ableton, FFT, IFFT, Computer Music.                           There are many computer programs that control sound in
                                                                      various ways in response to a MIDI command stream. The
                                                                      “3DMIDI” program appears to be un-supported and it is not
                      I.   I NTRODUCTION                              clear if the software works or ever worked. The available
                                                                      documentation describes set of separate programs that each
    A variety of computer software programs are available             performs a prescribed set of transformations to an embedded
for defining and manipulating objects in a virtual three-             set of objects in response to MIDI.
dimensional (3D) world. For example, “3DSMax” from Au-
todesk, Inc. and SolidWorks are available. They provide an as-            Each different performance is loaded and executed sep-
sortment of tools in a convenient graphical user interface (GUI)      arately, and has its own unique tool set to make specific
for manipulation and editing of 3D virtual objects. Programs          adjustments to the objects in that scene. There is an API
for computer display screensavers also permit manipulation            shown that invites others to develop their own performances,
of moving images. Also having great popularity are computer           each with their own unique sets of objects and tools which
software programs for manipulation of video clips, multimedia         cannot be edited at that point. Unfortunately, there is no
clips, and the like. Such programs include Max, Aperture,             convenient user interface available for interacting computer
and ArKaos. Another popular medium that supports creativity           graphics with musical instrument digital data. Conventional
with computers are the various computer software applications         methods generally require cumbersome specification of input
that involve the musical instrument digital interface (MIDI)          sources, scene description parameters and data objects, and
standard.                                                             linking of input sources and scene description objects. As a
                                                                      result, a relatively high level of computer skills is necessary
    The MIDI standard permits connection of musical instru-           for creating graphical presentations in conjunction with music
ments with digital output to related digital sound processing         input. It would be improving creative output if users could
devices, including computers with sound cards and sound               create scenes with objects and change both the objects and the
editing applications, soundboards, broadcast equipment, and           nature of the interaction between the video graphics and MIDI
the like. Music has become commonly performed with instru-            music data.
ments that send digital MIDI data since the introduction of
MIDI in approximately 1985. MIDI provides a flexible set of                Because of these difficulties and increased complexity,
instructions that are sent via a serial data link from a controller   there is need for a graphical user interface that supports inte-
to a receiver that processes those commands in a variety of           gration with digital musical instruments, vocal recognition, and
ways that pertain to the output functions of the receiving            it is possible to do this through and with a Fourier Transform
device. The data and instructions involve most commonly that          (FT). It is possible and likely to be able to create this interface
of sounds and music, but can also involve instructions for            as a three-dimensional audio sequencer and spatializer in
machine control and lighting control devices.                         MaxMSP, also using a speech processing application and using
                                                                      the Fast Fourier Transform (FFT), the Inverse Fast Fourier
   A separate branch of technology is the development of              Transform (IFFT), and the Discrete Fourier Transform (DFT)
computer video graphics, the digital electronic representation        analyses to filter in keywords to find objects and create a scene

that you can traverse. In order to control a SysEx function, a manufacturer-specific
ID code is sent. Equipment which isn’t set up to recognize
II. H ISTORY OF MIDI that particular code will ignore the rest of the message, while
devices that do recognize it will continue to listen.
MIDI (Musical Instrument Digital Interface) protocol has
become the dominant method of connecting pieces of elec- ”The MIDI protocol allows for control over more than just
tronic musical equipment. And when you consider the previous when a note should be played.”
standard you have to say that MIDI arrived at just the right SysEx messages are usually used for tasks such as loading
time. custom patches and are typically recorded into a sequencer
The control voltage (CV) and gate trigger system used using a ’SysEx Dump’ feature on the equipment.
on early analogue synths was severely limited in its scope MIDI information was originally sent over a screened
and flexibility. Analogue synths tended to have very few twisted pair cable (two signal wires plus an earthed shield
features that could be controlled remotely, relying as they to protect them from interference) terminated with 5-pin DIN
did on physical knobs and sliders, patch cables and manual plugs. However, this format has been superseded to some
programming. extent by USB connections, as we’ll discuss later. No waves
or varying voltages are transmitted since MIDI data is sent
Furthermore, there was no universal standard for the way
digitally, meaning that the signal pins either carry a voltage or
CV control should work, complicating the process when in-
none at all, corresponding to the binary logical values 1 and
terfacing between products from different manufacturers. The
0.
majority of vintage CV-controlled synths can now be adapted
with a CV-to-MIDI converter, so you can use MIDI to control These binary digits (bits) are combined into 8-bit messages.
them. The protocol supports data rates of up to 31,250 bits per
second. Each MIDI connection sends information in one
Dave Smith, founder of Californian synth legend Sequen- direction only, meaning two cables are needed if a device is
tial Circuits and now head of Dave Smith Instruments, antici- used both to send and receive data (unless you’re working over
pated the demand for a more powerful universal protocol and USB that is).
developed the first version of the MIDI standard, which was
released in 1983. With the increasing complexity of synths, and In addition to the expected IN and OUT connections, most
as the music industry shifted towards digital technology and MIDI devices also have a THRU port. This simply repeats the
computer-based studios, the MIDI setup took off and became signal received at the IN port so it can be sent on to other
the standard for connecting equipment. devices further down the chain. Devices may be connected
in series and, for the largest MIDI setups, an interface with
A. How It Works multiple output ports may be used to control more than 16
separate chained devices.
Absolutely no sound is sent via MIDI, just digital signals
known as event messages, which instruct pieces of equipment. B. Becoming a Standard
The most basic example of this can be illustrated by consider- The key feature of MIDI when it was launched was its
ing a controller keyboard and a sound module. When you push efficiency: it allowed a relatively significant amount of infor-
a key on the keyboard, the controller sends an event message mation to be transmitted using only a small amount of data.
which corresponds to that pitch and tells the sound module Given the limitations of early ’80s digital data transmission
to start playing the note. When you let go of the key, the methods, this was essential to ensure that the reproduction of
controller sends a message to stop playing the note. musical timing was sufficiently accurate.
Of course, the MIDI protocol allows for control over more Manufacturers quickly adopted MIDI and its popularity
than just when a note should be played. Essentially, a message was cemented by the arrival of MIDI-compatible computer
is sent each time some variable changes, whether it be note- hardware (most notably the built-in MIDI ports of the Atari ST,
on/off (including, of course, exactly which note it is), velocity which was released in 1985). As weaknesses or potential extra
(determined by how hard you hit the key), after-touch (how features were identified, the MIDI Manufacturers Association
hard the key is held down), pitch-bend, pan, modulation, updated the standard regularly following its first publication.
volume or any other MIDI-controllable function.
The most notable updates - Roland MT-32 (1987), General
The protocol supports a total of 128 notes (from C five MIDI (1991) and GM2 (1999), Roland GS (1991) and Yamaha
octaves below middle C through to G ten octaves higher), 16 XG (1997-99) - added further features or standards, generally
channels (so that 16 separate devices can be controlled per without making previous ones obsolete. It’s questionable just
signal chain, or multiple devices assigned the same channel how relevant the majority of these standards are to digital
so they respond to the same input) and 128 programs (corre- musicians and producers, since most of them relate in large
sponding to patches or voice/ effect setting changes). MIDI part to standardizing the playback of music distributed in MIDI
signals also include built-in clock pulses, which define the format. Unless you intend to distribute your music as MIDI
tempo of the track and allow basic timing synchronization files, most of them probably won’t affect you.
between equipment.
C. Right On Time
The other major piece of the jigsaw is the SysEx (System
Exclusive) message, designed so that manufacturers could uti- The most common criticisms of the MIDI protocol relate
lize MIDI to control features specific to their own equipment. to timing issues. Although MIDI was efficient by the standards

of the early ’80s, it is still undeniably flawed to some extent. with placing greater onus on manufacturers and software de-
There is some degree of jitter (variation in timing) present velopers to come up with their own powerful proprietary DAW-
in MIDI, resulting in discernible sloppiness in recording and based control systems operating via existing USB, FireWire or
playback. even over Ethernet connections or wirelessly.
Perhaps even more obvious to most of us is latency, the
III. H ISTORY OF CAD
delay between triggering a function (such as a sound) via MIDI
and the function being carried out (in this case the sound being Modern engineering design and drafting can be traced back
reproduced). The more information sent via MIDI, the more to the development of descriptive geometry in the 16th and
latency is created. It may only be in the order of milliseconds, 17th centuries. Drafting methods improved with the intro-
but it’s enough to become noticeable to the listener. duction of drafting machines, but the creation of engineering
drawings changed very little until after World War II.
Even more problematic is the fact that most of us use MIDI
in a computer-based studio and each link in the MIDI and During the war, considerable work was done in the devel-
audio chain could potentially add to the latency. This could opment of real-time computing, particularly at MIT, and by
either be due to software (drivers, DAWs, soft synths) or the 1950s there were dozens of people working on numerical
hardware (RAM, hard drives, processors) but the end result control of machine tools and automating engineering design.
is sloppy timing. The blame cannot be laid entirely at the But it’s the work of two people in particular—Patrick Hanratty
door of MIDI, but the weaknesses of multiple pieces of MIDI and Ivan Sutherland—who are largely credited with setting the
equipment combined with all the other sources of timing error stage for what we know today as CAD or Computer Aided
can have a significant detrimental effect on the end result. Design.
Most new MIDI equipment is supplied not only with A. The Fathers of CAD
traditional 5-pin DIN connections but with standard Type A
or B USB ports that allow direct connection to your computer. Hanratty is widely credited as “the Father of CADD/CAM.”
However, USB is not the solution to all your MIDI timing In 1957, while working at GE, he developed PRONTO (Pro-
problems. Despite the higher data transfer rates possible over gram for Numerical Tooling Operations), the first commer-
USB, latency is actually higher than over a standard DIN- cial CNC programming system. Five years later, Sutherland
based MIDI connection. Furthermore, jitter is significantly presented his Ph.D. thesis at MIT titled “Sketchpad, A Man-
higher when using MIDI over USB, leading to unpredictable Machine Graphical Communication System.” Among its fea-
inaccuracies in timing. tures, the first graphical user interface, using a light pen to
manipulate objects displayed on a CRT.
D. Beyond MIDI The 1960s brought other developments, including the first
digitizer (from Auto-trol) and DAC-1, the first production
It’s clear that while MIDI has been massively important interactive graphics manufacturing system. By the end of
to the development of music technology over the last 25 the decade, a number of companies were founded to com-
years, it does come with a few major weaknesses. One mercialize their fledgling CAD programs, including SDRC,
heavily researched alternative, the Zeta Instrument Processor Evans and Sutherland, Applicon, Computervision, and M and
Interface protocol proposed in the mid-’90s, failed to gain S Computing.
support from manufacturers and never saw commercial release.
However, the same development team helped to develop the By the 1970s, research had moved from 2D to 3D. Major
OpenSound Control (OSC) protocol used by the likes of milestones included the work of Ken Versprille, whose inven-
Native Instruments’ Reaktor and Traktor and the Max/MSP tion of NURBS for his Ph.D. thesis formed the basis of modern
and SuperCollider development environments. 3D curve and surface modeling, and the development by Alan
Grayer, Charles Lang, and Ian Braid of the PADL (Part and
This is a much higher bandwidth system which overcomes Assembly Description Language) solid modeler.
many of the timing issues of MIDI, most notably by trans-
mitting information with built-in timing messages as quickly With the emergence of UNIX workstations in the early
as possible through high-bandwidth connections rather than ’80s, commercial CAD systems like CATIA and others began
relying on the real-time, event messages used by MIDI devices, showing up in aerospace, automotive, and other industries. But
which just assume that timing is correct and respond to each it was the introduction of the first IBM PC in 1981 that set
message as soon as it’s received. the stage for the large-scale adoption of CAD. The following
year, a group of programmers formed Autodesk, and in 1983
One significant barrier to the development of a universal released AutoCAD, the first significant CAD program for the
protocol for contemporary music equipment is that there is so IBM PC.
much variation between equipment. With so many different
synthesis methods, programming systems, levels of user con- B. The CAD Revolution
trol and forms of sound manipulation available on different
pieces of gear, it’s unlikely that any universal system for their AutoCAD marked a huge milestone in the evolution of
control is possible. CAD. Its developers set out to deliver 80 percent of the
functionality of the other CAD programs of the day, for 20
However, as computer processing and interfacing technolo- percent of their cost. From then on, increasingly advanced
gies have developed so rapidly since the early ’80s, perhaps drafting and engineering functionality became more affordable.
the solution lies not with updating or replacing MIDI but rather But it was still largely 2D.

That changed in 1987 with the release of Pro/ENGINEER, extensions, adding the ability to do real-time synthesis using
a CAD program based on solid geometry and feature-based an internal hardware digital signal processor (DSP) board. The
parametric techniques for defining parts and assemblies. It same year, IRCAM licensed the software to Opcode Systems.
ran on UNIX workstations—PCs of the time were simply not
powerful enough—but it was a game changer. The later years 2) 1990s:: Opcode launched a commercial version named
of the decade saw the release of several 3D modeling kernels, Max in 1990, developed and extended by David Zicarelli. How-
most notably ACIS and Parasolids, which would form the basis ever, by 1997, Opcode was considering cancelling it. Instead,
for other history-based parametric CAD programs. Zicarelli acquired the publishing rights and founded a new
company, Cycling ’74, to continue commercial development.
The timing was fortunate, as Opcode was acquired by Gibson
C. CAD Today, CAD Tomorrow Guitar in 1998 and ended operations in 1999.
The modern CAD era has been marked by improvements IRCAM’s in-house Max development was also winding
in modeling, incorporation of analysis, and management of down; the last version produced there was jMax, a direct de-
the products we create, from conception and engineering scendant of Max/FTS developed in 1998 for Silicon Graphics
to manufacturing, sales, and maintenance (what has become (SGI) and later for Linux systems. It used Java for its graphical
known as PLM, product lifecycle management). interface and C for its real-time backend and was eventually
“Engineers and designers are being asked to create more, released as open-source software. Meanwhile, Puckette had
faster, and with higher quality,” says Bill McClure, vice independently released a fully redesigned open-source com-
president of product development at Siemens PLM. With all position tool named Pure Data (Pd) in 1996, which, despite
of this pressure on engineers and designers, what do you see some underlying engineering differences from the IRCAM
as the next big evolution in CAD? versions, continued in the same tradition. Cycling ’74’s first
Max release, in 1997, was derived partly from Puckette’s work
IV. M AX MSP AND I TS U SES on Pure Data. Called Max/MSP (”Max Signal Processing”,
or the initials Miller Smith Puckette), it remains the most
A. Introduction notable of Max’s many extensions and incarnations: it made
Max capable of manipulating real-time digital audio signals
Max, also known as Max/MSP/Jitter, is a visual pro-
without dedicated DSP hardware. This meant that composers
gramming language for music and multimedia developed and
could now create their own complex synthesizers and effects
maintained by San Francisco-based software company Cycling
processors using only a general-purpose computer like the
’74. Over its more than thirty-year history, it has been used by
Macintosh PowerBook G3.
composers, performers, software designers, researchers, and
artists to create recordings, performances, and installations. In 1999, the Netochka Nezvanova collective released
The Max program is modular, with most routines existing as nato.0+55, a suite of externals that added extensive real-time
shared libraries. An application programming interface (API) video control to Max.
allows third-party development of new routines (named exter-
nal objects). Thus, Max has a large user base of programmers 3) 2000s:: Though nato became increasingly popular
unaffiliated with Cycling ’74 who enhance the software with among multimedia artists, its development stopped in 2001.
commercial and non-commercial extensions to the program. SoftVNS, another set of extensions for visual processing in
Because of this ingenious extensible design, which simultane- Max, was released in 2002 by Canadian media artist David
ously represents both the program’s structure and its graphical Rokeby. Cycling ’74 released their own set of video extensions,
user interface (GUI), Max has been described as the lingua Jitter, alongside Max 4 in 2003, adding real-time video,
franca for developing interactive music performance software. OpenGL graphics, and matrix processing capabilities. Max 4
was also the first version to run on Windows. Max 5, released
in 2008, redesigned the patching GUI for the first time in
B. History
Max’s commercial history.
1) 1980s:: Miller Puckette began work on Max in 1985, at
the Institut de Recherche et Coordination Acoustique/Musique 4) 2010s:: In 2011, Max 6 added a new audio engine
(IRCAM) in Paris. Originally called The Patcher, this first compatible with 64-bit operating systems, integration with
version provided composers with a graphical interface for Ableton Live sequencer software, and an extension called
creating interactive computer music scores on the Macintosh. Gen, which can compile optimized Max patches for higher
At this point in its development Max couldn’t perform its own performance. Max 7, the most recent major version, was
real-time sound synthesis in software, but instead sent control released in 2014 and focused on 3D rendering improvements.
messages to external hardware synthesizers and samplers using On June 6, 2017, Ableton announced its purchase of Cycling
MIDI or a similar protocol. Its earliest widely recognized use ’74, with Max continuing to be published by Cycling ’74 and
in composition was for Pluton, a 1988 piano and computer David Zicarelli remaining with the company Programs sharing
piece by Philippe Manoury; the software synchronized a Max’s visual programming concepts are now commonly used
computer to a piano and controlled a Sogitec 4X for audio for real-time audio and video synthesis and processing.
processing.
C. Language
In 1989, IRCAM developed Max/FTS (”Faster Than
Sound”), a version of Max ported to the IRCAM Signal Max is named after composer Max Mathews, and can
Processing Workstation (ISPW) for the NeXT. Also known as be considered a descendant of his MUSIC language, though
”Audio Max”, it would prove a forerunner to Max’s MSP audio its graphical nature disguises that fact. Like most MUSIC-N

languages, Max distinguishes between two levels of time: that V. S OUND S TROLL
of an event scheduler, and that of the DSP (this corresponds to
the distinction between k-rate and a-rate processes in Csound, SoundStroll is a kind of 3D audio sequencer. It is a
and control rate vs. audio rate in SuperCollider). tool for placing sounds in a virtual landscape (a very bare
landscape) constructed with jitter OpenGL, and triggering them
The basic language of Max and its sibling programs is and spatializing them around you as you take a stroll in this
that of a data-flow system: Max programs (named patches) are virtual sound world. SoundStroll is a tool for placing sounds
made by arranging and connecting building-blocks of objects in an open 3D virtual landscape and triggering them as you
within a patcher, or visual canvas. These objects act as self- take a stroll through your soundscape. SoundStroll exists as a
contained programs (in reality, they are dynamically linked MaxMSP project, which is to say a set of MaxMSP patches;
libraries), each of which may receive input (through one or as such it should be compatible with Windows and Mac OSX
more visual inlets), generate output (through visual outlets), (made with Max 6.1.9 on OSX 10.6.8). It is intended to be
or both. Objects pass messages from their outlets to the inlets free and open source (though one still needs to possess a
of connected objects. MaxMSP 6 licence to modify the sources) under the terms
of CreativeCommons Attribution-Non Commercial licence (cc
Max supports six basic atomic data types that can be by nc).
transmitted as messages from object to object: int, float, list,
symbol, bang, and signal (for MSP audio connections). Several VI. H ISTORY OF S PEECH R ECOGNITION S OFTWARE
more complex data structures exist within the program for
handling numeric arrays (table data), hash tables (coll data), Speech recognition software (or speech recognition tech-
XML information (pattr data), and JSON-based dictionaries nology) enables phones, computers, tablets, and other ma-
(dict data). An MSP data structure (buffer ) can hold digital chines to receive, recognize and understand human utterances.
audio information within program memory. In addition, the It uses natural language as input to trigger an action; enabling
Jitter package adds a scalable, multi-dimensional data structure our devices to also respond to our spoken commands. The
for handling large sets of numbers for storing video and other technology is being used to replace other, more ‘overused’
datasets (matrix data). methods of input like typing, texting, and clicking. This turns
out to be slightly ironic because of the fact that texting has
Max is typically learned through acquiring a vocabulary of become the norm, over voice.
objects and how they function within a patcher; for example,
the metro object functions as a simple metronome, and the A. 1950s and 60s
random object generates random integers. Most objects are
non-graphical, consisting only of an object’s name and several In this day and age, speech recognition can be found in any-
arguments-attributes (in essence class properties) typed into thing and everything, from cars with Bluetooth connections,
an object box. Other objects are graphical, including sliders, to asking Google to search up ’spaghetti’, to just processing
number boxes, dials, table editors, pull-down menus, but- speech over connections with Microsoft Skype, and many more
tons, and other objects for running the program interactively. things. The ability to talk to your devices have expanded to
Max/MSP/Jitter comes with about 600 of these objects as the encompass the vast majority of technology that we use in our
standard package; extensions to the program can be written by daily lives.
third-party developers as Max patchers (e.g. by encapsulating
some of the functionality of a patcher into a sub-program that The first speech recognition systems were focused on
is itself a Max patch), or as objects written in C, C++, Java, numbers, not words. In 1952, Bell Laboratories designed the
or JavaScript. “Audrey” system which could recognize a single voice speak-
ing digits aloud. Ten years later, IBM introduced “Shoebox”
The order of execution for messages traversing through which understood and responded to 16 words in English.
the graph of objects is defined by the visual organization of Across the globe other nations developed hardware that
the objects in the patcher itself. As a result of this organizing could recognize sound and speech. And by the end of the ‘60s,
principle, Max is unusual in that the program logic and the the technology could support words with four vowels and nine
interface as presented to the user are typically related, though consonants.
newer versions of Max provide several technologies for more
standard GUI design.
B. 1970s
Max documents (named patchers) can be bundled into Speech recognition made several meaningful advancements
stand-alone applications and distributed free or sold commer- in this decade. This was mostly due to the US Department
cially. In addition, Max can be used to author audio and MIDI of Defense and DARPA. The Speech Understanding Research
plugin software for Ableton Live through the Max for Live (SUR) program they ran was one of the largest of its kind in
extension. the history of speech recognition. Carnegie Mellon’s “Harpy’
speech system came from this program and was capable of
With the increased integration of laptop computers into understanding over 1,000 words which is about the same as a
live music performance (in electronic music and elsewhere), three-year-old’s vocabulary.
Max/MSP and Max/Jitter have received attention as a devel-
opment environment available to those serious about laptop Also significant in the ‘70s was Bell Laboratories’ intro-
music/video performance. duction of a system that could interpret multiple voices.

C. 1980s Fig. 1. Overall Description
The ‘80s saw speech recognition vocabulary go from a
few hundred words to several thousand words. One of the
breakthroughs came from a statistical method known as the
“Hidden Markov Model (HMM)”. Instead of just using words
and looking for sound patterns, the HMM estimated the
probability of the unknown sounds actually being words.

D. 1990s
Speech recognition was propelled forward in the 90s in
large part because of the personal computer. Faster processors
made it possible for software like Dragon Dictate to become
more widely used.
Bell South introduced the voice portal (VAL) which was a
dial-in interactive voice recognition system. This system gave
birth to the myriad of phone tree systems that are still in
existence today.

E. 2000s
By the year 2001, speech recognition technology had
achieved close to 80 percent accuracy. For most of the decade a port of the host computer (box 106). The UI events may
there was not a lot of advancements until Google arrived with comprise events such as display mouse or computer keyboard
the launch of Google Voice Search. Because it was an app, this activation by the user, or graphics table input, and the like.
put speech recognition into the hands of millions of people. The musical instrument digital interface input can comprise
It was also significant because the processing power could signals received from a digital musical instrument connected
be offloaded to its data centers. Not only that, Google was to a suitable port of the host computer or from a stored digital
collecting data from billions of searches which could help it music file.
predict what a person is actually saying. At the time Google’s
English Voice Search System included 230 billion words from The processed UI events from the user (at box 104 of FIG.
user searches. 1) can comprise a variety of user input at the host computer
that relate to the scene description and modify the rendered
F. 2010s scene in accordance with the user input. Examples of UI
events from the user include playback controls, by which the
In 2011 Apple launched Siri which was similar to Google’s user can halt operation of the rendering and close the scene
Voice Search. The early part of this decade saw an explosion description. The user also can launch a scene description editor
of other voice recognition apps. And with Amazon’s Alexa, application, which provides a graphical user interface (GUI)
Google Home we have seen consumers becoming more and through which the user can manipulate and change values in
more comfortable talking to machines. the scene description to be rendered, thereby effecting the
scene that will rendered. The user-editable scene description
Today, some of the largest tech companies are competing
parameters are described in more detail below.
to herald the speech accuracy title. In 2016, IBM achieved a
word error rate of 6.9 percent. In 2017 Microsoft beat out IBM The digital music input (at box 106 of FIG. 1) may
with a 5.9 percent claim. Shortly after that IBM improved their comprise, for example, input received over an interface that
rate to 5.5 percent. However, it is Google that is claiming the is compatible with the MMA interface, wherein MMA is the
lowest rate at 4.9 percent. MIDI (Musical Instrument Digital Interface) Manufacturer’s
Association protocol specification. Those skilled in the art will
VII. D IAGRAMS AND E XPLANATIONS appreciate that a variety of musical instrument digital interfaces
may be used, although the MIDI standard of the MMA is
The processing is illustrated by the flow diagram of FIG. 1, the most well-known and widely used for digital music rep-
which shows that a host computer system processes a scene as resentation. A wide variety of electronic musical instruments
specified by a scene description, thereby beginning execution can be supported, including synthesizers that produce MIDI
of the scene description. The scene processing comprises command streams for electronic piano, drum, guitar, and the
loading a scene description into working memory of the like. Thus, the digital music input at box 106 can comprise
computer. This processing is represented by the flow diagram a MIDI command stream that is delivered live (that is, in
box numbered 102. The scene description defines one or more response to activation in real time) or delivered serially from
objects located in three-dimensional space of the rendered a conventional MIDI file.
scene, as specified by the scene description. The system then
monitors, or listens, for input from two sources at the host Those skilled in the art will appreciate that a MIDI com-
computer: user interface (UI) events from the user at the host mand stream can produce sounds that are triggered from a
computer (box 104) and digital music input received from MIDI-enabled sound engine that receives MIDI commands

as control inputs and that can produce corresponding sounds Fig. 3. Digital Music Input
and musical notes. Such sounds and musical notes can be
stored as *.wav, *.aiff, *.mp3 files, and the like. Other digitally
encoded audio files can be used for input, as well. Such
audio files can be easily played through digital media players.
Moreover, musical interfaces such as the MMA MIDI interface
can interact with graphical interfaces in real time as a digital
instrument is played.
For example, the illustrated embodiment utilizes graphics
control through the DirectX interface, but OpenGL or any other
graphics API could also be supported. Those skilled in the
art will understand the integration details for such interaction,
in view of the description herein. In the description herein, a
MIDI input stream will be assumed for the digital music input,
unless otherwise indicated. That is, references to “MIDI” input
will be understood to include all varieties of digital music input
described herein, unless otherwise indicated.
After the user UI events and MIDI port events are pro-
cessed, the system updates the scene (box 108). Next, at box
109, the scene is rendered, meaning that applicable video and
audio output is generated. Lastly, if no halt instruction or the host computer receives the MIDI input at a sound card through
like is received at box 110 execution is continued by returning a MIDI port. The system then matches the digital music input
to listening for, and processing, input from the user (box 104) to trigger events of the scene description, as indicated by
and the musical instrument (box 106). box 304. The trigger events correspond to musical instrument
actions that generate music note events, such as piano keys
Fig. 2. User Input that are struck, guitar strings that are plucked, drum surfaces
that are hit, and the like. The trigger events comprise MIDI
commands such as the output from a synthesizer or other
electronic music instrument.
As noted above, a MIDI command stream can be played
through a MIDI-enabled sound engine and can be stored
as audio data in such common formats as Windows Media
Player or Real Player or the like, including music files such
as a *.WAV file or *.AIFF file or the like. Each trigger
event is associated with process functions that are specified
in the scene description. At box 306, the process functions are
executed, thereby producing changes to the defined objects in
the rendered scene. As noted previously, the scene description
is updated per the digital music events and process functions,
and the updated scene is rendered, while digital music input
listening continues. This processing is indicated by box 310.
A variety of actions associated with the process functions
may be carried out. For example, the actions may specify col-
FIG. 2 shows additional details of processing the user lisions between two or more objects of the scene description,
input. First, at box 202, conventional computer operating and can include explosion of one or more objects of the scene
system listening is performed, to await received input from description, or other movements of the objects in the scene
the user interface. When user input is received that changes the description. The actions can be specified by user input so as to
scene description, such as object manipulation commands or permit changes in speed, size, movement, color, and behavior
changes in MIDI processing parameters, the scene description of the scene objects.
in memory that is being executed (rendered) is changed in
accordance with that input. FIG. 4 is a block diagram of an exemplary host com-
puter 1200 that performs the processing described herein.
This is represented in FIG. 2 by the box 204. The system
The computer includes a processor 1202, such as a general
continues to listen for additional user events, as indicated at
purpose computer chip and ancillary components, as provided
box 206. As noted above, execution of the scene description
in conventional personal computers, workstations, and the
and listening for further user input continues, as indicated by
like that are generally available. Through the processor 1202,
the return from box 210 to box 202, until execution is halted
the computer executes program instructions to carry out the
by a user input.
operations described herein. The processor communicates with
FIG. 3 shows additional details of processing the musical other components of the computer over a system bus 1203 for
instrument digital interface (MIDI) input. First, at box 302, the data exchange and operations. The processor can operate with

Fig. 4. Exemplary Host Computer Fig. 5. First 16 Bins of an FFT Frame(Amplitudes)

and the Inverse Fast Fourier Transform (IFFT). The FFT and
IFFT are optimized (very fast) computer-based algorithms that
perform a generalized mathematical process called the Discrete
Fourier Transform (DFT). The DFT is the actual mathematical
transformation that the data go through when converted from
one domain to another (time to frequency). In a more basic
a sound card 1204 that processes digital music data, such as a explanation, the DFT is just a slow/slower version of the FFT.
digital music input data stream received from a digital music
input device including a music synthesizer and the like, and FFTs, IFFTs, and DFTs became really important to a lot
can produce audio (sound) output 1205. of disciplines when engineers figured out how to take samples
quickly enough to generate enough data to re-create sound and
The processor 1202 also responds to input devices 1206 other analog phenomena digitally. But, they do not just work
that receive user input, including such input devices as a on sounds; they work on any continuous signal (images, radio
computer keyboard, mouse, and other similar devices. The waves, seismographic data, etc.).
computer includes memory 1208, typically provided as volatile
(dynamic) memory for storing program instructions, operating An FFT of a time domain signal takes the samples and
data, and so forth. The datastore 1210 is typically non-volatile gives us a new set of numbers representing the frequencies,
memory, such as data disks or disk arrays. amplitudes, and phases of the sine waves that make up the
sound we’ve analyzed. It is these data that are displayed in
The computer can also include a program product reader the sonograms.
1212 that receives externally accessible media 1214 such as
flash drives and optical media discs, and the like. Such media Chart 5 shows the first 16 bins of a typical FFT analysis
1214 can include program instructions, comprising program after the conversion is made from real and imaginary numbers
products, that can be read by the reader 1212 and executed by to amplitude/phase pairs. The phases are left out because it is
the processor 1202 to provide the operation described herein. hard to make up a bunch of arbitrary phases between 0 and
The processor uses a graphics or video card 1216 to visually 2. In a lot of cases, you might not need them (and in a lot of
render the objects in a scene description according to the cases, you would!). In this case, the sample rate is 44.1 kHz
digital music input received through the sound card 1204. The and the FFT size is 1,024, so the bin width (in frequency) is
visually rendered graphics output can be viewed at a display the Nyquist frequency (44,100/2 = 22,050) divided by the FFT
device 1218, such as visual display devices and the like. The size, or about 22 Hz.
sound output 1205 and rendered graphics output 1218 together Amplitude values are assumed to be between 0 and 1;
comprise the rendered scene output, providing a multimedia notice that they are quite small because they all must sum
presentation. to 1, and there are a lot of bins!.
The numbers are not real, but notice that they are made
VIII. T HE F REQUENCY D OMAIN them up to represent a sound that has a simple, more or less
A. The DFT, FFT, and IFFT harmonic structure with a fundamental somewhere in the 66 Hz
to 88 Hz range (you can see its harmonics at around 2, 3, 4, 5,
The most common tools used to perform Fourier analysis and 6 times its frequency, and note that the harmonics decrease
and synthesis are called the Fast Fourier Transform (FFT) in amplitude more or less like they would in a sawtooth wave).

1) How the FFT Works: The Fast Fourier Transform in a            Fig. 9.    f(t)*w(t)
Nutshell: Computing Fourier Coefficients Here’s a little three-
step procedure for digital sound processing.
    1)      Window
    2)      Periodicize
    3)      Fourier transform (this also requires sampling, at a
            rate equal to 2 times the highest frequency required).
            you do this with FFT. Following is an illustration of
            steps 1 and 2.
Here’s the graph of a (periodic) function, f(t). (Note that f(t)
need not be a periodic function.)
                                                                     Fig. 10.    f(t)*w(t)
Fig. 6.   Graph of a (periodic) function, f(t)

                                                                         You have a periodic function, and the Fourier theorem says
    Look at the portion of the graph between 0 ≤ t ≤ 1.              it can represent this function as a sum of sines and cosines.
Following is a graph of the window function we need to use.          This is step 3. You can also use other, non-square windows.
The function is called w(t). Note that w(t) equals 1 only in the     This is done to ameliorate the effect of the square windows
interval 0 ≤ t ≤ 1 and it’s 0 everywhere else.                       on the frequency content of the original signal.

Fig. 7.   w(t)
                                                                     B. The DFT, FFT, and IFFT
                                                                         Now, once you have a periodic function, all you need to do
                                                                     is figure out, using the FFT, what the component sine waves
                                                                     of that waveform are.
                                                                        It is possible to represent any periodic waveform as a
                                                                     sum of phase-shifted sine waves. In theory, the number of
                                                                     component sine waves is infinite—there is no limit to how
                                                                     many frequency components a sound might have. In practice,
                                                                     you need to limit it to some predetermined number. This limit
    In step 1, you window the function. In Figure 7 you plot         has a serious effect on the accuracy of our analysis.
both the window function, w(t) (which is nonzero in the region           Here’s how that works: rather than looking for the fre-
of interest) and function f(t) in the same picture.                  quency content of the sound at all possible frequencies (an
                                                                     infinitely large number - 100.000000001 Hz, 100.000000002
Fig. 8.   f(t)*w(t)                                                  Hz, 100.000000003 Hz, etc.), next, divide up the frequency
                                                                     spectrum into a number of frequency bands and call them bins.
                                                                     The size of these bins is determined by the number of samples
                                                                     in our analysis frame (the chunk of time mentioned above).
                                                                     The number of bins is given by the formula:
                                                                         number of bins = frame size/2
                                                                        1) Frame Size: Per example, decide on a frame size of
                                                                     1,024 samples. This is a common choice because most FFT
                                                                     algorithms in use for sound processing require a number of
    In Figure 8 you plot f(t)*w(t), which is the periodic            samples that is a power of two, and it’s important not to get
function multiplied by the windowing function. From this             too much or too little of the sound.
figure, it’s obvious what part of f(t) is the area of interest.
                                                                        A frame size of 1,024 samples gives us 512 frequency
   In step 2, you need to periodically extend the windowed           bands. Assume that we’re using a sample rate of 44.1 kHz, we
function, f(t)*w(t), all along the t-axis.                           know that we have a frequency range (remember the Nyquist

theorem) of 0 kHz to 22.05 kHz. To find out how wide each IX. T IMELINE
of the frequency bins is, use the following formula:
1) Sept 10 2018 - Oct 01 2018 : Research and Editing
bin width = frequency/number of bins 2) Oct 01 2018 - Nov 08 2018 : SDK/Implementation
This formula gives us a bin width of about 43 Hz. Remem- 3) Nov 08 2018 - Jan 15 2019 : Data Collection
ber that frequency perception is logarithmic, so 43 Hz gives us 4) Jan 15 2019 - Feb 12 2019 : Implementation/Testing
worse resolution at the low frequencies and better resolution 5) Feb 12 2019 - Mar 12 2019 : Debugging
at higher frequencies. 6) Mar 12 2019 - Apr 02 2019 : Review/Final Debug-
ging
By selecting a certain frame size and its corresponding
bandwidth, you avoid the problem of having to compute an X. C ONCLUSION
infinite number of frequency components in a sound. Instead,
you just compute one component for each frequency band. In accordance with embodiments of the software, a graph-
ical scene representation is produced at a display of a host
Fig. 11. Example of a commonly used FFT-based program: the phase vocoder computer such that a scene description is rendered and updated
menu from Tom Erbe’s SoundHack. Note that the user is allowed to select by a received digital music input, which can be keywords
(among several other parameters) the number of bands in the analysis. This that are gathered from a specific song or vocal recognition
means that the user can customize what is called the time/frequency resolution wherein the digital music input is matched to trigger events
trade-off of the FFT.
of the scene description and action of each matched trigger
event are executed in accordance with action processes of the
scene description, thereby updating the scene description with
respect to objects depicted in the scene on which the actions
are executed. The updated scene description is then rendered.
Thus, the software provides a patcher in MaxMSP that can
link to a musical instrument digital interface (e.g., MIDI) data
stream and through a sound and word processing application
and producing a scene based on that, that interacts as you
traverse it. In this way, this software can also take keywords
from vocal recognition and use FFT to be able to put it all
through a spectral filter, which you can edit to change the
sound of what you listen to as you traverse the world, which
will also place sounds in an open three-dimensional virtual
landscape and trigger them as you take a stroll through your
soundscape, and finally, it will allow you to edit the objects
in the soundscape based on the spatialisation tools, which can
be anything, including CAD.
2) Software That Uses the FFT: There are many software
packages available that will do FFTs and IFFTs of your data ACKNOWLEDGMENT
for you and then let you mess around with the frequency The author would like to thank Charlie Peck, Marc Ben-
content of a sound. The y-axis tells us the amplitude of amou, Forrest Tobey, Xunfei Jiang, and David Barbella.

Fig. 12. Another way to look at the frequency spectrum is to remove time as R EFERENCES
an axis and just consider a sound as a histogram of frequencies. Think of this
as averaging the frequencies over a long time interval. This kind of picture [1] Cycling ’74. Max/MSP History - Where did Max/MSP
(where there’s no time axis) is useful for looking at a short-term snapshot come from? June 2009. URL: https://web.archive.org/
of a sound (often just one frame), or perhaps even for trying to examine the
spectral features of a sound that doesn’t change much over time (because all
web/20090609205550/http://www.cycling74.com/twiki/
we see are the ”averages”). bin/view/FAQs/MaxMSPHistory.
[2] Phil Burk. The Frequency Domain. May 2011. URL:
http : / / sites . music . columbia . edu / cmc /
MusicAndComputers/chapter3/03 04.php.
[3] David Cohn. Evolution of Computer-Aided Design. May
2014. URL: http://www.digitaleng.news/de/evolution-
of-computer-aided-design/.
[4] 3D Innovations. The History of Computer-Aided Design
(CAD). Nov. 2014. URL: https://3d- innovations.com/
blog/the-history-of-computer-aided-design-cad/.
[5] IRCAM. A brief history of MAX. June 2009. URL:
each component frequency. Looking at just one frame of an https : / / web . archive . org / web / 20090603230029 / http :
FFT, you usually assume a periodic, unchanging signal. A //freesoftware.ircam.fr/article.php3?id article=5.
histogram is generally most useful for investigating the steady- [6] Peter Kirn. A conversation with David Zicarelli and
state portion of a sound. (Figure 12 is a screen shot from Gerhard Behles. June 2017. URL: http://cdm.link/2017/
SoundHack.) 06/conversation-david-zicarelli-gerhard-behles/.

[7]   Future Music. 30 years of MIDI: a brief history. Dec.
       2012. URL: http://www.musicradar.com/news/tech/30-
       years-of-midi-a-brief-history-568009.
 [8]   Tim Place. A modular standard for structuring patches
       in Max. URL: http : / / jamoma . org / publications /
       attachments/jamoma-icmc2006.pdf.
 [9]   Miller Puckette. Synthetic Rehersal: Training the Syn-
       thetic Performer. URL: https://quod.lib.umich.edu/cgi/p/
       pod/dod-idx/synthetic-rehearsal-training-the-synthetic-
       performer.pdf?c=icmc;idno=bbp2372.1985.043;format=
       pdf.
[10]   Miller Puckette. The Patcher. URL: http://msp.ucsd.edu/
       Publications/icmc88.pdf.
[11]   Mike Sheffield. Max/MSP for average music junkies.
       Jan. 2018. URL: http://www.hopesandfears.com/hopes/
       culture/music/168579-max-msp-primer.
[12]   Harvey W Starr and Timonthy M Doyle. Patent
       US20090015583 - Digital music input rendering for
       graphical presentations. Jan. 2009. URL: https://www.
       google.com/patents/US20090015583.
[13]   Naomi van der Velde. Speech Recognition Software:
       Past, Present & Future. Sept. 2017. URL: https://www.
       globalme.net/blog/speech-recognition-software-history-
       future.
   [2] [3] [1] [5] [4] [6] [7] [8] [10] [9] [11] [12] [13]

You can also read