DIGITAL MUSIC INPUT RENDERING FOR GRAPHICAL PRESENTATIONS IN SOUNDSTROLL(MAXMSP)

Page created by Chris Gray
 
CONTINUE READING
DIGITAL MUSIC INPUT RENDERING FOR GRAPHICAL PRESENTATIONS IN SOUNDSTROLL(MAXMSP)
Digital Music Input Rendering for Graphical
            Presentations in SoundStroll(MaxMSP)

                                                             Justin Kerobo
                                               School of Computer Science and Music
                                                          Earlham College
                                                     Richmond, Indiana 47374
                                                  Email: jakerobo15@earlham.edu

    Abstract—A graphical presentation is produced at a display        and manipulation of virtual worlds comprised of three dimen-
of a host computer such that a scene description is rendered and      sional objects in a 3D space with applications in many fields,
updated by a received digital music input. The digital music input    from microscopic imaging to galactic modeling and notably
is matched to trigger events of the scene description and actions     computer graphics for films and gaming environments. There
of each matched trigger event are executed in accordance with         have been a few attempts to associate the direct performance
action processes of the scene description, thereby updating the
                                                                      of music with computer video and graphics to create new art
scene description with respect to objects depicted in the scene on
Which the actions are executed. The updated scene description         forms. One program, Bliss Paint for the Macintosh, used MIDI
is then rendered. The system provides a means for connecting a        input to change colors on an evolving rendering of a fractal
graphics API to a musical instrument digital interface (e.g., MIDI)   image. Another program, ArKaos, uses MIDI commands to
data stream, possibly from Ableton, Reason, to MaxMSP and             play video clips in a DJ-like process other program, MaxMSP,
producing a graphical presentation in SoundStroll in MaxMSP.          uses MIDI commands in a flexible environment to drive video
                                                                      clips, audio clips, and drive external events.
   Keywords—MIDI, LATEX, OpenGL, Graphics, MaxMSP, Music
Technology, Ableton, FFT, IFFT, Computer Music.                           There are many computer programs that control sound in
                                                                      various ways in response to a MIDI command stream. The
                                                                      “3DMIDI” program appears to be un-supported and it is not
                      I.   I NTRODUCTION                              clear if the software works or ever worked. The available
                                                                      documentation describes set of separate programs that each
    A variety of computer software programs are available             performs a prescribed set of transformations to an embedded
for defining and manipulating objects in a virtual three-             set of objects in response to MIDI.
dimensional (3D) world. For example, “3DSMax” from Au-
todesk, Inc. and SolidWorks are available. They provide an as-            Each different performance is loaded and executed sep-
sortment of tools in a convenient graphical user interface (GUI)      arately, and has its own unique tool set to make specific
for manipulation and editing of 3D virtual objects. Programs          adjustments to the objects in that scene. There is an API
for computer display screensavers also permit manipulation            shown that invites others to develop their own performances,
of moving images. Also having great popularity are computer           each with their own unique sets of objects and tools which
software programs for manipulation of video clips, multimedia         cannot be edited at that point. Unfortunately, there is no
clips, and the like. Such programs include Max, Aperture,             convenient user interface available for interacting computer
and ArKaos. Another popular medium that supports creativity           graphics with musical instrument digital data. Conventional
with computers are the various computer software applications         methods generally require cumbersome specification of input
that involve the musical instrument digital interface (MIDI)          sources, scene description parameters and data objects, and
standard.                                                             linking of input sources and scene description objects. As a
                                                                      result, a relatively high level of computer skills is necessary
    The MIDI standard permits connection of musical instru-           for creating graphical presentations in conjunction with music
ments with digital output to related digital sound processing         input. It would be improving creative output if users could
devices, including computers with sound cards and sound               create scenes with objects and change both the objects and the
editing applications, soundboards, broadcast equipment, and           nature of the interaction between the video graphics and MIDI
the like. Music has become commonly performed with instru-            music data.
ments that send digital MIDI data since the introduction of
MIDI in approximately 1985. MIDI provides a flexible set of                Because of these difficulties and increased complexity,
instructions that are sent via a serial data link from a controller   there is need for a graphical user interface that supports inte-
to a receiver that processes those commands in a variety of           gration with digital musical instruments, vocal recognition, and
ways that pertain to the output functions of the receiving            it is possible to do this through and with a Fourier Transform
device. The data and instructions involve most commonly that          (FT). It is possible and likely to be able to create this interface
of sounds and music, but can also involve instructions for            as a three-dimensional audio sequencer and spatializer in
machine control and lighting control devices.                         MaxMSP, also using a speech processing application and using
                                                                      the Fast Fourier Transform (FFT), the Inverse Fast Fourier
   A separate branch of technology is the development of              Transform (IFFT), and the Discrete Fourier Transform (DFT)
computer video graphics, the digital electronic representation        analyses to filter in keywords to find objects and create a scene
that you can traverse.                                              In order to control a SysEx function, a manufacturer-specific
                                                                    ID code is sent. Equipment which isn’t set up to recognize
                   II.   H ISTORY OF MIDI                           that particular code will ignore the rest of the message, while
                                                                    devices that do recognize it will continue to listen.
    MIDI (Musical Instrument Digital Interface) protocol has
become the dominant method of connecting pieces of elec-               ”The MIDI protocol allows for control over more than just
tronic musical equipment. And when you consider the previous        when a note should be played.”
standard you have to say that MIDI arrived at just the right            SysEx messages are usually used for tasks such as loading
time.                                                               custom patches and are typically recorded into a sequencer
    The control voltage (CV) and gate trigger system used           using a ’SysEx Dump’ feature on the equipment.
on early analogue synths was severely limited in its scope              MIDI information was originally sent over a screened
and flexibility. Analogue synths tended to have very few            twisted pair cable (two signal wires plus an earthed shield
features that could be controlled remotely, relying as they         to protect them from interference) terminated with 5-pin DIN
did on physical knobs and sliders, patch cables and manual          plugs. However, this format has been superseded to some
programming.                                                        extent by USB connections, as we’ll discuss later. No waves
                                                                    or varying voltages are transmitted since MIDI data is sent
    Furthermore, there was no universal standard for the way
                                                                    digitally, meaning that the signal pins either carry a voltage or
CV control should work, complicating the process when in-
                                                                    none at all, corresponding to the binary logical values 1 and
terfacing between products from different manufacturers. The
                                                                    0.
majority of vintage CV-controlled synths can now be adapted
with a CV-to-MIDI converter, so you can use MIDI to control             These binary digits (bits) are combined into 8-bit messages.
them.                                                               The protocol supports data rates of up to 31,250 bits per
                                                                    second. Each MIDI connection sends information in one
     Dave Smith, founder of Californian synth legend Sequen-        direction only, meaning two cables are needed if a device is
tial Circuits and now head of Dave Smith Instruments, antici-       used both to send and receive data (unless you’re working over
pated the demand for a more powerful universal protocol and         USB that is).
developed the first version of the MIDI standard, which was
released in 1983. With the increasing complexity of synths, and         In addition to the expected IN and OUT connections, most
as the music industry shifted towards digital technology and        MIDI devices also have a THRU port. This simply repeats the
computer-based studios, the MIDI setup took off and became          signal received at the IN port so it can be sent on to other
the standard for connecting equipment.                              devices further down the chain. Devices may be connected
                                                                    in series and, for the largest MIDI setups, an interface with
A. How It Works                                                     multiple output ports may be used to control more than 16
                                                                    separate chained devices.
    Absolutely no sound is sent via MIDI, just digital signals
known as event messages, which instruct pieces of equipment.        B. Becoming a Standard
The most basic example of this can be illustrated by consider-          The key feature of MIDI when it was launched was its
ing a controller keyboard and a sound module. When you push         efficiency: it allowed a relatively significant amount of infor-
a key on the keyboard, the controller sends an event message        mation to be transmitted using only a small amount of data.
which corresponds to that pitch and tells the sound module          Given the limitations of early ’80s digital data transmission
to start playing the note. When you let go of the key, the          methods, this was essential to ensure that the reproduction of
controller sends a message to stop playing the note.                musical timing was sufficiently accurate.
    Of course, the MIDI protocol allows for control over more           Manufacturers quickly adopted MIDI and its popularity
than just when a note should be played. Essentially, a message      was cemented by the arrival of MIDI-compatible computer
is sent each time some variable changes, whether it be note-        hardware (most notably the built-in MIDI ports of the Atari ST,
on/off (including, of course, exactly which note it is), velocity   which was released in 1985). As weaknesses or potential extra
(determined by how hard you hit the key), after-touch (how          features were identified, the MIDI Manufacturers Association
hard the key is held down), pitch-bend, pan, modulation,            updated the standard regularly following its first publication.
volume or any other MIDI-controllable function.
                                                                        The most notable updates - Roland MT-32 (1987), General
    The protocol supports a total of 128 notes (from C five         MIDI (1991) and GM2 (1999), Roland GS (1991) and Yamaha
octaves below middle C through to G ten octaves higher), 16         XG (1997-99) - added further features or standards, generally
channels (so that 16 separate devices can be controlled per         without making previous ones obsolete. It’s questionable just
signal chain, or multiple devices assigned the same channel         how relevant the majority of these standards are to digital
so they respond to the same input) and 128 programs (corre-         musicians and producers, since most of them relate in large
sponding to patches or voice/ effect setting changes). MIDI         part to standardizing the playback of music distributed in MIDI
signals also include built-in clock pulses, which define the        format. Unless you intend to distribute your music as MIDI
tempo of the track and allow basic timing synchronization           files, most of them probably won’t affect you.
between equipment.
                                                                    C. Right On Time
    The other major piece of the jigsaw is the SysEx (System
Exclusive) message, designed so that manufacturers could uti-           The most common criticisms of the MIDI protocol relate
lize MIDI to control features specific to their own equipment.      to timing issues. Although MIDI was efficient by the standards
of the early ’80s, it is still undeniably flawed to some extent.    with placing greater onus on manufacturers and software de-
There is some degree of jitter (variation in timing) present        velopers to come up with their own powerful proprietary DAW-
in MIDI, resulting in discernible sloppiness in recording and       based control systems operating via existing USB, FireWire or
playback.                                                           even over Ethernet connections or wirelessly.
    Perhaps even more obvious to most of us is latency, the
                                                                                      III.   H ISTORY OF CAD
delay between triggering a function (such as a sound) via MIDI
and the function being carried out (in this case the sound being        Modern engineering design and drafting can be traced back
reproduced). The more information sent via MIDI, the more           to the development of descriptive geometry in the 16th and
latency is created. It may only be in the order of milliseconds,    17th centuries. Drafting methods improved with the intro-
but it’s enough to become noticeable to the listener.               duction of drafting machines, but the creation of engineering
                                                                    drawings changed very little until after World War II.
    Even more problematic is the fact that most of us use MIDI
in a computer-based studio and each link in the MIDI and                During the war, considerable work was done in the devel-
audio chain could potentially add to the latency. This could        opment of real-time computing, particularly at MIT, and by
either be due to software (drivers, DAWs, soft synths) or           the 1950s there were dozens of people working on numerical
hardware (RAM, hard drives, processors) but the end result          control of machine tools and automating engineering design.
is sloppy timing. The blame cannot be laid entirely at the          But it’s the work of two people in particular—Patrick Hanratty
door of MIDI, but the weaknesses of multiple pieces of MIDI         and Ivan Sutherland—who are largely credited with setting the
equipment combined with all the other sources of timing error       stage for what we know today as CAD or Computer Aided
can have a significant detrimental effect on the end result.        Design.
    Most new MIDI equipment is supplied not only with               A. The Fathers of CAD
traditional 5-pin DIN connections but with standard Type A
or B USB ports that allow direct connection to your computer.           Hanratty is widely credited as “the Father of CADD/CAM.”
However, USB is not the solution to all your MIDI timing            In 1957, while working at GE, he developed PRONTO (Pro-
problems. Despite the higher data transfer rates possible over      gram for Numerical Tooling Operations), the first commer-
USB, latency is actually higher than over a standard DIN-           cial CNC programming system. Five years later, Sutherland
based MIDI connection. Furthermore, jitter is significantly         presented his Ph.D. thesis at MIT titled “Sketchpad, A Man-
higher when using MIDI over USB, leading to unpredictable           Machine Graphical Communication System.” Among its fea-
inaccuracies in timing.                                             tures, the first graphical user interface, using a light pen to
                                                                    manipulate objects displayed on a CRT.
D. Beyond MIDI                                                          The 1960s brought other developments, including the first
                                                                    digitizer (from Auto-trol) and DAC-1, the first production
    It’s clear that while MIDI has been massively important         interactive graphics manufacturing system. By the end of
to the development of music technology over the last 25             the decade, a number of companies were founded to com-
years, it does come with a few major weaknesses. One                mercialize their fledgling CAD programs, including SDRC,
heavily researched alternative, the Zeta Instrument Processor       Evans and Sutherland, Applicon, Computervision, and M and
Interface protocol proposed in the mid-’90s, failed to gain         S Computing.
support from manufacturers and never saw commercial release.
However, the same development team helped to develop the                By the 1970s, research had moved from 2D to 3D. Major
OpenSound Control (OSC) protocol used by the likes of               milestones included the work of Ken Versprille, whose inven-
Native Instruments’ Reaktor and Traktor and the Max/MSP             tion of NURBS for his Ph.D. thesis formed the basis of modern
and SuperCollider development environments.                         3D curve and surface modeling, and the development by Alan
                                                                    Grayer, Charles Lang, and Ian Braid of the PADL (Part and
    This is a much higher bandwidth system which overcomes          Assembly Description Language) solid modeler.
many of the timing issues of MIDI, most notably by trans-
mitting information with built-in timing messages as quickly            With the emergence of UNIX workstations in the early
as possible through high-bandwidth connections rather than          ’80s, commercial CAD systems like CATIA and others began
relying on the real-time, event messages used by MIDI devices,      showing up in aerospace, automotive, and other industries. But
which just assume that timing is correct and respond to each        it was the introduction of the first IBM PC in 1981 that set
message as soon as it’s received.                                   the stage for the large-scale adoption of CAD. The following
                                                                    year, a group of programmers formed Autodesk, and in 1983
    One significant barrier to the development of a universal       released AutoCAD, the first significant CAD program for the
protocol for contemporary music equipment is that there is so       IBM PC.
much variation between equipment. With so many different
synthesis methods, programming systems, levels of user con-         B. The CAD Revolution
trol and forms of sound manipulation available on different
pieces of gear, it’s unlikely that any universal system for their       AutoCAD marked a huge milestone in the evolution of
control is possible.                                                CAD. Its developers set out to deliver 80 percent of the
                                                                    functionality of the other CAD programs of the day, for 20
    However, as computer processing and interfacing technolo-       percent of their cost. From then on, increasingly advanced
gies have developed so rapidly since the early ’80s, perhaps        drafting and engineering functionality became more affordable.
the solution lies not with updating or replacing MIDI but rather    But it was still largely 2D.
That changed in 1987 with the release of Pro/ENGINEER,         extensions, adding the ability to do real-time synthesis using
a CAD program based on solid geometry and feature-based            an internal hardware digital signal processor (DSP) board. The
parametric techniques for defining parts and assemblies. It        same year, IRCAM licensed the software to Opcode Systems.
ran on UNIX workstations—PCs of the time were simply not
powerful enough—but it was a game changer. The later years            2) 1990s:: Opcode launched a commercial version named
of the decade saw the release of several 3D modeling kernels,      Max in 1990, developed and extended by David Zicarelli. How-
most notably ACIS and Parasolids, which would form the basis       ever, by 1997, Opcode was considering cancelling it. Instead,
for other history-based parametric CAD programs.                   Zicarelli acquired the publishing rights and founded a new
                                                                   company, Cycling ’74, to continue commercial development.
                                                                   The timing was fortunate, as Opcode was acquired by Gibson
C. CAD Today, CAD Tomorrow                                         Guitar in 1998 and ended operations in 1999.
   The modern CAD era has been marked by improvements                  IRCAM’s in-house Max development was also winding
in modeling, incorporation of analysis, and management of          down; the last version produced there was jMax, a direct de-
the products we create, from conception and engineering            scendant of Max/FTS developed in 1998 for Silicon Graphics
to manufacturing, sales, and maintenance (what has become          (SGI) and later for Linux systems. It used Java for its graphical
known as PLM, product lifecycle management).                       interface and C for its real-time backend and was eventually
    “Engineers and designers are being asked to create more,       released as open-source software. Meanwhile, Puckette had
faster, and with higher quality,” says Bill McClure, vice          independently released a fully redesigned open-source com-
president of product development at Siemens PLM. With all          position tool named Pure Data (Pd) in 1996, which, despite
of this pressure on engineers and designers, what do you see       some underlying engineering differences from the IRCAM
as the next big evolution in CAD?                                  versions, continued in the same tradition. Cycling ’74’s first
                                                                   Max release, in 1997, was derived partly from Puckette’s work
               IV.   M AX MSP AND I TS U SES                       on Pure Data. Called Max/MSP (”Max Signal Processing”,
                                                                   or the initials Miller Smith Puckette), it remains the most
A. Introduction                                                    notable of Max’s many extensions and incarnations: it made
                                                                   Max capable of manipulating real-time digital audio signals
    Max, also known as Max/MSP/Jitter, is a visual pro-
                                                                   without dedicated DSP hardware. This meant that composers
gramming language for music and multimedia developed and
                                                                   could now create their own complex synthesizers and effects
maintained by San Francisco-based software company Cycling
                                                                   processors using only a general-purpose computer like the
’74. Over its more than thirty-year history, it has been used by
                                                                   Macintosh PowerBook G3.
composers, performers, software designers, researchers, and
artists to create recordings, performances, and installations.        In 1999, the Netochka Nezvanova collective released
The Max program is modular, with most routines existing as         nato.0+55, a suite of externals that added extensive real-time
shared libraries. An application programming interface (API)       video control to Max.
allows third-party development of new routines (named exter-
nal objects). Thus, Max has a large user base of programmers           3) 2000s:: Though nato became increasingly popular
unaffiliated with Cycling ’74 who enhance the software with        among multimedia artists, its development stopped in 2001.
commercial and non-commercial extensions to the program.           SoftVNS, another set of extensions for visual processing in
Because of this ingenious extensible design, which simultane-      Max, was released in 2002 by Canadian media artist David
ously represents both the program’s structure and its graphical    Rokeby. Cycling ’74 released their own set of video extensions,
user interface (GUI), Max has been described as the lingua         Jitter, alongside Max 4 in 2003, adding real-time video,
franca for developing interactive music performance software.      OpenGL graphics, and matrix processing capabilities. Max 4
                                                                   was also the first version to run on Windows. Max 5, released
                                                                   in 2008, redesigned the patching GUI for the first time in
B. History
                                                                   Max’s commercial history.
    1) 1980s:: Miller Puckette began work on Max in 1985, at
the Institut de Recherche et Coordination Acoustique/Musique           4) 2010s:: In 2011, Max 6 added a new audio engine
(IRCAM) in Paris. Originally called The Patcher, this first        compatible with 64-bit operating systems, integration with
version provided composers with a graphical interface for          Ableton Live sequencer software, and an extension called
creating interactive computer music scores on the Macintosh.       Gen, which can compile optimized Max patches for higher
At this point in its development Max couldn’t perform its own      performance. Max 7, the most recent major version, was
real-time sound synthesis in software, but instead sent control    released in 2014 and focused on 3D rendering improvements.
messages to external hardware synthesizers and samplers using      On June 6, 2017, Ableton announced its purchase of Cycling
MIDI or a similar protocol. Its earliest widely recognized use     ’74, with Max continuing to be published by Cycling ’74 and
in composition was for Pluton, a 1988 piano and computer           David Zicarelli remaining with the company Programs sharing
piece by Philippe Manoury; the software synchronized a             Max’s visual programming concepts are now commonly used
computer to a piano and controlled a Sogitec 4X for audio          for real-time audio and video synthesis and processing.
processing.
                                                                   C. Language
   In 1989, IRCAM developed Max/FTS (”Faster Than
Sound”), a version of Max ported to the IRCAM Signal                   Max is named after composer Max Mathews, and can
Processing Workstation (ISPW) for the NeXT. Also known as          be considered a descendant of his MUSIC language, though
”Audio Max”, it would prove a forerunner to Max’s MSP audio        its graphical nature disguises that fact. Like most MUSIC-N
languages, Max distinguishes between two levels of time: that                            V.   S OUND S TROLL
of an event scheduler, and that of the DSP (this corresponds to
the distinction between k-rate and a-rate processes in Csound,          SoundStroll is a kind of 3D audio sequencer. It is a
and control rate vs. audio rate in SuperCollider).                  tool for placing sounds in a virtual landscape (a very bare
                                                                    landscape) constructed with jitter OpenGL, and triggering them
    The basic language of Max and its sibling programs is           and spatializing them around you as you take a stroll in this
that of a data-flow system: Max programs (named patches) are        virtual sound world. SoundStroll is a tool for placing sounds
made by arranging and connecting building-blocks of objects         in an open 3D virtual landscape and triggering them as you
within a patcher, or visual canvas. These objects act as self-      take a stroll through your soundscape. SoundStroll exists as a
contained programs (in reality, they are dynamically linked         MaxMSP project, which is to say a set of MaxMSP patches;
libraries), each of which may receive input (through one or         as such it should be compatible with Windows and Mac OSX
more visual inlets), generate output (through visual outlets),      (made with Max 6.1.9 on OSX 10.6.8). It is intended to be
or both. Objects pass messages from their outlets to the inlets     free and open source (though one still needs to possess a
of connected objects.                                               MaxMSP 6 licence to modify the sources) under the terms
                                                                    of CreativeCommons Attribution-Non Commercial licence (cc
     Max supports six basic atomic data types that can be           by nc).
transmitted as messages from object to object: int, float, list,
symbol, bang, and signal (for MSP audio connections). Several          VI.   H ISTORY OF S PEECH R ECOGNITION S OFTWARE
more complex data structures exist within the program for
handling numeric arrays (table data), hash tables (coll data),          Speech recognition software (or speech recognition tech-
XML information (pattr data), and JSON-based dictionaries           nology) enables phones, computers, tablets, and other ma-
(dict data). An MSP data structure (buffer ) can hold digital       chines to receive, recognize and understand human utterances.
audio information within program memory. In addition, the           It uses natural language as input to trigger an action; enabling
Jitter package adds a scalable, multi-dimensional data structure    our devices to also respond to our spoken commands. The
for handling large sets of numbers for storing video and other      technology is being used to replace other, more ‘overused’
datasets (matrix data).                                             methods of input like typing, texting, and clicking. This turns
                                                                    out to be slightly ironic because of the fact that texting has
    Max is typically learned through acquiring a vocabulary of      become the norm, over voice.
objects and how they function within a patcher; for example,
the metro object functions as a simple metronome, and the           A. 1950s and 60s
random object generates random integers. Most objects are
non-graphical, consisting only of an object’s name and several          In this day and age, speech recognition can be found in any-
arguments-attributes (in essence class properties) typed into       thing and everything, from cars with Bluetooth connections,
an object box. Other objects are graphical, including sliders,      to asking Google to search up ’spaghetti’, to just processing
number boxes, dials, table editors, pull-down menus, but-           speech over connections with Microsoft Skype, and many more
tons, and other objects for running the program interactively.      things. The ability to talk to your devices have expanded to
Max/MSP/Jitter comes with about 600 of these objects as the         encompass the vast majority of technology that we use in our
standard package; extensions to the program can be written by       daily lives.
third-party developers as Max patchers (e.g. by encapsulating
some of the functionality of a patcher into a sub-program that         The first speech recognition systems were focused on
is itself a Max patch), or as objects written in C, C++, Java,      numbers, not words. In 1952, Bell Laboratories designed the
or JavaScript.                                                      “Audrey” system which could recognize a single voice speak-
                                                                    ing digits aloud. Ten years later, IBM introduced “Shoebox”
    The order of execution for messages traversing through          which understood and responded to 16 words in English.
the graph of objects is defined by the visual organization of           Across the globe other nations developed hardware that
the objects in the patcher itself. As a result of this organizing   could recognize sound and speech. And by the end of the ‘60s,
principle, Max is unusual in that the program logic and the         the technology could support words with four vowels and nine
interface as presented to the user are typically related, though    consonants.
newer versions of Max provide several technologies for more
standard GUI design.
                                                                    B. 1970s
    Max documents (named patchers) can be bundled into                  Speech recognition made several meaningful advancements
stand-alone applications and distributed free or sold commer-       in this decade. This was mostly due to the US Department
cially. In addition, Max can be used to author audio and MIDI       of Defense and DARPA. The Speech Understanding Research
plugin software for Ableton Live through the Max for Live           (SUR) program they ran was one of the largest of its kind in
extension.                                                          the history of speech recognition. Carnegie Mellon’s “Harpy’
                                                                    speech system came from this program and was capable of
    With the increased integration of laptop computers into         understanding over 1,000 words which is about the same as a
live music performance (in electronic music and elsewhere),         three-year-old’s vocabulary.
Max/MSP and Max/Jitter have received attention as a devel-
opment environment available to those serious about laptop             Also significant in the ‘70s was Bell Laboratories’ intro-
music/video performance.                                            duction of a system that could interpret multiple voices.
C. 1980s                                                          Fig. 1.   Overall Description
   The ‘80s saw speech recognition vocabulary go from a
few hundred words to several thousand words. One of the
breakthroughs came from a statistical method known as the
“Hidden Markov Model (HMM)”. Instead of just using words
and looking for sound patterns, the HMM estimated the
probability of the unknown sounds actually being words.

D. 1990s
    Speech recognition was propelled forward in the 90s in
large part because of the personal computer. Faster processors
made it possible for software like Dragon Dictate to become
more widely used.
    Bell South introduced the voice portal (VAL) which was a
dial-in interactive voice recognition system. This system gave
birth to the myriad of phone tree systems that are still in
existence today.

E. 2000s
    By the year 2001, speech recognition technology had
achieved close to 80 percent accuracy. For most of the decade     a port of the host computer (box 106). The UI events may
there was not a lot of advancements until Google arrived with     comprise events such as display mouse or computer keyboard
the launch of Google Voice Search. Because it was an app, this    activation by the user, or graphics table input, and the like.
put speech recognition into the hands of millions of people.      The musical instrument digital interface input can comprise
It was also significant because the processing power could        signals received from a digital musical instrument connected
be offloaded to its data centers. Not only that, Google was       to a suitable port of the host computer or from a stored digital
collecting data from billions of searches which could help it     music file.
predict what a person is actually saying. At the time Google’s
English Voice Search System included 230 billion words from           The processed UI events from the user (at box 104 of FIG.
user searches.                                                    1) can comprise a variety of user input at the host computer
                                                                  that relate to the scene description and modify the rendered
F. 2010s                                                          scene in accordance with the user input. Examples of UI
                                                                  events from the user include playback controls, by which the
   In 2011 Apple launched Siri which was similar to Google’s      user can halt operation of the rendering and close the scene
Voice Search. The early part of this decade saw an explosion      description. The user also can launch a scene description editor
of other voice recognition apps. And with Amazon’s Alexa,         application, which provides a graphical user interface (GUI)
Google Home we have seen consumers becoming more and              through which the user can manipulate and change values in
more comfortable talking to machines.                             the scene description to be rendered, thereby effecting the
                                                                  scene that will rendered. The user-editable scene description
    Today, some of the largest tech companies are competing
                                                                  parameters are described in more detail below.
to herald the speech accuracy title. In 2016, IBM achieved a
word error rate of 6.9 percent. In 2017 Microsoft beat out IBM        The digital music input (at box 106 of FIG. 1) may
with a 5.9 percent claim. Shortly after that IBM improved their   comprise, for example, input received over an interface that
rate to 5.5 percent. However, it is Google that is claiming the   is compatible with the MMA interface, wherein MMA is the
lowest rate at 4.9 percent.                                       MIDI (Musical Instrument Digital Interface) Manufacturer’s
                                                                  Association protocol specification. Those skilled in the art will
           VII.   D IAGRAMS AND E XPLANATIONS                     appreciate that a variety of musical instrument digital interfaces
                                                                  may be used, although the MIDI standard of the MMA is
   The processing is illustrated by the flow diagram of FIG. 1,   the most well-known and widely used for digital music rep-
which shows that a host computer system processes a scene as      resentation. A wide variety of electronic musical instruments
specified by a scene description, thereby beginning execution     can be supported, including synthesizers that produce MIDI
of the scene description. The scene processing comprises          command streams for electronic piano, drum, guitar, and the
loading a scene description into working memory of the            like. Thus, the digital music input at box 106 can comprise
computer. This processing is represented by the flow diagram      a MIDI command stream that is delivered live (that is, in
box numbered 102. The scene description defines one or more       response to activation in real time) or delivered serially from
objects located in three-dimensional space of the rendered        a conventional MIDI file.
scene, as specified by the scene description. The system then
monitors, or listens, for input from two sources at the host        Those skilled in the art will appreciate that a MIDI com-
computer: user interface (UI) events from the user at the host    mand stream can produce sounds that are triggered from a
computer (box 104) and digital music input received from          MIDI-enabled sound engine that receives MIDI commands
as control inputs and that can produce corresponding sounds           Fig. 3.   Digital Music Input
and musical notes. Such sounds and musical notes can be
stored as *.wav, *.aiff, *.mp3 files, and the like. Other digitally
encoded audio files can be used for input, as well. Such
audio files can be easily played through digital media players.
Moreover, musical interfaces such as the MMA MIDI interface
can interact with graphical interfaces in real time as a digital
instrument is played.
    For example, the illustrated embodiment utilizes graphics
control through the DirectX interface, but OpenGL or any other
graphics API could also be supported. Those skilled in the
art will understand the integration details for such interaction,
in view of the description herein. In the description herein, a
MIDI input stream will be assumed for the digital music input,
unless otherwise indicated. That is, references to “MIDI” input
will be understood to include all varieties of digital music input
described herein, unless otherwise indicated.
    After the user UI events and MIDI port events are pro-
cessed, the system updates the scene (box 108). Next, at box
109, the scene is rendered, meaning that applicable video and
audio output is generated. Lastly, if no halt instruction or the      host computer receives the MIDI input at a sound card through
like is received at box 110 execution is continued by returning       a MIDI port. The system then matches the digital music input
to listening for, and processing, input from the user (box 104)       to trigger events of the scene description, as indicated by
and the musical instrument (box 106).                                 box 304. The trigger events correspond to musical instrument
                                                                      actions that generate music note events, such as piano keys
Fig. 2.   User Input                                                  that are struck, guitar strings that are plucked, drum surfaces
                                                                      that are hit, and the like. The trigger events comprise MIDI
                                                                      commands such as the output from a synthesizer or other
                                                                      electronic music instrument.
                                                                           As noted above, a MIDI command stream can be played
                                                                      through a MIDI-enabled sound engine and can be stored
                                                                      as audio data in such common formats as Windows Media
                                                                      Player or Real Player or the like, including music files such
                                                                      as a *.WAV file or *.AIFF file or the like. Each trigger
                                                                      event is associated with process functions that are specified
                                                                      in the scene description. At box 306, the process functions are
                                                                      executed, thereby producing changes to the defined objects in
                                                                      the rendered scene. As noted previously, the scene description
                                                                      is updated per the digital music events and process functions,
                                                                      and the updated scene is rendered, while digital music input
                                                                      listening continues. This processing is indicated by box 310.
                                                                          A variety of actions associated with the process functions
                                                                      may be carried out. For example, the actions may specify col-
    FIG. 2 shows additional details of processing the user            lisions between two or more objects of the scene description,
input. First, at box 202, conventional computer operating             and can include explosion of one or more objects of the scene
system listening is performed, to await received input from           description, or other movements of the objects in the scene
the user interface. When user input is received that changes the      description. The actions can be specified by user input so as to
scene description, such as object manipulation commands or            permit changes in speed, size, movement, color, and behavior
changes in MIDI processing parameters, the scene description          of the scene objects.
in memory that is being executed (rendered) is changed in
accordance with that input.                                               FIG. 4 is a block diagram of an exemplary host com-
                                                                      puter 1200 that performs the processing described herein.
    This is represented in FIG. 2 by the box 204. The system
                                                                      The computer includes a processor 1202, such as a general
continues to listen for additional user events, as indicated at
                                                                      purpose computer chip and ancillary components, as provided
box 206. As noted above, execution of the scene description
                                                                      in conventional personal computers, workstations, and the
and listening for further user input continues, as indicated by
                                                                      like that are generally available. Through the processor 1202,
the return from box 210 to box 202, until execution is halted
                                                                      the computer executes program instructions to carry out the
by a user input.
                                                                      operations described herein. The processor communicates with
    FIG. 3 shows additional details of processing the musical         other components of the computer over a system bus 1203 for
instrument digital interface (MIDI) input. First, at box 302, the     data exchange and operations. The processor can operate with
Fig. 4.   Exemplary Host Computer                                  Fig. 5.   First 16 Bins of an FFT Frame(Amplitudes)

                                                                   and the Inverse Fast Fourier Transform (IFFT). The FFT and
                                                                   IFFT are optimized (very fast) computer-based algorithms that
                                                                   perform a generalized mathematical process called the Discrete
                                                                   Fourier Transform (DFT). The DFT is the actual mathematical
                                                                   transformation that the data go through when converted from
                                                                   one domain to another (time to frequency). In a more basic
a sound card 1204 that processes digital music data, such as a     explanation, the DFT is just a slow/slower version of the FFT.
digital music input data stream received from a digital music
input device including a music synthesizer and the like, and           FFTs, IFFTs, and DFTs became really important to a lot
can produce audio (sound) output 1205.                             of disciplines when engineers figured out how to take samples
                                                                   quickly enough to generate enough data to re-create sound and
    The processor 1202 also responds to input devices 1206         other analog phenomena digitally. But, they do not just work
that receive user input, including such input devices as a         on sounds; they work on any continuous signal (images, radio
computer keyboard, mouse, and other similar devices. The           waves, seismographic data, etc.).
computer includes memory 1208, typically provided as volatile
(dynamic) memory for storing program instructions, operating           An FFT of a time domain signal takes the samples and
data, and so forth. The datastore 1210 is typically non-volatile   gives us a new set of numbers representing the frequencies,
memory, such as data disks or disk arrays.                         amplitudes, and phases of the sine waves that make up the
                                                                   sound we’ve analyzed. It is these data that are displayed in
    The computer can also include a program product reader         the sonograms.
1212 that receives externally accessible media 1214 such as
flash drives and optical media discs, and the like. Such media         Chart 5 shows the first 16 bins of a typical FFT analysis
1214 can include program instructions, comprising program          after the conversion is made from real and imaginary numbers
products, that can be read by the reader 1212 and executed by      to amplitude/phase pairs. The phases are left out because it is
the processor 1202 to provide the operation described herein.      hard to make up a bunch of arbitrary phases between 0 and
The processor uses a graphics or video card 1216 to visually       2. In a lot of cases, you might not need them (and in a lot of
render the objects in a scene description according to the         cases, you would!). In this case, the sample rate is 44.1 kHz
digital music input received through the sound card 1204. The      and the FFT size is 1,024, so the bin width (in frequency) is
visually rendered graphics output can be viewed at a display       the Nyquist frequency (44,100/2 = 22,050) divided by the FFT
device 1218, such as visual display devices and the like. The      size, or about 22 Hz.
sound output 1205 and rendered graphics output 1218 together           Amplitude values are assumed to be between 0 and 1;
comprise the rendered scene output, providing a multimedia         notice that they are quite small because they all must sum
presentation.                                                      to 1, and there are a lot of bins!.
                                                                       The numbers are not real, but notice that they are made
                VIII.   T HE F REQUENCY D OMAIN                    them up to represent a sound that has a simple, more or less
A. The DFT, FFT, and IFFT                                          harmonic structure with a fundamental somewhere in the 66 Hz
                                                                   to 88 Hz range (you can see its harmonics at around 2, 3, 4, 5,
   The most common tools used to perform Fourier analysis          and 6 times its frequency, and note that the harmonics decrease
and synthesis are called the Fast Fourier Transform (FFT)          in amplitude more or less like they would in a sawtooth wave).
1) How the FFT Works: The Fast Fourier Transform in a            Fig. 9.    f(t)*w(t)
Nutshell: Computing Fourier Coefficients Here’s a little three-
step procedure for digital sound processing.
    1)      Window
    2)      Periodicize
    3)      Fourier transform (this also requires sampling, at a
            rate equal to 2 times the highest frequency required).
            you do this with FFT. Following is an illustration of
            steps 1 and 2.
Here’s the graph of a (periodic) function, f(t). (Note that f(t)
need not be a periodic function.)
                                                                     Fig. 10.    f(t)*w(t)
Fig. 6.   Graph of a (periodic) function, f(t)

                                                                         You have a periodic function, and the Fourier theorem says
    Look at the portion of the graph between 0 ≤ t ≤ 1.              it can represent this function as a sum of sines and cosines.
Following is a graph of the window function we need to use.          This is step 3. You can also use other, non-square windows.
The function is called w(t). Note that w(t) equals 1 only in the     This is done to ameliorate the effect of the square windows
interval 0 ≤ t ≤ 1 and it’s 0 everywhere else.                       on the frequency content of the original signal.

Fig. 7.   w(t)
                                                                     B. The DFT, FFT, and IFFT
                                                                         Now, once you have a periodic function, all you need to do
                                                                     is figure out, using the FFT, what the component sine waves
                                                                     of that waveform are.
                                                                        It is possible to represent any periodic waveform as a
                                                                     sum of phase-shifted sine waves. In theory, the number of
                                                                     component sine waves is infinite—there is no limit to how
                                                                     many frequency components a sound might have. In practice,
                                                                     you need to limit it to some predetermined number. This limit
    In step 1, you window the function. In Figure 7 you plot         has a serious effect on the accuracy of our analysis.
both the window function, w(t) (which is nonzero in the region           Here’s how that works: rather than looking for the fre-
of interest) and function f(t) in the same picture.                  quency content of the sound at all possible frequencies (an
                                                                     infinitely large number - 100.000000001 Hz, 100.000000002
Fig. 8.   f(t)*w(t)                                                  Hz, 100.000000003 Hz, etc.), next, divide up the frequency
                                                                     spectrum into a number of frequency bands and call them bins.
                                                                     The size of these bins is determined by the number of samples
                                                                     in our analysis frame (the chunk of time mentioned above).
                                                                     The number of bins is given by the formula:
                                                                         number of bins = frame size/2
                                                                        1) Frame Size: Per example, decide on a frame size of
                                                                     1,024 samples. This is a common choice because most FFT
                                                                     algorithms in use for sound processing require a number of
    In Figure 8 you plot f(t)*w(t), which is the periodic            samples that is a power of two, and it’s important not to get
function multiplied by the windowing function. From this             too much or too little of the sound.
figure, it’s obvious what part of f(t) is the area of interest.
                                                                        A frame size of 1,024 samples gives us 512 frequency
   In step 2, you need to periodically extend the windowed           bands. Assume that we’re using a sample rate of 44.1 kHz, we
function, f(t)*w(t), all along the t-axis.                           know that we have a frequency range (remember the Nyquist
theorem) of 0 kHz to 22.05 kHz. To find out how wide each                                                IX.    T IMELINE
of the frequency bins is, use the following formula:
                                                                                    1)    Sept 10 2018 - Oct 01 2018 : Research and Editing
    bin width = frequency/number of bins                                            2)    Oct 01 2018 - Nov 08 2018 : SDK/Implementation
    This formula gives us a bin width of about 43 Hz. Remem-                        3)    Nov 08 2018 - Jan 15 2019 : Data Collection
ber that frequency perception is logarithmic, so 43 Hz gives us                     4)    Jan 15 2019 - Feb 12 2019 : Implementation/Testing
worse resolution at the low frequencies and better resolution                       5)    Feb 12 2019 - Mar 12 2019 : Debugging
at higher frequencies.                                                              6)    Mar 12 2019 - Apr 02 2019 : Review/Final Debug-
                                                                                          ging
    By selecting a certain frame size and its corresponding
bandwidth, you avoid the problem of having to compute an                                                X.     C ONCLUSION
infinite number of frequency components in a sound. Instead,
you just compute one component for each frequency band.                              In accordance with embodiments of the software, a graph-
                                                                                 ical scene representation is produced at a display of a host
Fig. 11. Example of a commonly used FFT-based program: the phase vocoder         computer such that a scene description is rendered and updated
menu from Tom Erbe’s SoundHack. Note that the user is allowed to select          by a received digital music input, which can be keywords
(among several other parameters) the number of bands in the analysis. This       that are gathered from a specific song or vocal recognition
means that the user can customize what is called the time/frequency resolution   wherein the digital music input is matched to trigger events
trade-off of the FFT.
                                                                                 of the scene description and action of each matched trigger
                                                                                 event are executed in accordance with action processes of the
                                                                                 scene description, thereby updating the scene description with
                                                                                 respect to objects depicted in the scene on which the actions
                                                                                 are executed. The updated scene description is then rendered.
                                                                                 Thus, the software provides a patcher in MaxMSP that can
                                                                                 link to a musical instrument digital interface (e.g., MIDI) data
                                                                                 stream and through a sound and word processing application
                                                                                 and producing a scene based on that, that interacts as you
                                                                                 traverse it. In this way, this software can also take keywords
                                                                                 from vocal recognition and use FFT to be able to put it all
                                                                                 through a spectral filter, which you can edit to change the
                                                                                 sound of what you listen to as you traverse the world, which
                                                                                 will also place sounds in an open three-dimensional virtual
                                                                                 landscape and trigger them as you take a stroll through your
                                                                                 soundscape, and finally, it will allow you to edit the objects
                                                                                 in the soundscape based on the spatialisation tools, which can
                                                                                 be anything, including CAD.
    2) Software That Uses the FFT: There are many software
packages available that will do FFTs and IFFTs of your data                                            ACKNOWLEDGMENT
for you and then let you mess around with the frequency                            The author would like to thank Charlie Peck, Marc Ben-
content of a sound. The y-axis tells us the amplitude of                         amou, Forrest Tobey, Xunfei Jiang, and David Barbella.

Fig. 12. Another way to look at the frequency spectrum is to remove time as                                  R EFERENCES
an axis and just consider a sound as a histogram of frequencies. Think of this
as averaging the frequencies over a long time interval. This kind of picture      [1] Cycling ’74. Max/MSP History - Where did Max/MSP
(where there’s no time axis) is useful for looking at a short-term snapshot           come from? June 2009. URL: https://web.archive.org/
of a sound (often just one frame), or perhaps even for trying to examine the
spectral features of a sound that doesn’t change much over time (because all
                                                                                      web/20090609205550/http://www.cycling74.com/twiki/
we see are the ”averages”).                                                           bin/view/FAQs/MaxMSPHistory.
                                                                                  [2] Phil Burk. The Frequency Domain. May 2011. URL:
                                                                                      http : / / sites . music . columbia . edu / cmc /
                                                                                      MusicAndComputers/chapter3/03 04.php.
                                                                                  [3] David Cohn. Evolution of Computer-Aided Design. May
                                                                                      2014. URL: http://www.digitaleng.news/de/evolution-
                                                                                      of-computer-aided-design/.
                                                                                  [4] 3D Innovations. The History of Computer-Aided Design
                                                                                      (CAD). Nov. 2014. URL: https://3d- innovations.com/
                                                                                      blog/the-history-of-computer-aided-design-cad/.
                                                                                  [5] IRCAM. A brief history of MAX. June 2009. URL:
each component frequency. Looking at just one frame of an                             https : / / web . archive . org / web / 20090603230029 / http :
FFT, you usually assume a periodic, unchanging signal. A                              //freesoftware.ircam.fr/article.php3?id article=5.
histogram is generally most useful for investigating the steady-                  [6] Peter Kirn. A conversation with David Zicarelli and
state portion of a sound. (Figure 12 is a screen shot from                            Gerhard Behles. June 2017. URL: http://cdm.link/2017/
SoundHack.)                                                                           06/conversation-david-zicarelli-gerhard-behles/.
[7]   Future Music. 30 years of MIDI: a brief history. Dec.
       2012. URL: http://www.musicradar.com/news/tech/30-
       years-of-midi-a-brief-history-568009.
 [8]   Tim Place. A modular standard for structuring patches
       in Max. URL: http : / / jamoma . org / publications /
       attachments/jamoma-icmc2006.pdf.
 [9]   Miller Puckette. Synthetic Rehersal: Training the Syn-
       thetic Performer. URL: https://quod.lib.umich.edu/cgi/p/
       pod/dod-idx/synthetic-rehearsal-training-the-synthetic-
       performer.pdf?c=icmc;idno=bbp2372.1985.043;format=
       pdf.
[10]   Miller Puckette. The Patcher. URL: http://msp.ucsd.edu/
       Publications/icmc88.pdf.
[11]   Mike Sheffield. Max/MSP for average music junkies.
       Jan. 2018. URL: http://www.hopesandfears.com/hopes/
       culture/music/168579-max-msp-primer.
[12]   Harvey W Starr and Timonthy M Doyle. Patent
       US20090015583 - Digital music input rendering for
       graphical presentations. Jan. 2009. URL: https://www.
       google.com/patents/US20090015583.
[13]   Naomi van der Velde. Speech Recognition Software:
       Past, Present & Future. Sept. 2017. URL: https://www.
       globalme.net/blog/speech-recognition-software-history-
       future.
   [2] [3] [1] [5] [4] [6] [7] [8] [10] [9] [11] [12] [13]
You can also read