VOICE QUALITIES IN AUDIO SUBTITLES: OPPORTUNITIES AND CHALLENGES IN VOICE DESIGN FOR ACCESSIBILITY AND BEYOND - DIVA PORTAL

Page created by Earl Dennis
 
CONTINUE READING
VOICE QUALITIES IN AUDIO SUBTITLES: OPPORTUNITIES AND CHALLENGES IN VOICE DESIGN FOR ACCESSIBILITY AND BEYOND - DIVA PORTAL
DEGREE PROJECT IN MEDIA TECHNOLOGY,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2021

Voice Qualities in Audio Subtitles:
Opportunities and Challenges in
Voice Design for accessibility and
beyond

ANNE-CHARLOT SCHOLZ

KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Abstract
This paper explores novel experiential qualities of the voice in Audio Subtitles
through research through design. Audio subtitles is an accessibility service for
users who have trouble comprehending subtitles in audiovisual content and has
been newly developed for video on demand platforms such as SVT Play. In
order to explore possibilities in its voice design, short video clips of films and
TV series with different types of audio subtitles were produced, presented to
and discussed with a small number of potential users of audio subtitles that
included people with dyslexia, cognitive difficulties and autism. Results in-
dicated that applied voices that did not support the user’s expectations, low
and high pitches as well as low-quality speech synthesis, made for uncomfort-
able experiences, which could prove to be useful for provoking reflection and
challenging norms. The paper also discusses how voice design for this ser-
vice has the potential to match the filmmakers intentions by translating more
than semantic information, as well as how audio subtitles could potentially
be produced by professional sound designers and filmmakers instead of video
on demand services. Finally, challenges such as misgendering and insensitive
choices of voice in voice design for audio subtitles are considered, underscor-
ing how ethics can’t be avoided when working with the voice modality.
Sammanfattning
Denna uppsats utforskar nya kvaliteter av rösten i uppläst undertext genom
research through design, en metod där kunskap skapas genom design proces-
sen och reaktioner till design. Uppläst undertext är en tillgänglighetstjänst för
användare som har problem med att läsa och följa undertexter i audiovisu-
ellt innehåll och har nyligen utvecklats för video on demand-plattformar som
SVT Play. För att utforska möjligheter i dens röstdesign producerades korta
videoklipp av filmer och TV-serier med olika typer av uppläst undertext. De
presenterades för och diskuterades med ett litet antal potentiella användare av
tjänsten, bland dem personer med dyslexi, kognitiva svårigheter och autism.
Resultaten indikerade att röster som inte stödde användarens förväntningar,
låga och höga tonhöjden samt talsyntes av låg kvalitet, gav obehagliga upp-
levelser, vilket kan visa sig vara användbart för att framkalla reflektioner och
utmana normer. Uppsatsen diskuterar även hur röstdesign för uppläst under-
text har potentialen att efterlikna filmskaparnas avsikter genom att översätta
mer än semantisk information, och hur ljudundertexter kan produceras av pro-
fessionella ljuddesigner och filmskapare istället för video on demand tjänster.
Slutligen tas utmaningar som felaktig könsbestämning och okänsliga röstval i
röstdesign för uppläst undertext i hänsyn, vilket understryker hur etik inte kan
undvikas när det arbetas med röst-modaliteten.
Voice Qualities in Audio Subtitles for Film and TV:
           Opportunities and Challenges in Voice Design for
                        accessibility and beyond

                                                   Anne-Charlot Scholz
                                              KTH Royal Institute of Technology
                                                    Stockholm, Sweden
                                                     acscholz@kth.se

ABSTRACT                                                             44]. Improved quality of speech synthesis has simultaneously
This paper explores novel experiential qualities of the voice        opened up possibilities to implement text-to-speech AST in
in Audio Subtitles through research through design. Audio            a cost-effective manner [51], which has encouraged video on
subtitles is an accessibility service for users who have trouble     demand platforms to build such a tool for their service. A
comprehending subtitles in audiovisual content and has been          team at the video on demand service of Swedish Television
newly developed for video on demand platforms such as SVT            (SVT Play) [2] has recently built a text-to-speech AST service
Play. In order to explore possibilities in its voice design, short   and offered this paper an inside look into the choices made
video clips of films and TV series with different types of audio     for the feature. Here they implemented a so called voice-over
subtitles were produced, presented to and discussed with a           effect in audio subtitles, which infers a voice speaking over
small number of potential users of audio subtitles that included     the original performers seen on screen, which could make for
people with dyslexia, cognitive difficulties and autism. Results     an especially interesting voice-user-interaction to explore [7,
indicated that applied voices that did not support the user’s ex-    51].
pectations, low and high pitches as well as low-quality speech
                                                                     The aural dimension of films - i.e. the sound and voice quality -
synthesis, made for uncomfortable experiences, which could
                                                                     has been said to be immensely relevant for the user experience,
prove to be useful for provoking reflection and challenging
                                                                     interpretation and emotional reception of films [43, 20]. The
norms. The paper also discusses how voice design for this
                                                                     voice is arguably the most prominent factor in audio subtitles
service has the potential to match the filmmakers intentions
                                                                     and is itself an intimate and intricate modality for humans,
by translating more than semantic information, as well as how
                                                                     making it all the more relevant to explore [28]. Some efforts
audio subtitles could potentially be produced by professional
                                                                     have previously been made to examine the user experience of
sound designers and filmmakers instead of video on demand
                                                                     voice qualities in speech synthesis, but there is space for more
services. Finally, challenges such as misgendering and insen-
                                                                     research in the context of voice design in audio subtitles [26,
sitive choices of voice in voice design for audio subtitles are
                                                                     16, 14].
considered, underscoring how ethics can’t be avoided when
working with the voice modality.                                     This paper therefore aims to explore novel qualities of the
                                                                     voice modality in audio subtitles for fictional film and TV.
INTRODUCTION                                                         A qualitative research through design methodology was em-
The subject of voice design has in recent years become in-           ployed, gaining insights on the user experience of the voice
creasingly relevant and discussed in the research field of Hu-       by designing prototypes and examining reactions to them [53].
man Computer Interaction (HCI). This is in part due to the           For this, 28 final prototypes of AST with different applied
development of voice assistants and other forms of voice-user-       voice qualities were designed, created and presented to a small
interfaces, which has led to an interest in examining the voice      group of potential users of the service during individual inter-
as a design material and exploring its many qualities [8, 50, 39,    views. Participants included people with dyslexia, cognitive
28]. In this paper, voice qualities are defined as all conceivable   difficulties and autism but it is believed that results will be
factors that contribute to the perception and interpretation of      relevant for all users in need of AST. Results were analyzed
the voice such as perceived age, gender, dialects and expres-        and discussed in order to identify possible implications for
sion, which makes for an infinite number of qualities. The           voice design in audio subtitles and beyond, which could prove
voice is an especially important factor in the context of ac-        to be beneficial for AST- as well as HCI developers, users and
cessibility services such as Audio Subtitles (AST), which is         researchers.
intended to ensure access to audiovisual content for users that
have difficulties reading and comprehending the subtitles [26].
With the continuous development of an online media envi-             RELATED WORK
ronment and video on demand streaming as well as newly               This section presents previous research on audio subtitles,
introduced legislation regarding accessibility, the demand for       voice qualities and sociophonetics, wherein authors discussed
audio subtitles in Sweden has been shown to be growing [1, 36,       concerns within voice design.
Voice qualities in Film and HCI                                     Rendering
The objective of audio subtitles has been said to fully immerse     In Remael’s study [45], the rendering was additionally im-
the viewer in the story told on screen and provide access to        plemented to emulate how the original audio could and was
enjoyable high-quality entertainment through a single cohesive      intended to be heard, such as through a telephone or a radio.
experience [15, 52]. The researchers stated that this should        Results suggested that the rendering was an important factor
be achieved by ensuring that the service does not disrupt or        for the narrative cohesion of the product.
minimize the emotional journey the source text of the film has
                                                                    Dialects
to offer. Many factors are to be considered to achieve a high
quality user experience, but in the case of AST one of the most     Pucher and colleagues [42] examined text-to-speech voices
important factors to the experience and perceived quality is        with regional dialects, which users perceived as fun and per-
said to be the aural medium and more specifically, the voice        sonal, but not appropriate for formal settings. It was further
[43, 20]. Previous research has highlighted the relevance of        claimed that dialect and sociolect speech synthesis will evoke
sound and voice quality in relation to engagement with the          user’s emotions because of their association of the dialect to a
content and emotional reception of a film, which is why aural       specific social group.
and prosodic properties in the voice such as speech style,          Whispering
intensity, delivery, intonation and accents should be examined      In a recent research through design study, Parviainen and
more closely [20, 52, 43, 28]. These properties have also           colleagues [39] examined interaction with voice-assistants
been shown to be crucial for the user’s quality assessment and      through whispering. The authors criticized that voice modal-
interpretation of emotion and intent of the source text [28, 15].   ities were usually discarded in the interactions, though they
Other characteristics of the voice and how they are perceived       are vital for human-to-human interaction. Elimination of them
such as gender and age have also been suggested to play a role      would ignore human’s understanding, interpretation and emo-
in the immersion of the content.                                    tional impact of vocalised voice. If implementations of audio
Vocal expression studies have shown the complicated nature          subtitles similarly discard those factors - as they usually focus
in which a voice is perceived and analyzed by humans [28],          on the interpretation of semantic information -, intended emo-
making the development of speech synthesis a challenging            tional information and experience has been said to potentially
ordeal. Many countries have instead resorted to dubbing inter-      get lost to the viewer [15]. Similarly, other voice qualities con-
national films and TV-series, allowing national performers to       tain information that could be consequential for the perception
perform vocally to the facial performance of the original actor     of the film and alter the quality of the user experience.
in the film [10]. In audio subtitles however, the common prac-      Sociophonetics in voice user interfaces
tice is to instead employ speech synthesis and the voice-over       The topic of voice as a design material has previously been
effect [7, 51]. The user experience of aural and voice qualities    researched and discussed in HCI in relation to voice assis-
in synthesized speech has previously been examined, but in          tants, since it’s been concluded that voices have the potential
the context of AST and its voice design, there is more left to      to shape user perceptions and experiences and should be well
explore [26, 16, 14].                                               considered [8, 50]. Studies had previously usually focused
What follows are some categories of voice qualities that have       on aspects of intelligibility and naturalness of the voice as
previously been examined and are of interest in relation to         well as the technological development of voice- and speech
audio subtitles and this thesis.                                    interactions with voice-user-interfaces [50]. Research on so-
                                                                    ciophonetics - which is the study of social factors that influ-
Gender of the voice                                                 ence production and perception of speech - has found that
Studies that have examined the user experience of synthetic         speech could indicate many factors such as gender, age, social
voices have shown that a synthetic female voice was preferred       class, nationality and sexuality, with one project even creat-
over a masculine one in that it was perceived to be more natural    ing synthetic voices that were perceived to be introverted and
and require less effort by the user to follow [16]. In a study by   extroverted [35, 50]. Sociophonetics have previously been
Remael [45] the actors performing in the film also provided         identified to often be omitted from research concerning voice
their voices for the audio subtitles. This inferred that the user   design for smart devices, in spite of the potential social impact
heard several different voices in AST of the same gender (and       the choice of voice could have for voice-user-interfaces [8,
in this case, exactly the same voice) as the actor portraying the   50]. Voice-user-interface designers usually employ “one voice
character, which was well received.                                 fits all” for their devices and decide which voice qualities are
                                                                    deemed standard or what a neutral, non-dialect and appropri-
                                                                    ate voice in the given national context is, which is grounded
Emotional Speech Synthesis
                                                                    in the designer’s existing speech biases. Designing the voice
This also allowed for the actors to infuse the audio subtitles
                                                                    for voice-user-interfaces thus comes with responsibility and
with emotion and a similar performance as seen in the film [45].
                                                                    social consequences.
Emotional text-to-speech AST is not commonly used, but it
has been shown to be preferred over expressionless synthetic        It has been argued that choosing a standard voice reinforces
speech [47]. A mood-congruent vocal quality in films was            speech ideologies, stereotypes and prejudice against certain
preferred due to it fitting into the emotional landscape better     ways of speaking and an ignorance of the wide variety and
than with mood-incongruent qualities, allowing for greater          diversity of voices and dialects that exist in each language [50,
immersion in the content [15].                                      8]. Users have previously shown an apparent preference for
voices in voice-user-interfaces that speak with their own native     for AST on video on demand. Possible voice and sound quali-
accent, a phenomenon the authors have called the similarity-         ties were consequently laid out and compared to the current
attraction effect. This in addition to the findings that different   state of text-to-speech audio subtitles. The topic was addi-
generations of users speak and interpret language differently        tionally reviewed in informal discussion with people work-
are some of the arguments for the concept of individualization       ing with the voice in different ways such as a professor in
that Sutton and colleagues [50] propose could be employed            voice acoustics, an opera singer and developers of AST. In
in voice design, allowing the user to choose a voice of their        hindsight the chosen qualities were identified to fit into three
preference for voice-user-interfaces. To address the issue of        categories, namely assumptions about the speaker, expression
filter bubbles apparent in voice design today, the authors si-       of the speaker and context of the speaker.
multaneously propose reverse individualisation, where the
user is instead faced with voice qualities unfamiliar to them        Assumptions about the speaker
in order to normalize diversity in voice and challenge what is       Here the sociophonetics of the voice such as dialects, pitch,
considered a “standard” voice for voice-user-interfaces. Con-        age, sex and possible speech impediments were considered.
text awareness is another concept proposed for voice design,         Inspiration was drawn from the synthetic voices kindly made
in that multiple voices can be heard depending on the use and        available by Acapela Group, a company that develops text-
context of the voice-user-interface [50].                            to-speech software and services [21]. The voices included
                                                                     an Indian English dialect, Swedish regional dialects, Swedish
Overall, researchers call for more expression in voice-user-         children’s voices and an elderly English voice. The team at
interfaces to embody the social and cultural identity in speech      SVT Play sourced the voices used for AST from Acapela
and more of all speech variations in the world to be available       Group’s as well, which was further reason to include the at-
for those interfaces [50, 8]. This research also showed that         tributes of dialects and age that the company had available.
the user experience is going to be influenced by each user’s
individual sociocultural knowledge and experience of voices          Expression of the speaker
[50].                                                                This category grouped the voice qualities that expressed emo-
                                                                     tion and to an extent circumstance, such as whispering and
METHOD                                                               singing. In choosing voice qualities for audio subtitles, the
This section gives a brief explanation on the methodology used       topic was approached by first reviewing previous research
in the paper, research through design, and describes how the         on voice modalities, Audio Description and AST. Something
study was conducted in the following steps: (i) design process,      that was often examined was the experience of an emotional
how and which voice- and video material were chosen for the          tone and expression in Audio Description as opposed to a
design artefacts, (ii) how the artefacts were produced and (iii)     neutral one, especially with text-to-speech [47, 15]. This led
the recruitment- and data collection process in the form of          to an interest to design some prototypes with emotional AST,
interviews.                                                          which were intended to go with the overall emotional vocal
Research through design is a research methodology wherein            performance by the original performer in the clip. There was
knowledge is garnered through the process of designing and           a limited number of options available for emotional synthetic
exploring reactions to designs [53]. It is described to create       speech, a sad and a glad male American English voice, both
artifacts that provide concrete embodiments of theory and tech-      of which were chosen to be included. After further thought
nical opportunities [53, 17]. It is further stated that the intent   of how emotion can be expressed, the attributes of shouting,
should be to produce knowledge for the research and practice         whispering and singing were considered, which are often ways
communities. Research through design has been previously             to express circumstance and feeling. Whispering could among
agreed upon to be about research on the future [54], akin to         other things indicate intimacy or fear. Shouting could suggest
design futuring, a term which has been used to describe ap-          the speaker to be happy, angry, agitated or distressed. Singing
proaches for exploring futures with design, often in an effort to    could also be an expression of happiness, a performative act or
change the present [33]. Design allows the researchers to envi-      even be used in a therapeutic manner (as seen in “The King’s
sion the futures in more experiential detail through presenting      Speech” (2010) [25]).
and putting artifacts or pieces of fiction up for discussion.
                                                                     Context of the speaker
This paper aims to follow the principles of research through         This category represented the way the voice would be heard
design and design futuring by designing prototypes of voices         differently depending on the context seen on screen, for in-
for audio subtitles and engaging with the user groups reactions.     stance if the voice was heard in a large hall, a small room or
The paper will further analyze the resulting experiential qual-      through a telephone. It was reflected upon how voices could
ities of the voice and identify implications for voice design        be heard in movies and TV, where audio technicians put care-
in HCI. Audio subtitles is first and foremost an accessibility       ful thought to how the voice should be perceived in a certain
service and since the designer is not part of the target group,      setting and scene. This called for an inclusion of a particular
emphasis will be put on data and knowledge created by the            rendering and distortion of voices when they were for example
participants instead of the design process of the prototypes.        heard through a telephone or a radio. Additional thought was
                                                                     put on the proximity attribute of the voice, which could say
Choosing voice qualities                                             a lot about the setting of a scene. Is the speaker far away or
Previous research on phonetic qualities of synthetic voices was      near, left or right from the audience and what are the acoustics
studied to gain inspiration for what should be a preferred state     of the setting? Designs with different reverb sizes and filter
effects were thus included to emulate different settings, as          used for non-singing scenes, whispering AST was likewise
well as playing around with voices being heard from different         used for non-whispering scenes, the proximity of the voice
mono outputs, i.e. sounds coming from only one side of the            did not match the screen and dialects, gender and age did not
speakers.                                                             correspond to the original speaker.
Omitted voice qualities
An effort was made to include a variety of voice qualities that       Design conditions and tools
were relevant in the context of AST in film and TV. Some ideas        Available free online text-to-speech resources were reviewed
that came up were not included in the end, due to the topic           to find suitable material and software for prototyping. Acapela
being broad enough as it was or due to a lack of resources.           Group kindly granted access to their online Editor Acapela
These ideas included opera song in AST, nasal voices signaling        Cloud with a wide range of synthetic voices - including
a cold or sadness and speech impediments. The latter topic in         Swedish ones - and were thus used for text-to-speech ma-
particular would have been interesting to examine, challenging        terial [21]. Qualities that were not available such as singing
the norms in voice design. Unfortunately, no appropriate              and to an extent whispering AST were instead created with a
voice performer was available to produce AST with a speech            natural human voice from an amateur female voice actor. For
impediment.                                                           the rendering of the voice, the audio editing software Audacity
                                                                      and the available audio effects in iMovie were used to for
Choosing video material                                               instance emulate a radio station.
A lot of the video material was chosen from prior personal
                                                                      Video material was chosen from video on demand services
experience and mapping out what kind of scenes would fit
                                                                      SVT Play [2], ComhemPlay [3] and Netflix [13]. Short sec-
with certain voice qualities, although this requirement was
                                                                      tions of the video material were chosen and the translated
later discarded. TV and film as mediums of scripted fiction
                                                                      subtitles were transcribed. Audio files of the synthetic and
were deemed to be similar enough to both be included in the
                                                                      human voice reading the subtitles aloud were created. SVT
study. Documentaries, operas or news programs were not
                                                                      Play had four synthetic voices available and provided a short
included, as they were assumed to have a different relation to
                                                                      film ("Index" (2020) [32]) with separate audio files for the
the use and design of audio subtitles.
                                                                      AST. All video and audio files were combined in video editing
Several international TV series and movies were reviewed in           software iMovie in a way to mimic audio subtitles.
order to find appropriate scenes for the voice modalities. The
final prototypes consisted of video material sourced from sev-        Final Prototypes
eral different streaming services when a certain TV series or         A total of 28 final prototypes were created, each about a minute
film was held in mind. “The King’s Speech” (2010) [25] was            long. They are referenced to as V1-V28 (see Table 1).
for example identified to be suitable potential source material,
since it had a lot of situations useful to the voice qualities that   Voice qualities
had been laid out. This included several scenes with a voice          The varying qualities implemented were as follows: Gender
being heard over a radio, with an echo or in a cathedral and          (male, female); Age (old, middle-aged, child); Dialect (Fin-
even speak-singing, which was otherwise thought hard to find.         land Swedish, Scanian Swedish, Indian English); Pitch (low
Singing scenes could oftentimes be argued to not need trans-          pitch, high pitch); Expression (happy, sad, whispering, shout-
lation in movies and TV, since the important take-away from           ing, singing); Output effect (radio, telephone, echo, muffled,
those scenes could be said to be the atmosphere, melody and           mono); Reverb size (small room, large room, cathedral).
vocal performance instead of the text. In “The King’s Speech”
however, the singing is part of the therapy the protagonist has       Video material
to overcome his stammer. In an arguably pivotal scene for             TV series and films used were: "Call my Agent!" (French
the relationship between two characters the protagonist has           production) [24, 23, 22]; "Index" (Swedish production) [32];
trouble speaking about his childhood and instead sings the            "Kuch Kuch Hota Hai" (Indian production) [27]; "Narcos"
words out, which would be important to translate and include          (American production) [38]; "Perfume" (German production)
in audio subtitles.                                                   [29]; "The Bureau" (French production) [46, 34, 9]; "The
Supporting VS Contradicting Expectations                              King’s Speech" (British production) [25]; "Weissensee" (Ger-
Part of the design process was deciding whether the prototypes        man production) [18].
should support expectations of the vocal performance of peo-          Text-to-speech material
ple seen on screen or if they instead should be provocations,
                                                                      The following synthetic voices were provided by Acapela
meant to contradict and challenge expectations. For a while
                                                                      Group and used for the prototypes: Deepa (f, Indian En-
voice qualities were matched to appropriate scenes and thus
                                                                      glish); Elin (f, Swedish); Emil (m, Swedish); Emma (f,
supported expectations, but soon it was deemed interesting and
                                                                      Swedish); Erik (m, Swedish); Filip (m, Swedish child); Freja
engaging to discuss provocations that might not be accepted           (f, Swedish child); Mia (f, Scanian Swedish); Samuel (m, Fin-
by the target audience and why that is, a concept akin to what        land Swedish); Will (m, American English), Will from Afar
Sutton and colleagues discussed in their study [50]. This ap-         (shouting), Will Happy, Will Old Man (elderly), Will Sad, Will
proach was assumed to result in more fodder for discussions           Up Close (whispering) [21].
and reflections. Prototypes thus began to include scenes where
a child’s voice was used for adult speakers, singing AST was
Recruitment and Interview Process
                                                                        Participants
       Table 1. Prototypes V1 - V28 of audio subtitles (AST)            With the help from two companies working with people with
V    Video Material        Description                                  dyslexia and visual or cognitive impairments, Dyslexiförbun-
                                                                        det [12] and Begripsam [5], 10 participants were recruited for
1    The King’s Speech     A voice speaks over the radio, the AST
                           (Samuel) has a Finland Swedish dialect.
                                                                        virtual interviews via video conference system Zoom Video
2    The King’s Speech     A voice speaks over the radio, the AST       Communications [55]. Participants are referred to as P1-P10.
                           (Mia) has a Scanian Swedish dialect.         All of them were potential users of synthetic speech and ac-
3    The King’s Speech     A voice speaks over the radio, the AST       cessibility services such as audio subtitles or worked with
                           (Erik) has a radio effect.                   accessibility questions in some form or other. Several of the
4    The King’s Speech     A conversation between two men with a
                           high pitched AST (Erik).                     participants were for instance members of the board or consul-
5    The King’s Speech     A conversation between two men with a        tants at one of the organizations and had a lot of experience
                           low pitched AST (Erik).                      with this and similar subjects. Others simply had themselves
6    The King’s Speech     A conversation in a cathedral with a large   or a relative with a need for an accessibility service such as
                           reverb size on the AST (Erik), having it
                           sound like it is heard in a cathedral.
                                                                        AST. Participants were told they would get to experience and
7    The King’s Speech     A conversation in a room with a small        react to some conceptual audio subtitles. They were asked to
                           reverb size on the AST (Erik), having it     interact with the prototypes and perform a “think-aloud” evalu-
                           sound like it is heard in a small room.      ation to discuss their experience in individual semi-structured
8    The King’s Speech     A conversation in a room with a larger       interviews.
                           reverb size on the AST (Erik), having it
                           sound like it is heard in a larger room.     Interviews
9    The King’s Speech     The speaker holds a speech in an arena,
                           the AST (Erik) has an echo effect on it.     The interview consisted of two parts. During the first, partici-
10   The Bureau            A conversation between a man and a           pants watched V25-V28 in a previously determined order and
                           woman in a car. The AST alternates be-       answered questions about their spontaneous reaction directly
                           tween female (Elin) and male (Erik), one     after each viewing. This made for a gentle introduction to the
                           character having a left or right mono out-
                           put.
                                                                        topic of audio subtitles, since it showcased a real-life scenario
11   The Bureau            A conversation heard over telephone          of how AST are implemented today, namely at SVT Play. This
                           recordings, the AST (Elin) has a tele-       part was skipped over with two participants due to time con-
                           phone effect.                                straints (P6, P9). During the second part, participants viewed
12   The Bureau            A conversation between two men, the          a selection of five prototypes one at a time and subsequently
                           AST is a human voice whispering.
13   The Bureau            A conversation between two men, the          answered questions on the experience of them. The interview
                           AST is a human voice singing.                concluded with synoptic and broad questions on their expec-
14   Call my Agent!        Two women are whispering to each other,      tations and experience with AST and the previously viewed
                           the AST is a natural voice also whisper-     prototypes.
                           ing.
15   Call my Agent!        Two women are whispering to each other,      Interviews were recorded and later transcribed with the permis-
                           the AST (Will Up Close) is also whisper-     sion of the participants and reactions to experiential qualities
                           ing.
16   Call my Agent!        Two men are shouting at each other, the      of the voice for audio subtitles were analyzed.
                           AST (Will From Afar) is also shouting.
17   Call my Agent!        Two men are shouting at each other, the      RESULTS
                           AST (Erik) is muffled.                       Results showed how differently, as well as similarly the proto-
18   Call my Agent!        Two men are shouting at each other, the
                           AST (Filip) is a child.                      types could be experienced. The reactions were grouped after
19   Call my Agent!        An elderly women holds a speech, the         the voice qualities examined in the prototypes, accompanied
                           AST (Will Old Man) is an elderly man.        by relevant quotes from participants.
20   Perfume               A woman speaks with a child, the AST
                           alternates between a female adult voice
                           (Elin) and a child’s voice (Freja).          Gender of the voice
21   Weissensee            A woman sings a song, the AST is a hu-       The question of what gender the voice had was prominent
                           man voice singing.                           throughout the study and commented on no matter if it was
22   Kuch Kuch Hota Hai    Two women have a conversation, the           the intended focal point of the prototypes seen by the par-
                           AST (Deepa) has an Indian English di-
                           alect.
                                                                        ticipants. It was stated that it was common to use only one
23   Narcos                A man holds a speech, the AST (Will          voice for accessibility services such as audio books or screen
                           Happy) has a glad expression.                readers due to there often being no option to combine several
24   Narcos                A man holds a speech, the AST (Will          voices. Participants seemed to therefore be more accepting
                           Sad) has a sad expression.                   of the gender-coded AST not fitting the original speaker and
25   Index                 AST (Elin) speaks over several people.
26   Index                 AST (Erik) speaks over several people.       it was seldom argued to be the most important factor in a
27   Index                 AST (Emil) speaks over several people.       clip, but a general wish for the gender to match the speaker
28   Index                 AST (Emma) speaks over several people.       was expressed more than once and described to improve the
                                                                        experience. This became apparent with prototype V10, where
                                                                        the voice of a male- and a female-coded AST was used for
                                                                        a conversation between a man and a woman. Two of three
participants (P1, P2) had positive reactions to the fact that the   for different characters seen on screen and three out of four
voices matched the speakers. All but one participant suggested      participants (P1, P2, P4) that viewed these prototypes had
that there should always be an option to at least be able to        strong, positive, almost yearning reactions to it. They said it
choose between a male and female voice for each film and TV         made it easier for them to follow the contents of the clip as
series. Seven of the participants (P1, P2, P5, P6, P7, P9, P10)     well as improving the experience overall. P4 described it as
described usually choosing which voice to use - specifically        bringing the events closer to them as a viewer.
the gender of the voice - based on a certain category and genre
or simply by how they feel that day. This indicates that the        In some other prototypes a dialogue option was discussed as
gender of the voice used for AST is indeed a matter of interest     a must-have, because of the otherwise unintelligible nature
to the user, who would like to control that choice for each         of the clip (for instance V16-V18). It was speculated that
                                                                    different voices for different characters would be extremely
instance of use.
                                                                    helpful in following and keeping up with the conversation,
                                                                    especially when the exchange was quick and heated such
Age of the voice                                                    as in an argument. The exception was P5, who said they
One of the examined voice qualities was the question of age         would prefer one voice for everyone in the clips in order to be
in a voice used for audio subtitles. The majority of prototypes     able to focus. This participant (who had epilepsy and autism
used voices that were explicitly not child-like or elderly, but     syndrome) found the dialogue prototype V10 especially hard
rather middle-aged. Still it seemed possible to discern dif-        to bear because of the mono effect that was implemented,
ferences in ages among them, as was expressed by P1 when            which became extremely uncomfortable in their ears. This led
comparing the voices Erik and Emil (V26, V27): "I feel like         to the participant not realizing or identifying that there was
Erik is a man in his 50’s and this (Emil) was maybe a man           both a male- and female-coded AST voice used in the clip in
in his 30’s". This also impacted the perceived appropriate-         question.
ness of the AST voice used for a certain speaker, as P1 later
commented on in a scene in prototype V1.                            Mono effect
  Here I would have liked to have a (...) neutral Swedish           This mono effect in V10 was generally barely noticed in prac-
  voice like Erik or maybe Emil. Possibly Erik, since this          tice and only somewhat approved of in theory. P1 stated that
  was an older man (speaking). - P1 on V1                           the mono output made the voices sound more "flat", although
                                                                    they further said that the effect made it clearer who was speak-
The prototypes where the AST voice included that of a child         ing and where the voices were coming from, as well as it being
garnered strong positive reactions to the child’s voice, espe-      a fun idea, which was seconded by P2. In practice it seemed
cially in V20, where participants thought the voices matched        that the dialogue quality was enough in itself and that a mono
with the speakers. As for V18 where a child’s voice was             effect did not improve the experience any further.
matched to adults, participants pointed out that the voice was
well done but was not acceptable in this kind of situation.         Dialects
The clip instead garnered confused reactions over who was           The use of dialects garnered mixed reactions from participants
speaking, followed with strong opinions on when a child’s           and was often an engaging topic of discussion. Six participants
voice would be appropriate to use in audio subtitles. All par-      (P1, P4, P5, P6, P9, P19) agreed that it would be a nice feature
ticipants who discussed this topic (P2, P4, P5, P6) expressed       to have in theory, but often had mellow reactions in practice.
that it would fit either a children’s movie, a movie from the       Everyone agreed that it should in any case be optional to
viewpoint of a child or when used solely when a child was           have dialects, albeit for different reasons. P5 stated that for
speaking. Prototype V19 provoked a similar initial confusion        people with autism, dialects can provide a sort of "safe place",
over who was speaking in the clip, as the voice was an elderly      the sentiment of which was echoed by P9, a participant who
man and the original speaker an elderly woman. Both partic-         viewed a prototype with the dialect from their home-region
ipants (P3, P7) who saw this clip felt that this voice was not      (V2). They described that "something special happens when
appropriate at all to use in this case, although the perceived      you hear your own dialect" and that it in a way brings the
age-appropriateness was mentioned to be a nice touch. They          contents closer to them. Several participants (P3, P4, P10)
felt that the mismatched gender and the particular deep and         expressed that they would not want certain extreme dialects in
croaky nature of the audio subtitles voice made for too big a       their audio subtitles, worrying about the intelligibility of the
contrast to the speaker appearing on the screen in this case, but   dialect. They suggested a mild version of the dialect instead.
that the voice would fit well for elderly men in movies. P7 even    At the same time, participants pointed out that they did not
likened the experience to watching the character cross-dress,       want "dialect neutral" AST to imply a Stockholm specific
which was not the case.                                             dialect, which they felt is often mistaken for dialect neutral
                                                                    Standard Swedish. In prototype V1-2 two participants (P1,
Dialogue                                                            P5) commented on how mismatched and "weird" the choice
Another quality most would comment on unprovoked, was               of AST voice was, because they identified the context to be
the use and combination of several voices in one clip. This         a speech made by a British person and expected a Standard
was often already discussed in the first part of the interviews     Swedish AST voice to go with it. Three participants (P1, P3,
during clips V25-V28, where a woman and several men were            P10) posed the requirement of the dialect having to fit the
speaking with one male or female audio subtitles voice cov-         context of the film. For Swedish regional dialects, participants
ering them all. Prototype V10 and V20 used different voices         expressed that the character on screen would have to originate
from a given region in order for the dialect to be appropriate      At other times, the speech synthesis was commended for well-
in audio subtitles. This would entail the product to already be     done emphasis. Emotion, tone and emphasis seem to therefore
produced in Swedish and would not be applicable in the case         always play a very large role in the perception of the AST.
of SVT Play, who implement AST exclusively for translated           P6 and P10 even stated to have turned off a film or TV series
subtitles. P9 was inclined to use an AST voice with dialect         because of the disengaging audio subtitles used.
anyhow because of the familiarity of the dialect. This could
imply that dialects would indeed make a viable and tempting         P7 felt that the "glad" AST (V23) fit well with the scene
option for users from regions in Sweden with a striking dialect     and stated that there is a perceived difference with emotional
that is not often heard or used for accessibility services such     voices and the right tone, the experience otherwise becoming
as audio subtitles or even in dubbing otherwise. The choice         disengaging and boring. P4 thought that the "sad" AST (V24)
                                                                    did not fit the scene, since the original speaker had a strong
for Swedish regional dialects seems therefore only relevant
                                                                    almost aggressive tone and the AST was too soft-spoken and
for users who want to hear their own or another comfortable
                                                                    calm. They stated they would have preferred the voice to
dialect of their choice.
                                                                    adapt more to the tone of the scene and again underlined the
The subject was perceived and discussed a little differently in     importance of tone and emotions in audio subtitles.
regard to international dialects. P4 who experienced prototype
V22 reacted positively to the Indian English dialect, stating       Whispering and shouting
that it gave a context and was "culture specific". P9 who           Two participants (P2, P10) found the whispering prototype
discussed the topic theoretically also agreed that international    V14 to be the best, although this coincided with the fact that the
dialects would "add something to the experience" and be an          prototype in question used a human voice. P2 expressed that
entertaining feature, but P3 instead found it unnecessary. This     it felt natural to adapt the tone this way and that it "intensified
participant reacted negatively to the Indian English prototype      the mood", P10 stating something similar: "It fits very well.
V22 and found that it "adds a dimension to their conversation       And it felt more alive actually". Others (P4, P5, P6, P8) did
which doesn’t exist". They argued that there probably would         not like the whispering prototypes (V12, V14, V15) at all,
not exist a dialect in the mother tongue of the speakers and that   either not liking the sharp "S"-sounds made by the human
it was inappropriate to add a dialect in the AST. P5 discussed      speaker, the strong nature of the whispering, the low sound of
that some people with autism could easily discern if a dialect      it or the degree of intelligibility. They theoretically preferred a
was authentic or not and that it could therefore be risky to have   less extreme version of a whispering (and shouting) AST, and
artificial dialects that may offend or irritate certain viewers.    would instead opt for the service to simply lower the voice or
Both P3 and P5 demanded this assurance of authenticity, as          adapt the tone. P6 and P8, who viewed prototype V12 where
they found it wrong for dialects to be imitated, at least by        the AST whispered over non-whispering speakers, argued that
human speakers for accessibility services.                          they did not need this sort of dramatisation, since they would
                                                                    understand the kind of situation the original speakers were in
Pitch                                                               or when they were whispering regardless. Both participants
The prototypes where the pitch of the voice was altered gar-        seemed however confused about whether the original speakers
nered arguably the strongest negative reactions. P8 called the      in the clip were actually whispering or not. P5 had trouble
experience "horrible" (V5) and said they did not understand         dealing with the sharp sounds made by the human AST speaker
the purpose of having an awful and uncomfortable voice that         in V12 and stated that they did not realize it was meant to
was additionally hard to hear and decipher for audio subti-         represent whispering at all.
tles. P7 and P10 gave suggestions for scenarios where the           Participants who theoretically discussed a corresponding
low-pitched voice could fit, for instance a ghost movie or for a    shouting audio subtitles (P2, P5, P10) saw many problems
villain’s voice. Still they seemed apprehensive as to whether       with it. They speculated that for shouting AST to work, the
they would actually like to hear this type of AST in those          background sound would have to be lowered (P2) or that shout-
scenarios.                                                          ing AST could be too shocking or distracting to some users
                                                                    and that it would be inappropriate (P10). The participant who
Emotional text-to-speech in audio subtitles                         viewed the shouting AST prototype V15 (P3) seemed reluctant
Only two participants (P4, P7) viewed the prototypes with           to state that the voice fit the original scene in question. They
coded emotions (V23, V24), but many others (P2, P3, P5, P6,         described how it on the other hand could destroy the context
P10) still discussed the topic of tone and emphasis in audio        of the scene if the audio subtitles would not shout. P3 thus
subtitles and the importance of it fitting the scene. It was        seemed to appreciate the effort of the service to honor the
referred to as the "dramatic part" by one participant, stating      original tone as much as possible, but also found the scene
which voice had a better fit for a dramatic scene such as the       hard to follow in general.
one in V25-V28. This was also linked to how natural a text-
to-speech AST sounded in a given scene, how well the voice          Singing
adapted to the flow and tone of a sentence and would frequently     The singing prototype V21 was generally well-received (P1,
be commented on if it was not done well enough. This was            P5, P9), in part due to the audio subtitles having a human per-
common in V25-V28, a dramatic scene with refugees shouting,         former. It was said that a singing AST increases the experience,
where many participants (P1, P4, P5, P6, P7) commented that         although opinions on whether they would want to use this fea-
some of the AST performed flat, "technical" or "robot-like".        ture differed. P1 for instance liked the singing, but stated that
they would still prefer to have the service read out the subtitles   DISCUSSION
instead of singing them simultaneously, especially given that        The discussion focuses on overarching themes found in the
the AST would have a synthetic voice. Several people (P1,            results, such as the fact that some prototypes elicited com-
P5, P10) stated that it would be of even greater importance          fortable and other uncomfortable experiences. The section
that the voice appear very natural or even said that it would        consequently discusses possible implications for voice design
work exclusively with human performers. The voice would              in audio subtitles and beyond, as well as ethical considerations
also have to fit the tone of the movie and performers. P5 and        in AST.
P9 speculated that they would not have understood that the
original performer sang (V21) and that only with the singing         Design implications for audio subtitles
AST it became clear that that was the case.                          Results indicated that users want the voice design of audio
P7 and P10 viewed the prototype with a singing AST for a             subtitles to add to the experience of the medium and it became
non-singing scene (V13) and felt it was very inappropriate.          clear that a gender- and age appropriate AST voice would
Both said they would find it appropriate for musical films and       intensify and improve the experience of a movie or TV series
P7 speculated that a children’s film such as Disney’s "Frozen"       greatly. Especially the gender of the voice was frequently
(2013) [11] might make use of such an implementation.                discussed, although there seemed to be no overarching pref-
                                                                     erence in what gender a voice should have in a given scene.
                                                                     Participants rather demanded that the gender should fit the
Dramatising effects                                                  original speaker in the scene, the AST voice being customized
There were almost exclusively negative reactions to the pro-         to each character seen and heard. This is something which is
totypes with different effects put on the AST voice, such as         commonly implemented in dubbed movies and was shown to
effects for reverb sizes and distortions. It was only when fur-      be well received in audio subtitles before, albeit with the origi-
ther discussed in theory that participants were much more            nal actors performing the AST [45, 10]. This kind of dialogue
inclined to find the effects a suitable feature for AST, contra-     implementation was suggested to be very helpful for users in
dicting their initial reaction. Several participants (P1, P2, P3,    order to better follow the content of a given scene, as well as
P6, P7, P8) found effects to be unnecessary for the experi-          improving the experience overall. The quality of experience
ence of the AST and that it could even disturb the experience.       seemed to be a priority for everyone, although opinions on
This was especially apparent with some of the reverb size ef-        how far the design of AST should go differed quite a lot.
fects, where participants stated they had a harder time hearing
what was being said. Here some stated their priorities for           Participants had strong opinions on the dramatisation of audio
AST, which should first and foremost be to be able to follow         subtitles one way or the other. Some participants felt that
the conversations and plot of the medium with help of audio          dramatised AST (i.e. with different effects or an adapted
subtitles.                                                           expression such as whispering) were absolutely unnecessary
                                                                     and that AST should first and foremost translate and deliver
  You have to put a lot of energy into trying to differentiate,      semantic information. Others however expressed a wish for a
  who says what here now. - P9 on V17                                high-quality emotional experience and accurate translation of
  If you can not see the text or can read then it is most            the filmmakers cinematic experiential intentions, as described
  important that you can hear what they say, maybe not               by P9.
  that you describe the space in the actual sound of the text.         It’s about enhancing the experience or ensuring that the
  - P8                                                                 experience that the filmmakers want to convey actually
P8 argued that you would not emulate effects in your head              reaches me as a consumer. I do not want a custom variant,
when you read the text. Some others (P4, P9, P10) were                 I want a variant that conveys the original in a way that I
accepting of these sort of dramatising effects, finding the idea       can assimilate. - P9
entertaining, although the effects often initially went unnoticed    This feeling of AST as an accessibility service representing a
or unidentified. This may have been affected by participants         "custom variant" was echoed previously by P5, who described
seldom finding the effects to be appropriate to the scene, such      her disappointment in the quality of product an accessibility
as seen in V7 and V8 where large and small room effects              service usually implies. It is thus worth discussing how users
were applied. Participants seemed to have an easier time             who are in need of such services can feel disregarded and that
identifying the echo effect in V9, which was arguably more           they are not a priority in the entertainment and UX industry.
fitting and appropriate for the scene, where the original voice
also had an echo. Still, participants (P1, P2, P8, P9, P10)          Opportunities in the production of accessibility services
thought many of the effects not to originate from the AST            In order to ensure inclusion and representation of users in need
but the original audio and were albeit confused. P9 even             of accessibility services such as AST, they could be treated as
mistook the radio effect to simply be an older, technologically      an ongoing consideration in the production of audiovisual con-
inferior, low-quality AST and not a deliberate effect applied        tent. If audio subtitles would be considered and produced early
to emulate the radio station. Similarly, the muffled effect          on during entertainment production, creators could ensure that
was also thought to be a technological issue. The topic of           their vision of product is correctly portrayed and translated
dramatisation and dramatising audio subtitles showed to be           for people in need of accessibility services. This could lead to
quite divisive, seemingly grounded in personal experience and        the emergence of an industry similar to the dubbing-industry.
preference.                                                          Human performers have previously been deemed too costly
for accessibility services [16, 51], but with text-to-speech AST,    the voice itself, the soundscape of the clip or the mismatch
this cost could be eliminated and instead be invested in for         between choice of voice and original speaker. What follows
instance sound designers or professional voice designers. This       are insights garnered on what made for an uncomfortable
way, video on demand companies - who may have other prior-           experience.
ities - would not have to be made responsible for producing
accessibility services, as is the case today [44]. Future research   A common theme in the uncomfortable experiences with the
could thus investigate how sound designers, film-makers and          prototypes were the negative reactions to when the voice did
voice designers would approach producing audio subtitles to          not support participants’ expectations and at times instead
give as similar an experience to the original as possible. This      actively contradicted them. This discomfort was expressed
would also contribute to the discussion on the importance of         several times by participants, experiencing that the dialect did
                                                                     not fit the context (V1, V2), the gender or age was inappropri-
inclusion, representation and accessibility in HCI [50, 8, 4, 40,
                                                                     ate (V18, V19) or that the expressed emotion and tone did not
31, 49, 30].
                                                                     match. When expressions such as singing and whispering were
Many participants expressed a wish for autonomy in suggest-          applied on a visual and context that originally did not include
ing multiple choice options for audio subtitles, which indicates     those expressions, participants appeared confused. Contextual
that the user group recognizes they have different requirements      or dramatising effects on the voice such as reverb size and
and opinions among themselves. P9 suggested a choice be-             distortions seemed to have a similar effect, especially when
tween an AST with one voice without effects or any additions         the intention was not clear to the participant. Another aspect
and one "elevated experience"-AST that would include reverb          that made participants uncomfortable were the altered pitches
size- and context effects, adapted dialects and expressions and      in (V4, V5). This seemed to inspire more of a physical pain,
several voices for speakers. Even though several of the par-         since the voices were described as "unpleasant" and "horri-
ticipants stated they found the translation and interpretation       ble". The additional connotation to ghosts and villains and
of emotional information to be unnecessary in contrast to se-        P7 describing the experience as "scary", suggests that altering
mantic information, previous research has described how the          the pitch could be an effective way of creating discomfort.
two of them combined are highly important to the subsequent          There was even a slight general disregard for text-to-speech
quality of the user experience [15]. This study additionally         voices palpable during the interviews. Some participants had
showed that further voice qualities in AST do have the poten-        stronger reactions to speech synthesis than others, but most
tial to present many benefits to users. Simple effects such as       expressed that they would still prefer a human speaker AST.
distortions of the voice to emulate context like a radio, tele-      It was sure to be commented upon if the text-to-speech was
phone or other could - when done properly - make it easier           deemed inadequate or "flat", which could lead to frustration
to identify speakers and follow a story and become an im-            on the user’s part. For one participant (P5) it could even be-
portant feature to some users, as previously found in [45].          come physically straining due to their medical preconditions.
Dialects could similarly clarify the cultural and geographical       Choosing particularly low-quality speech synthesis could thus
context of a scene, but also exist to make users of a certain        be another way to create uncomfortable experiences.
region more comfortable with the AST voice, arguably due to
their association to a specific or familiar social group and the     These instances suggest that they would make an inappro-
similarity-attraction effect discussed by Sutton and colleagues      priate choice for conventional audio subtitles, but they si-
[50, 8, 42]. Expressions such as seen in highly emotional or         multaneously raise the question of whether this suggests any
                                                                     implications for how to design with the voice beyond AST.
singing scenes can generally be supported by emulating them
                                                                     Researchers have previously discussed the concept of uncom-
to at least a small degree, since some thought that it intensified
                                                                     fortable interactions and ambiguity in HCI, highlighting the
the experience. This could possibly be due to these expres-
                                                                     benefits of what would conventionally be considered a shift
sions containing emotional information that could otherwise
get lost to the viewer, as discussed in [39, 15].                    away from traditional UX values [19, 6]. This concept often
                                                                     referred to examples of physical interactions, art installations
The difference in opinion on whether accessibility services          or performances, but could possibly be applied to voice design
should translate emotional in addition to semantic information       in voice-user-interfaces as well.
and whether an "elevated experience" of audio subtitles is
                                                                     The voice has previously been described to have a big impact
desired could be further examined and discussed as the impor-
                                                                     on experience and contain and express a lot of information
tance of accessibility and inclusion gains more attention in the
industry [31, 43, 52, 37, 41].                                       between humans, who therefore are very fine-tuned to ana-
                                                                     lyzing and interpreting voice [28]. This intimate relationship
                                                                     between the voice and humans leads to them having precon-
Creating discomfort with voice design                                ceived notions on how and what the voice should be in dif-
What became interesting to observe during the interviews was         ferent contexts, which makes the voice a very intimate and
at times the level of discomfort and confusion experienced by        efficient modality to create uncomfortable experiences with.
participants, as seen in the especially strong disapproving re-      It also explains why participants had such strong reactions to
actions to the provocative prototypes (V1, V2, V12, V13, V18,        voices that were contradicting their expectations, due to them
V19). It became very clear which prototypes did not "work"           not being able to intuitively match the voice to the face seen
in the sense that the participants did not enjoy the experience      on screen. Contradicting expectations thus elicited discomfort
or would have liked to use that particular version of AST. This      but could simultaneously inspire reflection on the user’s part.
was the case due to different reasons, often either related to
You can also read