Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls

Page created by Javier Yang
 
CONTINUE READING
Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls
This paper is part of a special issue on The Effects of
                                                                                                                 Noise on Aquatic Life

                                           Performance of a Deep Neural Network at
                                           Detecting North Atlantic Right Whale Upcalls
                                                    Oliver S. Kirsebom,1, a) Fabio Frazao,1 Yvan Simard,2, 3 Nathalie Roy,3 Stan Matwin,1, 4 and Samuel Giard3
                                                    1
                                                      Institute for Big Data Analytics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
                                                    2
                                                      Fisheries and Oceans Canada Chair in underwater acoustics applied to ecosystem and marine mam-
                                                    mals, Marine Sciences Institute, University of Québec at Rimouski, Rimouski, Québec, Canada
                                                    3
                                                      Maurice Lamontagne Institute, Fisheries and Oceans Canada, Mont-Joli, Québec, Canada
                                                    4
                                                      Institute of Computer Sciences, Polish Academy of Sciences, Warsaw, Poland
                                                    (Dated: 3 March 2020)
arXiv:2001.09127v2 [eess.AS] 29 Feb 2020

                                                    Passive acoustics provides a powerful tool for monitoring the endangered North Atlantic right
                                                    whale (Eubalaena glacialis), but robust detection algorithms are needed to handle diverse and
                                                    variable acoustic conditions and differences in recording techniques and equipment. Here,
                                                    we investigate the potential of deep neural networks for addressing this need. ResNet, an
                                                    architecture commonly used for image recognition, is trained to recognize the time-frequency
                                                    representation of the characteristic North Atlantic right whale upcall. The network is trained
                                                    on several thousand examples recorded at various locations in the Gulf of St. Lawrence in
                                                    2018 and 2019, using different equipment and deployment techniques. Used as a detection
                                                    algorithm on fifty 30-minute recordings from the years 2015–2017 containing over one thou-
                                                    sand upcalls, the network achieves recalls up to 80%, while maintaining a precision of 90%.
                                                    Importantly, the performance of the network improves as more variance is introduced into the
                                                    training dataset, whereas the opposite trend is observed using a conventional linear discrim-
                                                    inant analysis approach. Our work demonstrates that deep neural networks can be trained
                                                    to identify North Atlantic right whale upcalls under diverse and variable conditions with a
                                                    performance that compares favorably to that of existing algorithms.
                                                     c 2020 Acoustical Society of America.       [http://dx.doi.org(DOI number)]

                                                    [XYZ]                                                                                        Pages: 1–12

                                           I. INTRODUCTION                                                       measures, which include vessel speed reduction and fish-
                                                                                                                 ing closure, were then put in place by the management
                                                The North Atlantic right whale (NARW, Eubalaena                  authorities in an effort to prevent the recurrence of such
                                           glacialis) comprises a small cetacean population that                 events (DFO, 2018).
                                           counted ∼ 400 individuals in 2017 (Hayes et al., 2018;                    The key information required to trigger vessel speed
                                           Pace et al., 2017; Pettis et al., 2019). Listed as endan-             reduction and fishing closure is the presence of the an-
                                           gered in Canada (COSEWIC, 2013), this population has                  imals in the highest risk areas. This information can
                                           been declining since 2010 (Pettis et al., 2019). NARW                 be acquired over large areas for short time windows
                                           used to be mainly distributed along the US continen-                  from systematic or opportunistic sightings from aircrafts
                                           tal shelf, up to the Bay of Fundy and the Western Sco-                or vessels. However, to obtain continuous round-the-
                                           tian shelf in Canada. This distribution, however, has                 clock information over the season, NARW detection with
                                           changed in the last decade (Davis et al., 2017). The oc-              passive acoustic monitoring (PAM) systems is needed
                                           casional yearly occurrence of a few individuals in the Gulf           (Simard et al., 2019). Various configurations of PAM
                                           of St. Lawrence in summer and fall markedly increased in              systems are possible for small- to large-scale coverages
                                           2015 and a high seasonal occurrence has continued since               (Gervaise et al., 2019b), including some supporting de-
                                           (Simard et al., 2019). This area is a site of intensive fixed-        tection in quasi real-time such as Viking-WOW buoys
                                           gear fishing. It is crossed by the main continental seaway            (https://ogsl.ca/viking/), Slocum gliders and fixed buoys
                                           that connects the Atlantic and the Great Lakes (Simard                (Baumgartner et al., 2019, 2013).
                                           et al., 2014). In 2017, 12 individuals died in the Gulf                   NARW upcall detection and classification (DC) algo-
                                           of St. Lawrence. The mortalities involved collisions with             rithms are a key component of such PAM systems. Sev-
                                           ships and entanglement in fixed fishing gear. Protection              eral algorithms, exploiting the time-frequency structure
                                                                                                                 of the call, have been used so far (Baumgartner et al.,
                                                                                                                 2011; Gillespie, 2004; Mellinger, 2004; Simard et al., 2019;
                                           a) Corresponding   author: oliver.kirsebom@dal.ca                     Urazghildiiev and Clark, 2006, 2007; Urazghildiiev et al.,

                                           J. Acoust. Soc. Am. / 3 March 2020      JASA/Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls   1
2008). Their performance is dependent of the signal to            spectrograms, which is also the strategy adopted in the
noise ratio (SNR), which varies with the range of the call-       present work.
ing whale, the noise levels and the other biological and               The paper is structured as follows. In Sec. II we
instrumentation factors (Gervaise et al., 2019b; Simard           first describe how the acoustic data was collected, then
et al., 2019). The DC performance of these classical sig-         discuss the generation of training datasets, the neural
nal processing methods under actual in situ recording             network design and the training protocol. In Sec. III
conditions tends to plateau around a detection probabil-          we present the results of the detection and classification
ity of about 50% (i.e. recall index) when the false detec-        tasks, which are then discussed in Sec. IV. Finally, in
tion probability is kept below about 10% (DCLDE 2013;             Sec. V we summarize and conclude.
Simard et al., 2019). The objective of the present study is
to test if modern machine-learning approaches can break
                                                                  II. MATERIALS AND METHODS
this apparent DC performance ceiling.
     Within the last decade, artificial neural networks           A. Acoustic Data
have become the preferred machine-learning approach
for solving a wide range of tasks, outperforming exist-                The PAM data were collected between 2015 and 2019
ing computational methods and achieving human-level               at 6 stations in the southern Gulf of St. Lawrence (Fig. 1).
accuracy in domains such as image analysis (He et al.,            Two different deployment configurations were employed,
2015) and natural speech processing (Hinton et al., 2012).        producing distinct datasets, A, B, and B∗ (Table I). In
Originally inspired by the human brain, neural networks           addition to these datasets, we have also considered a
consist of a large number of interconnected “neurons”,            third dataset, C, which is a subset of the DCLDE 2013
each typically performing a simple linear operation on            dataset generated from PAM data collected in the Gulf
input data, specified by a set of weights and a bias, fol-        of Maine in 2009.1
lowed by an activation function. In a supervised training              In the case of dataset A, the PAM system was
approach, the network is given examples of labeled data,          deployed from the surface, with the hydrophone
and the weights and biases are adjusted to produce the            tethered to a real-time ocean observing Viking buoy
desired output using an optimization algorithm. Modern            (Multi-Electronique Inc., Rimouski, Qc, Canada,
neural networks exhibit multi-layer architectures, which          http://www.multi-electronique.com/buoy.html)
enable them to build complex concepts out of simpler              with a 60-m long cable floating at the surface
concepts and hence learn a non-linear representation of           for half of its length.       The recording digital hy-
the data conducive to solving a given task. Therefore,            drophone, at a depth of ∼ 30 m, was an ic-Listen
modern neural networks are often referred to as deep              HF (Ocean Sonics, Truro Heights, N.S., Canada,
neural networks (DNNs), and the strategy of represent-            https://oceansonics.com/product-types/iclisten-smart-
ing complex data as a nested hierarchy of concepts is             hydrophones/). It sampled continuously the raw (0 gain)
referred to as deep learning. Two of the most commonly            acoustic signal with 24-bit resolution. The receiving
encountered basic architectures are convolutional neural          sensitivity of the hydrophone was −170 dB re 1 V µPa−1 .
networks (CNNs) and recurrent neural networks (RNNs),                  In the case of datasets B and B∗ , the PAM system
which are particularly well adapted to the tasks of ana-          used short (< 10 m) “I”-type oceanographic moorings,
lyzing image data and sequential data, respectively. The          with an anchor, an acoustic release, and streamlined
availability of large labeled datasets, containing millions       underwater floats, for bottom deployment at depths
of labeled examples, has been a key factor in the success         varying from 75 m to 125 m, with the autonomous
of DNNs in domains such as image analysis and natu-               hydrophone ∼ 5 m above the seafloor. The recording
ral speech processing. Therefore, much of the current             equipment consisted of AURAL M2 (Multi-Electronique
research in deep learning focuses on how to train DNNs            Inc., Rimouski, Qc, Canada, http://www.multi-
more efficiently on smaller datasets.                             electronique.com/aural.html) sampling the 16-dB
     Shallow neural networks have been employed for the           pre-amplified acoustic signal with a 16-bit resolution
purpose of sound classification in marine bioacoustics            at duty cycles of 15 min or 30 min every hour. The
since the 1990s, usually combined with a method of fea-           receiving sensitivity of the HTI 96-MIN (High Tech Inc.,
ture extraction, e.g. (Bahoura and Simard, 2010), but             Gulfport, MS) hydrophone equipping the AURAL is
also acting directly on the spectrogram (Halkias et al.,          −164±1 dB re 1 V µPa−1 over the < 0.5-kHz bandwidth
2013; Potter et al., 1994). In the last few years, the first      used here. Further details can be found in (Simard
studies employing modern DNNs have been reported.                 et al., 2019).
Examples include classification of fish sounds (Malfante               Because of the different acoustic equipment and de-
et al., 2018), detection and classification of orca vocal-        ployment types, the recordings from the two datasets dif-
izations (Bergler et al., 2019), classification of multiple       fered significantly in terms of signal amplitude and noise
whale species (McQuay et al., 2017; Thomas et al., 2019),         background from different sources, including flow noise,
and detection and classification of sperm whale vocaliza-         strum, and knocks resulting from the effects of tidal cur-
tions (Bermant et al., 2019). In all cases, CNNs have             rents and the surface motion due to waves on the hy-
been leveraged to analyze the information encoded in              drophone and deployment apparatus. Additional SNR
                                                                  variability of the recordings is introduced by the differ-

2    J. Acoust. Soc. Am. / 3 March 2020   JASA/Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls
FIG. 1. (Color online) Map of the Gulf of St. Lawrence with bathymetry and seaways, showing the location of the PAM
stations. Circles indicate surface deployments (dataset A) while stars indicate bottom deployments (datasets B and B∗ ).

ent locations and depths at which the hydrophones were             manually validated by an expert by examining the spec-
deployed in the southern Gulf of St. Lawrence, provid-             trogram and labelled as “true” or “false” using the longer
ing different exposures to the above environmental condi-          call pattern context in a ∼ 1-min window. In NARW call
tions and to the shipping noise field from the main seaway         occurrence studies, the false detections are then elimi-
(Aulanier et al., 2016; Simard et al., 2019). The datasets         nated. For the present work, however, both true and false
used here therefore represent a large range of conditions          detections were extracted from the recordings and used
that can be encountered in realizing the DC task for the           as “positives” and “negatives”, respectively, for building
low-frequency NARW upcall from acoustic data collected             the training datasets A and B (Table II). For dataset C,
using different PAM systems. To develop a deep learning            we have used the existing annotations from the DCLDE
model that is robust to such realistic range of variability,       2013 Challenge. Finally, we have built the composite
no effort was made to enhance the SNR before feeding               datasets AB and ABC by combining all the samples from
the data to the neural network.                                    the individual datasets.
                                                                        To examine the accuracy of the validation protocol
B. Training and Test Datasets
                                                                   used for producing datasets A and B, a second expert
                                                                   was subsequently tasked with reviewing a subset of the
     Datasets A and B were first analyzed with a clas-             annotations. The review differed from the validation in
sical time-frequency based detector (TFBD) following               several ways: the second expert had knowledge of both
(Mellinger, 2004) and (Mouy et al., 2009). This algo-              the labels assigned by the first expert and the classifica-
rithm looks for a typical image of NARW upcall in the              tion proposed by the DNN. The second expert used raw
SNR-enhanced (i.e., noise-subtracted), high-resolution             spectrograms while the first expert used SNR-enhanced
(32 ms × 3.9 Hz) spectrogram of the recordings, and a              spectrograms2 and a larger temporal context, including
detection is triggered by the degree of cross-coincidence.         considerations of occurrence probability over the seasons.
The NARW upcall template used was a 1-s, 100–200 Hz                Finally, the first expert was instructed to adopt a more
chirp with a ±10 Hz bandwidth. For further details, see            conservative annotation strategy, always assigning a neg-
(Simard et al., 2019). The resulting detections were then          ative label in cases of substantial doubt, whereas no such

J. Acoust. Soc. Am. / 3 March 2020   JASA/Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls   3
TABLE I. Datasets used in this work.

Dataset Deployment type                Location           Year(s)                                                            Analysis method
    A           Surface buoy     Gulf of St. Lawrence        2019         Expert validation of detections reported by TFBD algorithm
     B        Bottom mooring     Gulf of St. Lawrence        2018         Expert validation of detections reported by TFBD algorithm
        ∗
    B         Bottom mooring     Gulf of St. Lawrence 2015–2017                             Full manual analysis of fifty 30-min recordings
     C        Bottom mooring         Gulf of Maine           2009                          Full manual analysis of seven days of recordings

TABLE II. Number of samples in the datasets used for training and testing the classifiers. The composite datasets AB and
ABC were produced by combining all the samples from the individual datasets.

                                   Training and validation                                                                                    Testing
Dataset
                     No. samples           Positives          Negatives                               No. samples                             Positives              Negatives
    A                   1, 767                42%                   58%                                      307                                  18%                    82%
     B                  3, 309                61%                   39%                                      579                                  59%                    41%
     C                  3, 000                50%                   50%                                          −                                −                      −
    AB                  5, 076                55%                   45%                                      886                                  45%                    55%
    ABC                 8, 076                53%                   47%                                          −                                −                      −

instruction was given to the second expert. The results                                                                                                         t0
of the annotation review will be discussed in Sec. IV.
                                                                            Samples/day

                                                                                          200
                                                                                                (a)
     The extracted segments are 3 s long and centered on                                                                                                train+val test
                                                                                                Dataset A
                                                                                          100   2019
the midpoint of the upcall, as determined by the TFBD
algorithm. However, the midpoint determined by the al-                                      0
gorithm rarely coincides with the actual midpoint of the                                                                                 t0
                                                                            Samples/day

upcall, producing segments that are misaligned by up to                                   300
                                                                                                (b)
0.5 s in either direction. We note that such quasi-random                                 200                                    train+val test                  Positives
                                                                                                Dataset B
                                                                                                2018                                                             Negatives
time shifts are desirable for training a DNN classifier be-                               100
cause they encourage the network to learn a more general,                                   0
time translation invariant, representation of the upcall.                                             29         08         18      28    08   18    28 07 17 27
                                                                                                05-        06-        06-        06- 07- 07- 07- 08- 08- 08-
     For the purpose of testing the classification perfor-                                                                          time, t (month-day)
mance of the trained models, including their capacity for
generalizing, we split datasets A and B as follows: Sam-
ples obtained at times t < t0 were used for training and                  FIG. 2. (Color online) Time-split used to produce training
validation, while samples obtained at times t > t0 were                   and validation sets and test sets for Datasets A (a) and B (b).
retained for testing. Here, t0 was chosen to produce a
85:15 split ratio between the number of samples used for
training and validation and the number of samples used                    lap with the training datasets A and B, which originate
for testing (Fig. 2). This split implies temporal separa-                 from 2019 and 2018, respectively. Dataset B∗ was man-
tions of 52 min and 33 min between the latest sample in                   ually analyzed in its entire length by a third expert who
the training dataset and the earliest sample in the test                  identified 1, 157 NARW upcall occurrences.
dataset for A and B, respectively.
     From the bottom deployments we also produced
dataset B∗ consisting of fifty 30-min segments from two                   C. Spectrogram and SNR Computation
years between 2015 and 2017 (Fig. 3). These data were                          First, the 3-s acoustic segments were downsampled to
used for testing the detection performance of the neural                  1, 000 samples s−1 using MATLAB’s resample function,
network on continuous data. The data cover all seasons                    which employs a polyphase anti-aliasing filter. The spec-
and times of the day, hence providing a representative                    trogram representation was then computed on a dB scale
picture of the acoustic conditions found in the Gulf of                   using a window size of 0.256 s, a step size of 0.032 s (88%
St. Lawrence. Moreover, the data have no temporal over-                   overlap), and a Hamming window. These parameters

4        J. Acoust. Soc. Am. / 3 March 2020   JASA/Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls
A1796 (SNR=6.6)
             B∗                       B            A
                                                                                      450
                                                                                                                                      +0 dB
                                                                                      400
                                                                                      350                                             -20 dB

                                                                     Frequency (Hz)
                                                                                      300
  2015       2016        2017        2018        2019                                                                                 -40 dB
                             Year                                                     250
                                                                                      200
                                                                                                                                      -60 dB
                                                                                      150
FIG. 3. (Color online) Temporal distribution of the training
datasets A and B and the continuous test dataset B∗ .                                 100                                             -80 dB
                                                                                       50
                                                                                                                                      -100 dB
                                                                                        0
have been shown to be optimal for identifying NARW up-                                      0.0   0.5    1.0     1.5      2.0   2.5
calls (Gervaise et al., 2019a) and produce a spectrogram                                                       Time (s)
with (time, frequency) dimensions of 94 × 129. We note
that the spectrograms were fed to the network in their
                                                                   FIG. 4.     (Color online) Positive spectrogram sample from
raw form. In particular, no effort was made to normal-
                                                                   dataset A. Superimposed is a 1-s window centered on the up-
ize the spectrograms to correct for systematic differences
in signal amplitude in the three datasets. This approach           call (dashed line) and a trace drawn along the upcall (dotted
was adopted to produce the most general model possible.            curve), as computed by the heuristic SNR algorithm described
     For the estimation of the SNR value of each sample,           in the text.
positive or negative, the following heuristic algorithm was
implemented: (1) A denoised spectrogram, Xd , was cre-
ated by subtracting first the median value of each time            blocks with skip connections (He et al., 2016). CNNs
slice (to reduce broadband, impulsive noise) and then              consist of a stack of convolutional layers followed by a few
subtracting the median value of each frequency slice (to           fully connected layers. During the training process, the
reduce tonal noise). (2) The mid-point of the upcall was           convolutional layers learn to extract patterns from the in-
determined by sliding a 1-s wide window across the de-             put images, which are passed to the fully connected lay-
noised spectrogram while seeking to maximize,                      ers for classification (Goodfellow et al., 2016, Chapter 9).
                                                                   The residual blocks in a ResNet are composed of convo-
         sumt (maxf (Xd )) + sumf (maxt (Xd )) ,                   lutional layers, but allow some connections between lay-
                                                                   ers to be skipped, thereby avoiding “vanishing” and “ex-
in the frequency interval 80–200 Hz, where the subscripts          ploding” gradients during training (He et al., 2016). We
indicate the axis (t:time, f :frequency) along which the           used blocks with batch normalization (Ioffe and Szegedy,
mathematical operation is applied. (3) A trace was                 2015) and rectified linear units (ReLU) (Nair and Hin-
drawn by connecting the pixels with the maximum value              ton, 2010). The architecture was composed of eight such
in each time slice of Xd . (4) The median value of the             blocks preceded by one convolutional layer and followed
original spectrogram, X, was computed along this trace,            by a batch normalization layer, global average pooling
including also the three pixels immediately above and              (Lin et al., 2013), and a fully connected layer with a soft-
below to account for the finite “width” of the upcall.             max function, which is responsible for the classification.
(5) Finally, the median values of X in the 0.5-s adja-             Finally, the output layer gives a score in the range 0–1
cent windows were computed for the frequency interval              for each of the two classes (positive and negative), which
80–200 Hz and subtracted. Fig. 4 shows the result ob-              add up to 1.
tained on a typical positive sample from dataset A. We
stress that SNR estimation is highly challenging for the           E. Training Protocol
datasets considered in this work because of non-uniform
stationary noise, transient noise, and distortion of the               We trained the network on two NVIDIA RTX 2080
upcalls due to propagation effects.                                Ti GPUs with 11GB of memory. Training was performed
                                                                   with a batch size of 128 and terminated after a pre-set
                                                                   number of epochs, N . That is, 128 samples were passed
D. Neural Network Architecture
                                                                   through the network between successive optimizations of
     The problem was set up as a binary classification             the weights and biases, and every sample in the train-
task: A neural network was trained to classify the 3-s             ing dataset was passed through the network N times.
spectrograms according to the criterion, contains (pos-            Weights and biases were optimized with the ADAM op-
itive class, 1) or does not contain (negative class, 0) a          timizer (Kingma and Ba, 2014) using the recommended
NARW upcall. We used a residual network (ResNet),                  parameters: an initial learning rate of 0.001, decay of
which is a CNN architecture mainly built of residual               0.01, β1 of 0.9, and β2 of 0.999. No effort was made to

J. Acoust. Soc. Am. / 3 March 2020   JASA/Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls                  5
explore the effects of these parameters on the training                For the computation of recall, precision, and false-
outcome. The network was trained to maximize the F1               positive rate, we merged adjacent positive (1) bins into
score, defined as the harmonic mean of precision and re-          “detection events”, which extend from the lower edge
call, F1 = 2P R/(P + R), where R is the recall, i.e., the         of the first bin, t, to the upper edge of the last bin,
fraction of the upcalls that were detected, and P is the          t + N ∆t, N being the number of bins in the event. To
precision, i.e., the fraction of the detected upcalls that        allow for minor temporal misalignments between anno-
were in fact upcalls. Thus, the F1 score considers both           tations and detections, we adopted a temporal buffer
recall and precision and attaches equal importance to the         of 1.0 s, effectively expanding every detection event to
two.                                                              [t − 2∆t, t + (N + 2)∆t]. Considering the primary in-
     Initially, the network was trained using 5-fold cross-       tended application of the detection algorithm, namely, to
validation with a 85:15 random split between the training         quantify upcall occurrences in PAM data and provide oc-
and validation sets, allowing us to confirm that the op-          currence times of sufficient accuracy to aid validation by
timization had converged without overfitting. Based on            a human analyst, such a small temporal buffer is fully jus-
these initial training sessions, N = 100 was found to pro-        tifiable. The recall was then computed as the fraction of
vide an optimal choice for all the training datasets. The         annotated upcalls that exhibit at least 50% overlap with
network was then trained on the full training datasets            a detection event, while the precision was computed as
without cross-validation for N = 100 epochs. This was             the fraction of the detection events that exhibit at least
repeated nine times with different random number gener-           50% overlap with an annotated upcall. Any detection
ator seeds to assess the sensitivity of the training outcome      event that did not overlap with an annotated upcall or
to the initial conditions.                                        exhibited less than 50% overlap was counted as one false
                                                                  positive for the computation of the false-positive rate.
F. Linear Discriminant Analysis
                                                                  III. RESULTS
    To establish a baseline against which to compare the
performance of the neural network, we implemented a               A. Classification Performance
linear discriminant analysis (LDA) model following the
approach of (Martinez and Kak, 2001), noting that such                 The classification performance of the DNN and LDA
models have traditionally been adopted for solving sound          classifiers on the test datasets are summarized in terms
detection and classification tasks in marine bioacoustics.        of the average F1 score, recall, and precision obtained
First, the 94 × 129 spectrogram matrix was flattened to           in the nine independent training sessions (Fig. 5). The
a vector of length 12, 126. Second, the dimensionality            DNN model trained on the ABC dataset exhibits the best
was reduced by means of principal component analysis              overall performance, achieving a recall of 87.5% and a
(PCA). Third, we trained the LDA classifier using a least-        precision of 90.2% on the AB test set (with standard de-
squares solver combined with automatic shrinkage follow-          viations of 1.1% and 1.2%) and outperforming the base-
ing the Ledoit-Wolf lemma (Ledoit and Wolf, 2004). The            line LDA model by a statistically significant margin (as
training was repeated for several choices of PCA dimen-           evident from Fig. 6 below).
sionality using a 85:15 random split between training and              We have investigated the effect of increasing the size
validation sets, and the dimensionality yielding the best         of dataset C by up to a factor of 10 (15, 000 upcalls), but
performance on the validation set was selected.                   found only a negligible improvement in the performance
                                                                  of the DNN model. (We note that the DCLDE 2013
                                                                  dataset contains 6, 916 logged calls. To produce a dataset
G. Detection Algorithm                                            with 15, 000 upcalls we added time-shifted copies of the
     For the purpose of detecting NARW upcalls in con-            logged calls to the dataset.)
tinuous acoustic data, the following simple algorithm was              We have also investigated the effect of discarding
implemented: First, the data were segmented using a               samples with SNR below a certain minimum value,
window of 3 s and a step size of ∆t = 0.5 s. Each 3-s             SNRmin , from the AB test set (Fig. 6). For the DNN
segment was then fed to the DNN classifier, producing a           model, we observe a gradual increase in performance
sequence of classification scores between 0–1, which we           as we restrict our attention to upcalls with increasingly
interpret as a time-series of upcall occurrence probabil-         larger SNR values, with the recall improving from 89%
ities. Empirically, we found it useful to smoothen the            at SNRmin ' 0 to 98% at SNRmin ' 12 and the precision
classification scores using a five-bin (2.5 s) wide averag-       improving from 91% to 100% across the same range of
ing window. This greatly reduced the number of false              SNR. The performance of the LDA model also improves
positives (factor of ∼ 5) at the cost of a modest increase        with increasing SNR. This is especially true for the pre-
in the number of false negatives (factor of ∼ 2). Finally,        cision, which improves from 70% at SNRmin ' 0 to 95%
we applied a uniform detection threshold, setting the bin         at SNRmin ' 12, whereas the recall reaches a maximum
value to 1 (“positive”) when the score was greater than           of 82% before worsening at the largest SNR values.
or equal to the threshold and 0 (“negative”) when it was               In the following, we examine a small set of repre-
below.                                                            sentative spectrogram samples, which have been either
                                                                  correctly classified or misclassified by the DNN model

6    J. Acoust. Soc. Am. / 3 March 2020   JASA/Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls
(a) DNN Performance                                                                (b) LDA Performance
                                                                           90                                                                       90
                             43.3          10.4             16.8                                                66.4          64.4           64.7
                         A   41.2
                             62.0
                                             9.2
                                            60.0
                                                            13.7
                                                            65.4
                                                                                                            A   58.2
                                                                                                                77.7
                                                                                                                               65.5
                                                                                                                               63.7
                                                                                                                                             64.5
                                                                                                                                             65.2
                                                                           80                                                                       80

                             48.0          88.3             83.7                                                46.5          79.7           73.4
                         B   40.3           89.4            82.4                                            B   75.0           89.7          87.6
                             80.0           88.1            86.7           70                                   33.9           71.8          63.1   70
    Training dataset

                                                                                       Training dataset
                             69.1          87.8             85.0           60                                   60.0          76.6           74.6   60
                        AB   66.1
                             79.7
                                            89.9
                                            86.7
                                                            86.5
                                                            85.2
                                                                                                           AB   55.4
                                                                                                                65.6
                                                                                                                               83.2
                                                                                                                               71.0
                                                                                                                                             79.3
                                                                                                                                             70.5

                                                                           50                                                                       50
                             80.5          90.1             88.8                                                60.6          71.9           70.5
                       ABC   71.3
                             93.0
                                            90.1
                                            90.2
                                                            87.5
                                                            90.2
                                                                                                          ABC   55.2
                                                                                                                67.4
                                                                                                                               74.9
                                                                                                                               69.1
                                                                                                                                             72.1
                                                                                                                                             68.9

                                                                           40                                                                       40
                             A

                                           B

                                                           AB

                                                                                                                A

                                                                                                                              B

                                                                                                                                             AB
                                        Test dataset                                                                       Test dataset

FIG. 5. (Color online) (a) DNN classification performance in terms of F1 score (top, large font), recall (middle, small font),
and precision (bottom, small font). Rows: training dataset; columns: test dataset. The colorscale indicates the F1 score. (b)
LDA classifier performance.

(Fig. 7). We divide the samples into true positives, true                            fiers on dataset B∗ , which has been subject to full manual
negatives, false positives, and false negatives, and for                             analysis.
each category we give three examples reflecting differ-
ent levels of certainty and difficulty as perceived by the                           B. Detection Performance on Continuous Data
second expert: (a) certain and easy, (b) certain, but dif-
ficult, and (c) uncertain. Here, it must be remembered                                    The detection algorithm introduced in Sec. II G was
that the experts had access to a larger temporal context                             tested on dataset B∗ , which consists of fifty 30-minute
of ∼ 1 min to inform their decision. Notably, this may                               segments and has a total of 1, 157 upcalls. The number
have helped the expert to correctly identify calls with low                          of calls per file exhibits significant variation, ranging from
SNR in cases where the calls form part of a call series.                             none to 100 with a median value of 15.
     A few observations can be made: the model is able to                                 The performance of the detection algorithm is sum-
correctly identify upcalls with very different SNR (B2225,                           marized in terms of recall, precision, and false-positive
B3121); the model is able to correctly classify negatives                            rate (Fig. 8). The detection threshold is seen to pro-
containing potentially confusing patterns (A12), but not                             vide a convenient tunable parameter to adjust the detec-
always (A111); the model struggles in cases with low                                 tion performance, depending on whether high precision
SNR (A134, A1858); the model can be confused by tonal                                or high recall is desired. One also notes that the nine
noises and multipath echoes (B205). These deficiencies                               independent training sessions produced detectors with
could potentially be resolved by enlarging the temporal                              very similar recall, but varying levels of precision. In
window, thereby giving the model access to the same                                  particular, the best-performing detector achieves a recall
contextual information that is available to the human                                of 80%, while maintaining a precision above 90%, cor-
analyst, notably the appearance of an upcall series.                                 responding to a false-positive rate of 5 occurences per
     Finally, we highlight a limitation of the classification                        hour for this particular test dataset, while the “average”
results reported in this section. Since datasets A and                               detector achieves a recall of 60% for the same level of
B only contain samples flagged by the TFBD algorithm,                                precision.
the performance demonstrated on these datasets is not                                     Finally, we have considered the effect of discard-
necessarily representative of the performance on a ran-                              ing samples with SNR below a certain minimum value,
dom selection of samples from a continuous recording,                                SNRmin , from the test dataset (Fig. 9). We observe
and the reported metrics (Fig. 5) cannot be readily ap-                              a gradual improvement in performance with increasing
plied to continuous data. In the next section, we address                            SNR. For example, by considering only samples with
this limitation by testing the performance of the classi-                            SNR > 4.0, the false-positive rate is reduced from 35

J. Acoust. Soc. Am. / 3 March 2020                     JASA/Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls         7
On the other hand, we found that the DNN models
                  1.0
                        (a)                                                    generally performed poorly when trained on one dataset,
                  0.9                                                          but tested on another (e.g. trained on A, but tested on B).
                                                                               This behavior was not observed with the LDA models,
      Recall
                  0.8
                  0.7                                                          whose less performant solution appear to be less sensitive
                                                                               to the training dataset.
                  0.6                                                               The quality and accuracy of the training datasets
                  0.5                                                          built as part of this work is limited by both the use of
                                                                               a classical time-frequency based detector to select can-
                  1.0
                        (b)                                                    didate upcalls for expert validation and by human sub-
                  0.9                                                          jectivity in the validation step. Any bias in the selec-
      Precision

                  0.8                                                          tion or validation step will be reflected in the training
                  0.7                                                          dataset and hence affect the learning of the DNN. To ex-
                                                                  DNN          plore the bias in the validation step, a second expert was
                  0.6                                             LDA          tasked with reviewing all the incorrectly classified seg-
                  0.5                                                          ments (false positives as well as false negatives) and an
                                                                               equally large number of correctly classified segments ran-
                  600
                        (c)                                      Positives     domly sampled from the AB test dataset (cf. Sec. II B).
                                                                 Negatives
    No. samples

                                                                               The second expert flagged about half of the incorrectly
                  400                                                          classified segments as “borderline”, implying that the
                                                                               expert considered these classifications as being highly
                  200                                                          uncertain. On the other hand, the second expert only
                                                                               flagged 9% of the correctly classified segments as border-
                    0                                                          line. Removing the borderline cases from the test data
                     −4        0         4      8           12           16    improves the recall and precision by 2% and 5%, respec-
                                          SNRmin
                                                                               tively. However, the second expert also changed some
                                                                               of the labels not considered to be borderline. Adopt-
                                                                               ing the second expert’s revised labels for the test data,
FIG. 6. (Color online) (a) Effect of discarding samples with                   the recall decreases by 6% while the precision increases
SNR < SNRmin from the AB test dataset on the recall of the                     by 2%. These changes in performance metrics testify to
DNN and LDA models trained on ABC. The lines show the                          the difficulty of obtaining accurate annotations on PAM
average recall obtained in the nine training sessions, while the               data. It would be interesting to investigate the inter-
shaded bands show the 10% and 90% percentiles. (b) Same,                       annotator variability in a more systematic and controlled
for the precision. (c) Number of positive and negative samples                 manner than done here, but this is beyond the scope of
in the AB test dataset with SNR ≥ SNRmin .                                     the present study. (For example, it can be argued that
                                                                               the second expert may have been biased by prior knowl-
                                                                               edge of the labels proposed by the first expert and the
to 6 occurences per hour, while maintaining a recall of                        DNN.)
90% and retaining more than 95% of the upcalls in the                               In order to obtain a realistic assessment of the perfor-
test dataset.                                                                  mance that can be expected of the DNN model in a prac-
                                                                               tical setting, we have tested the model’s ability to iden-
                                                                               tify upcalls in continuous acoustic data representative of
IV. DISCUSSION                                                                 the actual conditions required for a NARW upcall PAM
     The DNN classifier has been found to outperform                           DC system. In order to obtain an unbiased estimate of
the baseline LDA model, achieving recall and precision                         the detection performance, these data were subject to a
of 87.5% and 90.2% on the AB test dataset. Addition-                           complete manual validation not resorting to the use of a
ally, the DNN models trained on the combined datasets                          classical detection algorithm to select candidate upcalls.
generally performed better than the models trained on                          The best-performing model achieved a recall of 80% while
the individual datasets, also when tested on the individ-                      maintaining a precision above 90%, corresponding to a
ual datasets. For example, models trained on AB con-                           false-positive rate of 5 occurences per hour for the cho-
sistently outperformed models trained exclusively on A,                        sen test dataset, while the “average” model only achieved
even when tested solely on A. In contrast, the baseline                        a recall of 60% for the same level of precision. However,
LDA model achieved worse performance when trained on                           by restricting our attention to upcalls with SNR & 4.0,
combined datasets. This is an important observation, be-                       the recall of the average model was increased to 85% for
cause it suggests that DNNs have the capacity to handle                        the same level of precision while retaining over 95% of
larger variance in the data, and indeed benefit from being                     the upcalls. Existing algorithms are capable of achieving
trained on data with greater variance, producing models                        similar levels of recall, but at the cost of a significantly
that are more robust to inter-dataset variability.                             higher false-positive rate (DCLDE 2013; Simard et al.,
                                                                               2019).

8                 J. Acoust. Soc. Am. / 3 March 2020   JASA/Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls
True Positives               True Negatives                False Positives               False Negatives
                                     B3121 (SNR=21.2)              A1461 (SNR=2.4)               A111 (SNR=-1.8)               B205 (SNR=4.3)

                        450
                        400
                        350
       Frequency (Hz)

                        300
  a)                    250
                        200
                        150
                        100
                         50
                          0
                                     B2225 (SNR=5.9)               A12 (SNR=11.8)                A134 (SNR=4.5)               A1858 (SNR=7.4)

                        450
                        400
                        350
       Frequency (Hz)

                        300
  b)                    250
                        200
                        150
                        100
                         50
                          0
                                     B1307 (SNR=7.5)               B1484 (SNR=5.2)               B1090 (SNR=3.9)               B2582 (SNR=2.9)

                        450
                        400
                        350
       Frequency (Hz)

                        300
  c)                    250
                        200
                        150
                        100
                         50
                          0
                              0.0 0.5 1.0 1.5 2.0 2.5      0.0 0.5 1.0 1.5 2.0 2.5       0.0 0.5 1.0 1.5 2.0 2.5       0.0 0.5 1.0 1.5 2.0 2.5
                                        Time (s)                     Time (s)                      Time (s)                      Time (s)

FIG. 7. (Color online) Representative 3-s spectrogram samples. First column: true positives; second column: true negatives;
third column: false positives; fourth column: false negatives. For each category, three examples are given reflecting different
levels of certainty and difficulty as perceived by the second expert: (a) certain and easy; (b) certain, but difficult; (c) uncertain.
Spectrograms are labeled by their ID and SNR (in dB).

    Finally, we note that a related work entitled “Deep                               siders acoustic data from several locations off the east
neural networks for automated detection of marine mam-                                coast of the US, whereas our work considers data from
mal species” (Shiu et al., 2020) has been published during                            the Gulf of St. Lawrence. Where (Shiu et al., 2020) pro-
the review of our paper. We would like to highlight the                               vides a comparison of several network architectures, our
complementary nature of the two studies. While similar                                work provides insights into the importance of dataset size
deployment techniques (surface buoys, bottom moorings)                                and variance. Moreover, our work provides insights into
and acoustic recorders were used, (Shiu et al., 2020) con-                            recall variability with SNR. Although a direct compar-

J. Acoust. Soc. Am. / 3 March 2020                      JASA/Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls   9
1.0                                               1.0
                                                                                                                      (a)

                                                                                               FPR (hour−1 )
                     0.8                                               0.8
                                                                                                               101

                                                                              Precision, P
     Recall, R

                     0.6                                               0.6

                     0.4                                               0.4                                     100
                                 R
                     0.2                                               0.2
                                 P                             (a)
                     0.0                                                0.0                                    1.0
                                                                                                                      (b)
                        0.0    0.2       0.4     0.6     0.8         1.0
                                                                                                               0.9
                                     Detection threshold

                                                                                                  Precision
                                                                                                                              0.0
                                                                                                               0.8
                     50                                                1.0                                                    2.0
                                                                                                               0.7            4.0
                     40                                                0.8                                                    8.0
     FPR (hour−1 )

                                                                                                               0.6

                                                                              Precision, P
                                                                                                                              12.0
                     30          FPR-R                                 0.6                                     0.5
                                 P-R                                                                              0.0         0.2        0.4            0.6   0.8    1.0
                     20                                                0.4                                                                     Recall
                     10                                                0.2
                                                               (b)                                             1500
                      0                                                 0.0                                             (c)
                       0.0     0.2       0.4      0.6    0.8         1.0
                                          Recall, R                                            No. upcalls     1000

FIG. 8. (Color online) Detection performance on the contin-                                                     500
uous test data (dataset B∗ ) in terms of recall (R), precision
(P), and false-positive rate (FPR). The lines show the aver-
age performance while the shaded bands show the 10% and                                                           0
90% percentiles. (a) R and P versus the adopted detection                                                          −4         0      4       8    12            16   20
threshold. (b) P-R and FPR-R curves.                                                                                                       SNRmin

                                                                                             FIG. 9. (Color online) Detection performance of the “aver-
ison of the detection performances achieved in the two
                                                                                             age” model on the continuous test data (dataset B∗ ) for the
studies cannot be made since different test datasets were
used, we note that the recall obtained by (Shiu et al.,                                      upcalls meeting the criterion SNR > SNRmin . The perfor-
2020) on continuous test data is somewhat higher than                                        mance is shown in terms of recall (R), precision (P), and false-
the recall obtained in our study (95% vs. 87% for a false-                                   positive rate (FPR) for the five cut-off values SNRmin = 0.0,
positive rate of 20 h−1 ). However, the difference is within                                 2.0, 4.0, 8.0, 12.0. (a) FPR vs. R; (b) P vs. R; (c) Number of
the range of variability that could be explained by differ-                                  upcalls vs. SNRmin .
ences in SNR in the test data.

V. CONCLUSION                                                                                could be achieved by further expanding the variance of
                                                                                             the training dataset. Using the DNN classifier, we im-
     In summary, we have demonstrated that DNNs can                                          plemented a simple detection algorithm, which exhib-
be trained to recognize NARW upcalls in acoustic record-                                     ited good performance on continuous test data, achiev-
ings which have been made with different acoustic equip-                                     ing a recall of 80% while maintaining a precision above
ment and deployment types, and hence differ significantly                                    90%. It would be interesting to explore still more so-
in terms of signal amplitude and noise background. By                                        phisticated machine-learning approaches, most notably
training a DNN on a dataset comprised of about 4, 000                                        approaches that consider a wider temporal context, as
samples of NARW upcalls and an approximately equal                                           done by the human experts, but this is beyond the scope
number of negative samples, we achieved recall and pre-                                      of the present study and is left for future work. These
cision of 90% on a test dataset containing about 700 up-                                     results highlight the potential of DNNs for solving sound
calls and a similar number of negatives. The DNN was                                         detection and classification tasks in underwater acous-
observed to benefit from being trained on data with in-                                      tics and motivate a community effort towards building
creased variance, suggesting that improved performance                                       larger and improved training datasets, especially for de-

10                   J. Acoust. Soc. Am. / 3 March 2020 JASA/Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls
ployments with interfering noise events, which present                  Bergler, C., Schröter, H., Cheng, R. X., Barth, V., Weber, M.,
more challenging acoustic conditions.                                     Nöth, E., Hofer, H., and Maier, A. (2019). “ORCA-SPOT: An
                                                                          automatic killer whale sound detection toolkit using deep learn-
                                                                          ing,” Sci. Rep. 9(1), 10997, doi: 10.1038/s41598-019-47335-w.
ACKNOWLEDGMENTS                                                         Bermant, P. C., Bronstein, M. M., Wood, R. J., Gero, S., and
                                                                          Gruber, D. F. (2019). “Deep machine learning techniques for
    We would like to thank C. Hilliard for proofread-                     the detection and classification of sperm whale bioacoustics,” Sci.
                                                                          Rep. 9(1), 12588, doi: 10.1038/s41598-019-48909-4.
ing the manuscript. We are grateful to G. E. Davis,                     COSEWIC (2013). “COSEWIC assessment and status report on
H. Johnson, C. T. Taggart, C. Evers, and J. Theriault                     the North Atlantic right whale Eubalaena glacialis in Canada,”
for numerous discussions on NARW bioacoustics that                        Committee on the Status of Endangered Wildlife in Canada. Ot-
                                                                          tawa, xi + 58 pp.
supported and guided the development of the neural-                     Davis, G. E., Baumgartner, M. F., Bonnell, J. M., Bell, J., Berchok,
network classifier. We thank Fisheries and Oceans                         C., Bort Thornton, J., Brault, S., Buchanan, G., Charif, R. A.,
Canada and the Natural Sciences and Engineering Re-                       Cholewiak, D., Clark, C. W., Corkeron, P., Delarue, J., Dudzin-
search Council, Discovery Grant to YS, for the collec-                    ski, K., Hatch, L., Hildebrand, J., Hodge, L., Klinck, H., Kraus,
                                                                          S., Martin, B., Mellinger, D. K., Moors-Murphy, H., Nieukirk, S.,
tion and preparation of the datasets. The NARW detec-                     Nowacek, D. P., Parks, S., Read, A. J., Rice, A. N., Risch, D.,
tion algorithm is a contribution of the Canadian Foun-                    Širović, A., Soldevilla, M., Stafford, K., Stanistreet, J. E., Sum-
dation for Innovation MERIDIAN cyberinfrastructure                        mers, E., Todd, S., Warde, A., , and Van Parijs, S. M. (2017).
                                                                          “Long-term passive acoustic recordings track the changing distri-
(https://meridian.cs.dal.ca/). This work was pre-                         bution of North Atlantic right whales (Eubalaena glacialis) from
sented at the fifth International Meeting on The Effects                  2004 to 2014,” Sci. Rep. 7, 13460.
of Noise on Aquatic Life held in Den Haag, July 2019.                   DCLDE          2013.      Unpublished       results,     https://soi.st-
                                                                          andrews.ac.uk/dclde2013/.
                                                                        DFO (2018). “Science advice on timing of the mandatory slow-
                                                                          down zone for shipping traffic in the Gulf of St. Lawrence to
                                                                          protect the North Atlantic right whale,” DFO Can. Sci. Advis.
                                                                          Sec. Sci. Resp. 2017/042, 16 p.
                                                                        Gervaise, C., Simard, Y., Aulanier, F., and Roy, N. (2019a). “Op-
APPENDIX: SUPPLEMENTARY MATERIAL                                          timal passive acoustics systems for real-time detection and lo-
                                                                          calization of North Atlantic right whales in their feeding ground
                                                                          off Gaspé in the Gulf of St. Lawrence,” Can. Tech. Rep. Fish.
    Upon manuscript acceptance we will provide the                        Aquat. Sci. 3346, ix + 57 pp.
code necessary to reproduce the results presented in the                Gervaise, C., Simard, Y., Aulanier, F., and Roy, N. (2019b). “Per-
paper including initialization and training of the neural                 formance study of passive acoustic systems for detecting North
network, prediction on test data, and computation of test                 Atlantic right whales in seaways: the Honguedo strait in the Gulf
                                                                          of St. Lawrence,” Can. Tech. Rep. Fish. Aquat. Sci. 3346, ix +
metrics, along with all necessary training and test data.                 53 pp.
We will also provide more spectrogram samples (includ-                  Gillespie, D. (2004). “Detection and classification of right whale
ing expert annotation, model output, and SNR) and a                       calls using an “edge” detector operating on a smoothed spectro-
                                                                          gram,” Can. Acoust. 32, 39–47.
complete diagram of the neural network architecture.                    Goodfellow, I., Bengio, Y., and Courville, A. (2016). “Deep learn-
                                                                          ing book,” MIT Press 521(7553), 800.
                                                                        Halkias, X. C., Paris, S., and Glotin, H. (2013). “Classification of
1 https://soi.st-andrews.ac.uk/static/soi/dclde2013/documents/            mysticete sounds using machine learning techniques,” J. Acoust.
                                                                          Soc. Am. 134(5), 3496–3505, doi: 10.1121/1.4821203.
 WorkshopDataset2013.pdf                                                Hayes, S. A., Josephson, E., Maze-Foley, K., Rosel, P., Byrd, B.,
2 SNR  enhancement can have two effects: It may allow extracting          Chavez-Rosales, S., Cole, T., Engleby, L., Garrison, L., Hatch, J.
 signals deeply embedded in noise, which cannot be seen in the raw        et al. (2018). “US Atlantic and Gulf of Mexico marine mammal
 spectrograms, but it may also generate artifacts that are mimick-        stock assessments - 2017 (second edition),” NOAA Tech. Memo.
 ing real signals.                                                        NMFS NE-245.
                                                                        He, K., Zhang, X., Ren, S., and Sun, J. (2015). “Delving deep
Aulanier, F., Simard, Y., Roy, N., Gervaise, C., and Bandet, M.           into rectifiers: Surpassing human-level performance on ImageNet
 (2016). “Spatio-temporal exposure of blue whale habitats to              classification,” in 2015 IEEE International Conference on Com-
 shipping noise in St. Lawrence system,” DFO Can. Sc. Advis.              puter Vision (ICCV), pp. 1026–1034, doi: 10.1109/ICCV.2015.
 Sec. Res. Doc. 2016/090, p. vi + 26 p.                                   123.
Bahoura, M., and Simard, Y. (2010). “Blue whale calls classifica-       He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual
 tion using short-time fourier and wavelet packet transforms and          learning for image recognition,” in Proceedings of the IEEE con-
 artificial neural network,” Digit. Signal Process. 20(4), 1256–          ference on computer vision and pattern recognition, pp. 770–778.
 1263, doi: 10.1016/j.dsp.2009.10.024.                                  Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly,
Baumgartner, M. F., , and Mussoline, S. E. (2011). “A generalized         N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N.,
 baleen whale call detection and classification system,” J. Acoust.       and Kingsbury, B. (2012). “Deep neural networks for acoustic
 Soc. Am. 129, 2889–2902.                                                 modeling in speech recognition: The shared views of four re-
Baumgartner, M. F., Bonnell, J., Van Parijs, S. M., Corkeron,             search groups,” IEEE Signal Process. Mag. 29(6), 82–97, doi:
 P. J., Hotchkin, C., Ball, K., Pelletier, L.-P., Partan, J., Peters,     10.1109/MSP.2012.2205597.
 D., Kemp, J., Pietro, J., Newhall, K., Stokes, A., Cole, T. V. N.,     Ioffe, S., and Szegedy, C. (2015). “Batch normalization: Acceler-
 Quintana, E., , and Kraus, S. D. (2019). “Persistent near real-          ating deep network training by reducing internal covariate shift,”
 time passive acoustic monitoring for baleen whales from a moored         arXiv preprint arXiv:1502.03167 .
 buoy: System description and evaluation,” Meth. Ecol. Evol. 10,        Kingma, D. P., and Ba, J. (2014). “Adam: A method for stochastic
 1476–1489.                                                               optimization,” arXiv preprint arXiv:1412.6980 .
Baumgartner, M. F., Fratantoni, D. M., Hurst, T. P., Brown,             Ledoit, O., and Wolf, M. (2004). “Honey, i shrunk the sample
 M. W., Cole, T. V. N., Van Parijs, S. M., , and Johnson, M.              covariance matrix,” J. Portf. Manag. 30(4), 110–119, doi: 10.
 (2013). “Real-time reporting of baleen whale passive acoustic            3905/jpm.2004.110.
 detections from ocean gliders,” J. Acoust. Soc. Am. 134, 1814–
 1823.

J. Acoust. Soc. Am. / 3 March 2020 JASA/Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls                 11
Lin, M., Chen, Q., and Yan, S. (2013). “Network in network,”           Potter, J. R., Mellinger, D. K., and Clark, C. W. (1994). “Ma-
  arXiv preprint arXiv:1312.4400 .                                       rine mammal call discrimination using artificial neural networks,”
Malfante, M., Mohammed, O., Gervaise, C., Dalla Mura, M., and            J. Acoust. Soc. Am. 96, 1255–1262, doi: https://doi.org/10.
  Mars, J. I. (2018). “Use of deep features for the automatic clas-      1121/1.410274.
  sification of fish sounds,” in OCEANS’18 MTS/IEEE, Kobe,             Shiu, Y., Palmer, K. J., Roch, M. A., Fleishman, E., Liu, X.,
  Japan.                                                                 Nosal, E-M., Helble, T., Cholewiak, D., Gillespie, D., and Klinck,
Martinez, A. M., and Kak, A. C. (2001). “PCA versus LDA,” IEEE           H., (2020). “Deep neural networks for automated detection of
  T. Pattern Anal. 23(2), 228–233, doi: 10.1109/34.908974.               marine mammal species,” Sci. Rep. 10, 607.
McQuay, C., Sattar, F., and Driessen, P. F. (2017). “Deep learn-       Simard, Y., Roy, N., Giard, S., and Aulanier, F. (2019). “North
  ing for hydrophone big data,” in 2017 IEEE Pacific Rim Con-            Atlantic right whale shift to the Gulf of St. Lawrence in 2015 as
  ference on Communications, Computers and Signal Processing             monitored by long-term passive acoustics,” Endang. Sp. Res. 40,
  (PACRIM), pp. 1–6, doi: 10.1109/PACRIM.2017.8121894.                   271–284, doi: https://doi.org/10.3354/esr01005.
Mellinger, D. K. (2004). “A comparison of methods for detecting        Simard, Y., Roy, N., Giard, S., and Yayla, M. (2014). “Cana-
  right whale calls,” Can. Acoust. 32, 55–65.                            dian year-round shipping traffic atlas for 2013: Volume 1,
Mouy, X., Bahoura, M., and Simard, Y. (2009). “Automatic recog-          East Coast marine waters,” Can. Tech. Rep. Fish. Aquat. Sci.
  nition of fin and blue whale calls for real-time monitoring in the     3091(Vol.1)E, xviii + 327 pp.
  St. Lawrence,” J. Acoust. Soc. Am. 126.                              Thomas, M., Martin, B., Kowarski, K., Gaudet, B., and Matwin,
Nair, V., and Hinton, G. E. (2010). “Rectified linear units improve      S. (2019). “Marine mammal species classification using convo-
  restricted boltzmann machines,” in Proceedings of the 27th inter-      lutional neural networks and a novel acoustic representation,”
  national conference on machine learning (ICML-10), pp. 807–            arXiv preprint arXiv:1907.13188 .
  814.                                                                 Urazghildiiev, I. R., and Clark, C. W. (2006). “Acoustic detection
Pace, R. M., Corkeron, P. J., and Kraus, S. D. (2017). “Statespace       of North Atlantic right whale contact calls using the generalized
  markrecapture estimates reveal a recent decline in abundance of        likelihood ratio test,” J. Acoust. Soc. Am. 120, 1956–1963.
  North Atlantic right whales,” Ecol. Evol. 7, 8730–8741.              Urazghildiiev, I. R., and Clark, C. W. (2007). “Acoustic detection
Pettis, H. M., Pace, R. M. I., and Hamilton, P. K. (2019). “North        of North Atlantic right whale contact calls using spectrogram-
  Atlantic right whale consortium 2019 annual report card,” Re-          based statistics,” J. Acoust. Soc. Am. 122, 769–776.
  port to the North Atlantic Right Whale Consortium, 19 pp.            Urazghildiiev, I. R., Clark, C. W., and Krein, T. P. (2008). “Detec-
                                                                         tion and recognition of North Atlantic right whale contact calls
                                                                         in the presence of ambient noise,” Can. Acoust. 36, 111–117.

12     J. Acoust. Soc. Am. / 3 March 2020 JASA/Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls
You can also read