SLIMIPL: LANGUAGE-MODEL-FREE ITERATIVE PSEUDO-LABELING

Page created by Andrea Mann
 
CONTINUE READING
SLIMIPL: LANGUAGE-MODEL-FREE ITERATIVE PSEUDO-LABELING
SLIM IPL:          L ANGUAGE -M ODEL -F REE I TERATIVE
                                                                                P SEUDO -L ABELING

                                                                                                A P REPRINT

                                                      Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert
                                                                    Facebook AI Research, Menlo Park & Paris, USA & France
arXiv:2010.11524v1 [cs.CL] 22 Oct 2020

                                                                   {antares,qiantong,jacobkahn,gab,locronan}@fb.com

                                                                                              October 23, 2020

                                                                                               A BSTRACT
                                                  Recent results in end-to-end ASR have demonstrated the efficacy of simple pseudo-labeling for semi-
                                                  supervised models trained both with Connectionist Temporal Classification (CTC) and Sequence-
                                                  to-Sequence (seq2seq) losses. Iterative Pseudo-Labeling (IPL), which continuously trains a single
                                                  model using pseudo-labels iteratively re-generated as the model learns, has been shown to further
                                                  increase performance in ASR. We improve upon the IPL algorithm: as the model learns, we propose
                                                  to iteratively re-generate transcriptions with hard labels (the most probable tokens) assignments,
                                                  that is without a language model. We call this approach Language-Model-Free IPL (slimIPL) and
                                                  we give a resultant training setup for CTC and seq2seq models. At inference, our experiments
                                                  show that decoding with a strong language model is more beneficial with slimIPL than IPL, as
                                                  IPL exhibits some language model over-fitting issues. Compared to prior work on semi-supervised
                                                  and unsupervised approaches, slimIPL not only simplifies the training process, but also achieves
                                                  competitive and state-of-the-art results on L IBRI S PEECH test sets in both standard and low-resource
                                                  settings.

                                         Index Terms: deep learning, semi-supervised learning, pseudo-labeling, self-training, speech recognition

                                         1    Introduction
                                         Recent work in deep learning has shifted towards methods which can efficiently learn from large amounts of unlabeled
                                         data to improve performance and decrease costs on acquiring labels. Semi-supervised learning [1] combines information
                                         from both labeled and unlabeled data; the amount of unlabeled data typically exceeds the amount of labeled data. In
                                         automatic speech recognition (ASR), while many of the recent semi-supervised methods outperform a supervised
                                         baseline in a low-resource setting, a gap between semi- and fully-supervised training remains. Further, not all of the
                                         approaches are equally scalable as the amount of labeled and unlabeled data increases, as is the case in recent setups
                                         such as the L IBRI L IGHT benchmark [2].
                                         Some of the earliest and simplest semi-supervised approaches use self-training [3]. Self-training employs a base model
                                         trained with labeled data which acts as a “teacher” and is used to label unlabeled data (the resulting labels are referred as
                                         “pseudo-labels”, PL). A “student” model is then trained (typically from scratch) with both labeled and pseudo-labeled
                                         data to yield a final model. For competitive results in ASR, a language model was a key component of pseudo-labeling:
                                         it is usually combined with the acoustic model via beam-search decoding [4, 5] or through shallow fusion [6, 7, 8] to
                                         generate pseudo-labels. However, then it is observed that acoustic models tend to over-fit to the text training set of the
                                         language model used for pseudo-labeling [5, 8].
                                         In this work, we show that competitive pseudo-labeling approaches do not need to rely on beam-search decoding
                                         nor on a language model. Thus, pseudo-labels are generated by picking hard labels, tokens with the highest acoustic
                                         model probability. Our approach is based on the recently-proposed iterative pseudo-labeling algorithm (IPL) [5]: we
                                         continuously train a single model using iteratively re-generated pseudo-labels as model learns. We call our algorithm
                                         language-model-free IPL (slimIPL) and give its overview in Section 4. We demonstrate in Section 5 that this approach
SLIMIPL: LANGUAGE-MODEL-FREE ITERATIVE PSEUDO-LABELING
A PREPRINT - O CTOBER 23, 2020

is effective across different loss functions and tokens sets in both standard- and low-resource settings. Using the
L IBRI L IGHT benchmark, we also show that slimIPL is easily scaled to a large amount of unlabeled audio. Ablation
experiments in Section 5.6 show slimIPL can overcome the language model over-fitting issue inherent to the IPL
algorithm, but also demonstrate that slimIPL is more stable when training seq2seq models.

2   Related Work

Self-training methods [3] still attracts researchers: extensions to the self-training are multiple and include (a) selecting
particular subsets of pseudo-labeled data for student training, (b) reiteration of the PL procedure several times to
progressively-improve the teacher model, (c) the introduction of different types of noise for student model training, and
(d) sampling techniques and schedules over labeled and pseudo-labeled datasets. Many recent works on self-training
propose and validate these extensions, including those in computer vision [9, 10], natural language processing [11, 12,
13, 14, 15, 16, 17], ASR [18, 19, 20, 7, 8], and speech translation [21].
An extension to the simple pseudo-labeling method consists in continuously training a single model [22]. At the
beginning of training, a model is trained only on labeled data after which training continues where data is selected
jointly from both labeled and unlabeled datasets. Pseudo-labels re-generation occurs after some number of iterations,
and a supervised loss is computed both on labeled and pseudo-labeled data for each batch. An additional parameter
determines the contribution of pseudo-labeled data to the overall loss. The effectiveness of this iterative training for a
single model has been validated on tasks in vision [23], natural language processing [15], and ASR [24, 5].
In addition to self-training, many other semi-supervised algorithms have been proposed in a variety of domains:

       • computer vision: graph-based methods [25, 26], generative modelling [27, 28], consistency-based meth-
         ods [29, 30, 31, 32], and contrastive methods [33, 34, 35, 36];
       • machine translation (MT): integration of a language model trained on monolingual data [37, 38, 39], back-
         translation [40, 41, 42], synthetic data usage [43], and web-scale bitext mining [44];
       • automatic speech recognition (ASR): representation learning [45, 2, 46, 47], local prior matching [4],
         adversarial training [48], back-translation [49] and others [50, 51, 52, 53, 54].

Below, we give an overview of the approaches in ASR that are most recent and relevant to our work.
IPL The iterative pseudo-labeling (IPL) algorithm [5] follows prior work [22]: it uses data augmentation of both
labeled and unlabeled data, and continuously trains a single model with iterative re-generation of pseudo-labels by
beam-search decoding with a language model (LM), as the model learns. Compared to a self-training with a student
network training each time from scratch [19], the IPL algorithm improves efficiency and performance. Prior work on
IPL was applied only to models trained with word-pieces and Connectionist Temporal Classification (CTC) [55].
Noisy self-training Another recent work on self-training [8] performs five iterations of student network training, each
time from scratch, with pseudo-labels generated by a teacher network. It uses a Listen, Attend and Spell (LAS) [56]-
style acoustic model (AM). In this approach, as in ours, data augmentation is used for both labeled and unlabeled
data. As is the case with IPL, shallow fusion with a language model is used with a decoding procedure to generate
pseudo-labels, while slimIPL doesn’t use a language model. Further, this approach filters teacher network predictions
based on transcription score, whereas slimIPL’s filtering criteria is based only on data statistics.
Self-training [24] is the closest to our work: authors also continuously train a model re-generating pseudo-labels with
hard labels after each iteration. This work focused on studying the impact of noise, and considers only CTC trained
models on the Wall Street Journal dataset. Both SpecAugment [57] and speed perturbation are applied for labeled
and unlabeled data during training in [24], whereas slimIPL uses only SpecAugment. This work lacks the study of
over-fitting to the LM and comparison between hard-labels and beam-search decoding.
Wav2vec 2.0 Recent work on unsupervised pre-training [58] shows a significant boost in performance for low-resource
settings. wav2vec training has two steps: first, pre-training on unlabeled data by masking the input audio in the latent
space and solving a contrastive learning task [59]; second, fine-tuning the model using labeled audio only. Wav2vec
learns from the raw waveform, whereas slimIPL uses log-mel filterbanks.

3   Pseudo-Labeling

Let L = {xi , y i } be a labeled dataset and U = {xj } a large unlabeled dataset. We consider a semi-supervised
pseudo-labeling approach as outlined in Section 1 where the acoustic model is continuously trained on combination of a

                                                             2
A PREPRINT - O CTOBER 23, 2020

labeled set and an iteratively re-generated pseudo-labeled set. Training minimizes the following loss function:
                                         L(θ) = LL (θ) + λLU (θ),          λ ∈ R+ ,                                    (1)
where θ are the parameters of the acoustic model, and λ is a tunable parameter controlling the importance of unlabeled
data. In Eq. (1) the losses for labeled data LL and for unlabeled data LU are defined as:
                                 LL (θ) = −Ex,y∼p(x,y) log pθ (y|x),              (x, y) ∈ L ,                         (2)
where p(x, y) is the empirical data distribution of samples from L, pθ (y|x) is the conditional distribution defined by
the acoustic model,
                                   LU (θ) = −Ex∼p(x) log pθ (ŷ|x),                 x∈U,                               (3)
where p(x) is the empirical data distribution of samples from U , and ŷ are the pseudo-labels for utterance x ∈ U .
One key difference in existing pseudo-labeling approaches is how the labels assignments ŷ are obtained for unlabeled
data x ∈ U . In the general literature, pseudo-labeling refers to the hard labels generation
                                                ŷ = argmax log pθ (y|x̂).                                             (4)
                                                         y

In machine translation and automatic speech recognition domains, the model pθ (y|x) is often a sequence-to-sequence
model, and the solution of Eq. (4) may be approximated with a beam-search decoding algorithm [40, 15, 7, 6, 5, 8, 21, 24].
In fact, most recent work on speech recognition rely on a language model plm (y) to generate the pseudo-labels, and
attempt to find instead:
                                ŷ = argmax log pθ (y|x) + α log plm (y) ,     x∈U,                                   (5)
                                         y

where α is an hyper-parameter controlling the amount of language model regularization. More details on decoding can
be found in Section 5.3.
Pseudo-labeling is also popular in computer vision [22, 60]. Variants exist, such as “soft labels” ŷ = pθ (y|x) and
variations on soft labeling [31, 61]. Sampling [62, 41] is also another way to generate pseudo-labels ŷ.

4     Language-Model-Free Iterative Pseudo-Labeling
In the original IPL training approach [5], pseudo-labels are generated with a beam-search decoder leveraging a language
model, approximating the solution suggested by Eq. (5). While the main motivation is to transfer the knowledge of
the language model into the acoustic model, the two main drawbacks of this approach are (i) generating pseudo-labels
is computationally intensive, and (ii) over-fitting the language model knowledge is easy. Regularization tricks are
proposed in [5] to overcome (ii), such that one can still benefit from the language model when decoding at evaluation
time.

    Algorithm 1: slimIPL
     Data: Labeled data L = {xi , y i }, Unlabeled data U = {xj }
     Result: Acoustic model pθ
     Initialize pθ by training on only labeled data L;
     repeat
         1. Draw a subset of unpaired data S ∈ U
          2. Apply pθ to the subset S and generate L̂ = {(x, ŷ)|x ∈ S, ŷ = argmax pθ (y|x)}
                                                                                      y
        3. (For Seq2Seq only) Filter the subset L̂ by removing
              a. samples with n-gram repetitions
              b. outliers for the “beak band” in x-ŷ sizes plane, (x, ŷ) ∈ L̂
        4. Fine-tune pθ on L ∪ L̂ with data augmentation
     until convergence or maximum iterations are reached;

We demonstrate in this paper that pseudo-labels do not need to rely on any language model information. Our approach
(as shown in Algorithm 1) follows the IPL algorithm, but pseudo-labels are simply generated by considering the top
prediction according to the acoustic model (see Eq. (4)). For CTC-based acoustic models this corresponds exactly to
choose the most likely token at each time step. For seq2seq models, we approximate Eq. (4) with the greedy solution

                                                              3
A PREPRINT - O CTOBER 23, 2020

of choosing the most likely token at each time step of the seq2seq decoder. In addition, a regularization scheme is
implemented via data augmentation over the input (acoustic) data, both for labeled D and unlabeled D̂ samples. In our
experiments, we only considered SpecAugment [57] for data augmentation.
In our study, we show that this approach (dubbed slimIPL) works for CTC and sequence-to-sequence (seq2seq) models,
targeting both letters or word-pieces.

4.1   Seq2seq Filtering

In text generation tasks, seq2seq model decoders tend to generate short transcriptions and also suffer from looping
issues (generating repeated n-grams) [63, 7, 6]. Compared to the regular IPL approach, the problem is less pronounced
with slimIPL. We speculate that language model-based decoding for seq2seq models is a rather fragile process.
We nevertheless found in our experiments that filtering generated transcriptions was still valuable for slimIPL (see
Section 5.6.2).
Recent works [6, 8] in ASR introduce scoring functions to evaluate model confidence for generated transcriptions.
These scoring functions estimate the dependence between the acoustic model scores and token transcription length over
a validation set, and filter out samples which are too far from the expected behavior.
Instead of relying on model predictions, we propose a filtering technique based only on input data statistics, assuming
a strong correlation between audio duration and the length of the corresponding transcription. Figure 1 exhibits this
correlation for L IBRI S PEECH train and validation sets (details on data are in Section 5.1). As most of the labeled data
falls into a “beak band” region in the (audio duration, transcription length) plane, we filter out generated samples falling
out this estimated region:

Figure 1: Dependence between audio duration (ms) and its tokens transcription length (including spaces between words)
for LS-100 with validation sets (left) and LS-960 with validation sets (right).

Figure 2: Beak band regions for the LS-100 with validation sets (left) and LS-960 with validation sets (right). Samples
with pseudo-labels out of the red lines will be filtered during training, while ones between red lines (grey zone) will be
used.

                                                             4
A PREPRINT - O CTOBER 23, 2020

                                       l
           1. Consider the ratio ri = lxyi , where lxi is the i-th sample duration and ly is the i-th sample token transcription
                                           i
              length (including spaces between words).
           2. Take 1%, rdown , and 99%, rup , percentiles for the {ri } empirical distribution. These percentile values will
              give the beak band in lx − ly plane, see Figure 2: all samples with either ri < rdown or ri > rup will be
              filtered out for further training.

As some n-gram looping issues in generated transcriptions are still observed after this filtering approach, we also filter
out a sample if its generated transcription contains a 5-gram occurring more than once or 3-gram occurring more than
twice. This filtering criterion was empirically tuned on validation set performance, and follows previous work [6].

5         Experiments
5.1        Data

All experiments are performed on the L IBRI S PEECH dataset [64] (contains 960 hours of training audio with paired
transcriptions: train-clean-100, train-clean-360, and train-other-500 parts), and audio from L IBRI VOX
(54K hours of unlabeled audio) extracted as described in [2]. We consider three scenarios with different amounts of
labeled / unlabeled data: LS-100 / LS-860, LS-100 / LV, and LS-960 / LV, which are defined in Table 1. The standard
L IBRI S PEECH validation sets (dev-clean and dev-other) were used to tune all hyper-parameters, as well as to select
the best models. Test sets (test-clean and test-other) were used only to report final WER performance. We keep
the original 16kHz sampling rate and compute log-mel filterbanks with 80 coefficients for a 25ms sliding window,
strided by 10ms. All features are normalized to have zero mean and unit variance per input sequence before feeding
them into the acoustic model neural network.
                                      Table 1: Different semi-supervised training scenarios.

                  Setting                      Labeled Data                                            Unlabeled Data
             LS-100 / LS-860               train-clean-100                           train-clean-360, train-other-500
                                                                                     train-clean-360, train-other-500
               LS-100 / LV                 train-clean-100
                                                                                                L IBRI VOX
                                train-clean-100, train-clean-360
               LS-960 / LV                                                                               L IBRI VOX
                                        train-other-500

5.2        Acoustic Models

We consider both CTC [55] and seq2seq-based [65] models. Architectures follow exactly [7, 5, 66], where more details
can be found. The encoder of our acoustic models is composed of a convolutional frontend (several 1-D convolutions
with kernel-width 31 ) followed by 36 4-heads Transformer blocks [67]. The self-attention dimension is Dtr = 768
and the feed-forward network (FFN) dimension is 3072 in each Transformer block. Depending on the experiment, our
models have different strides, implemented in the convolution layers.
For CTC-trained models, the output of the encoder HLe is followed by a linear layer to the output classes. For seq2seq
models, we have an additional decoder, which is a stack of 6 Transformers with encoding dimension 256 and 4 attention
heads. The probability distribution of the transcription is factorized as:
                                                                n
                                                                Y
                                           p(y1 , ..., yn ) =         p(yi | y0 , ..., yi−1 , HLe ),                              (6)
                                                                i=1

where y0 is a special symbol indicating the beginning of the transcription. We use dropout after the convolutions. For
all Transformer layers (encoder and decoder – when present), we use dropout on the self-attention and on the FFN. We
also use layer drop [68], dropping entire layers at the FFN level.

Tokens Two family of tokens set are investigated: word-pieces and letters. We use 5k word-pieces [69, 70] generated
from the SentencePiece toolkit2 : for LS-100 / LS-980 and LS-100 / LV scenarios word-pieces are constructed from
      1
          Kernel size 7 is used for models with stride 3.
      2
          https://github.com/google/sentencepiece

                                                                        5
A PREPRINT - O CTOBER 23, 2020

the LS-100 transcriptions; for LS-960 / LV scenario the entire training transcriptions of L IBRI S PEECH are used to
generate word-pieces set. The letter set consists of 26 English alphabet letters, augmented with the apostrophe and a
word boundary token.

Data augmentation during training is performed with SpecAugment [57]. Settings are two frequency masks with
frequency mask parameter F = 30, ten time masks with maximum time-mask ratio p = 0.1 and time mask parameter
T = 50; time warping is not used.

Training For all experiments we use the Adagrad optimizer [71] and decay learning rate by 2 each time the word
error rate reaches a plateau on the validation sets.

Implementation All models architectures, as well as slimIPL are implemented within the wav2letter++3 frame-
work [72].

5.3        Beam-search Decoding and Rescoring

In all our experimental results, we report word error rate (WER) without a language model (LM), but also WER
obtained with a one-pass beam-search decoder leveraging a LM. Following the notation introduced in Section 3, the
beam-search decoder aims at maximizing:
                                           log pθ (ŷ|x) + α log plm (ŷ) + β|ŷ|,
where α and β are hyper-parameters to tune. We rely on the beam-search decoder from the wav2letter++ framework
following [73, 74, 7]: the lexicon-based beam-search decoder with a word-level LM for CTC models and the lexicon-
free-based beam-search decoder with a token-level LM for seq2seq models. The seq2seq beam-search decoder is
stabilized by introducing an EOS-penalty γ to hypothesis that have finished in an end-of-sentence token [7]. γ is tuned
together with other hyper-parameters and tends to prevent the decoder from early-stopping. L IBRI S PEECH validation
sets, dev-clean and dev-other, are used to optimize the beam-search decoder hyper-parameters, through random
search.
We also report WER obtained by rescoring the beam of hypothesis obtained with our one-pass decoder. Rescoring is
performed with a strong word-based Transformer LM, following the procedure described in [7].
We use open-sourced language models trained on the L IBRI S PEECH LM corpus from [74, 7] to perform the beam-search
decoding and rescoring: 4-gram word-level LM, 20-gram letter-level LM and word-level Transformer LM. Additionally
we train 6-gram word-piece-level LMs on the L IBRI S PEECH LM corpus with the KenLM toolkit [75]. We apply
pruning by removing the 5-grams appearing once and 6-gram appearing once or twice. As word-pieces we use either
ones constructed from the LS-100 or from LS-960 train sets. Word-level perplexities of all language models used for
the beam-search decoding and rescoring are listed in Table 2.

Table 2: Word-level perplexities of language models (for token-level language models upper bound on the word-level
perplexity is computed). For a 6-gram word-piece-level language model marked with "*" tokens are constructed on the
LS-100 training set.

                          Data      word 4-gram char 20-gram wp 6-gram* wp 6-gram Transf.
                        dev-clean      148.0           177            156.7          155.8     48.2
                        dev-other      136.6           161            150.8          149.6     50.2

5.4        Supervised Baselines

We considered different strides for acoustic modeling, looking the best configuration among strides 1, 2, 3, 4 letter-based
models and 2, 4, 8 for word-pieces models. For both dropout and layer drop we use 0.3 value for models trained on
LS-100 and 0.2 for models on LS-960. Performance in WER as well as best stride configurations are reported in Table 3
for LS-100, and Table 4 for LS-960.
Our supervised baseline models trained on LS-100 clearly achieve new state-of-the-art results both for seq2seq and CTC
criteria, Table 3. The seq2seq model reaches a new state-of-the-art at 16.78% WER on test-other without a language
      3
          https://github.com/facebookresearch/wav2letter

                                                             6
A PREPRINT - O CTOBER 23, 2020

         Table 3: WER comparison of our supervised baselines on train-clean-100 with prior work.
                                                                                          Dev WER                  Test WER
       Method           Stride      Tokens    Criterion               LM
                                                                                      clean          other       clean      other
                                               hybrid             word 4-gram           5.0          19.5          5.8       18.6
      RWTH [76]           -            -
                                                S2S                    -                14.7         38.5          14.7      40.8
     DeCoAR [47]          -       phonemes      CTC                    -                  -            -           6.10     17.43
                                                                       -                12.4         27.7          12.8      28.7
                                                CTC
                                                                   wp 6-gram             9.7         22.9          10.3      24.0
                                    5k wp
                                                                       -                9.0          22.8          9.5       23.3
                                                S2S
    Word-level [77]     80ms                                       wp 6-gram            8.3          21.2          9.2       22.0
                                                                       -                8.0          21.0          7.7       21.4
                                                CTC
                                                                  word 4-gram           6.3          19.1          6.8       19.4
                                    words
                                                                       -                7.2          21.2          8.6       21.9
                                                S2S
                                                                  word 4-gram           7.3          19.5          8.0       20.4
   Improved T/S [8]       -         16k wp      S2S                    -                5.3          16.5          5.5       16.9
                                                                     -                  9.60        21.40 10.38             21.67
                        40ms        5k wp       S2S              wp 6-gram              8.51        18.71 8.86              18.94
                                                            wp 6-gram + rescoring       7.63        16.91 8.27              17.08
         Our            20ms         letter     S2S                    -                6.22        16.56          6.43     16.78
                                                                      -                 6.32        17.24          6.57     17.75
                        30ms         letter     CTC             word 4-gram             4.35        12.78          4.68     13.42
                                                           word 4-gram + rescoring      3.32        10.76          3.74     11.31

             Table 4: WER comparison of our supervised baselines on L IBRI S PEECH with prior work.
                                                                                               Dev WER               Test WER
     Method           Stride     AM tokens     Criterion                   LM
                                                                                              clean        other     clean    other
                                                                          -                     -            -        2.4      5.6
 FullAttn T-T [78]    30ms         letters      RNN-T
                                                                     word Transf.               -            -        2.0      4.6
                                                                          -                     -            -        2.1      4.6
  ContexNet [79]      80ms         1k wp      CNN-RNN-T
                                                                      wp LSTM                   -            -        1.9      4.1
                                                                          -                     -            -        2.1      4.3
  Conformer [80]      40ms         1k wp      Conformer-T
                                                                      wp LSTM                   -            -        1.9      3.9
                                                                          -                    2.8         7.6        3.0      7.7
 wav2vec 2.0 [58]     20ms         letters       CTC
                                                                     word Transf.              1.7         4.3        2.1      4.6
                                                                         -                    2.31         5.52      2.74      5.79
       Our            40ms         5k wp         S2S                 wp 6-gram                2.08         4.84      2.39      4.98
                                                                wp 6-gram + rescoring         1.98         4.32      2.24      4.60
                      20ms         letters       S2S                       -                  2.84         5.56      2.92      5.79
                                                                           -                  2.58         6.71      2.71      6.77
                      20ms         letters       CTC                 word 4-gram              1.95         5.04      2.46      5.44
                                                                word 4-gram + rescoring       1.49         4.09      2.11      4.52

model. The CTC model achieves new state-of-the-art results both on test-clean, 4.21% WER, and test-other,
12.15% WER, with beam-search decoding and further rescoring.

                                                            7
A PREPRINT - O CTOBER 23, 2020

          Table 5: Comparison of WER with other semi-supervised methods on the LS-100 / LS-860 setting.
                                                                                         Dev WER         Test WER
          Method         Stride   AM tokens   Criterion              LM
                                                                                       clean    other   clean   other
                                                                      -                5.48     9.32    5.95    10.31
          IPL [5]        80ms       5k wp        CTC
                                                           word 4-gram + rescoring     4.98     7.97    5.59    8.95
                                                                      -                 4.3     9.7      4.5     9.5
      Improved T/S [8]     -       16k wp        S2S
                                                                    LSTM                3.9     8.8      4.2     8.6
                                                                      -                 4.6     9.3      4.7     9.0
      wav2vec 2.0 [58]   20ms       letters      CTC             word 4-gram            2.3     5.7      2.8     6.0
                                                                 word Transf.           2.1     4.8      2.3     5.0
                                                                    -                  4.66     7.78    5.07     8.52
                         40ms       5k wp        S2S            wp 6-gram              4.55     7.45    5.13     8.08
                                                           wp 6-gram + rescoring       4.34      6.8    4.76     7.6
                                                                      -                13.69 19.25 13.52        20.37
        slimIPL, our     80ms       5k wp        CTC
                                                                 word 4-gram           12.52 17.18 12.39        18.39
                         20ms       letters      S2S                   -               4.31     7.05    4.32     7.85
                                                                      -                4.3      8.09    4.3       8.4
                         30ms       letters      CTC            word 4-gram            2.74     5.94    3.31     6.63
                                                           word 4-gram + rescoring     2.04     4.76    2.56     5.37

            Table 6: Comparison of WER with other semi-supervised methods on the LS-100 / LV setting.
                                                                                         Dev WER        Test WER
           Method        Stride   AM tokens    Criterion              LM
                                                                                        clean   other   clean   other
                                                                      -                 4.35    7.90    5.07    8.84
           IPL [5]        80ms      5k wp        CTC
                                                           word 4-gram + rescoring      3.19    6.14    3.72    7.11
                                                                      -                  3.3     6.5     3.1     6.3
      wav2vec 2.0 [58]    20ms      letters      CTC             word 4-gram             1.8     4.5     2.3     4.6
                                                                 word Transf.            1.9     4.0     2.0     4.0
                                                                     -                  4.27    7.53    4.38    7.84
        slimIPL, our      40ms      5k wp         S2S            wp 6-gram              4.03    6.79    4.15    7.16
                                                            wp 6-gram + rescoring       3.51    5.89    3.63    6.42

Our models trained on LS-960 are in the same WER ballpark than recent reported state-of-the-art, as shown in Table 4,
reaching 2.24% WER on test-clean and 4.6% WER on test-other with a language model.

5.5     Semi-Supervised Experiments with slimIPL

slimIPL architectures are identical to their supervised counterpart, except for their dropout and layer drop values. These
regularization parameters were decreased to “increase” model capacity, as more data is involved during the semi-
supervised training process. Acoustic models are bootstrapped on supervised data only, until they reach a reasonable
WER (within < 100 relative WER from the supervised baseline), after which rounds of pseudo-labeling procedure are
performed on a regular basis during the training procedure. Following [5], we generate pseudo-labels at each round for
all utterances when using LS-860 as unlabeled data. When LV is used as unlabeled data, around 25% of the unlabeled
dataset is sampled each time and gets used as pseudo-labels.

                                                            8
A PREPRINT - O CTOBER 23, 2020

                   Table 7: Comparison of WER with other semi-supervised methods on the LS / LV setting.
                                                                                                Dev WER       Test WER
               Method          Stride   AM tokens      Criterion            LM
                                                                                              clean   other   clean   other
                                                                              -                2.05   4.12    2.21    4.71
               IPL [5]         80ms        5k wp         CTC
                                                                   word 4-gram + rescoring     1.85   3.26    2.10    4.01
                                                                             -                 2.1    4.5      2.2    4.5
          wav2vec 2.0 [58]     20ms        letters       CTC            word 4-gram            1.4    3.5      2.0    3.7
                                                                        word Transf.           1.6    3.0      1.8    3.3
                                                                             -                 1.6    3.7      1.7    3.7
          Improved T/S [8]        -        1k wp         S2S
                                                                           LSTM                1.6    3.4      1.7    3.4
                                                                             -                 1.91   3.97    2.08    4.21
            slimIPL, our       40ms        5k wp         S2S             wp 6-gram             1.83   3.59    2.01    3.90
                                                                    wp 6-gram + rescoring      1.71   3.25    1.86    3.63

5.5.1        Experiments on LS-100 / LS-860 and LS-100 / LV
After reaching 25-35% word error rate on dev-other with supervised data, pseudo-label generation is enabled, and
pseudo-labels are re-generated after 20 epochs for CTC models (10 epochs for seq2seq models), then we double the
number of epochs before each pseudo-label generation. In total we perform re-labeling around 25-30 times.
slimIPL results can be found in Table 5. Both letter and word-piece-based seq2seq models outperform (in WER) their
supervised baseline counterparts on test-other, as well as previous work on IPL and noisy teacher-student work [8]
(a word-pieces seq2seq model which performs 5 rounds of PL). These experiments confirm that we can benefit from
bootstrapping the seq2seq models, performing effectively more pseudo-labeling rounds without training a model from
scratch at every PL round4 .
slimIPL for CTC model and letters set reaches state-of-the-art result with no LM on both test-clean and test-other
and in the same ballpark with the state-of-the-art result from the wav2vec work [58] with a language model.
Results for the setting LS-100 / LV are shown in Table 6. slimIPL outperforms in WER the original IPL algorithm on
this setting, on both test-clean and test-other.

5.5.2        Experiments on LS-960 / LV
As for the LS-100 setups, slimIPL models are first trained on supervised data only until they reach reasonable WER
performance (10-15% word error rate on the dev-other). For both CTC and seq2seq models we generate pseudo-labels
of unlabeled data every 5 epochs. In total we perform re-labeling around 20 times.
At this time we performed experiments with slimIPL only with word-piece-based seq2seq models. As shown in Table 7,
slimIPL outperforms in WER the regular IPL algorithm, and approaches recent state-of-the-art results for semi- and
unsupervised settings on test-clean and test-other.

5.6        Ablation Experiments

slimIPL differs from the original IPL [5] in two ways (i) it does not rely on a language model at training time, and
(ii) we introduced a filtering approach for seq2seq models. In the following, we investigate how critical are those
differences.

5.6.1        Language Model at Training
When the IPL algorithm was introduced [5], over-fitting to the language model used during training was observed. A
proposed work-around was to limit the language model weight in the beam-search decoding procedure, when generating
the pseudo-labels. In contrast to IPL, no language model is used at training with slimIPL, so we verify that slimIPL can
take better advantage of a strong language model at inference time. Table 8 shows a larger WER boost with rescoring
when using slimIPL instead of IPL.
      4
          For CTC models this was shown for the IPL algorithm in [5].

                                                                   9
A PREPRINT - O CTOBER 23, 2020

Table 8: Ablation study for LS-100 / LS-860 setting: comparison of WER between IPL (with n-gram beam-search
decoding and transformer LM rescoring for PL generation) and Viterbi IPL.
                                                                                  Dev WER      Test WER
            Method    Stride   AM tokens     Criterion            LM
                                                                                     other        other
                                                                  -                  8.52         9.14
                                               IPL            wp 6-gram              8.46         9.24
                                                         wp 6-gram + rescoring       7.77         8.39
             40ms     5k wp       S2S
                                                                  -                  7.78         8.52
                                             slimIPL          wp 6-gram              7.45         8.08
                                                         wp 6-gram + rescoring        6.8          7.6

5.6.2   Filtering
All our slimIPL results with seq2seq models in Section 5 were obtained with pseudo-label filtering. In our experience,
IPL with seq2seq models diverges in the absence of filtering: during training, the number of pseudo-labels with
either too short transcriptions or n-gram repetitions grows quickly. In fact, after 2-3 re-generations more than 50%
of pseudo-labels have these issues. In contrast, slimIPL is more robust, and does not blow up without pseudo-label
filtering. However, a small WER performance boost is observed with filtering, as shown in Figure 3.

Figure 3: Word error rate on dev-other for two VPL seq2seq models with word-pieces set: trained without beak band
filtering (grey) and with beak band filtering (orange) for the LS-100 / LS-860 setting.

6   Conclusion
We revisited one of the key components of recent pseudo-labeling success in ASR, beam-search decoding with a
language model, and proposed the slimIPL algorithm where we iteratively re-generate pseudo-labels with hard labels as
a single model learns. We demonstrate that slimIPL performs well across different tokens sets and loss functions. It
substantially simplifies the training compared to other semi-supervised and unsupervised approaches, while delivering
competitive and state-of-the-art results in both standard and low-resource settings on the L IBRI S PEECH test sets. At
inference, slimIPL is less prone to language model over-fitting and decoding with a strong language model is more
beneficial than for IPL.

                                                          10
A PREPRINT - O CTOBER 23, 2020

References
 [1] O. Chapelle and B. Schölkopf, “Alexanderzien. semi-supervised learning,” 2006.
 [2] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert,
     C. Fuegen et al., “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020-2020 IEEE
     International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7669–7673.
 [3] H. Scudder, “Probability of error of some adaptive pattern-recognition machines,” IEEE Transactions on Informa-
     tion Theory, vol. 11, no. 3, pp. 363–371, 1965.
 [4] W.-N. Hsu, A. Lee, G. Synnaeve, and A. Hannun, “Semi-supervised speech recognition via local prior matching,”
     arXiv preprint arXiv:2002.10336, 2020.
 [5] Q. Xu, T. Likhomanenko, J. Kahn, A. Hannun, G. Synnaeve, and R. Collobert, “Iterative pseudo-labeling for
     speech recognition,” arXiv preprint arXiv:2005.09267, 2020.
 [6] J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” in ICASSP 2020-2020 IEEE
     International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7084–7088.
 [7] G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, and R. Col-
     lobert, “End-to-end asr: from supervised to semi-supervised learning with modern architectures,” arXiv preprint
     arXiv:1911.08460, 2019.
 [8] D. S. Park, Y. Zhang, Y. Jia, W. Han, C.-C. Chiu, B. Li, Y. Wu, and Q. V. Le, “Improved noisy student training for
     automatic speech recognition,” arXiv preprint arXiv:2005.09629, 2020.
 [9] I. Z. Yalniz, H. Jégou, K. Chen, M. Paluri, and D. Mahajan, “Billion-scale semi-supervised learning for image
     classification,” arXiv preprint arXiv:1905.00546, 2019.
[10] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 687–10 698.
[11] D. Yarowsky, “Unsupervised word sense disambiguation rivaling supervised methods,” in 33rd annual meeting of
     the association for computational linguistics, 1995, pp. 189–196.
[12] D. McClosky, E. Charniak, and M. Johnson, “Effective self-training for parsing,” in Proceedings of the Human
     Language Technology Conference of the NAACL, Main Conference, 2006, pp. 152–159.
[13] R. Reichart and A. Rappoport, “Self-training for enhancement and domain adaptation of statistical parsers trained
     on small datasets,” in Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics,
     2007, pp. 616–623.
[14] Z. Huang and M. Harper, “Self-training pcfg grammars with latent annotations across languages,” in Proceedings
     of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009, pp. 832–841.
[15] J. He, J. Gu, J. Shen, and M. Ranzato, “Revisiting self-training for neural sequence generation,” in
     International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?
     id=SJgdnAVKDH
[16] N. Ueffing, “Using monolingual source-language data to improve mt performance,” in International Workshop on
     Spoken Language Translation (IWSLT) 2006, 2006.
[17] J. Zhang and C. Zong, “Exploiting source-side monolingual data in neural machine translation,” in Proceedings of
     the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1535–1545.
[18] S. Novotney and R. Schwartz, “Analysis of low-resource acoustic model self-training,” in Tenth Annual Conference
     of the International Speech Communication Association, 2009.
[19] J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” in ICASSP 2020-2020 IEEE
     International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7084–7088.
[20] S. H. K. Parthasarathi and N. Strom, “Lessons from building acoustic models with a million hours of speech,”
     in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
     IEEE, 2019, pp. 6670–6674.
[21] J. Pino, Q. Xu, X. Ma, M. J. Dousti, and Y. Tang, “Self-training for end-to-end speech translation,” arXiv preprint
     arXiv:2006.02490, 2020.
[22] D.-H. Lee, “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in
     Workshop on challenges in representation learning, ICML, vol. 3, no. 2, 2013.
[23] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness, “Pseudo-labeling and confirmation bias in
     deep semi-supervised learning,” arXiv preprint arXiv:1908.02983, 2019.

                                                          11
A PREPRINT - O CTOBER 23, 2020

[24] Y. Chen, W. Wang, and C. Wang, “Semi-supervised asr by end-to-end self-training,” arXiv preprint
     arXiv:2001.09128, 2020.
[25] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic
     functions,” in Proceedings of the 20th International conference on Machine learning (ICML-03), 2003, pp.
     912–919.
[26] B. Liu, Z. Wu, H. Hu, and S. Lin, “Deep metric transfer for label propagation with limited annotated data,” in
     Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[27] A. Odena, “Semi-supervised learning with generative adversarial networks,” arXiv preprint arXiv:1606.01583,
     2016.
[28] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, “Variational autoencoder for deep learning of
     images, labels and captions,” in Advances in neural information processing systems, 2016, pp. 2352–2360.
[29] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, “S4l: Self-supervised semi-supervised learning,” in Proceedings
     of the IEEE international conference on computer vision, 2019, pp. 1476–1485.
[30] Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le, “Unsupervised data augmentation for consistency training,”
     arXiv preprint arXiv:1904.12848, 2019.
[31] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel, “Mixmatch: A holistic approach
     to semi-supervised learning,” in Advances in Neural Information Processing Systems, 2019, pp. 5049–5059.
[32] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch:
     Simplifying semi-supervised learning with consistency and confidence,” arXiv preprint arXiv:2001.07685, 2020.
[33] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation
     learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp.
     9729–9738.
[34] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo,
     M. G. Azar et al., “Bootstrap your own latent: A new approach to self-supervised learning,” arXiv preprint
     arXiv:2006.07733, 2020.
[35] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features
     by contrasting cluster assignments,” arXiv preprint arXiv:2006.09882, 2020.
[36] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton, “Big self-supervised models are strong semi-
     supervised learners,” arXiv preprint arXiv:2006.10029, 2020.
[37] T. Brants, A. Popat, P. Xu, F. J. Och, and J. Dean, “Large language models in machine translation,” in Proceedings
     of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural
     Language Learning (EMNLP-CoNLL), 2007, pp. 858–867.
[38] W. He, Z. He, H. Wu, and H. Wang, “Improved neural machine translation with smt features,” in Thirtieth AAAI
     conference on artificial intelligence, 2016.
[39] C. Gulcehre, O. Firat, K. Xu, K. Cho, and Y. Bengio, “On integrating a language model into neural machine
     translation,” Computer Speech & Language, vol. 45, pp. 137–148, 2017.
[40] R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,”
     in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
     Papers), 2016, pp. 86–96.
[41] S. Edunov, M. Ott, M. Auli, and D. Grangier, “Understanding back-translation at scale,” in Proceedings of the
     2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 489–500.
[42] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato, “Phrase-based & neural unsupervised machine
     translation,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018,
     pp. 5039–5049.
[43] A. Currey, A. V. Miceli-Barone, and K. Heafield, “Copied monolingual data improves low-resource neural machine
     translation,” in Proceedings of the Second Conference on Machine Translation, 2017, pp. 148–156.
[44] H. Schwenk, G. Wenzek, S. Edunov, E. Grave, and A. Joulin, “Ccmatrix: Mining billions of high-quality parallel
     sentences on the web,” 2020.
[45] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural network features and semi-supervised
     training for low resource speech recognition,” in 2013 IEEE international conference on acoustics, speech and
     signal processing. IEEE, 2013, pp. 6704–6708.

                                                          12
A PREPRINT - O CTOBER 23, 2020

[46] A. Baevski, M. Auli, and A. Mohamed, “Effectiveness of self-supervised pre-training for speech recognition,”
     arXiv preprint arXiv:1911.03912, 2019.
[47] S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, “Deep contextualized acoustic representations for semi-supervised
     speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
     Processing (ICASSP). IEEE, 2020, pp. 6429–6433.
[48] A. H. Liu, H.-y. Lee, and L.-s. Lee, “Adversarial training of end-to-end speech recognition using a criticizing lan-
     guage model,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing
     (ICASSP). IEEE, 2019, pp. 6176–6180.
[49] M. K. Baskar, S. Watanabe, R. Astudillo, T. Hori, L. Burget, and J. Černockỳ, “Self-supervised sequence-to-
     sequence asr using unpaired speech and text,” arXiv preprint arXiv:1905.01152, 2019.
[50] L. Lamel, J.-L. Gauvain, and G. Adda, “Lightly supervised and unsupervised acoustic model training,” Computer
     Speech & Language, vol. 16, no. 1, pp. 115–129, 2002.
[51] F. Wessel and H. Ney, “Unsupervised training of acoustic models for large vocabulary continuous speech
     recognition,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 1, pp. 23–31, 2004.
[52] J. Ma and R. Schwartz, “Unsupervised versus supervised training of acoustic models,” in Ninth Annual Conference
     of the International Speech Communication Association, 2008.
[53] H. Liao, E. McDermott, and A. Senior, “Large scale deep neural network acoustic modeling with semi-supervised
     training data for youtube video transcription,” in 2013 IEEE Workshop on Automatic Speech Recognition and
     Understanding. IEEE, 2013, pp. 368–373.
[54] V. Manohar, D. Povey, and S. Khudanpur, “Semi-supervised maximum mutual information training of deep
     neural network acoustic models,” in Sixteenth Annual Conference of the International Speech Communication
     Association, 2015.
[55] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling
     unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference
     on Machine learning, 2006, pp. 369–376.
[56] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary
     conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal
     Processing (ICASSP). IEEE, 2016, pp. 4960–4964.
[57] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data
     augmentation method for automatic speech recognition,” Proc. Interspeech 2019, pp. 2613–2617, 2019.
[58] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of
     speech representations,” arXiv preprint arXiv:2006.11477, 2020.
[59] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint
     arXiv:1807.03748, 2018.
[60] K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch:
     Simplifying semi-supervised learning with consistency and confidence,” CoRR, vol. abs/2001.07685, 2020.
     [Online]. Available: https://arxiv.org/abs/2001.07685
[61] D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel, “Remixmatch: Semi-
     supervised learning with distribution alignment and augmentation anchoring,” arXiv preprint arXiv:1911.09785,
     2019.
[62] K. Imamura, A. Fujita, and E. Sumita, “Enhancement of encoder and attention using target monolingual corpora in
     neural machine translation,” in Proceedings of the 2nd Workshop on Neural Machine Translation and Generation,
     2018, pp. 55–63.
[63] J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in sequence to sequence
     models,” Proc. Interspeech 2017, pp. 523–527, 2017.
[64] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio
     books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
     2015, pp. 5206–5210.
[65] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in
     neural information processing systems, 2014, pp. 3104–3112.
[66] R. Collobert, A. Hannun, and G. Synnaeve, “Word-level speech recognition with a letter to word encoder,” arXiv
     preprint arXiv:1906.04323, 2019.

                                                          13
A PREPRINT - O CTOBER 23, 2020

[67] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention
     is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
[68] A. Fan, E. Grave, and A. Joulin, “Reducing transformer depth on demand with structured dropout,” in International
     Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SylO2yStDr
[69] M. Schuster and K. Nakajima, “Japanese and korean voice search,” in International Conference on Acoustics,
     Speech and Signal Processing, 2012, pp. 5149–5152.
[70] T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer
     for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language
     Processing: System Demonstrations, 2018, pp. 66–71.
[71] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”
     Journal of machine learning research, vol. 12, no. Jul, pp. 2121–2159, 2011.
[72] V. Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Synnaeve, V. Liptchinsky, and R. Collobert, “wav2letter++: The
     fastest open-source speech recognition system,” arXiv preprint arXiv:1812.07625, 2018.
[73] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-to-end convnet-based speech recognition system,”
     arXiv preprint arXiv:1609.03193, 2016.
[74] T. Likhomanenko, G. Synnaeve, and R. Collobert, “Who needs words? lexicon-free speech recognition,” Proc.
     Interspeech 2019, pp. 3915–3919, 2019.
[75] K. Heafield, “Kenlm: Faster and smaller language model queries,” in Proceedings of the sixth workshop on
     statistical machine translation. Association for Computational Linguistics, 2011, pp. 187–197.
[76] C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter, and H. Ney, “Rwth asr systems for
     librispeech: Hybrid vs attention,” Proc. Interspeech 2019, pp. 231–235, 2019.
[77] R. Collobert, A. Hannun, and G. Synnaeve, “Word-level speech recognition with a letter to word encoder.”
[78] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, “Transformer transducer: A
     streamable speech recognition model with transformer encoders and rnn-t loss,” in ICASSP 2020-2020 IEEE
     International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7829–7833.
[79] W. Han, Z. Zhang, Y. Zhang, J. Yu, C.-C. Chiu, J. Qin, A. Gulati, R. Pang, and Y. Wu, “Contextnet: Im-
     proving convolutional neural networks for automatic speech recognition with global context,” arXiv preprint
     arXiv:2005.03191, 2020.
[80] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer:
     Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.

                                                         14
You can also read