DOMAIN ROBUST, FAST, AND COMPACT NEURAL LANGUAGE MODELS

Page created by Rene Chang
 
CONTINUE READING
DOMAIN ROBUST, FAST, AND COMPACT NEURAL LANGUAGE MODELS
DOMAIN ROBUST, FAST, AND COMPACT NEURAL LANGUAGE MODELS

            Alexander Gerstenberger1 , Kazuki Irie1,∗ , Pavel Golik2 , Eugen Beck1,2 , Hermann Ney1,2
       1
           Human Language Technology and Pattern Recognition Group, Computer Science Department
                           RWTH Aachen University, 52074 Aachen, Germany
                                2
                                  AppTek GmbH, 52062 Aachen, Germany
alexander.gerstenberger@rwth-aachen.de, {irie, beck, ney}@cs.rwth-aachen.de, pgolik@apptek.com

                          ABSTRACT                                      is crucial for obtaining a good neural language model. Sec-
                                                                        ond, the diversity in the data requires some extra effort in
Despite advances in neural language modeling, obtaining a               modeling [9, 10] for building a robust model, in contrast to
good model on a large scale multi-domain dataset still re-              n-gram language models for which such a diversity can be
mains a difficult task. We propose training methods for build-          simply leveraged by static or Bayesian interpolation [11, 12].
ing neural language models for such a task, which are not                   In this work, we aim at building neural language mod-
only domain robust, but reasonable in model size and fast               els (LMs) on a large scale multi-domain dataset, which are
for evaluation. We combine knowledge distillation from pre-             not only domain robust (no need for domain label at test
trained domain expert language models with the noise con-               time), but reasonable in terms of model size, and fast for
trastive estimation (NCE) loss. Knowledge distillation allows           evaluation. To achieve this goal, we combine knowledge
to train a single student model which is both compact and               distillation (KD) [13–15] using pre-trained domain expert
domain robust, while the use of NCE loss makes the model                models together with the noise contrastive estimation [16–19]
self-normalized, which enables fast evaluation. We conduct              loss. We conduct our experiments on a multi-domain speech
experiments on a large English multi-domain speech recogni-             recognition dataset provided by AppTek, which disposes
tion dataset provided by AppTek. The resulting student model            about 10 B words (from which we selected 1.2 B for neural
is of the size of one domain expert, while it gives similar per-        model training) for language model training. We demon-
plexities as various teacher models on their expert domain; the         strate that our method effectively achieves a language model
model is self-normalized, allowing for 30% faster first pass            which is of the size of one expert, while it gives similar
decoding than the naive models which require the full soft-             perplexities as the teacher models on their expert domain,
max computation, and finally it gives improvements of more              and it is self-normalized. We implement our models using the
than 8% relative in terms of word error rate over a large multi-        TensorFlow [20] based open-source toolkit RETURNN [21]1 .
domain 4-gram count model trained on more than 10 B words.
    Index Terms— language modeling, domain robustness,                                       2. RELATED WORK
teacher student learning, ASR                                           In [9], a domain robust neural language model is constructed
                                                                        as a large mixture of domain experts. An obvious downside
                     1. INTRODUCTION                                    of such an approach is the large size of the final model. In this
                                                                        work, instead of copying all domain experts’ parameters, we
Neural network language models [1], such as long short-term
                                                                        make use of knowledge distillation [13–15] to obtain a sin-
memory (LSTM) [2] recurrent neural networks (RNNs) [3, 4]
                                                                        gle student model which is both compact and domain robust.
or Transformers [5–7], have been shown to consistently out-
                                                                        Distillation from domain experts for robust acoustic modeling
perform the n-gram language models and give large improve-
                                                                        has been investigated in [22]. In case of language modeling,
ments for automatic speech recognition. However, such im-
                                                                        distillation must be combined with an efficient softmax com-
provements are not obtained for free; they are results of care-
                                                                        putation method: in previous work, [23] uses the word-class
ful tuning of the model hyper-parameters. In practice, for
                                                                        factorized output, while [24] uses the NCE. In this work, our
large scale tasks (with more than a few billion-word training
                                                                        primary goal is to use the NCE since it makes the model self-
text) containing sub-corpora with multiple domains, it is not
                                                                        normalized, which is fast for evaluation, while we also com-
straightforward to obtain a good neural language model [8].
                                                                        pare it with the sampled softmax [25] variant. Our experiment
First, the model size must be increased for a large amount of
                                                                        also include knowledge transfer from powerful Transformer
data. This slows down the training and tuning process which
                                                                        teacher models to a single domain robust LSTM student.
    *Work conducted while the author was at RWTH Aachen. Now with the      1 Example config files are available in https://github.com/rwth-i6/
Swiss AI Lab, IDSIA, USI & SUPSI, 6928 Manno-Lugano, Switzerland.       returnn-experiments/tree/master/2020-lm-domain-robust.
3. TRAINING METHOD                                  choosing the domain label with the highest weight (as shown
                                                                     in Sec 4.1).
3.1. Knowledge distillation for large vocabulary LM                      We consider two methods for building the interpolated
For knowledge distillation from a teacher pT (w|h) to a stu-         teacher model. First, we simply estimate a single set of in-
dent language model pθ (w|h) with its parameters θ and a vo-         terpolation weights for the teacher models based on their per-
cabulary V , we optimize θ to minimize the distillation loss         plexity on the entire development set and use a static interpo-
which computes for each history h in the data:                       lated teacher model (static teacher approach). Alternatively,
                                                                     we can estimate target domain specific interpolation weights
                                                                     based on each development subset and use these domain con-
                         X
         LKD (h; θ) = −       pT (w|h) log pθ (w|h)      (1)
                          w∈V
                                                                     ditional weights to dynamically define the teacher depending
                                                                     on the domain of the training sequence (domain conditional
In practice, this term is interpolated with the standard cross-      approach) as in [22]. This results in a better teacher ensemble.
entropy loss using an interpolation weight.
    When large vocabulary word-level language models are                              4. EXPERIMENTAL SET-UPS
trained using some method for avoiding the full softmax, the         4.1. AppTek English multi-domain dataset
distillation loss must also be adapted accordingly. We con-
                                                                     We conduct experiments on an English large multi-domain
sider both NCE and sampled softmax methods.
                                                                     dataset provided by AppTek. The LM training data consists
    Knowledge distillation using sampled softmax: In the
                                                                     of 33 subsets with domains including news, movie subtitles
sampled softmax loss [25], the normalization term in the soft-
                                                                     (entertainment), user generated content and sport, which com-
max is computed based on a subset of words sampled for each
                                                                     prises 10.2 B words in total with a vocabulary size of 250 K
batch from a noise distribution. Thus, we can directly ob-
                                                                     words. Our domain labels are movies, news, social media,
tain the distillation loss by replacing pT (w|h) and pθ (w|h) in
                                                                     user generated content (UGC) and voice messages (MSG).
Eq. (1) with the corresponding sampled softmax probabilities,
                                                                     These target domains are defined by the dev and eval datasets.
making sure to use the same samples for teacher and student.
                                                                         We first check which subsets of the training dataset are
    Knowledge distillation using NCE: While sampled soft-
                                                                     relevant to our target domains. We train 4-gram Kneser-
max makes training faster than the full softmax, use of the
                                                                     Ney LMs [26] on each subset of the training data. Then,
NCE [16–18] loss allows both to train faster and to train self-
                                                                     we linearly interpolate the models by using the interpolation
normalized models (therefore, we would only have to com-
                                                                     weights optimized on every domain specific subset of the dev
pute the exponential of the logits at test time). The NCE loss
                                                                     data. The resulting interpolation weights on each domain
trains the model to discriminate noise samples drawn from a
                                                                     indicate the domain relevance of each training subset. Table
noise distribution q from true data by means of logistic re-
                                                                     1 shows Top-8 most relevant subsets.
gression. For knowledge distillation, we use the loss which
computes for each data point (h, w):                                     Based on this analysis, we assign news-04 as news expert
                                                                     News, because it has highest weight on the news domain, and
                                                                     ent-04 as movies expert Movie, in total 1.2 B words2 . We
                             X
   LKD-NCE (h, w; θ) = −           gT (w̃, h) log gθ (w̃, h)
                       w̃∈Dq ∪{w}
                                                                     pre-train separate models on each of the two datasets3 as ex-
                                                                    perts for the corresponding domain. We use an interpolation
                     + (1 − gT (w̃, h)) log [1 − gθ (w̃, h)]         of these models as teacher for distillation.
                                                               (2)
                                                                     4.2. Model architectures
where gθ (w, h) := σ(sθ (w, h) − log q(w|h)) with sθ (w, h)
                                                                     We pre-train both LSTM and Transformer based teacher mod-
the logits of the student (similarly gT (w, h) for the teacher),
                                                                     els. Deep Transformer models have recently shown good per-
and Dq is the set of words sampled from a noise distribution
                                                                     formance on a variety of LM datasets, outperforming LSTM
q. In order to obtain a self-normalized student model, the
                                                                     models. However, Transformer LMs require more memory
teacher models are also pre-trained using the NCE loss.
                                                                     for evaluation due to the self-attention. For distillation, a
3.2. Knowledge distillation for domain robust modeling               student LSTM model can benefit from Transformer teachers,
                                                                         2 We could also merge news-01 and -04 or ent-02, -03, and -04 to train the
We make use of distillation methods above for building a sin-
                                                                     corresponding experts, if we had more computational resources.
gle, compact, domain robust model. The teacher model in our              3 In our preliminary experiments with LSTMs, we found the interpola-
experiments is an interpolation of multiple neural language          tion of models trained on subsets outperform the single model trained on the
models trained on different sub-corpora of the dataset.              whole dataset. We could potentially obtain a better expert models by first
    First of all, the target domain labels are specified in the      training a single background model on the whole data and then fine-tune that
                                                                     model on the domain subsets separately, as has been done in [9]. However,
development set. We can assign each training subset to a do-         in practice, pre-training a single model on the whole data would require the
main, by training a 4-gram count model on the subset, in-            model to be very large; distributed training of “reasonable size” models sepa-
terpolating the models optimized for each domain and then            rately on different subsets is more convenient and it potentially scales better.
Table 1. Interpolation weights (scaled by factor 100) for 4-                  ensemble: on news, 7.4% relative improvement is obtained.
gram LMs. We removed values smaller than 10−2 . We show 8                         The last part of Table 2 shows the results for distillation
most relevant subsets out of 33. #Running words in million.                   experiments using domain conditional interpolation of expert
 Train    # Run.       Interpolation weights on dev set                       models as the teacher. The resulting student model gives
 subset    words All Movies News Social UGC MSG                               comparable performance to the previous case with the static
 news-01       93 2.0          - 10.8      0.2 0.1        -                   teacher. The improvement by domain conditional weights
 ent-01        94 5.2        3.7    3.1 13.7 13.5       6.9                   does not seem to carry over to the student performance5 .
 ent-02      174 7.3        12.7    1.2    2.1 3.7 11.4
 news-02       18 2.7          -    6.2    1.9 2.0      0.4                   Table 2. Perplexities for sampled softmax case. “Dom.” in-
 news-03 2,960 3.7           1.0    6.3    0.9 4.6      3.0                   dicates domain conditional weights for interpolating expert
 ent-03      651 15.9       23.3    3.1 21.5 20.0 14.8                        models to obtain the teacher.

                                                                                                    Dom.
 ent-04      469 22.8       48.1    1.1 28.2 27.0 20.7
                                                                              Model Model                   Development perplexity
 news-04     730 27.6        4.2 56.7      4.4 12.4     9.9                   Role     Type         All Movie News Social UGC MSG
                                                                                       4-gram      155.5 186.7 103.0 158.9 174.4 187.2
                                                                              News
which would potentially allow us to make use of the good                               LSTM        100.0 123.0 65.7 103.1 96.5 131.5
performance of Transformer models, while obtaining more                                4-gram      150.4 99.1 246.2 110.5 134.6 154.5
                                                                              Movie
memory efficient LSTM models for evaluation.                                           LSTM        104.4 79.2 134.9 149.7 83.8 118.4
     Our LSTM language models use two LSTM layers with                        Teacher               78.7 79.4 69.0 75.3 74.7 95.4
                                                                                       LSTM
2048 hidden units each, 128 input embedding dimension,                        Student               75.0 76.5 63.9 72.6 71.3 92.5
which amounts to 600 M parameters given the vocabulary of                     Teacher               78.7 75.7 63.0 74.4 74.2 94.9
                                                                                       LSTM ×
size 250 K. For the Transformer models we use 128 input                       Student               75.2 77.5 62.3 73.9 71.7 94.3
embedding dimension, 32 layers, 2048 feed-forward dimen-
sion, 768 residual dimension (tied with key/query and value                   5.2. Results for NCE based distillation
dimensions) and 16 attention heads, which amounts to 400 M
                                                                              The (normalized) perplexities for the NCE experiments are
parameters. Following [7], no positional encoding is used.
                                                                              shown in Table 3. While we obtain slightly better perplexities
     We use the frequency sorted log-uniform distribution to
                                                                              compared with the sampled softmax variants (Sec 5.1), the
sample 8192 and 1024 negative samples respectively for sam-
                                                                              overall observations are similar: The student model outper-
pled softmax [25] and NCE [16]. We share the noise samples
                                                                              form the expert models but the domain conditional distillation
within the same batch. For NCE, we set the constant nor-
                                                                              approach does not give extra gain in performance.
malization term to one and initialize the bias of the softmax
layer to − log(V ), following [27]4 , which makes the model                   Table 3. Perplexities for LSTM models trained with NCE.
initially self-normalized. We found this to be crucial for train-             Again, “Dom.” indicates domain conditional weights for in-
ing model with the NCE loss to match performance of models                    terpolating expert models to obtain the teacher.
                                                                                            Dom.

trained with the full softmax. All models are trained on a sin-                  Model                Development perplexity
gle NVIDIA Tesla V100 GPU (16 GB) at RWTH IT Center.                             Role         All Movie News Social UGC MSG
                                                                                 News        100.7 126.5 65.3 103.9 96.9 131.6
              5. TEXT BASED EXPERIMENTS                                          Movie       103.7    77.6 149.0 82.5 80.4 117.5
5.1. Results for sampled softmax based distillation                              Teacher      77.1    77.5 68.0 73.8 72.1 94.0
                                                                                 Student      75.0    76.2 65.0 72.5 70.4 91.6
The results for the sampled softmax case are shown in Table
                                                                                 Teacher      77.1    74.0 62.1 72.6 71.6 93.7
2. All models are trained for 6 epochs until convergence. Our                             ×
                                                                                 Student      75.1    76.6 63.3 72.7 71.4 93.7
distillation loss scale is set to 0.5, for which we achieved the
best results. The top part of the table compares the expert                       We will therefore use the student trained with the static
LSTM models with the 4-gram models trained only on the                        teacher (from Table 3) for the ASR experiment later. The
corresponding subset.                                                         variance of the log normalization term for the model is
     The middle part of Table 2 shows the results for the dis-                0.023 and the mean value is -0.034, which is acceptably
tillation with the static teacher. The teacher is obtained by                 self-normalized6 .
interpolation between News and Movie LSTM models, us-                             Can we reduce the student model size? In addition, we
ing interpolation weights estimated on the whole dev set. We                  investigate the possibility to reduce the student model size.
note that this teacher does not outperform the individual ex-                     5 We note that the word frequency used in the sampler is computed on the
pert models on their expert domain. By distillation, we obtain                whole training set. Using domain specific sampling distributions might lead
a single student model with better perplexities than the teacher              to different conclusions.
                                                                                  6 Following [28], we can also correct the logits by subtracting the mean.
  4 We scaled this value by 1.5. We found this to improve the model perfor-   For unnormalized LM scores we then get 75.0 on dev and 92.1 on eval with
mance and convergence speed in our experiments on this dataset.               correction, compared with 77.6 and 95.3 respectively without correction.
Table 4 shows the results for student models with LSTM size      broadcast news and media as well as entertainment domains.
of 1024 and 512 instead of 2048, as well as a model with a       We also use RETURNN for the acoustic model training.
bottle neck layer before the softmax [29]. The bottleneck ap-         The decoding is carried out with the RASR toolkit [33,
proach works best in our experiments, achieving a compres-       34]. The recognition lexicon contains 250 K words. In con-
sion rate of 5.7 while showing only up to 3.5% degradation       trast to popular benchmark sets, such as Hub5 2000 or Lib-
compared to the full size LSTM student.                          riSpeech, no manual segmentation is available and we apply
                                                                 an automatic speech activity detection to break the long audio
Table 4. Perplexities for small students in the NCE case. hid-
                                                                 recordings into utterances. Each utterance then is processed
den denotes the LSTM dimension and bn-dim is the dimen-
                                                                 in isolation.
sion of an additional linear bottleneck layer before softmax.
                                                                      Table 6 summarizes the results. We evaluate both de-
 hidden bn- #Param.            Development perplexity
                                                                 codings using normalized and unnormalized language model
  dim dim [M]            All Movie News Social UGC MSG
                                                                 scores for the LSTM model trained with NCE. In both cases,
          -     600 75.0 76.2 65.0 72.5 70.4 91.6
  2048                                                           we obtain improvements of up to 7-8% relative in WER over
         512 212 76.7 78.1 67.4 73.4 72.6 91.5
                                                                 our 4-gram baseline LM trained on all text data. This con-
  1024          300 81.2 81.3 72.0 76.0 77.6 97.2
          -                                                      firms that the full softmax computation is not needed for the
   512          163 90.5 88.0 86.9 83.1 86.6 102.6
                                                                 NCE-trained models. The system which uses the unnormal-
                                                                 ized scores runs 30% faster. Only looking at the time for the
5.3. Transformer teachers for an LSTM student                    language model score computation shows a speedup of 40%.
Table 5 shows the results for distillation with Transformer           In addition, we also evaluate the Transformer student
teachers. The Transformer teacher ensemble gives 69.4 per-       trained using sampled softmax in Sec. 5.3. We obtain 9.5%
plexity which is 12% relative improvements over the LSTM         relative improvement in WER over the 4-gram baseline.
teacher (Table 2). While the resulting LSTM student outper-      Table 6. WERs (%) for first pass recognition experiments.
forms the student trained with LSTM teachers (Table 2), the      “Normalized” refers to the use of full softmax for evaluation.
improvement is only marginal.                                                     Train Norm-         Dev           Eval
                                                                    LM
    Finally, we also use a Transformer as student using sam-                       data alized PPL WER PPL WER
pled softmax distillation. Our observation is similar to the        4-gram        10.2B          108.7 19.0 119.7 21.8
                                                                                           Yes
LSTM teacher case (Sec. 5.1): the student outperforms both                                        75.0 17.5 91.8 20.5
                                                                    NCE-LSTM 1.2B
the teacher and each domain expert. We obtain up to 7% rel-                                No     77.6 17.6 95.3 20.5
ative improvement over the teacher model.                           Transformer 1.2B Yes          65.5 17.2 81.9 20.2

Table 5. Sampled softmax distillation using Transformer                                7. CONCLUSION
teacher. Again, “Dom.” indicates domain conditional weights      We presented a training method for LSTM language models
for interpolating expert models to obtain the teacher.           on a difficult large scale multi-domain task, which success-
                 Dom.

 Model Model                 Development perplexity              fully combines knowledge distillation from pre-trained do-
 Role     Type        All Movie News Social UGC MSG              main LMs and NCE loss. We compressed the large ensem-
 News                91.1 118.8 55.3 92.1 86.0 124.7             ble of domain experts into a single, compact model, while
          Trafo
 Movie               95.2    74.2 131.6 74.5 73.2 106.0          maintaining similar perplexities of the teacher, and being self-
 Teacher Trafo       69.4    73.1 56.8 65.8 63.8 88.2            normalized. We achieved up to 8-9% improvement in WER
          LSTM       73.6    75.5 62.9 69.0 67.0 90.7            over a strong 4-gram model trained on much more data. In
 Student
          Trafo      65.5    68.5 54.1 62.8 59.7 83.6            future work, we explore how to extend this method for new
 Teacher Trafo       69.4    70.1 52.0 65.1 63.5 87.8            target domains or additional training data; aiming for an in-
                  ×
 Student LSTM        73.7    76.3 60.2 70.4 70.1 93.6            cremental lifelong learning algorithm.
                                                                                 8. ACKNOWLEDGEMENT
                 6. ASR EXPERIMENTS
                                                                                       We thank Tobias Menne for help with the
We carry out ASR first pass decoding [30] experiments us-                              baseline ASR system, Ralf Schlüter and
ing the obtained LSTM and Transformer student models. Our                              Volker Steinbiss for helpful comments on the
                                                                                       paper. This work has received funding from
system is based on the hybrid approach [31]. The HMM state
                                                                 the European Research Council (ERC) under the European Unions
tying schema was estimated following a phonetic decision         Horizon 2020 research and innovation programme (grant agreement
tree approach [32], resulting in 5K tied states. The acoustic    No 694537, project SEQCLAS). The work reflects only the authors
model is a compact bi-directional neural network with four       views and none of the funding parties is responsible for any use
layers with 512 LSTM cells [2] per layer and per direction.      that may be made of the information it contains. Experiments were
We trained it on 80-dimensional MFCC features extracted          partially performed with computing resources granted by RWTH
from a very large collection of various recordings from the      Aachen under project nova0003.
9. REFERENCES                                       [19] Xie Chen, Xunying Liu, Mark J. F. Gales, and Philip C. Woodland,
                                                                                    “Recurrent neural network language model training with noise con-
 [1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jan-            trastive estimation for speech recognition,” in Proc. IEEE Int. Conf.
     vin, “A neural probabilistic language model,” The Journal of Machine           on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Aus-
     Learning Research, vol. 3, pp. 1137–1155, 2003.                                tralia, Apr. 2015, pp. 5411–5415.
 [2] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,”        [20] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,
     Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.                        Jeffrey Dean, and Matthieu Devin et al., “Tensorflow: A system for
 [3] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and              large-scale machine learning,” in Proc. USENIX Sympo. on Operating
     Sanjeev Khudanpur,        “Recurrent neural network based language             Systems Design and Impl. (OSDI 16), Savannah, GA, USA, Nov. 2016,
     model,” in Interspeech, Makuhari, Japan, Sept. 2010, pp. 1045–1048.            pp. 265–283.
                                                                               [21] Albert Zeyer, Tamer Alkhouli, and Hermann Ney, “RETURNN as a
 [4] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney, “LSTM neural
                                                                                    generic flexible neural toolkit with application to translation and speech
     networks for language modeling.,” in Proc. Interspeech, Portland, OR,
                                                                                    recognition,” in Proc. Assoc. for Computational Linguistics (ACL),
     USA, Sept. 2012, pp. 194–197.
                                                                                    Melbourne, Australia, July 2018.
 [5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
                                                                               [22] Zhao You, Dan Su, and Dong Yu, “Teach an all-rounder with experts in
     Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention
                                                                                    different domains,” in Proc. IEEE Int. Conf. on Acoustics, Speech and
     is all you need,” in Proc. Advances in Neural Information Processing
                                                                                    Signal Processing (ICASSP), Brighton, UK, May 2019, pp. 6425–6429.
     Systems (NIPS), pp. 5998–6008. Long Beach, CA, USA, Dec. 2017.
                                                                               [23] Kazuki Irie, Zhihong Lei, Ralf Schlüter, and Hermann Ney, “Prediction
 [6] Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii
                                                                                    of LSTM-RNN full context states as a subtask for N-gram feedforward
     Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, and Ravi Teja Gadde,
                                                                                    language models,” in ICASSP, Calgary, Canada, Apr. 2018, pp. 6104–
     “Jasper: An End-to-End Convolutional Neural Acoustic Model,” in
                                                                                    6108.
     Proc. Interspeech, 2019, pp. 71–75.
                                                                               [24] Jesús Andrés-Ferrer, Nathan Bodenstab, and Paul Vozila, “Effi-
 [7] Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Language
                                                                                    cient language model adaptation with noise contrastive estimation and
     modeling with deep Transformers,” in Interspeech, Graz, Austria, Sept.
                                                                                    kullback-leibler regularization,” in Proc. Interspeech 2018, Hyderabad,
     2019, pp. 3905–3909.
                                                                                    India, Sept. 2018, pp. 3368–3372.
 [8] Anirudh Raju, Denis Filimonov, Gautam Tiwari, Guitang Lan, and
                                                                               [25] Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Ben-
     Ariya Rastrow, “Scalable Multi Corpora Neural Language Models for
                                                                                    gio, “On using very large target vocabulary for neural machine transla-
     ASR,” in Proc. Interspeech, 2019, pp. 3910–3914.
                                                                                    tion,” in Proc. ACL, Beijing, China, July 2015, pp. 1–10.
 [9] Kazuki Irie, Shankar Kumar, Michael Nirschl, and Hank Liao,
                                                                               [26] Reinhard Kneser and Hermann Ney, “Improved backing-off for m-
     “RADMM: Recurrent adaptive mixture model with applications to do-
                                                                                    gram language modeling,” in Proc. IEEE Int. Conf. on Acoustics,
     main robust language modeling,” in Proc. IEEE Int. Conf. on Acoustics,
                                                                                    Speech and Signal Processing (ICASSP), Detroit, MI, USA, May 1995,
     Speech and Signal Processing (ICASSP), Calgary, Canada, Apr. 2018,
                                                                                    pp. 181–184.
     pp. 6079–6083.
                                                                               [27] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard
[10] Michael Hentschel, Marc Delcroix, Atsunori Ogawa, Tomoharu Iwata,              Schwartz, and John Makhoul, “Fast and robust neural network joint
     and Tomohiro Nakatani, “A unified framework for feature-based do-              models for statistical machine translation,” in Proc. Assoc. for Compu-
     main adaptation of neural network language models,” in Proc. IEEE Int.         tational Linguistics (ACL), Baltimore, Maryland, June 2014, pp. 1370–
     Conf. on Acoustics, Speech and Signal Processing (ICASSP), Brighton,           1380.
     UK, May 2019, pp. 7250–7254.
                                                                               [28] Jacob Goldberger and Oren Melamud, “Self-normalization properties
[11] Cyril Allauzen and Michael Riley, “Bayesian language model interpo-            of language modeling,” in Proc. Assoc. for Computational Linguistics
     lation for mobile speech input,” in Proc. Interspeech, Florence, Italy,        (ACL), Santa Fe, USA, Aug. 2018, pp. 764–773.
     Aug. 2011, pp. 1429–1432.
                                                                               [29] Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and
[12] Ernest Pusateri, Christophe Van Gysel, Rami Botros, Sameer Badaskar,           Bhuvana Ramabhadran, “Low-rank matrix factorization for deep neu-
     Mirko Hannemann, Youssef Oualil, and Ilya Oparin, “Connecting and              ral network training with high-dimensional output targets,” in Proc.
     Comparing Language Model Interpolation Techniques,” in Proc. Inter-            IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP),
     speech, 2019, pp. 3500–3504.                                                   Vancouver, Canada,, May 2013, pp. 6655–6659.
[13] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distilling the knowl-     [30] Eugen Beck, Wei Zhou, Ralf Schlüter, and Hermann Ney, “LSTM lan-
     edge in a neural network,” in NIPS Deep Learning and Representation            guage models for LVCSR in first-pass decoding and lattice-rescoring,”
     Learning Workshop, Montreal, Canada, Dec. 2014.                                arXiv preprint arXiv:1907.01030, July 2019.
[14] Jimmy Ba and Rich Caruana, “Do deep nets really need to be deep?,”        [31] Hervé Bourlard and Christian J. Wellekens, “Links between Markov
     in Proc. Advances in Neural Information Processing Systems (NIPS),             models and multilayer perceptrons,” in Advances in Neural Informa-
     Quebec, Canada, Dec. 2014, vol. 27, pp. 2654–2662.                             tion Processing Systems I, D.S. Touretzky, Ed., pp. 502–510. Morgan
[15] Cristian Buciluă, Rich Caruana, and Alexandru Niculescu-Mizil,                Kaufmann, San Mateo, CA, USA, 1989.
     “Model compression,” in Proc. ACM SIGKDD Int. Conf. on Knowl-             [32] Steve Young, Julian Odell, and Philip C. Woodland, “Tree-based state
     edge Disc. and Data Mining, Philadelphia, PA, USA, Aug. 2006, pp.              tying for high accuracy acoustic modelling,” in Proc. Workshop on
     535–541.                                                                       Human Language Technology, Plainsboro, NJ, USA, Mar. 1994, pp.
[16] Michael Gutmann and Aapo Hyvärinen, “Noise-contrastive estimation:            307–312.
     A new estimation principle for unnormalized statistical models,” in       [33] David Rybach, Stefan Hahn, Patrick Lehnen, David Nolden, Martin
     Proc. of Int. Conf. on AI and Statistics, 2010, pp. 297–304.                   Sundermeyer, Zoltán Tüske, Simon Wiesler, Ralf Schlüter, and Her-
[17] Andriy Mnih and Yee Whye Teh, “A fast and simple algorithm for                 mann Ney, “RASR - the RWTH Aachen University open source speech
     training neural probabilistic language models,” in Proc. Int. Conf. on         recognition toolkit,” in Proc. IEEE Automatic Speech Recog. and Un-
     Machine Learning (ICML), Edinburgh, Scotland, 2012, ICML’12, pp.               derstanding Workshop (ASRU), Honolulu, HI, USA, Dec. 2011.
     419–426.                                                                  [34] Simon Wiesler, Alexander Richard, Pavel Golik, Ralf Schlüter, and
[18] Zhuang Ma and Michael Collins, “Noise contrastive estimation and               Hermann Ney, “RASR/NN: The RWTH neural network toolkit for
     negative sampling for conditional models: Consistency and statistical          speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech
     efficiency,” in Proc. Conf. on Empirical Methods in Nat. Lang. Pro-            and Signal Processing (ICASSP), Florence, Italy, May 2014, pp. 3313–
     cessing (EMNLP), Brussels, Belgium, Oct.-Nov. 2018, pp. 3698–3707.             3317.
You can also read