Automated Source Code Generation and Auto-completion Using Deep Learning: Comparing and Discussing Current Language-Model-Related Approaches

Page created by Julie Yang

Business

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Automated Source Code Generation and Auto-completion Using Deep Learning: Comparing and Discussing Current Language-Model-Related Approaches

Automated Source Code Generation and Auto-completion Using Deep Learning:
                                             Comparing and Discussing Current Language-Model-Related Approaches
                                                   Juan Cruz-Benito1 * Sanjay Vishwakarma2† , Francisco Martin-Fernandez1 , Ismael Faro1 ,
                                                                                     1
                                                                                   IBM Quantum. IBM T.J. Watson Research Center.
                                                                                          Yorktown Heights, NY 10598, USA
                                                                         2
                                                                           Electrical and Computer Engineering. Carnegie Mellon University.
                                                                                                Mountain View, USA
                                                               juan.cruz@ibm.com, svishwak@andrew.cmu.edu, paco@ibm.com, ismael.faro1@ibm.com
arXiv:2009.07740v4 [cs.CL] 12 Jan 2021

                                                                       Abstract                             and Nguyen 2015; Bielik, Raychev, and Vechev 2016; Cruz-
                                                                                                            Benito et al. 2018) used different techniques such as graph-
                                           In recent years, the use of deep learning in language mod-       based statistical LMs, probabilistic LMs, or Deep Learning
                                           els gained much attention. Some research projects claim that
                                                                                                            (DL) LMs to suggest code to programmers similarly to auto-
                                           they can generate text that can be interpreted as human-
                                           writing, enabling new possibilities in many application ar-      completer features in IDEs. LMs were used to generate auto-
                                           eas. Among the different areas related to language process-      mated source code based on sample code inputs or pseudo-
                                           ing, one of the most notable in applying this type of model-     code and evaluating how this generated code performs (Oda
                                           ing is programming languages. For years, the Machine Learn-      et al. 2015; Tiwang, Oladunni, and Xu 2019; Fedus, Good-
                                           ing community has been researching this software engineer-       fellow, and Dai 2018). Another exciting application of NLP
                                           ing area, pursuing goals like applying different approaches to   into source code languages is the automatic translation be-
                                           auto-complete, generate, fix, or evaluate code programmed by     tween different languages. The work reported in (Nguyen,
                                           humans. Considering the increasing popularity of the Deep-       Nguyen, and Nguyen 2013; Roziere et al. 2020) explores
                                           Learning-enabled language models approach, we detected a         different supervised and unsupervised approaches to mi-
                                           lack of empirical papers that compare different deep learn-
                                                                                                            grate code between different programming languages to im-
                                           ing architectures to create and use language models based
                                           on programming code. This paper compares different neu-          prove interoperability or port codebases written in obsolete
                                           ral network architectures like AWD-LSTMs, AWD-QRNNs,             or deprecated languages (such as COBOL or Python2). An-
                                           and Transformer while using transfer learning and different      other example found is the use of Bayesian networks, atten-
                                           tokenizations to see how they behave in building language        tion mechanisms, and pointer networks (Proksch, Lerch, and
                                           models using a Python dataset for code generation and fill-      Mezini 2015; Li et al. 2017; Donahue, Lee, and Liang 2020)
                                           ing mask tasks. Considering the results, we discuss each ap-     to fill a given code portion with missings.
                                           proach’s different strengths and weaknesses and what gaps           There is a more general understanding of the natural lan-
                                           we find to evaluate the language models or apply them in a       guages’ different characteristics in the NLP broad field.
                                           real programming context.                                        Since there exist many research fields related to human lan-
                                                                                                            guages, there is a richer background on existing language
                                                                   Introduction                             characteristics. For example, there is much knowledge on
                                                                                                            aspects like the minimal representation units of a word in a
                                         We are digitally surrounded by computational Language              specific language, the most used words of a language, or if
                                         Models (LMs) that guide us while writing to reduce the user        a word is a neologism or not. Programming languages share
                                         effort, suggest different options for words/sentences to en-       some syntax similarities with spoken languages. However, it
                                         hance our style, or fix our grammatical/correctness errors         does not have the same restrictions in the sense of common
                                         accurately (Kannan et al. 2016; Bryant and Briscoe 2018;           words or neologisms (Allamanis et al. 2015; Karampatsis
                                         Ghosh and Kristensson 2017). Many of the keys we press             and Sutton 2019), or other syntax restrictions and features
                                         while writing on a keyboard act as part of the inputs to com-      such as punctuation, format, or style. Every programming
                                         pose new datasets for those models that shape how we com-          language has indeed reserved words and symbols to denote
                                         municate with others. Nevertheless, does it happen in the          different actions, resources, or syntax. However, there is an
                                         same way when we write code? Succinctly, yes. Accord-              essential part of the source code that is only limited by the
                                         ing to some recent surveys found in the literature (Alla-          programmer’s imagination, the conventions existing, or the
                                         manis et al. 2018; Chen, Le, and Babar 2020), the Natu-            guides for good practices. As (Karampatsis and Sutton 2019)
                                         ral Language Processing (NLP) subfield related to program-         claims,
                                         ming language includes examples of LMs used in several
                                         tasks and contexts. For example, the authors of (Nguyen              [...] traditional language models limit the vocabulary to
                                                                                                              a fixed set of common words. For code, this strong as-
                                            * Contact    author                                               sumption has been shown to have a significant negative
                                            †
                                                Intern at IBM Quantum at the time of writing this paper       effect on predictive performance [...]

In that paper, Karampatsis and Sutton 2019 present how ity et al. 2017), the One-Billion Word benchmark (Chelba
segmenting words into subword units can improve source et al. 2014), or The Hutter Prize Wikipedia dataset 3 .
code modeling. Similarly, other researchers (Ganin et al. The AWD-LSTM is a variation of the famous Long
2016; Kim et al. 2016; Karpathy 2016) dug in representing Short-Term Memory (LSTM) architecture (Hochreiter and
source code vocabulary with a similar emphasis on model- Schmidhuber 1997). The LSTM is a type of Recurrent Neu-
ing words using sub-word units and envisioning their impor- ral Network (RNN) especially capable of processing and
tance when using neural networks (NNs). Nevertheless, how predicting sequences. That ability with sequences is the rea-
that word segmentation affect the accuracy or the appropri- son why LSTMs have been employed largely in LMs (Kim
ateness of the code generated or auto-completed in some et al. 2016). The AWD-LSTM includes several optimiza-
modern LM using deep learning approaches? That kind of tions compared to the regular LSTM. Two of the most im-
question raises the main goal for this paper: discover what portant ones are the use of Average Stochastic Gradient De-
kinds of associations between different modern neural net- scent (ASGD) and the WeightDropout. The ASGD is used
work architectures and tokenization models produce the best as the NN’s optimizer to consider the previous weights (not
results when creating LMs to generate and auto-complete only the current one) during each training iteration. The
source code. WeightDropout introduces the dropout technique (Srivastava
To pursue that goal, this research aims to conduct exper- et al. 2014) to avoid overfitting but with the characteristic of
iments combining different deep neural network (DNN) ar- returning zero, not with a subset of activations between lay-
chitectures with different tokenization and pre-trained mod- ers, like in traditional dropout, but with a random subset of
els over an existing Python dataset. Using that experimenta- weights.
tion, we want to investigate the combinations that improve The QRNN is another type of RNN that includes alternate
code generation and auto-completion tasks (for example, convolutional and pooling layers in the architecture. This de-
filling the blanks) while checking the outcomes from those sign makes the QRNN able to capture better long-term se-
tasks using metrics like accuracy and human assessment. quences and train much faster since the convolutional layers
The rest of the paper is as follows: Section 2 presents allow the computation of intermediate representations from
the different approaches followed during the research, the the input in parallel. They can be up to 16 times faster at train
DNNs used, the software methods and data employed. Sec- and test time than LSTMs while having better predictive ac-
tion 3 describes results achieved during the research accord- curacy than stacked LSTMs of the same hidden size. We
ing to different metrics and tests, while section 4 discusses use a QRNN modified (AWD-QRNN) to include the same
these findings and the implications of the results as appro- ASGD and WeightDropout modifications to improve its sta-
priate. Finally, Section 5 presents some conclusions. bility and optimize its capabilities, as for the AWD-LSTM.
We utilize AWD-LSTM and AWD-QRNN to produce
Materials and methods LMs capable of solving the task of generating source code
based on input as in the literature (Merity, Keskar, and
We have trained a set of deep neural networks using different Socher 2018a,b; Wang, Gong, and Liu 2019; Gong et al.
architectures, tokenization techniques, and software libraries 2018; Takase, Suzuki, and Nagata 2018; Yang et al. 2018;
to develop this research work. Following, we introduce the Krause et al. 2018; Merity, Keskar, and Socher 2018a;
different materials and methods employed for that purpose. Howard and Ruder 2018; Eisenschlos et al. 2019).
The Transformer is probably the most popular current
Deep Neural Networks and tokenization models DNN architecture in NLP due to its performance and re-
used cent state-of-the-art results in many tasks. It is an encoder-
Regarding the DNN architectures employed, we chose to use decoder architecture in which each layer uses attention
the following ones: ASGD Weight-Dropped LSTM (AWD- mechanisms. This use of (self-)attention mechanisms makes
LSTM) (Merity, Keskar, and Socher 2018b), Quasi Recur- Transformer able to model the relationships between all
rent Neural Networks (QRNNs) (Bradbury et al. 2017), and words in a sentence regardless of their respective position.
Transformer (Vaswani et al. 2017). These DNN architec- That implies a significant improvement over RNNs, en-
tures have been reportedly getting some state-of-the-art re- abling much more parallelization of data processing and un-
sults (Merity, Keskar, and Socher 2018b; Wang, Gong, and blocking the training over more massive datasets. The excel-
Liu 2019; Gong et al. 2018; Takase, Suzuki, and Nagata lent results of the Transformer architecture empowered the
2018; Yang et al. 2018; Krause et al. 2018; Rae et al. 2019; NLP community to create new state-of-the-art transformer-
Dai et al. 2019; Baevski and Auli 2018; Merity, Keskar, and based models (Young et al. 2018) like those used in the cur-
Socher 2018a; Howard and Ruder 2018; Eisenschlos et al. rent research: GPT-2 (Radford et al. 2019), BERT (Devlin
2019; Brown et al. 2020; Devlin et al. 2019) recently in the et al. 2019), and RoBERTa (Liu et al. 2019).
NLP field in several groundbreaking digital products1,2 and We choose to use GPT-2 since it is a causal transformer
in some of the most known datasets like the Penn Tree Bank (unidirectional) that can predict the next token in a sequence.
(Mikolov et al. 2011), WikiText-2 and WikiText-103 (Mer- So, it can generate source code based on input, allowing
us to compare the results with the AWD-LSTM and AWD-
1
https://openai.com/blog/openai-api/ QRNN experiments. Regarding BERT and RoBERTA, we
2
https://blog.google/products/search/search-language-
3
understanding-bert/ http://prize.hutter1.net/

used them to study how a masked modeling approach can fer learning approach similar to other researchers in exist-
perform auto-complete source code. In that case, we did not ing literature (Howard and Ruder 2018; Ruder et al. 2019;
use them for text generation, as in the other experiments, Chronopoulou, Baziotis, and Potamianos 2019; Eisenschlos
since BERT and RoBERTa are not designed for text gen- et al. 2019). We employed the pre-trained models on En-
eration. However, they can generate text (more diverse but glish texts to later fine-tune the models for the selected tasks
slightly worse in quality) (Wang and Cho 2019). using the GitHub CodeSearchNet dataset. The deep-neural-
Considering the tokenization techniques, for every AWD- networks-related source code was coded using the FastAI
LSTM and AWD-QRNN, we chose the following types of library (versions 1.0.61 and 2 dev 0.0.21). To apply the dif-
tokens: word, unigram, char, and byte-pair encoding (BPE) ferent tokenization techniques into the AWD-LSTMs and
(Sennrich, Haddow, and Birch 2016) -albeit some stud- QRNNs, we replaced the default Spacy tokenizer (Honnibal
ies show that BPE is suboptimal for pre-training (Bostrom and Montani 2017) with Google Sentencepiece (Kudo and
and Durrett 2020)-. For the Transformer models, we used Richardson 2018), following a similar approach to (Czapla,
the default ones from the pre-defined models: Wordpiece Howard, and Kardas 2018). In the development of Trans-
method (Schuster and Nakajima 2012) for BERT and BPE former architectures to see how they perform filling the
over raw bytes instead of Unicode characters for GPT-2 and blanks and generating texts, we used HuggingFace’s Trans-
RoBERTa. The different techniques were selected because former library combined with FastAI v2 (following the Fas-
they produce different tokens’ granularity that can enrich tAI’s example6 ), as included on the code repository that
our experimentation: full words, sub-words of specific sizes, supports this paper. To train the neural networks, we have
character-sized tokens, or byte pairs. Also, they enable us used some techniques worth to mention (all the details are
to compare the tokenization between the different types of in the code repository). To find the most appropriate learn-
models and tasks to solve. ing rate to use automatically, we used the function lr find
provided by FastAI following the proposal of (Smith 2017).
Experimentation details This function trains the DNN over the dataset for a few itera-
The dataset used for the experimentation is the Python tions while varying from very low to very high learning rates
dataset included in the “GitHub CodeSearchNet Challenge at the beginning of each mini-batch of data to find which
dataset” (Husain et al. 2019). It includes 2 million (com- is optimal one regarding error (loss) metrics until the DNN
ment, Python code) pairs from open-source libraries. As ob- diverges. To pursue a faster convergence, we schedule the
served during the dataset preparation for our experiments, learning rate as described in (Smith and Topin 2019) using
there are about 11 million Python code sentences. The rea- the one cycle policy (fit one cycle) in FastAI. Considering
son to choose this dataset is that it has already been used the transfer learning technique used, we trained the first “one
in previous research related to NLP and source code. The cycle” on the top of the existing pre-trained model to later
full dataset includes several languages: Go, Java, JavaScript, unfreeze all the model layers and do a more extended train-
PHP, Python, and Ruby. We chose to use only the Python ing (10-30 epochs) to improve the results. Regarding other
part of the dataset because it enables us to compare the ex- training details, in general, we used the default parameters
isting literature that uses Python language more than other from FastAI, except for a fixed multiplier to control all the
programming languages. The software libraries and pack- dropouts (drop mult) in AWD-LSTMs and AWD-QRNNs
ages used primarily during the research were the follow- set to 0.3 because of some heuristics discovered during test-
ing: FastAI (Howard and Gugger 2020), Google Senten- ing this research. Also, we decided to train similar archi-
cePiece (Kudo and Richardson 2018), and HuggingFace’s tectures using a fixed number of epochs to make the mod-
Transformers (Wolf et al. 2020). he preprocessing applied els comparable. For the AWD-LSTM and AWD-QRNN, we
to the dataset included removing most of the code com- used 30 epochs for fine-tuning because we found during
ments and autoformatting the code according to the PEP-8 the experimentation that the most remarkable improvement
Python style guide using the autopep84 package. Regarding for every model produced occurs during that range of it-
the AWD-LSTM networks, we have been using the FastAI- erations. Similarly, we fine-tuned the transformers for ten
provided base models pre-trained using the Wikitext-103 epochs since we did not find a significant improvement after
dataset (Merity et al. 2017). There are no default pre-trained that. For more information about the training setup and soft-
models in the FastAI’s AWD-QRNN version of those net- ware details, please refer to the repository that supports this
works, so we trained them from scratch. As we introduced, paper and the FastAI documentation.
regarding the Transformer architectures, we have been us- Finally, the hardware used to run the different soft-
ing three standard pre-trained models as a basis: GPT-2, ware and neural networks training was a computer running
BERT, and RoBERTa. In each case, the exact pre-trained Ubuntu Linux 18.04 LTS Bionic Beaver (64 bits). It has two
model used were gpt2, bert-base-cased, and roberta-base. Nvidia Tesla V100 GPUs x 16 gigabytes of RAM (Nvidia
These pre-trained models are available from HuggingFace’s CUDA version 10.1), a CPU with 16 cores Intel(R) Xeon(R)
model5 . CPU E5-2690 v4 @ 2.60GHz, 120 gigabytes of RAM, and
As the reader could infer from the previous explana- 120 Gigabytes for the primary disk (HDD).
tions about using pre-trained versions, we followed a trans- All the supporting materials and software details related
to this paper are publicly available in a GitHub repository
4
https://pypi.org/project/autopep8/
5 6
https://huggingface.co/models https://docs.fast.ai/tutorial.transformers

(Cruz-Benito and Vishwakarma 2020a). The NN models
produced are under the Zenodo record (Cruz-Benito and
Vishwakarma 2020b).

                                 Results
This section presents the results achieved after the full train-
ing of the selected DNN architectures with the different to-
kenization models.
   As outlined in the previous section, we trained AWD-
LSTM and AWD-QRNN DNN architectures using differ-
ent tokenization models - word, unigram, BPE, and char-,
and Transformer, using three different base models (GPT-2,
BERT, and RoBERTa). We trained every AWD-LSTM and
AWD-QRNN using one epoch to fit the model’s head and
fine-tuned for 30 epochs. Meanwhile, the Transformer net-
works were trained equally for one epoch to fit the head and                          Figure 1: Evolution of the accuracy of neural networks de-
fine-tune the models for ten epochs.                                                  voted to source code generation during the training epochs
   We followed a two-way strategy to evaluate the NNs
trained: use the NN training metrics and human evaluation
of the models’ output. The metrics used are some of the most
common in the literature: accuracy for the validation set and
loss for the training and validation sets. They help the re-
searchers understand how the NN acts over time, how the
model is fitted to the dataset, and the performance and error
scores while using training and validation datasets. In this
case, the accuracy is the score concerning the LM’s ability
to predict the next word of filling the missings accurately
given a sequence of words from the validation set. The loss
metrics report the error after applying the DNN to the train-
ing or validation set, respectively. Every implementation de-
tail related to the DNNs and the metrics is available in the
GitHub repository (Cruz-Benito and Vishwakarma 2020a).
Apart from those metrics, we assessed the models’ quality                             Figure 2: Evolution of the accuracy of neural networks de-
by applying them in the proposed tasks -generate text and                             voted to filling the blanks during the training epochs
auto-complete- and observing how they perform.

Training results                                                                      model combination performed better in the case of ac-
                                                                                      curacy metrics was the AWD-LSTM with char tokeniza-
Table 1 displays the final metrics for the different NNs at                           tion (accuracy 0.779633). The second one was the GPT-
the end of the training. Similarly, figures 1 and 2 show the                          2 transformer model -BPE over raw bytes tokenization-
evolution of each model’s accuracy during the training. Fig-                          (0.743738), and the third one the AWD-QRNN with char to-
ures 3, 4, 5, and 6 show the training loss and validation loss                        kenization (0.736358). Related to AWD-LSTM and AWD-
evolution along the training epochs.                                                  QRNN architectures’ combination with other tokenization
                                                                                      techniques, we obtained poor results on accuracy: between
 DNN architecture   Epochs   Accuracy   Train loss   Validation loss   Pre-trained?
 AWD-LSTM word        31     0.494893   2.308937       2.341698            Yes        0.494893 to 0.580373. On the other hand, according to the
 AWD-LSTM unigram
 AWD-LSTM BPE
                      31
                      31
                             0.557226
                             0.580373
                                        1.639875
                                        1.561393
                                                       1.826841
                                                       1.703536
                                                                           Yes
                                                                           Yes
                                                                                      results shown in Table 1 and Figure 2, both models (BERT
 AWD-LSTM char        31     0.779633   0.757956       0.742808            Yes        and RoBERTa) had excellent accuracy results in the trans-
 AWD-QRNN word        31     0.515747   1.972508       2.144126             No
 AWD-QRNN unigram     31     0.539951   1.790150       1.894901             No        former models intended for auto-completion 0.999238 and
 AWD-QRNN BPE         31     0.538290   1.824709       1.896698             No
 AWD-QRNN char        31     0.736358   0.944526       0.897850             No        0.999468, respectively.
 GPT-2                11     0.743738   1.407818       1.268246            Yes
 BERT                 11     0.999238   0.014755       0.004155            Yes
                                                                                         About how the pre-training and transfer learning af-
 RoBERTa              11     0.999468   0.010569       0.002920            Yes        fects the results, the two top results regarding the accu-
                                                                                      racy come up from pre-trained models in the English lan-
Table 1: Results after full training of each NN architecture                          guage (0.779633 and 0.743738), yet the third-best result
                                                                                      was from a non-pre-trained network (0.736358). Compar-
   On the one hand, according to the results displayed in Ta-                         ing the similar networks, the average (mean) accuracy of
ble 1 and Figure 1, for neural networks intended for auto-                            the AWD-LSTM pre-trained versions is 0.603031 (standard
mated source code generation -AWD-LSTM, AWD-QRNN,                                     deviation -std- of 0.123144), while the average accuracy of
and Transformer GPT-2-, the overall NN-tokenization                                   AWD-QRNN non-pre-trained versions was 0.582587 (std of

Figure 3: Evolution of the training loss of DNNs devoted to Figure 5: Evolution of the validation loss of DNNs devoted
generating source code during the training epochs to generating source code during the training epochs

Figure 4: Evolution of the training loss of neural networks Figure 6: Evolution of the validation loss of neural networks
devoted to filling the blanks during the training epochs devoted to filling the blanks during the training epochs

0.103107). The only combination NN-tokenization model code: AST-LSTM AST-LSTM (accuracy 90.3%) (Tiwang,
that worked worse when it was pre-trained was the one with Oladunni, and Xu 2019) or AST-MLP (90.1%) (Tiwang,
the word tokenization. Oladunni, and Xu 2019). So, considering this, our results
Regarding the observed losses, it is worth commenting are good in terms of accuracy compared to the existing peer-
that the AWD-LSTM char, AWD-QRNN char, and the three reviewed literature. As a side note, we did not find models
transformer models (GPT-2, BERT, RoBERTa) could be available publicly to compare with, neither in these nor in
trained for more epochs or with a higher learning rate. The other peer-reviewed papers. For the models devoted to auto-
model may be be underfitting since the training loss is higher complete code (BERT and RoBERTa), as of now, we did
than the validation loss (Table 1, figures 3, 4, 5, and 6). not find peer-reviewed papers trying to solve this task using
To put in context the accuracy achieved during the ex- Transformer architectures.
perimentation, we compare the results with the existing lit-
erature. The papers (Tiwang, Oladunni, and Xu 2019; Li Results of using the trained DNN models
et al. 2017; Raychev, Bielik, and Vechev 2016) present mod- Following, we introduce some examples of how the different
els trained to generate Python code considered state of the neural networks generated source code or auto-completed it.
art when published. Our three best models trained to gen- These examples are used to assess the model’s output qual-
erate source code outperform the approaches based on the ity, avoiding relying only on the NN metrics. This valida-
following architectures: Vanilla LSTM (accuracy: 67.3%) tion using more methods than the simple metrics is com-
(Li et al. 2017), Attention LSTM (69.8%) (Li et al. 2017), mon in the AI and NLP fields. Many papers employ human
Pointer Mixture Network (70.1%) (Li et al. 2017), or Proba- evaluation or specialized metrics to grasp better the model’s
bilistic Model (69.2%) (Raychev, Bielik, and Vechev 2016). performance and capture any other particular detail or qual-
Our models perform worse than other approaches based on ity indicator on the related task proposed. In this case, we
the abstract syntax tree (AST) instead of using the textual show only those with the best accuracy results. To gener-

ate source code, we show output from AWD-LSTM char,                     {’sequence’: ’[CLS] from pandas import def [SEP]’,
                                                                         ’score’: 8.858884393703192e-05,
AWD-QRNN char, and GPT-2. For auto-completion, we                        ’token’: 19353,
show results from BERT and RoBERTa. For the source code                  ’token_str’: ’def’}]
generation, every model faced the same challenges (Cruz-               RoBERTa
Benito and Vishwakarma 2020a). Those are the results when              [{’sequence’: ’from pandas import\n’,
any of them try to generate five different sentences with the            ’score’: 0.6224209666252136,
                                                                         ’token’: 50118,
input from pandas import:                                                ’token_str’: ’_
                                                                                       C’},
                                                                        {’sequence’: ’from pandas import.’,
  AWD-LSTM char                                                          ’score’: 0.22222988307476044,
                                                                         ’token’: 4,
  "from   pandas   import   defaults as _rp pandas ="                    ’token_str’: ’.’},
                                                                        {’sequence’: ’from pandas import ’,
  "from   pandas   import   DataFrameEncoder if self"                    ’score’: 0.038354743272066116,
  "from   pandas   import   OrderedDict else: passed"                    ’token’: 1437,
                                                                         ’token_str’: ’_
                                                                                       G’},
  "from   pandas   import   sequence if type(seq1) i"
                                                                        {’sequence’: ’from pandas import\n\n’,
  "from   pandas   import   conditional_insertion as"                    ’score’: 0.028566861525177956,
                                                                         ’token’: 50140,
  AWD-QRNN char                                                          ’token_str’: ’_
                                                                                       C_C’},
                                                                        {’sequence’: ’from pandas import.’,
                                                                         ’score’: 0.021909384056925774,
  "from   pandas   import   webbrowser self.login fr"                    ’token’: 479,
  "from   pandas   import   __http_args or [ ’Pipeli"                    ’token_str’: ’_
                                                                                       G.’}]
  "from   pandas   import   iterations is not None p"
  "from   pandas   import   service params to servic"                   For auto-completion, we observe that almost no auto-
  "from   pandas   import   difference if not isinst"                completion is right. The more accurate sentences from the
                                                                     Python language point are the ones in which the mask to-
  GPT-2                                                              ken has been replaced by white space or by a dot. Neverthe-
  "from pandas import time, np\n
                        "
                                                                     less, they are not correct but closer to be right compared to
  "from pandas import pandas from time.time.datetime import Date\n   the other ones. One interesting thing is that BERT assigns
               with n"
  "from pandas import gtk, os\n
                                                                     a very high score to a predicted mask, which is not correct,
      from pandas_utils import pandas_utils\n                        and shallow scores to the other possible solutions (also in-
      import pylint"
  "from pandas import wcpy\n                                         correct). In the case of RoBERTa, it gives lower scores to
          import cpy_context as cpy_context\n                        all the solutions, yet also fails on the correctness: the second
      "
  "from pandas import gkpy\n                                         sentence predicted (score 0.222) can be closer to be right
       """\n
       ... pass\n
                                                                     compared to the first one (score 0.622).
      kwargs = cg"

   To assess the generation, we do not focus on the semantics
                                                                                              Discussion
of the imports used or whether they are part of the Pandas li-       Considering the results obtained, one could convincingly as-
brary or not, but on the language’s correctness. In general,         sert that the tokenization model used profoundly affects the
from a Python language perspective, the outputs from GPT-            results when generating automated source code. Although
2 are better. The outputs include line breaks, indentation, and      that may be accurate, we must discuss it carefully.
fair use of multiple inputs in one sentence (except in one of
the outputs). The AWD-LSTM and AWD-QRNN failed to                    Discussing the outcomes from resulting models
auto-generate an import sentence appropriately, or at least,         First, our overall results are consistent with the existing liter-
they fail regarding the usual manner used by regular users.          ature (Karampatsis and Sutton 2019; Ganin et al. 2016; Kim
As a final comment on this, the other models trained failed          et al. 2016; Karpathy 2016). Sub-word tokenization works
on similar issues, plus they do not get enough semantic con-         better in the case of modeling source code, as 2019 stated.
text related to the Pandas library.                                  Every result obtained is consistent in that sense. Even more,
   Similarly, concerning the source code auto-completion,            as 2016 envision, char tokenization probably should the best
both BERT and RoBERTa tried to autofill the mask token               option to try by default when dealing with LMs and source
in the sentence from pandas import [MASK]. These are the             code. Furthermore, according to the results achieved, mod-
results:                                                             els such as GPT-2 -using a tokenization model based on BPE
   BERT                                                              over raw bytes- can outperform LSTM/QRNN models like
  [{’sequence’: ’[CLS] from pandas import [SEP] [SEP]’,
                                                                     those we tested to grasp better the internals of a program-
    ’score’: 0.9969683289527893,                                     ming language. As showcased during the results, even if
    ’token’: 102,
    ’token_str’: ’[SEP]’},
                                                                     GPT-2 was not the best model in terms of accuracy, it gave
   {’sequence’: ’[CLS] from pandas import [CLS] [SEP]’,              better code outputs than the other ones selected for the com-
    ’score’: 0.0010887219104915857,
    ’token’: 101,                                                    parison.
    ’token_str’: ’[CLS]’},                                              As future work, it would be great to check if the better
   {’sequence’: ’[CLS] from pandas import. [SEP]’,
    ’score’: 0.0004200416151434183,                                  textual output in the case of GPT-2 is because of a) it is a
    ’token’: 119,                                                    much bigger and better pre-trained model (163,037,184 pa-
    ’token_str’: ’.’},
   {’sequence’: ’[CLS] from pandas import ; [SEP]’,                  rameters against 36,491 for AWD-LSTM and AWD-QRNN
    ’score’: 0.00027348980074748397,
    ’token’: 132,
                                                                     models), b) it is related to dataset’s size or quality, c) if it is
    ’token_str’: ’;’},                                               related to both causes or, d) if it is related to other issues.

Continuing with the comments about the accuracy, one 2016) use custom datasets, but we find a lack in the liter-
may note that the textual outputs generated by AWD-LSTM ature of well-recognized code datasets to use. Comparing
char, AWD-QRNN char, and GPT-2 could be polished to with other recent papers in the NLP field used as the basis
be more accurate. The final training loss is higher than the for this research (Radford et al. 2019; Devlin et al. 2019; Liu
validation loss for the three selected DNN architectures, et al. 2019; Brown et al. 2020), the dataset may be relatively
which can be a sign of underfitting. We find the same is- small to train a big LM to accomplish appropriately chal-
sue (train loss ¡¡ valid loss) for BERT and RoBERTa-based lenging tasks like generating source code or auto-completing
models. Whether the purpose of this paper is not to produce it. Future work may be testing these or new approaches in
state-of-the-art results per se, we continued the training for bigger datasets to train big LMs focused on modeling the
over five more epochs to verify it. The improvement ob- Python language and checking whether the results are bet-
tained from extending the training for the best approaches ter. Recent examples of LMs -such as GPT-3 (Brown et al.
was residual in general, so we decided to report the results 2020)- claim to produce accurate textual outputs even in
for 1+30 epochs in AWD-LSTMs and AWD-QRNNs, and contexts in which they were not trained. Part of the expla-
1+10 epochs for Transformer. nation given for that ability is the use of gargantuan datasets
combined with Transformer and other attention-based archi-
Examining the effect of pre-training and transfer tectures. So, those approaches can also be relevant to other
learning contexts like ours. Another line for future research can be us-
ing datasets focused on specific libraries or Python aspects
Regarding the pre-training and transfer learning, every pre- and verify if these approaches specialize positively for those
trained model (on English-based datasets) got better accu- contexts the DNN models used in this paper.
racy than its non-pre-trained counterparts except in word
tokenization models. It seems to be strongly related to the Related to evaluating the code generated or filled, we ob-
statements we introduced at the beginning of this paper, cit- served in the literature different approaches (Celikyilmaz,
ing (Allamanis et al. 2015; Karampatsis and Sutton 2019) Clark, and Gao 2020; Roziere et al. 2020). In the con-
about the source code does not have the same restrictions text of LMs modeling source code, many papers and soft-
in the sense of common words or neologisms. In this sense, ware libraries devoted to translating between programming
the conclusion comes into our mind rapidly: if we consider languages typically evaluate text generation using methods
source code words in “word” units, they probably will not and metrics like BLEU (Papineni et al. 2002), or variants
fit in the fixed set of words used in a human language like like SacreBLEU (Post 2018). Other papers like (Tiwang,
English. So, the LM’s knowledge acquired during the pre- Oladunni, and Xu 2019) rely on the accuracy to assess an
training is not entirely valid when we get out of that fixed set LM’s performance based on deep learning. Some models
of words that compose a language. Most words in the pro- can even solve different tasks that are part of existing bench-
gramming language are neologisms for the LMs pre-trained marks (Weston et al. 2015) or are evaluated, checking their
in English, and thus, it needs to incorporate them and their perplexity (similarly to those that evaluate the model using
relationships into the learned knowledge. For the sub-word the accuracy). The current tendency in large models is to
units, the LM can be less sensible to the neologisms. Po- evaluate them using human intervention to evaluate the out-
tentially, it could be more robust the more divided a word put’s quality (Radford et al. 2019; Brown et al. 2020). We
is since the set of bytes or chars is more straightforward assessed the models using accuracy during our experiments
than the chunks present in richer constructions or informa- and evaluated the models’ textual outputs based on our prior
tion units. human knowledge. It would be interesting for the future to
plan new evaluation processes involving larger cohorts of
Going deeper into this research, concerning the pre-
source code experts to evaluate the models such as (Radford
training effect over LMs modeling source code, it could be
et al. 2019; Brown et al. 2020) do. One of the potential new
worth researching the relationship between the pre-training
assessments can be usability tests conducted with program-
in different human-spoken languages and the LM ability to
mers. They can compare the code they would write against
work with existing source code specific programming lan-
the code proposed by any of the DNNs presented here and
guages.
the result from other common code auto-completion tools
included in integrated development environments. As we
Reviewing the textual assessment of resulting LMs outlined in the results section, relying only on metrics like
About the tests made generating source code or filling in accuracy should not be enough. As in our case, accuracy
the blanks using the trained LMs, we think that, in general, and the other metrics can be a good indicator of the model’s
the textual results obtained are not so good, yet they are in- performance, yet we need to verify LMs behavior and qual-
formative of how LMs are working and how they can be ity using complementary methods like specialized metrics or
improved. One of the things that can explain these results human evaluation. For tasks like auto-completion or source
is the dataset used. In this case, we used a public dataset code generation, there are no existing specialized metrics
that other researchers can use to make results and experi- (like BLEU in translation, for example), so one of the fu-
ments comparable and replicable. In the literature, we do ture research lines is improving the evaluation of LMs for
not find a standard dataset for these tasks against which source code. Based on some existing ideas in broad NLP,
we can compare easily. Other papers (Tiwang, Oladunni, there are many opportunities to explore in that sense. From
and Xu 2019; Li et al. 2017; Raychev, Bielik, and Vechev new test suites for language models used in source code con-

texts (Roziere et al. 2020) to behavioral testing (Ribeiro et al. Bielik, P.; Raychev, V.; and Vechev, M. 2016. PHOG: proba-
2020) or human-centric evaluation of the models (Celiky- bilistic model for code. In International Conference on Ma-
ilmaz, Clark, and Gao 2020) with particular emphasis on chine Learning, 2933–2942.
reproducible and unbiased assessments, or combinations of
Bostrom, K.; and Durrett, G. 2020. Byte Pair Encoding is
automatic testing and human-centric assessments.
Suboptimal for Language Model Pretraining. In Findings
of the Association for Computational Linguistics: EMNLP
Conclusions 2020, 4617–4624. Online: Association for Computational
Linguistics. URL https://www.aclweb.org/anthology/2020.
This paper compares how different approaches to tokeniza- findings-emnlp.414.
tion models, deep neural network architectures, pre-trained
models, and transfer learning affect the results from lan- Bradbury, J.; Merity, S.; Xiong, C.; and Socher, R. 2017.
guage models used to generate source code or auto-complete Quasi-recurrent neural networks. In Proceedings of the
software pieces. We studied different DNN architectures like 5th International Conference on Learning Representations
AWD-LSTM, AWD-QRNN, and Transformer to seek which (ICLR 2017). URL https://arxiv.org/pdf/1611.01576.pdf.
kind of them work better with different tokenization mod- Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.;
els (word, unigram, BPE, and char). Also, we compared the Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell,
pre-training effect on the results given by LMs after training A.; et al. 2020. Language models are few-shot learners. Ad-
them and fine-tuning them via transfer learning to work with vances in Neural Information Processing Systems 33. URL
other languages (English language to Python programming https://arxiv.org/pdf/2005.14165.pdf.
language). As a result of this work, we find that in small
LMs (like our AWD-LSTM and AWD-QRNN models), the Bryant, C.; and Briscoe, T. 2018. Language model based
tokenization using char-sized chunks works better than us- grammatical error correction without annotated training
ing any other tokenization models. In larger models like the data. In Proceedings of the Thirteenth Workshop on Inno-
Transformer GPT-2, the accuracy was slightly worse than vative Use of NLP for Building Educational Applications,
the other architectures. However, GPT-2 raised better results 247–253.
on the source code generation tests (even using another to- Celikyilmaz, A.; Clark, E.; and Gao, J. 2020. Eval-
kenization approach like BPE over raw bytes). For source uation of Text Generation: A Survey. arXiv preprint
code auto-completion, we tested some transformer models arXiv:2006.14799 .
like BERT and RoBERTA. While their accuracy was above
any other models, they did not perform very well when per- Chelba, C.; Mikolov, T.; Schuster, M.; Ge, Q.; Brants, T.;
forming the tasks proposed in our tests. In general, we find Koehn, P.; and Robinson, T. 2014. One Billion Word Bench-
that pre-trained models work better, even if they were not mark for Measuring Progress in Statistical Language Mod-
trained initially for a programming language like Python eling. In Fifteenth Annual Conference of the International
(our models were pre-trained using the English language). Speech Communication Association.
Finally, related to evaluating tasks like automating source Chen, H.; Le, T. H. M.; and Babar, M. A. 2020. Deep
code generation and source code auto-completion, we raise Learning for Source Code Modeling and Generation: Mod-
concerns about the literature gaps and propose some re- els, Applications and Challenges. ACM Computing Surveys
search lines to work on in the future. (CSUR) .
Chronopoulou, A.; Baziotis, C.; and Potamianos, A. 2019.
Acknowledgments An Embarrassingly Simple Approach for Transfer Learning
We thank the IBM Quantum team and the IBM Research from Pretrained Language Models. In Proceedings of the
ETX team for the insightful discussions about this research 2019 Conference of the North American Chapter of the As-
and the support received during the development of this re- sociation for Computational Linguistics: Human Language
search. Technologies, Volume 1 (Long and Short Papers), 2089–
2095.
References Cruz-Benito, J.; Faro, I.; Martı́n-Fernández, F.; Therón, R.;
and Garcı́a-Peñalvo, F. J. 2018. A Deep-Learning-based
Allamanis, M.; Barr, E. T.; Bird, C.; and Sutton, C. 2015. proposal to aid users in Quantum Computing programming.
Suggesting accurate method and class names. In Proceed- In International Conference on Learning and Collaboration
ings of the 2015 10th Joint Meeting on Foundations of Soft- Technologies, 421–430. Springer.
ware Engineering, 38–49.
Cruz-Benito, J.; and Vishwakarma, S. 2020a.
Allamanis, M.; Barr, E. T.; Devanbu, P.; and Sutton, C. 2018. cbjuan/tokenizers-neural-nets-2020-paper: v1.0 doi:
A survey of machine learning for big code and naturalness. 10.5281/zenodo.4011767. URL https://github.com/cbjuan/
ACM Computing Surveys (CSUR) 51(4): 1–37. tokenizers-neural-nets-2020-paper/.
Baevski, A.; and Auli, M. 2018. Adaptive Input Representa- Cruz-Benito, J.; and Vishwakarma, S. 2020b. NN mod-
tions for Neural Language Modeling. In International Con- els produced by cbjuan/tokenizers-neural-nets-2020-paper:
ference on Learning Representations. v1.0 doi:10.5281/zenodo.4293857.

Czapla, P.; Howard, J.; and Kardas, M. 2018. Universal lan-     Husain, H.; Wu, H.-H.; Gazit, T.; Allamanis, M.; and
guage model fine-tuning with subword tokenization for pol-      Brockschmidt, M. 2019. Codesearchnet challenge: Eval-
ish. arXiv preprint arXiv:1810.10222 .                          uating the state of semantic code search. arXiv preprint
Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J. G.; Le, Q.; and      arXiv:1909.09436 .
Salakhutdinov, R. 2019. Transformer-XL: Attentive Lan-          Kannan, A.; Kurach, K.; Ravi, S.; Kaufmann, T.; Tomkins,
guage Models beyond a Fixed-Length Context. In Proceed-         A.; Miklos, B.; Corrado, G.; Lukacs, L.; Ganea, M.; Young,
ings of the 57th Annual Meeting of the Association for Com-     P.; et al. 2016. Smart reply: Automated response suggestion
putational Linguistics, 2978–2988.                              for email. In Proceedings of the 22nd ACM SIGKDD In-
                                                                ternational Conference on Knowledge Discovery and Data
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.
                                                                Mining, 955–964.
BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding. In NAACL-HLT (1).                       Karampatsis, R.-M.; and Sutton, C. 2019. Maybe deep neu-
                                                                ral networks are the best choice for modeling source code.
Donahue, C.; Lee, M.; and Liang, P. 2020. Enabling              arXiv preprint arXiv:1903.05734 .
Language Models to Fill in the Blanks. In Proceed-
ings of the 58th Annual Meeting of the Association for          Karpathy, A. 2016. The unreasonable effectiveness of recur-
Computational Linguistics, 2492–2501. Online: Association       rent neural networks URL http://karpathy.github.io/2015/05/
for Computational Linguistics. doi:10.18653/v1/2020.acl-        21/rnn-effectiveness.
main.225. URL https://www.aclweb.org/anthology/2020.            Kim, Y.; Jernite, Y.; Sontag, D.; and Rush, A. M. 2016.
acl-main.225.                                                   Character-Aware Neural Language Models. In Proceedings
Eisenschlos, J.; Ruder, S.; Czapla, P.; Kadras, M.; Gug-        of the Thirtieth AAAI Conference on Artificial Intelligence,
ger, S.; and Howard, J. 2019. MultiFiT: Efficient Multi-        AAAI’16, 2741–2749. AAAI Press.
lingual Language Model Fine-tuning. In Proceedings of           Krause, B.; Kahembwe, E.; Murray, I.; and Renals, S. 2018.
the 2019 Conference on Empirical Methods in Natural Lan-        Dynamic Evaluation of Neural Sequence Models. In Dy,
guage Processing and the 9th International Joint Confer-        J.; and Krause, A., eds., Proceedings of the 35th Inter-
ence on Natural Language Processing (EMNLP-IJCNLP),             national Conference on Machine Learning, volume 80 of
5702–5707. Hong Kong, China: Association for Computa-           Proceedings of Machine Learning Research, 2766–2775.
tional Linguistics. doi:10.18653/v1/D19-1572. URL https:        Stockholmsmässan, Stockholm Sweden: PMLR. URL http:
//www.aclweb.org/anthology/D19-1572.                            //proceedings.mlr.press/v80/krause18a.html.
Fedus, W.; Goodfellow, I.; and Dai, A. M. 2018. MaskGAN:        Kudo, T.; and Richardson, J. 2018. SentencePiece: A sim-
Better Text Generation via Filling in the       . In Inter-     ple and language independent subword tokenizer and deto-
national Conference on Learning Representations. Online.        kenizer for Neural Text Processing. In Proceedings of the
URL https://arxiv.org/pdf/1801.07736.pdf.                       2018 Conference on Empirical Methods in Natural Lan-
                                                                guage Processing: System Demonstrations, 66–71.
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle,
H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016.       Li, J.; Wang, Y.; Lyu, M. R.; and King, I. 2017. Code com-
Domain-adversarial training of neural networks. The Jour-       pletion with neural attention and pointer networks. In Pro-
nal of Machine Learning Research 17(1): 2096–2030.              ceedings of the Twenty-Seventh International Joint Confer-
                                                                ence on Artificial Intelligence (IJCAI-18), 4159–41655.
Ghosh, S.; and Kristensson, P. O. 2017. Neural networks for
text correction and completion in keyboard decoding. arXiv      Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;
preprint arXiv:1709.06429 .                                     Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.
                                                                2019. Roberta: A robustly optimized bert pretraining ap-
Gong, C.; He, D.; Tan, X.; Qin, T.; Wang, L.; and Liu, T.-Y.    proach. arXiv preprint arXiv:1907.11692 .
2018. Frage: Frequency-agnostic word representation. Ad-
vances in neural information processing systems 31: 1334–       Merity, S.; Keskar, N. S.; and Socher, R. 2018a. An anal-
1345.                                                           ysis of neural language modeling at multiple scales. arXiv
                                                                preprint arXiv:1803.08240 .
Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term
                                                                Merity, S.; Keskar, N. S.; and Socher, R. 2018b. Regular-
memory. Neural computation 9(8): 1735–1780.
                                                                izing and Optimizing LSTM Language Models. In Interna-
Honnibal, M.; and Montani, I. 2017. spacy 2: Natural            tional Conference on Learning Representations.
language understanding with bloom embeddings, convolu-          Merity, S.; Xiong, C.; Bradbury, J.; and Socher, R. 2017.
tional neural networks and incremental parsing URL https:       Pointer sentinel mixture models. In Proceedings of the
//spacy.io/.                                                    5th International Conference on Learning Representations
Howard, J.; and Gugger, S. 2020. Fastai: A layered API for      (ICLR 2017). URL https://arxiv.org/pdf/1609.07843.pdf.
deep learning. Information 11(2): 108.                          Mikolov, T.; Deoras, A.; Kombrink, S.; Burget, L.; and
Howard, J.; and Ruder, S. 2018. Universal Language Model        Černockỳ, J. 2011. Empirical evaluation and combination
Fine-tuning for Text Classification. In Proceedings of the      of advanced language modeling techniques. In Twelfth an-
56th Annual Meeting of the Association for Computational        nual conference of the international speech communication
Linguistics (Volume 1: Long Papers), 328–339.                   association.

Nguyen, A. T.; and Nguyen, T. N. 2015. Graph-based sta-          Acoustics, Speech and Signal Processing (ICASSP), 5149–
tistical language model for code. In 2015 IEEE/ACM 37th          5152. IEEE.
IEEE International Conference on Software Engineering,           Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural Ma-
volume 1, 858–868. IEEE.                                         chine Translation of Rare Words with Subword Units. In
Nguyen, A. T.; Nguyen, T. T.; and Nguyen, T. N. 2013. Lexi-      Proceedings of the 54th Annual Meeting of the Associa-
cal statistical machine translation for language migration. In   tion for Computational Linguistics (Volume 1: Long Papers),
Proceedings of the 2013 9th Joint Meeting on Foundations         1715–1725.
of Software Engineering, 651–654.                                Smith, L. N. 2017. Cyclical learning rates for training neural
Oda, Y.; Fudaba, H.; Neubig, G.; Hata, H.; Sakti, S.; Toda,      networks. In 2017 IEEE Winter Conference on Applications
T.; and Nakamura, S. 2015. Learning to generate pseudo-          of Computer Vision (WACV), 464–472. IEEE.
code from source code using statistical machine translation      Smith, L. N.; and Topin, N. 2019. Super-convergence: Very
(t). In 2015 30th IEEE/ACM International Conference on           fast training of neural networks using large learning rates.
Automated Software Engineering (ASE), 574–584. IEEE.             In Artificial Intelligence and Machine Learning for Multi-
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.         Domain Operations Applications, volume 11006, 1100612.
BLEU: a method for automatic evaluation of machine trans-        International Society for Optics and Photonics.
lation. In Proceedings of the 40th annual meeting of the         Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and
Association for Computational Linguistics, 311–318.              Salakhutdinov, R. 2014. Dropout: a simple way to prevent
Post, M. 2018. A Call for Clarity in Reporting BLEU              neural networks from overfitting. The journal of machine
Scores. In Proceedings of the Third Conference on Machine        learning research 15(1): 1929–1958.
Translation: Research Papers, 186–191. Brussels, Belgium:        Takase, S.; Suzuki, J.; and Nagata, M. 2018. Direct Out-
Association for Computational Linguistics. doi:10.18653/         put Connection for a High-Rank Language Model. In Pro-
v1/W18-6319. URL https://www.aclweb.org/anthology/               ceedings of the 2018 Conference on Empirical Methods in
W18-6319.                                                        Natural Language Processing, 4599–4609.
Proksch, S.; Lerch, J.; and Mezini, M. 2015. Intelligent code    Tiwang, R.; Oladunni, T.; and Xu, W. 2019. A Deep Learn-
completion with Bayesian networks. ACM Transactions on           ing Model for Source Code Generation. In 2019 Southeast-
Software Engineering and Methodology (TOSEM) 25(1): 1–           Con, 1–7. IEEE.
31.                                                              Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and        L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
Sutskever, I. 2019. Language Models are Unsupervised             tention is all you need. In Advances in neural information
Multitask Learners .                                             processing systems, 5998–6008.
Rae, J. W.; Potapenko, A.; Jayakumar, S. M.; Hillier, C.; and    Wang, A.; and Cho, K. 2019. BERT has a Mouth, and It
Lillicrap, T. P. 2019. Compressive Transformers for Long-        Must Speak: BERT as a Markov Random Field Language
Range Sequence Modelling. In International Conference on         Model. In Proceedings of the Workshop on Methods for
Learning Representations.                                        Optimizing and Evaluating Neural Language Generation
                                                                 (NAACL HLT 2019), 30–36.
Raychev, V.; Bielik, P.; and Vechev, M. 2016. Probabilistic
model for code with decision trees. ACM SIGPLAN Notices          Wang, D.; Gong, C.; and Liu, Q. 2019. Improving Neural
51(10): 731–747.                                                 Language Modeling via Adversarial Training. In Chaud-
                                                                 huri, K.; and Salakhutdinov, R., eds., Proceedings of the
Ribeiro, M. T.; Wu, T.; Guestrin, C.; and Singh, S. 2020.
                                                                 36th International Conference on Machine Learning, vol-
Beyond Accuracy: Behavioral Testing of NLP Models with
                                                                 ume 97 of Proceedings of Machine Learning Research,
CheckList. In Proceedings of the 58th Annual Meeting of
                                                                 6555–6565. Long Beach, California, USA: PMLR. URL
the Association for Computational Linguistics, 4902–4912.
                                                                 http://proceedings.mlr.press/v97/wang19f.html.
Online: Association for Computational Linguistics. doi:
10.18653/v1/2020.acl-main.442. URL https://www.aclweb.           Weston, J.; Bordes, A.; Chopra, S.; Rush, A. M.; van
org/anthology/2020.acl-main.442.                                 Merriënboer, B.; Joulin, A.; and Mikolov, T. 2015. Towards
                                                                 ai-complete question answering: A set of prerequisite toy
Roziere, B.; Lachaux, M.-A.; Chanussot, L.; and Lample,          tasks. arXiv preprint arXiv:1502.05698 .
G. 2020. Unsupervised Translation of Programming Lan-
guages. Advances in Neural Information Processing Systems        Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.;
33.                                                              Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davi-
                                                                 son, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu,
Ruder, S.; Peters, M. E.; Swayamdipta, S.; and Wolf, T.          J.; Xu, C.; Le Scao, T.; Gugger, S.; Drame, M.; Lhoest,
2019. Transfer learning in natural language processing. In       Q.; and Rush, A. 2020. HuggingFace’s ransformers: State-
Proceedings of the 2019 Conference of the North American         of-the-Art Natural Language Processing. In Proceedings
Chapter of the Association for Computational Linguistics:        of the 2020 Conference on Empirical Methods in Natu-
Tutorials, 15–18.                                                ral Language Processing: System Demonstrations, 38–45.
Schuster, M.; and Nakajima, K. 2012. Japanese and korean         Online: Association for Computational Linguistics. URL
voice search. In 2012 IEEE International Conference on           https://www.aclweb.org/anthology/2020.emnlp-demos.6.

Yang, Z.; Dai, Z.; Salakhutdinov, R.; and Cohen, W. W.
2018. Breaking the Softmax Bottleneck: A High-Rank RNN
Language Model. In International Conference on Learning
Representations.
Young, T.; Hazarika, D.; Poria, S.; and Cambria, E. 2018.
Recent trends in deep learning based natural language pro-
cessing. IEEE Computational Intelligence magazine 13(3):
55–75.

You can also read