A Neural Machine Translation Approach for Translating Malay Parliament Hansard to English Text

Page created by Jack Deleon
 
CONTINUE READING
A Neural Machine Translation Approach for Translating Malay Parliament Hansard to English Text
IALP 2020, Kuala Lumpur, Dec 4-6, 2020

     A Neural Machine Translation Approach for
   Translating Malay Parliament Hansard to English
                        Text
            Yu-Zane Low                                      Lay-Ki Soon                                      Shageenderan Sapai
  School of Information Technology                School of Information Technology                     School of Information Technology
         Monash University                               Monash University                                    Monash University
               Malaysia                                        Malaysia                                             Malaysia
   ylow0012@student.monash.edu                        soon.layki@monash.edu                             ssap0002@student.monash.edu

    Abstract— Parliament Hansard is one of the most precious              knowledge and it is not possible to generate all the rules in a
texts made available to the public. In Malaysia, the parliament           language [7].
Hansard records the debate and discussions in Malay language.
Topic modelling, sentiment analysis, relation extractions, trend              In 2015, Aasha and Ganesh [1] analysed English to
prediction or temporal analyses are frequently applied on                 Malayalam RbMT systems and found a system that “works up
parliament Hansard to discover interesting patterns. However,             to 6-word simple sentences” made in 2009 and a system that
most of the matured tools for such processing tasks work on               achieved 53.63% accuracy made in 2011. In the conclusion of
English text. As such, before the Malaysian parliament Hansard            their study, they mentioned that their paramount objective was
can be further processed, it is essential to translate the Malay          to expand the limitations of RbMT systems by including as
text into English. Several machine translation approaches have
been surveyed in this paper. From the literature review, neural
                                                                          many rules as possible to produce an error free system which
machine translation, particularly the Transformer Model has               is parallel to what Charoenpornsawat and colleagues
been proven to provide promising results in translating different         mentioned about the weaknesses of RbMT systems; the
languages. In this paper, we present our implementation of                requirement of linguistic knowledge and the importance of the
neural machine translation for Malay to English text. The                 coverage of all rules in a language (the more rules, the better).
experimental design shows that with a good set of parallel                Charoenpornsawat and colleague’s case study on ParSit, a
corpus and minimal fine-tuning, neural MT can achieve as high             RbMT using interlingual-based approach and discovered that
as 35.42 in BLEU score.                                                   ParSit generates many errors in “choosing incorrect meaning”
                                                                          at a rate of 81.74%. Charoenpornsawat and colleagues further
   Keywords—Machine translation, neural machine translation,
parliament hansard                                                        improved ParSit by adding C4.5 rule and RIPPER [7]. The
                                                                          accuracy was boosted to above 70% with 2.5k training
                      I. INTRODUCTION                                     sentences. They concluded that their method is an adaptive
    Machine translation (MT) is a sub-field of computational              model; can be applied to other languages and does not require
linguistics that involves the utility of computers to translate           linguistic knowledge.
text or speech from one natural language to another [16] and                  In 2007, Sinnard et al. [18] and Terumasa [20] were able
is stated to be as old as the modern digital computer [6]. The            to improve RbMT systems by combining the RbMT with a
Globalization and Localization Association (GALA), a global,              statistical phrase-based post-editing system and a statistical
non-profit trade organization for the translation and                     post editor respectively. Terumasa’s system provided
localization industry stated that there are three approaches to
                                                                          significant translation accuracy compared to RbMT with
MT, rule-based systems (RbMT); Statistical systems (SMT);
                                                                          significant level 0.01 both in open and closed tests. As for
and Neural MT (NMT). In this study, we will research the
feasibility, efficiency and accuracy of the three systems and             Sinnard et al., their RbMT plus statistical phrase-based post-
determine which system is most suitable to be implemented                 editing system (SYSTRAN + PORTAGE) was able to
for the project.                                                          outperform both the RbMT (SYSTRAN) and statistical
                                                                          machine translation system (PORTAGE) especially for the
     This paper is organised as follows: Section 2 details our            Europarl domain. Note that PORTAGE is a statistical phrase-
literature review in machine translation. Section 3 outlines the          based MT system but it is configured and trained for post-
adopted Transformer Model for our Malay to English                        editing.
translation. We then present the experimental setup in Section
4. Section 5 discusses the experimental results. Lastly, the                  In 1949, Warren Weaver suggested that the problems of
paper is concluded in Section 6.                                          MT can be “attacked” with statistical methods, but the
                                                                          approach was quickly abandoned by researchers due to the
                  II. LITERATURE REVIEW                                   lack of processing power and scarcity of machine-readable
    Rule-based Machine Translation (RbMT) system is one of                texts [6]. However, that changed as time passed and Brown et
the earliest translation techniques created [1]. According to             al. proposed a Statistical approach to MT in 1990. Before
Aasha and Ganesh, the RbMT exploits the linguistic rules and              NMT was introduced in the field of MT, Statistical Phrase-
language arrangement to perform translation to the destination            based (PbMT) approaches were the long-standing dominant
language. The strength of RbMT systems is its ability to                  method in MT [4, 12]. Because of its dominance and
deeply analyse both syntax and semantic levels. However, the              popularity, phrase-based Statistical Machine Translation
disadvantage of this system is it requires a lot of linguistic            approaches are focused in this study, at the expense of the

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
   978-1-7281-7689-5/20/$31.00 c 2020 IEEE                          316
A Neural Machine Translation Approach for Translating Malay Parliament Hansard to English Text
word-based approach. The phrase-based statistical MT                [24], its compactness compared to PbMT, e.g. Lua OpenNMT
approach or the alignment template approach allows for              system consists of 4K lines of code while the Moses SMT has
general many-to-many relations between words [15].                  over 100K lines with language modelling included [9] and has
According to Och and Ney (2004), the term phrase refers to a        been reported to have better results over SMT in both
consecutive sequence of words in a text and is discernible          automatic metrics (BLEU) and human evaluation [24]. In the
from the usage of the term in a linguistic sense [15]. The          paper Six Challenges for Neural Machine Translation by
advantages of SMT include the flexibility of the system as it       Koehn and Knowles in 2017 [9], the weakness of NMT
is not made specially for a language [16] and the MT is             includes the lower quality for out of domain translation, poor
achieved automatically using bilingual training corpus [15].        performance under low resource settings, etc.
Because of its nature, SMT requires large bilingual corpora
                                                                        In a case study comparing translation quality of NMT and
for training MT models [21] and corpus creation can be
                                                                    PbMT by Bentivogli and others in 2016 [4], they pitted three
expensive especially for users with limited resources and the
                                                                    state-of-the-art PbMT systems against a NMT system on the
results of SMT are unexpected; superficial fluency is
                                                                    same data and utilized high quality post-edits of the outputs to
deceiving [16].
                                                                    identify and measure post-editing effort and translation error
    In low-resource settings or small bilingual corpus, SMT is      types such as morphological errors, lexical errors and word
shown to have better results and performance than NMT               order errors. The results showed that NMT was superior to
models [12, 21]). Trieu et al. (2017) showed that with 800k         PbMT in terms of all the error types that were measured. Their
parallel sentences for training, they were able to produce an       analysis also highlighted some weakness in the NMT that can
NMT that had better BLEU score (28.93 & 26.81) than their           be improved. Based on our research, we found three
SMT (27.28 & 26.36). On the other hand, with 130k and 456k          approaches/architectures in NMT, namely; RNN,
parallel sentences for training, the SMT was able to                Transformer and DNN. Nonetheless, we will not explore the
outperform the NMT by BLEU scores of +0.48 and +3.11 for            DNN architecture in NMT as it is very far from perfection [19]
456k parallel and +2.95 and +7.15 for 130k parallel sentences.      and there is not a lot of research utilizing DNN for MT.
Trieu et al. (2017) concluded that for Asian language pairs
                                                                        The two state-of-the-art NMT recently developed by
(Japanese-English, Indonesian-Vietnamese, and English-
                                                                    Vaswani et al. in 2017 [22] and Wu et al. in 2016 [24].
Vietnamese) SMT performs better than NMT but NMT’s
                                                                    Vaswani et al.’s architecture is a newer form of NMT as it
performance grows faster with more resources.
                                                                    does not exploit recurrent neural networks [9, 24] or deep
    In a research analysing the SMT output for 6 different          neural network [19]. They claimed that their Transformer
languages – English translation pairs conducted by Wu et al.in      architecture can be trained significantly faster than
2011, a BLEU score of 50.84 for English to Spanish                  architectures based on recurrent or convolutional layers and
translation and scores ranging from 31.44 to 48.63 for 7            stated they reached a new “state-of-the-art”. As mentioned
translations were attained [23]. For comparison, Wu et al.          earlier, they were able to acquire a BLEU score of 41.8 while
accomplish a BLEU score of 47.24 while Vaswani et al. could         Wu et al.’s Google Neural Machine Translation (GNMT)
obtain a score of 41.8 with their NMT system based on self-         achieved 41.16 for English-to-French newstest2014 tests with
attention that is considered as “the current state-of-the-art” by   only a fraction of the training cost [24]. Other than that, the
Artetxe et al. [2] of its time, for English to French               transformer model was able to outperform other state-of-the-
translation. However, Wu et al. did not specify the number of       art NMTs.
training sentences used to achieve their results and the size of
                                                                        Even though Vaswani et al. proposed a superior system,
their corpus is measured under “corpus size”. It is important
                                                                    the GNMT is a formidable force in the field of NMT. The
to note that Wu et al. trained their SMT on the Medical domain
                                                                    GNMT is implemented using Long Short-Term Memory
while Vaswani et al. trained their NMT on the WMT 2014
                                                                    (LSTM) RNNs. The GNMT was reported to achieve levels of
English-French dataset. This information is significant
                                                                    accuracy on par with the average bilingual human translator
because Koehn & Knowles (2017) [10] showed that SMT and
                                                                    on some test sets. Compared to previous PbMT systems, the
NMT performs better when trained on the Medical domain
                                                                    GNMT was able to deliver an approximately 60% reduction
and tested on the Medical domain (BLEU score of 43.5 & 39.4
                                                                    in translation errors on some language pairs
respectively) than on other domains. All in all, Wu et al.
achievement is very impressive and showed that SMT is a                III. TRANSFORMER MODEL FOR MALAY TO ENGLISH
solid option for MT, but its high BLEU score does not                                   TRANSLATION
necessarily mean that SMT has better performance than
Vaswani et al.’s NMT [22].                                              As we wanted the best MT system to perform the Malay
                                                                    to English translation, we opted to use Vaswani and
      In recent years, NMT has appeared as the rising-star MT       colleague’s solution for NMT, namely the Transformer
approach; displaying formidable performances on public              Model [22]. Fortunately, OpenNMT, an open-source toolkit
benchmarks [5], possesses the potential to address the              for NMT supports training for the Transformer model as
weaknesses of older MT systems [24], and is currently a             proposed by Vaswani et-al. in 2017. The Transformer model
commonly used method for machine translation [9]. The NMT           renounces recurrence and depends entirely on the attention
strives to build and train a single, large neural network that      mechanism to extract global dependencies between input and
takes a sentence as input and outputs its translation unlike the    output. This approach contradicts the popular approach of
long-established PbMT which is composed of many sub-                using Recurrent models such as the GNMT which consists of
components that are tuned differently [3]. The strengths of the     2 recurrent Neural Networks in its architecture. The other
NMT include the avoidance of fragile design choices in PbMT         beauty of the Transformer model is the cost of training. For

                           2020 International Conference on Asian Language Processing (IALP)                                 317
the WMT 2014 en-to-de (4.5M sentence pairs) benchmark
dataset, the Transformer model only took 3.5 days on 8 P100
GPUs. This factor sealed our decision. Figure 1 and 2 outlines
the architecture of the model proposed by Vaswani et al [22].
The left side and right side of Figure 1 is the encoder and
decoder of the neural network. Given an input, the encoder
processes the input of symbol representations to a sequence
of continuous representations. The continuous representation
is then passed to the decoder which produces a sequence of
outputs one element at a time. The main aspect of the model
is the attention function. As described in Vaswani et al.’s
paper, the attention function is a function that takes a query,
and a set of key-value pairs and maps it to an output. The
attention function contained in this architecture is applied in
the “Scale Dot-Product Attention” as seen in Figure 2 where
Q represent Query; K, the Key; and V, the Value. The reason         Fig. 2. (left) Architecture of Scaled Dot-Product Attention. (right) Multi-
why the function is called so is because the Dot-product            Head Attention in the Transformer Model embodies multiple attention
attention applied in the algorithm is used with a scaling factor    layers running in parallel [22]
  ଵ
     where dk is the keys of dimensions. The model computes             As illustrated in Figure 2, the “Scaled Dot-Product
ඥௗೖ
the attention function on a set of Queries, Keys and Values         Attention” has h layers or heads. Hence, there would be an
that are packed together into the matrices Q, K and V               output of dv-dimensional output values. The output values
simultaneously. The computation of the matrix of output can         would then be concatenated (Concat) and projected (Linear)
be described by the equation below:                                 to produce the final output.

                                        ொ௄೅                             Due to the architecture and input format of the
Attention (Q, K, V) = SoftMax (                )V             (1)
                                                                    Transformer model available in the OpenNMT toolkit, the
                                         ඥௗೖ
                                                                    words in the source and target training dataset and input needs
                                                                    to be preprocessed and tokenized. Words with capital letters
                                                                    or even followed up with a punctuation mark without spaces
                                                                    might be considered as a unique word, i.e. “So,” and “So ,”
                                                                    (note the space in between “So” and the comma). Thus, the
                                                                    overall program must preprocess the data that will be fed into
                                                                    the NMT.

                                                                                             IV. EXPERIMENTS
                                                                    A. Experimental Design
                                                                       The specification of the system is detailed in Table I.

                                                                                     TABLE I.         SYSTEM SPECIFICATIONS

                                                                      Component                             Specification
                                                                     GPU                        NVIDIA GeForce RTX 2060
                                                                     CPU                   AMD Ryzen 5 3600 6-Core Processor
                                                                     RAM                                       16GB
                                                                     Python                                Python 3.7.8
                                                                     PyTorch                                    1.4.0
                                                                     TorchVision                                0.5.0

                                                                        As the training was executed using the GPU, a high end
                                                                    GPU is recommended for faster training. Due to the
                                                                    limitations of the system, we were forced to include the
                                                                    settings “- valid_batch_size 16” when calling the training
                                                                    function for the Transformer Model. The default maximum
                                                                    batch size for validation for OpenNMT-py was 32 and the
                                                                    training consistently stopped abruptly as the GPU was unable
          Fig. 1. The Transformer – model architecture [22]         to accommodate the NMT’s needs.

                            2020 International Conference on Asian Language Processing (IALP)                                           318
Fig. 3. OpenNMT recommended settings for training Transformer Model

    To compensate for the relatively low amount of memory                   especially common with names such as “Ocasio-Cortez” and
in the GPU, the maximum batch size used for the training was                “Zane”. Only with both mentioned parameters can we obtain
also half of the recommended batch size (4096 to 2048) to                   the proper translated sentences.
replicate the model produced in the Google setup. According
to the OpenNMT website, the Transformer model is said to be
very sensitive to hyperparameters, hence, the settings that we
included may very well affect our trained model. Apart from
the hyperparameter changes done to adapt to the GPU, the
other relevant parameters are not changed. Figure 3 shows the
recommended settings for training Transformers model.

B. Parallel Corpus
    The parallel corpus is obtained from the following website
(https://github.com/huseinzol05/Malay-Dataset). Under the
Translation Section, the author Zolkepli (2018) [25] amassed
a large parallel corpus consisting of 3 million sentences under             Fig. 5. Resultant Model Translation of an example of long sentence
the “malay-english” directory. The nature of the dataset is not
stated explicitly by the author but based on observation, it                     Despite that, “report-align” is not a miracle cure. In a
seems to be extracted from Wikipedia, news website, etc. The                particular instance, the model failed to translate the term
website, http://opus.nlpl.eu/ also provided us access to                    “Arab-Israeli War”, instead it came out “War War”. This is
OpenSubtitles, a database containing more than 3 million                    most likely due to the functionalities of “report_align” as the
subtitles in over 60 languages including Malay [13]. We were                word “Arab-Israeli” would be a “.” without the setting.
able to obtain 2 million sentences from OpenSubtitles. Before                   Apart from the occasional “bad” selection of target words
the model was trained, 40,000 sentences are extracted from                  for some names and terms and incorrect translation of
the Zolkepli’s dataset to be used as validation data for the                informal sentences, the model performed much better than we
NMT and the 20% of the remaining sentences was partitioned                  expected. As the Malay language consists of words that share
as our test dataset to be used for BLEU evaluation once the                 the same substring but have different prefixes, suffixes and
training was complete. A total of 4,382,200 sentences was                   meanings (i.e. guna, mengguna, menggunakan, etc.), we were
used to train the NMT.                                                      worried the MT model will be unable to accommodate the
                 V. RESULTS AND DISCUSSION                                  many variants of a word. The worry was for naught as the
                                                                            model managed to capture most of the common variants of
     After 5 days of training, the model is trained and is ready            most words such as “jelas”, “guna” and “diskriminasi”. Even
to translate. The model can translate short sentences relatively            complex sentences like “Cuma saya rasa kebimbangan, takut-
well such as “Aku pasti merasa sangat malu” in Figure 4                     takut istilah sebegini ia bercanggah dengan undang-undang
which roughly translate to “I will feel very shy” depending on              ataupun tidak bercanggah dengan undang-undang.” Was
the context.                                                                properly translated into “I just feel worried, this term is either
                                                                            against the law or not in conflict with the law.”. The model
                                                                            may have simplified the sentence coincidentally or not, it
                                                                            returned a very satisfactory translation of such an unnecessary
                                                                            complex sentence.
                                                                                By running the BLEU evaluation with the test dataset, we
Fig. 4. Resultant Model Translation of an example of short sentence         were able to obtain a BLEU score of 35.42 which is very
                                                                            promising. With Zolkepli’s publicly available dataset, anyone
    Its translation is arguably correct. The model performed                with a GPU with at least 6GB of RAM can achieve machine
very well for some longer sentences such as the one in Figure               translation of this level. However, as we do not know the exact
5.      However, without the additional parameters “-                       nature of the training dataset, the NMT could be challenged
replace_unk” and “-report_align” and the presence of “rarer”                with domain mismatch. Domain mismatch is a known
words, the translated sentence is not accurate or                           challenge in Machine Translation [10]. The difference in
unsatisfactory. Some translated sentences would have the                    meaning, expression and styles could be misinterpreted by the
string “” to represent words which the model is not sure               NMT. In other words, we do not know the true and effective
of its target word without the “-replace_unk” parameter. If “-              BLEU score of the NMT on Malaysian parliament Hansard.
report_align” is not set but “-replace_unk” is set, some words              The lack of or even absence of such dataset/parallel corpus in
would simply translate to “.”. These occurrences are                        this domain means that there is no way for us to measure the

                              2020 International Conference on Asian Language Processing (IALP)                                              319
BLEU score of the NMT on Malaysian parliament Hansard                               [7]    Charoenpornsawat, P., Sornlertlamvanich, V., & Charoenporn, T.
dataset.                                                                                   (2002). Improving Translation Quality of Rule-based Machine
                                                                                           Translation.                  http://www.mt-archive.info/Coling-2002-
    For comparison, we trained another NMT using the                                       Charoenpornsawat.pdf
Transformer model but using only the dataset from                                   [8]    Eisele, A., Federmann, C., Saint-Amand, H., Jellinghaus, M.,
OpenSubtitles. Even though the OpenSubtitle dataset had 2M                                 Herrmann, T., & Chen, Y. (2008). Using Moses to Integrate Multiple
                                                                                           Rule-Based Machine Translation Engines into a Hybrid System.
sentence pairs, the BLEU score it obtained is an abysmal 0.22                              http://www.statmt.org/moses/
when tested with the same test dataset. This finding                                [9]    Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. M. (2017).
demonstrates the importance of the nature and content of the                               OpenNMT: Open-Source Toolkit for Neural Machine Translation.
training dataset. Apart from that, the model was not practical                             https://arxiv.org/pdf/1701.02810.pdf
for translating parliament Hansard as it was only able to                           [10]   Koehn, P., & Knowles, R. (2017). Six Challenges for Neural Machine
translate short sentences (5 to 7 words) with mediocre at best                             Translation. https://arxiv.org/pdf/1706.03872.pdf
results. It consistently fails to translate longer sentences by                     [11]   Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M.,
producing meaningless target sentences which usually                                       Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C.,
                                                                                           Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open Source
contains repetition of a certain word. The extremely poor                                  Toolkit for Statistical Machine Translation ITC-irst 2.
performance is expected as the OpenSubtitles dataset does not                              http://www.statmt.org/moses/
contain a lot of sentence pairs with more than 10 words.                            [12]   Koehn, P., Och, F. J., & Marcu, D. (2003). Statistical Phrase-Based
                                                                                           Translation. Proceedings of HLT-NAACL 2003 Main Papers, 48–54.
    In other words, Zolkepli’s publicly available dataset is a                             https://www.aclweb.org/anthology/N03-1017.pdf
game-changer for training Malay to English Neural Machine                           [13]   Lison, P., & Tiedemann, J. (2016). Opensubtitles2016: Extracting large
Translation. The results are degrees apart. Even though there                              parallel corpora from movie and tv subtitles.
is the possibility of domain mismatch, the model trained with                       [14]   Och, F. J. (2003). Minimum Error Rate Training in Statistical Machine
Zolkepli’s dataset and OpenSubtitles is still capable of                                   Translation.
producing logical and proper translations with some errors                          [15]   Och, F. J., & Ney, H. (2004). The Alignment Template Approach to
when facing the problems stated earlier in this Section (rare                              Statistical                    Machine                     Translation.
words or informal sentences).                                                              https://dl.acm.org/doi/pdf/10.1162/0891201042544884
                                                                                    [16]   Saini, S., & Sahula, V. (2015). A survey of machine translation
                           VI. CONCLUSION                                                  techniques and systems for Indian languages. Proceedings - 2015 IEEE
                                                                                           International Conference on Computational Intelligence and
    A neural machine translation approach has been proposed                                Communication           Technology,     CICT       2015,      676–681.
to translate Malay Parliament Hansard to English texts. With                               https://doi.org/10.1109/CICT.2015.123
the publicly available NMT tool OpenNMT and Zolkepli’s                              [17]   Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J.,
Malay dataset, we managed to train a Transformer model                                     Junczys-Dowmunt, M., Läubli, S., Barone, A. V. M., Mokry, J., & N˘
                                                                                           Adejde †, M. (2017). Nematus: a Toolkit for Neural Machine
Neural Machine Translation proposed by Vaswani et al. [22].                                Translation. https://arxiv.org/pdf/1703.04357.pdf
This paper proves that the tools for automatically translating                      [18]   Sinnard, M., Ueffing, N., Isabelle, P., & Kuhn, R. (2007). Rule-based
Malay Parliament Hansard to English is readily available and                               Translation      With      Statistical   Phrase-based     Post-editing.
its strength is both formidable and astonishing. Hopefully,                                https://dl.acm.org/doi/pdf/10.5555/1626355.1626383
soon, the Transformer model can be improved further to                              [19]   Singh, S. P., Kumar, A., Darbari, H., Singh, L., Rastogi, A., & Jain, S.
adapt to more uncommon words, names and terms. For our                                     (2017). Machine translation using deep learning: An overview. 2017
                                                                                           International Conference on Computer, Communications and
future work, we intend to experiment on MOSES [11] and                                     Electronics,            COMPTELIX             2017,           162–167.
compare its effectiveness and efficiency against NMT.                                      https://doi.org/10.1109/COMPTELIX.2017.8003957
                                                                                    [20]   Terumasa, E. (2007). Rule Based Machine Translation Combined with
                               REFERENCES                                                  Statistical Post Editor for Japanese to English Patent Translation.
[1]   Aasha, V. C., & Ganesh, A. (2015). (PDF) Rule Based Machine                          http://www.ipdl.inpit.go.jp/homepg_e.ipdl
      Translation:       English       to     Malayalam:        A        Survey.    [21]   Trieu, H.-L., Tran, D.-V., & Nguyen, L.-M. (2017). Investigating
      https://www.researchgate.net/publication/291947299_Rule_Based_M                      Phrase-Based and Neural-Based Machine Translation on Low-
      achine_Translation_English_t                                                         Resource Settings. 31st Pacific Asia Conference on Language,
[2]   Artetxe, M., Labaka, G., & Agirre, E. (2018). Unsupervised Statistical               Information and Computation (PACLIC 31), 384–391.
      Machine Translation. https://arxiv.org/pdf/1809.01272.pdf                            http://www.statmt.org/wmt16/
[3]   Bahdanau, D., Cho, K., & Bengio, Y. (2016). NEURAL MACHINE                    [22]   Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
      TRANSLATION BY JOINTLY LEARNING TO ALIGN AND                                         A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need.
      TRANSLATE. https://arxiv.org/pdf/1409.0473.pdf                                       https://arxiv.org/pdf/1706.03762.pdf
[4]   Bentivogli, L., Bisazza, A., Cettolo, M., & Federico, M. (2016). Neural       [23]   Wu, C., Xia, F., Deleger, L., & Solti, I. (2011). Statistical machine
      versus Phrase-Based Machine Translation Quality: a Case Study.                       translation for biomedical text: are we there yet? AMIA ... Annual
      Proceedings of the 2016 Conference on Empirical Methods in Natural                   Symposium Proceedings / AMIA Symposium. AMIA Symposium,
      Language                        Processing,                     257–267.             2011, 1290–1299.
      https://www.aclweb.org/anthology/D16-1025.pdf                                 [24]   Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W.,
[5]   Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B.,                    Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A.,
      Huck, M., Yepes, A. J., Koehn, P., Logacheva, V., Monz, C., Negri,                   Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T.,
      M., Névéol, A., Neves, M., Popel, M., Post, M., Rubino, R., Scarton,                 Kazawa, H., … Dean, J. (2016). Google’s Neural Machine Translation
      C., Specia, L., Turchi, M., … Zampieri, M. (2016). Findings of the                   System: Bridging the Gap between Human and Machine Translation.
      2016 Conference on Machine Translation (WMT16). Proceedings of                       http://arxiv.org/abs/1609.08144
      the First Conference on Machine Translation, Volume 2: Shared Task            [25]   Zolkepli, H., “Malay-Dataset, We gather Bahasa Malaysia corpus!,
      Papers, 2, 131–198. http://statmt.org/wmt16/results.html                             available at https://github.com/huseinzol05/Malay-Dataset
[6]   Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek,
      F., Lafferty, J. D., Mercer, R. L., & Roossin, P. S. (1990). A Statistical
      Approach To Machine Translation. Computational Linguistics, 16(2),
      79–85. https://www.aclweb.org/anthology/J90-2002.pdf

                                 2020 International Conference on Asian Language Processing (IALP)                                                        320
You can also read