Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

Page created by Edward Benson
 
CONTINUE READING
1

                                               Survey on Publicly Available Sinhala Natural
                                                Language Processing Tools and Research
                                                                                                       Nisansa de Silva

                                              Abstract—

                                              Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs
                                              to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the
                                              perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic
                                              drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from
                                              Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing.
arXiv:1906.02358v10 [cs.CL] 3 Sep 2021

                                              However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper
                                              is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the
                                              researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and
                                              perpetually update it periodically to reflect the advances made in the field.

                                              Index Terms—Sinhala, Natural Language Processing, Resource Poor Language

                                                                                                                    F

                                         1    I NTRODUCTION                                                             one hand, according to Liddy [24], these systems can be
                                                 1                                                                      categorized into the following layers; phonological, morpho-
                                         Sinhala language, being the native language of the Sin-
                                                                                                                        logical, lexical, syntactic, semantic, discourse, and pragmatic.
                                         halese people [2–4], who make up the largest ethnic group
                                                                                                                        The phonological layer deals with the interpretation of lan-
                                         of the island country of Sri Lanka, enjoys being reported as
                                                                                                                        guage sounds. As such, it consists of mainly speech-to-text
                                         the mother tongue of approximately 16 million people [5, 6].
                                                                                                                        and text-to-speech systems. In cases where one is working
                                         To give a brief linguistic background for the purpose of
                                                                                                                        with written text of the language rather than speech, it is
                                         aligning the Sinhala language with the baseline of English,
                                                                                                                        possible to replace this layer with tools which handle Op-
                                         primarily it should be noted that Sinhala language belongs
                                                                                                                        tical Character Recognition (OCR) and language rendering
                                         to the same the Indo-European language tree [7, 8]. How-
                                                                                                                        standards (such as Unicode [26]). The morphological layer
                                         ever, unlike English, which is part of the Germanic branch,
                                                                                                                        analyses words at their smallest units of meaning. As such,
                                         Sinhala belongs to the Indo-Aryan branch. Further, Sinhala,
                                                                                                                        analysis on word lemmas and prefix-suffix-based inflection
                                         unlike English, which borrowed the Latin alphabet, has its
                                                                                                                        are handled in this layer. Lexical layer handles individual
                                         own writing system, which is a descendant of the Indian
                                                                                                                        words. Therefore tasks such as Part of Speech (PoS) tagging
                                         Brahmi script [9–15]. By extension, this makes Sinhala Script
                                                                                                                        happens here. The next layer, syntactic, takes place at the
                                         a member of the Aramaic family of scripts [16, 17]. Thus
                                                                                                                        phrase and sentence level where grammatical structures
                                         by inheritance, Sinhala writing system is abugida (alphasyl-
                                                                                                                        are utilized to obtain meaning. Semantic layer attempts to
                                         labary), which to say that consonant-vowel sequences are
                                                                                                                        derive the meanings from the word level to the sentence
                                         written as single units [18]. It should be noted that the mod-
                                                                                                                        level. Starting with Named Entity Recognition (NER) at
                                         ern Sinhala language have loanwords from languages such
                                                                                                                        the word level and working its way up by identifying the
                                         as Tamil, English, Portuguese, and Dutch due to various
                                                                                                                        contexts they are set in until arriving at overall meaning. The
                                         historical reasons [19]. Regardless of the rich historical array
                                                                                                                        discourse layer handles meaning in textual units larger than a
                                         of literature spanning several millennia (starting between
                                                                                                                        sentence. In this, the function of a particular sentence maybe
                                         3rd to 2nd century BCE [20, 21]), modern natural language
                                                                                                                        contextualized within the document it is set in. Finally, the
                                         processing tools for the Sinhala language are scarce [22].
                                                                                                                        pragmatic layer handles contexts read into contents without
                                             Natural Language Processing (NLP) is a broad area cov-
                                                                                                                        having to be explicitly mentioned [23, 24]. Some forms
                                         ering all computational processing and analysis of human
                                                                                                                        of anaphora (coreference) resolution [27–31] fall into this
                                         languages. To achieve this end, NLP systems operate at
                                                                                                                        application.
                                         different levels [23–25]. A graphical representation of NLP
                                                                                                                            On the other hand, Wimalasuriya and Dou [25] catego-
                                         layers and application domains are shown in Figure 1. On
                                                                                                                        rize NLP tools and research by utility. They introduce three
                                                                                                                        categories with increasing complexity; Information Retrieval
                                         •   Nisansa de Silva is with the Department of Computer Science & Engi-
                                                                                                                        (IR), Information Extraction (IE), and Natural Language Un-
                                             neering, University of Moratuwa.
                                             E-mail: nisansa@cse.mrt.ac.lk                                              derstanding (NLU). Information Retrieval covers applications,
                                         Manuscript revised: September 7, 2021.
                                                                                                                        which search and retrieve information which are relevant
                                           1. Englebretson and Genetti [1] observe that in some contexts the            to a given query. For pure IR, tools and methods up-to
                                         Sinhala language is also referred as Sinhalese, Singhala, and Singhalese       and including the syntactic layer in the above analysis are
2

 Natural Language Understanding

                                                                                  Domain
                                                                                 knowledge                                                    Pragmatic Layer

                                                                          Application reasoning and
                                                                                  execution

                                                             Contextual                               Utterance
                                                             Reasoning        Discourse context       planning
                                                                                                                                              Discourse Layer
         Information Extraction

                                                   NER and Semantic
                                                       tagging                 Semantic model
                                                                                                      Semantic realisation                    Semantic Layer

               Information Retrieval                 Parsing                                              Syntactic realisation               Syntactic Layer
                                                                                  Grammar

                                               PoS tagging
                                                                                  Lexicon
                                                                                                               Lexical realisation              Lexical Layer

                                        Morphological Analysis
                                                                            Morphological Rules
                                                                                                              Morphological realisation   Morphological Layer

                                         Speech Analysis
                                                                            Pronunciation model
                                                                                                                   Speech synthesis        Phonological Layer

                                     Input text                                                                        Output text

Fig. 1: NLP layers and tasks [23]

used. Information Extraction, on the other hand, extracts                                   this survey, we shall also consider the Sri Lankan NLP tools
structured information. The difference between IR and IE                                    repository, lknlp3 . This manuscript is at version 4.0.0. The
is the fact that IR does not change the structure of the                                    latest version of the manuscript can be obtained from arXiv4
documents in question. Be them structured, semi-structured,                                 or ResearchGate5 .
or unstructured, all IR does is fetching them as they are. In                                   Figure 3 in Appendix A shows the most prolific re-
comparison, IE, takes semi-structured or unstructured text                                  searchers in the domain of Sinhala NLP. The nodes contain
and puts them in a machine readable structure. For this, IE                                 the name of the researcher along with the total number of
utilizes all the layers used by IR and the semantic layer. Nat-                             Sinhala NLP papers that researcher has authored. The edges
ural Language Understanding is purely the idea of cognition.                                between two researchers are labeled with the number of
Most NLU tasks fall under AI-hard category and remain                                       Sinhala NLP papers the relevant pair of researchers have co-
unsolved [23]. However, with varying accuracy, some NLU                                     authored. Given that the objective of the visualization is to
tasks such as machine translation2 are being attempted.                                     portray the co-operation between researchers, the threshold
The pragmatic layer of the above analysis belongs to the                                    used here is to have at least 3 papers on Sinhala NLP
NLU tasks while the discourse layer straddles information                                   with on another researcher. We have also added labels to
extraction and natural language understanding [23].                                         clusters in cases where all or the majority of researchers
    The objective of this paper is to serve as a comprehensive                              in those clusters have the same affiliation. It is observable
survey on the state of natural language processing resources                                that the cluster from the Department of Computer Science &
for the Sinhala language. The initial structure and content                                 Engineering, University of Moratuwa is the most prolific in
of this survey are heavily influenced by the preliminary                                    Sinhala NLP research.
surveys carried out by de Silva [22] and Wijeratne et al.                                       The remainder of this survey is organized as follows;
[23]. However, our hope is to host this survey at arXiv                                     Section 2 introduces some important properties and con-
as a perpetually evolving work which continuously gets                                      ventions of the Sinhala language which are important for
updated as new research and tools for Sinhala language                                      the development and understanding of Sinhala NLP. Sec-
are created and made publicly available. Hence, it is our                                   tion 3 discusses the various tools and research available for
hope that this work will help future researchers who are                                    Sinhala NLP. In this section we would discuss both pure
engaged in Sinhala NLP research to conduct their literature                                 Sinhala NLP tools and research as well as hybrid Sinhala-
surveys efficiently and comprehensively. For the success of
                                                                                                3. https://github.com/lknlp/lknlp.github.io
  2. This is, however, not without the criticism of being nothing more                          4. https://arxiv.org/abs/1906.02358
than a Chinese room [32] rather than true NLU.                                                  5. http://bit.ly/31AhvvR
3

English work. We will also discuss research and tools which           Herath et al. [49] categorize Sinhala suffixes along the
contributes to Sinhala NLP either along with or by the help       attributes of: gender, number, definiteness, case, and con-
of Tamil, the other official language of Sri Lanka. Section 4     junctive. They further claim that there are 3 types of suffixes:
gives a brief introduction to the primary language sources        Suf1 adds gender, number, and definiteness; Suf2 adds case;
used by the studies discussed in this work. Finally, Section 5,   and Suf3 adds conjunctive. Conjunctive is claimed to be
concludes the survey.                                             equivalent to too and and in English. We show an extension
                                                                  of the suffix structure proposed by Herath et al. [49] in
2     P ROPERTIES OF THE S INHALA L ANGUAGE                       Table 4.
Before moving on to discussing Sinhala NLP resources, we
shall give a brief introduction to some of the important          3     S INHALA NLP RESOURCES
properties of Sinhala language, which impact the develop-
ment of Sinhala NLP resources. Sinhala grammar has two            In this section we generally follow the structure shown in
forms: written (literary) and spoken. These forms differ from     Figure 1 for sectioning. However, in addition to that, we
each-other in their core grammatical structures [1, 33, 34].      also discuss topics such as available corpora, other data
The written form strictly adheres to the SOV (Subject,            sets, dictionaries, and WordNets. We focus on NLP tools
Object, and Verb) configuration [35, 36]. Further, in the         and research rather than the mechanics of language script
written form, subject-verb agreement is enforced [37] such        handling [50–55]. One of the earliest attempts on Sinhala
that, in order to be grammatically correct, the subject and the   NLP was done by Herath et al. [56]. However, progress
verb must agree in terms of: gender (male/female), num-           on that project has been minimal due to the limitations of
ber (singular/plural) and person (1st/2nd/3rd). However,          their time. The later work by Nandasara [57] has not caught
in spoken Sinhala, the SOV order can be neglected [38]            much of the advances done up to the time of its publication.
and male singular 3rd person verb can be used for all             Given that it was a decade old by the time the first edition
nouns [37]. Sinhala is also a head-final language, where          of this survey was compiled, we observe the existence of
the complements and modifiers would appear before their           many new discoveries in Sinhala NLP which have not been
heads [39] this is similar to that of English and dissimilar      taken into account by it. A review on some challenges and
to that of French. In total, according to Abhayasinghe [40],      opportunities of using Sinhala in computer science was
there are 25 types of simple sentence structures in Sinhala.      done by Nandasara and Mikami [58]. At this point, it is
Similar to many Indo Aryan languages, animacy plays a             worthy to note that the largest number of studies in Sinhala
major role in Sinhala grammar in syntactic and semantic           NLP has been on optical character recognition (OCR) rather
roles [41–43]. Comparative studies done by Noguchi [44]           than on higher levels of the hierarchy shown in Figure 1. On
and by Miyagishi [34, 45] have found that animacy extends         the other hand, the most prolific single project of Sinhala
its influence from phrase level to sentence level in Sinhala      NLP we have observed so far is an attempt to create an
(e.g., Usage of post-positions [35, 46]). On this matter, Ta-     end-to-end Sinhala-to-English translator [18, 59–76]. Tamil,
ble 1 explains grammatical cases and inflections of animate       the other official language of Sri Lanka is also a resource
common nouns while Table 2 explains grammatical cases             poor language. However, due to the existence of larger
and inflections of inanimate common nouns. We provide             populations of Tamil speakers worldwide, including but not
a comparative analysis of parsing the very simple English         limited to economic powerhouses such as India, there are
sentence “I eat a red apple” and its Sinhala, Hindi, and French   more research and tools available for Tamil NLP tasks [23].
translations in Fig 2. English and French parsing was done        Therefore, it is rational to notice that Sinhala and Tamil NLP
using the Stanford Parser6 . Hindi parsing was done using the     endeavours can help each other. Especially, given the above
IIIT-Hyderabad Parser7 and the study by Singh et al. [47].        fact, that these are official languages of Sri Lanka, results
    Herath et al. [48, 49] argue that pure Sinhala words did      in the generation of parallel data sets in the form of official
not have suffixes and that adding suffixes was incorporated       government documents and local news items. A number
to Sinhala after 12th century BC with the influx of Sanskrit      of researchers make use of this opportunity. We shall be
words. With this, they declare Sinhala to have to following       discussing those applications in this paper as well. Further,
types of words:                                                   there have been some fringe implementations, which bridge
                                                                  Sinhala with other languages such as Japanese [8, 21, 77–79].
    1) Suffixes
    2) Nouns
    3) Cases                                                      3.1    Corpora
    4) Verbs                                                      For any language, the key for NLP applications and im-
    5) Conjunctions and articles                                  plementations is the existence of adequate corpora. On this
    6) Adjectives                                                 matter, a relatively substantial Sinhala text corpus8 was
    7) Demonstratives, Interrogatives, and negatives              created by Upeksha et al. [80, 81] by web crawling. Later a
    8) Particles and prefixes                                     smaller Sinhala newes corpus9 was created by de Silva [22].
They further divide nouns into five groups: material, agen-       Both of the above corpora are publicly available. However,
tive, common, abstract, and proper. In addition to these,         none of these come close to the massive capacity and range
they also introduce compound nouns.We show the noun               of the existing English corpora. A word corpus of ap-
categorization proposed by Herath et al. [48] in Table 3.         proximately 35,000 entries was developed by Weerasinghe

    6. http://nlp.stanford.edu:8080/parser/                           8. https://osf.io/a5quv/
    7. http://ltrc.iiit.ac.in/analyzer/                               9. https://osf.io/tdb84/
4

TABLE 1: Examples for grammatical cases and inflection of animate common nouns

                                                                 Singular                                                Plural
 Form      Case                        Masculine                                Feminine
                                                                                                           Masculine           Feminine
                                Definite    Indefinite                 Definite        Indefinite
                                                                                      ගැහැ ය
    1      Nominative                සා               ෙස               ගැහැ ය                                   ස්                ගැහැ
                                                                                      ගැහැ ෙය
    2      Accusative
                                     සා               ෙස               ගැහැ ය           ගැහැ යක                                ගැහැ
    3      Auxiliary
                                                                                        ගැහැ ය ට
    4      Dative                    සාට              ෙස ට             ගැහැ යට                                       ට         ගැහැ       ට
                                                                                        ගැහැ යකට
    5      Genitive
                                   සාෙ              ෙස ෙ              ගැහැ යෙ          ගැහැ යකෙ                      ෙ       ගැහැ        ෙ
    6      Locative
    7      Instrumental
                                  සාෙග             ෙස ෙග            ගැහැ යෙග          ගැහැ යකෙග                  ෙග         ගැහැ         ෙග
    8      Ablative
    9      Vocative                   ස                ස               ගැහැ ය            ගැහැ ය                                ගැහැ

TABLE 2: Examples for grammatical cases and inflection                      released16 a massive corpus of text and stop words taken
of inanimate common nouns
Note that the grammatical cases of Auxiliary and Vocative do not exist      from a decade of Sinhala Facebook posts. A parallel corpus
for inanimate nouns in Sinhala.                                             of Sinhala and English was collected by Banón et al. [86]
                                                                            containing 217,407 sentences and available to download
                                   Singular                                 from their website17 . However the later audit by Caswell
                                                           Plural
  Form    Case               Definite Indefinite                            et al. [87] raised issues on the quality of that data set.
    1     Nominative                                                        A parallel corpus18 of aligned Sinhala-English documents
                              ෙප ත         ෙප ත             ෙප
    2     Accusative                                                        and sentences obtained from crawling the web was released
    4     Dative             ෙප තට         ෙප තකට      ෙප        වලට        by Sachintha et al. [88].
    5     Genitive           ෙප ෙ                                               As for Sinhala-Tamil corpora, Hameed et al. [89] claim
                                           ෙප තක           ෙප    වල
    6     Locative          ෙප ෙත                                           to have built a sentence aligned Sinhala-Tamil parallel cor-
    7     Instrumental      ෙප ෙත                                           pus and Mohamed et al. [90] claim to have built a word
                                           ෙප ත        ෙප       ව
    8     Ablative          ෙප                                              aligned Sinhala-Tamil parallel corpus. However, at the time
                                                                            of writing this paper, neither of them was publicly available.
TABLE 3: Noun categorization by Herath et al. [48]                          A very small Sinhala-Tamil aligned parallel corpus created
                                                                            by Farhath et al. [91] using order papers of government
  Type               Examples                                               of Sri Lanka is available to download19 . A Sinhala and
                                                                            Tamil annotated corpus for emotion analysis was created
  Material              , ෙගය                                               by Jenarthanan et al. [92].
  Agentive            ව නා, ම
  Common             ෙග යා,    සා                                           3.2   Data Sets
  Abstract             , උස                                                 Specific data sets for Sinhala, as expected, is scarce. How-
  Proper             ෙක ළඹ, අමර                                             ever, a Sinhala PoS tagged data set [93–95] is available to
  Compound              බ ( +බ ),                 ම    (        +ම )        download from github20 . Further, a Sinhala NER data set
                                                                            created by Manamini et al. [96] is also available to download
                                                                            from github21 .
et al. [82]. But it does not seem to be online anymore. A                       Facebook has released FastText [97–99] models for the
number of Sinhala-English parallel corpora were introduced                  Sinhala language trained using the Wikipedia corpus. They
by Guzmán et al. [83]. This includes a 600k+ Sinhala-                      are available as both text models22 and binary files23 . Using
English subtitle pairs10 initially collected by [84], 45k+                  the above models by Facebook, Lakmal et al. [100] have
Sinhala-English sentence pairs from GNOME11 , KDE12 , and                   created an extended FastText model trained on Wikipedia,
Ubuntu13 . Guzmán et al. [83] further provided two mono-                   News, and official government documents. The binary file24
lingual corpora for Sinhala. Those were a 155k+ sentences of                of the trained model is available to be downloaded. Herath
filtered Sinhala Wikipedia14 and 5178k+ sentences of Sinhala                 16. https://bit.ly/2GEI4d6
common crawl15 . Wijeratne and de Silva [85] have publicly                   17. https://www.paracrawl.eu/
                                                                             18. https://github.com/kdissa/comparable-corpus
  10. http://bit.ly/2KsFQxm                                                  19. http://bit.ly/2HTMEme
  11. http://bit.ly/2Z8q0fo                                                  20. http://bit.ly/2Krhrbv
  12. http://bit.ly/2WLY6bI                                                  21. http://bit.ly/2XrwCoK
  13. http://bit.ly/2wLVZGt                                                  22. http://bit.ly/2JXAyL8
  14. http://bit.ly/2EQZ7oM                                                  23. http://bit.ly/2JY5J9c
  15. http://bit.ly/2ZaQFZo                                                  24. http://bit.ly/2WowH0h
5

                                                                                                   S

                       S
                                                                                NP                                     VP

              NP                    VP

                                                                                                       NP                          V

                     VBP                       NP

                                                                                             JJ                  NP

              PRP                   DT         JJ           NN

                                                                                PRP                         JJ         NN
               I     eat            a          red      apple

                                                                                මම           ර          ඇප            ෙග ය        ක

(a) English                                                      (b) Sinhala

                           S
                                                                                                   S

              NP                          VP
                                                                                      VN                         NP

                               NP                     VGF
                                                                               CLS          V          DET                MWN

              PRP     JJ            NN         VM           VP
                                                                                                                      N         ADJ

               म      लाल           सेब        खाता          ँ
                                                                               Je          mange       une        pomme         rouge

(c) Hindi                                                        (d) French

Fig. 2: Parse trees for the sentence “I eat a red apple” in four languages.

et al. [101] has compiled a report on the Sinhala lexicon for     earliest and most extensive Sinhala-English dictionary avail-
the purpose of establishing a basis for NLP applications. A       able for consumption was by Malalasekera [103]. However,
comparative analysis of Sinhala word embedding has been           this dictionary is locked behind copyright laws and is not
conducted by Lakmal et al. [100]. A dataset25 consisting          available for public research and development. This copy-
of 3576 Sinhala documents drawn from Sri Lankan news              right issue is shared with other printed dictionaries [104–
websites and tagged (CREDIBLE, FALSE, PARTIAL or UN-              109] as well. The dictionary by Kulatunga [110] is publicly
CERTAIN) was published by Jayawickrama et al. [102]               available for usage through an online web interface but does
                                                                  not provide API access or means to directly access the data
                                                                  set. The largest publicly available English-Sinhala dictionary
3.3   Dictionaries                                                data set is from a discontinued FireFox plug-in EnSiTip [111]
A necessary component for the purpose of bridging Sinhala         which bears a more than passing resemblance to the above
and English resources are English-Sinhala dictionaries. The       dictionary by Kulatunga [110]. Hettige and Karunananda
                                                                  [62] claim to to have created a lexicon to help in their
 25. https://github.com/LIRNEasia/MisinformationCorpusSinhala     attempt to create a system capable of English-to-Sinhala
6

TABLE 4: Extension of the suffix structure proposed by Herath et al. [49]
Number = Singular (S) / Plural (P)
Definite = Definite (D) / Indefinite (I) / Undecided (U)
Case = Nominative (N) / Accusative (A) / Dative (Da) / Genitive (G) / Instrumentive (In) / Auxiliary (Au) / Locative (L) / Ablative (Ab)
Conjun = with suffix th (Y) / without suffix th (N)

                                                                                               Attributes
 Suf1       Suf2        Suf3         Noun                 -tree-
                                                                           Number            Definite  Case                   Conjun
    -          -          -           ගස්                  (-s)              P                  U     N, A, V                   N
    අ          -          -           ගස                  (the-)             S                  D     N, A, V                   N
   අ           -          -          ගස                    (a-)              S                  I      N, A                     N
   අක          -          -          ගසක                 (of a-)             S                  I        G                      N
    අ          ට          -           ගසට              (to the-)             S                  D       Da                      N
    -          ඒ          -           ගෙස              (in the-)             S                  D       Au                      N
    -         වල          -          ගස්වල               (on -s)             P                  U        L                      N
    -        වලට          -         ගස්වලට               (to -s)             P                  U       Da                      N
    -         එ           -          ගෙස               (of the-)             S                  D        G                      N
   අ         ඉ            -         ගස                 (from a-)             S                  I       Au                      N
    -          -         උ           ග                  (-s and)             P                  U      N, A                     Y
    අ          -                     ගස                  (-and)              S                  D      N, A                     Y
   අ           -         උ          ගස                  (a- too)             S                  I      N, A                     Y
   අක          -                    ගසක              (of a -, too)           S                  U      N, A                     Y
    අ          ට                    ගසට             (to the- too)            S                  D       Da                      Y
    -          ඒ                    ගෙස               (in -s too)            S                  D       Au                      Y
    -         වල                   ගස්වල              (on -s too)            P                  U        L                      Y
    -        වලට                   ගස්වලට             (to -s too)            P                  U       Da                      Y
    -         එ                     ගෙස             (of the- too)            S                  D        G                      Y
   අ         ඉ           උ         ගස               (from a- too)            S                  I       Au                      Y

machine translation. A review on the requirements for                   what Arukgoda et al. [121] have cloned to use in their appli-
English-Sinhala smart bilingual dictionary was conducted                cation uploaded on github26 . However, even at its peak, due
by Samarawickrama and Hettige [112].                                    to the lack of volunteers for the crowd-soured methodology
    There exists the government sponsored trilingual dic-               of populating the WordNet, it was at best an incomplete
tionary [113], which matches Sinhala, English, and Tamil.               product. Another effort to build a Sinhala Wordnet was
However, other than a crude web interface on the ministry               initiated by Welgama et al. [122] independently from above;
website, there is no efficient API or any other way for a               but it too have stopped progression even before achieving
researcher to access the data on this dictionary. Weerasinghe           the completion level of Wijesiri et al. [119].
and Dias [114] have created a multilingual place name
database for Sri Lanka which may function both as a dic-
                                                                        3.5   Morphological Analyzers
tionary and a resource for certain NER tasks.
                                                                        As shown in Fig 1, morphological analysis is a ground
                                                                        level necessary component for natural language processing.
3.4   WordNets
                                                                        Given that Sinhala is a highly inflected language [22, 37, 38],
WordNets [115] are extremely powerful and act as a versatile            a proper morphological analysis process is vital. The ear-
component of many NLP applications. They encompass a                    liest attempt on Sinhala morphological analysis we have
number of linguistic properties which exist between the                 observed are the studies by Herath et al. [48, 49]. They are
words in the lexicon of the language including but not                  more of an analysis of Sinhala morphology rather than a
limited to: hyponymy, hypernymy, synonymy, and meronymy.                working tool. As such we discussed the observations and
Their uses range from simple gazetteer listing applica-                 conclusions of these works at Section 2. It is also worth to
tions [25] to information extraction based on semantic simi-            note that these works predates the introduction of Sinhala
larity [116, 117] or semantic oppositeness [118]. An attempt            unicode and thus use a transliteration of Sinhala in the Latin
has been made to build a Sinhala Wordnet by Wijesiri et al.             alphabet.
[119]. For a time it was hosted on [120] but it too is now
defunct and all the data and applications are lost other than             26. https://github.com/jseanm1/aruthSWSD
7

    The next attempt by Herath et al. [123] creates a modular     and Silva [140] proposed a stochastic POS tagger based on a
unit structure for morphological analysis of Sinhala. Much        small 10,000 word corpus drawn from Facebook and Twitter.
later, as a step on their efforts to create a system with the
ability to do English-to-Sinhala machine translation, Hettige     3.7       Parsers
and Karunananda [59] claim to have created a morphologi-
                                                                  The PoS tagged data then needs to be handed over to
cal analyzer (void of any public data or code), which links
                                                                  a parser. This is an area which is not completely solved
to their studies of a Sinhala parser [60] and computational
                                                                  even in English due to various inherent ambiguities in
grammar [18]. Hettige et al. [71] further propose a multi-
                                                                  natural languages. However, in the case of English, there
agent System for morphological analysis. Welgama et al.
                                                                  are systems which provide adequate results [141] even if not
[124] attempted to evaluate machine learning approaches
                                                                  perfect yet. The Sinhala state of affairs, is that, the first parser
for Sinhala morphological analysis. Yet another indepen-
                                                                  for the Sinhala language was proposed by Hettige and
dent attempt to create a morphological parser for Sinhala
                                                                  Karunananda [60] with a model for grammar [18]. The study
verbs was carried out by Fernando and Weerasinghe [125].
                                                                  by Liyanage et al. [38] is concentrated on the same given that
Later, another study, which was restricted to morphological
                                                                  they have worked on formalizing a computational grammar
analysis of Sinhala verbs was conducted by Dilshani and
                                                                  for Sinhala. While they do report reasonable results, yet
Dias [126]. There was no indication on whether this work
                                                                  again, do not provide any means for the public to access the
was continued to cover other types of words. Further, other
                                                                  data or the tools that they have developed. Kanduboda [37]
than this singular publication, no data or tools were made
                                                                  have worked on Sinhala differential object markers relevant
publicly accessible. Nandathilaka et al. [127] proposed a rule
                                                                  for parsing.
based approach for Sinhala lemmatizing. The work by Wel-
                                                                      The first attempt at a Sinhala parser, as mentioned above,
gama et al. [128] claim to have set a set of gold standard
                                                                  was by Hettige and Karunananda [60] where they created
definitions for the morphology of Sinhala Words; but given
                                                                  prototype Sinhala morphological analyzer and a parser as
that their results are not publicly available, further usage or
                                                                  part of their larger project to build an end-to-end transla-
confirmation of these claims cannot not be done. The table 5
                                                                  tor system. The function of the parser is based on three
provides a comparative summery of the discussion above.
                                                                  dictionaries: Base Dictionary, Rule Dictionary, and Concept
                                                                  Dictionary. They are built as follows:
3.6   Part of Speech Taggers
The next step after morphological analysis is Part of Speech            •    The Base Dictionary: prakurthi (base words), nipatha
(PoS) tagging. The PoS tags differ in number and function-                   (prepositions), upasarga (prefixes), and vibakthi (Irreg-
ality from language to language. Therefore, the first step in                ular Verbs).
creating an effective PoS tagger is to identifying the PoS              •    The Rule Dictionary: inflection rules used to gener-
tag set for the language. This work has been accomplished                    ate various forms of verbs and nouns from the base
by Fernando et al. [93] and Dilshani et al. [94]. Expanding                  words.
on that, Fernando et al. [93] has introduced an SVM Based               •    The Concept Dictionary: synonyms and antonyms
PoS Tagger for Sinhala and then Fernando and Ranathunga                      for the words found in the base dictionary.
[95] give an evaluation of different classifiers for the task         Parsers are, in essence, a computational representation
of Sinhala PoS tagging. While here it is obvious that there       of the grammar of a natural language. As such, in building
has been some follow up work after the initial foundation,        Sinhala parsers, it is crucial to create a computational model
it seems, all of that has been internal to one research group     for Sinhala grammar. The first such attempt was taken
at one institution as neither the data nor the tools of any       by Hettige and Karunananda [18] with special consideration
of these findings have been made available for the use of         given to Morphology and the Syntax of the Sinhala language
external researchers. Several attempts to create a stochastic     as an extension to their earlier work [60]. Here, it is worthy
PoS tagger for Sinhala has been done with the studies             to note that, unlike in their earlier attempt [60], where they
by Herath and Weerasinghe [129], Jayaweera and Dias [130],        explicitly mentioned that they are building a parser, in this
and Jayasuriya and Weerasinghe [131] being the most no-           study [18], they use the much conservative claim of building
table. Within a single group which did one of the above           a computational grammar. Under Morphology, they again
stochastic studies [130], yet another set of studies was car-     handled Sinhala inflection. Their system is based on a Finite
ried out to create a Sinhala PoS tagger starting with the foun-   State Transducer (FST) and Context-Free Grammar (CFG)
dation of Jayaweera and Dias [132] which then extended            where they they modeled 85 rules for nouns and 18 rules
to a Hidden Markov Model (HMM) based approach [133]               for verbs. The specific implementation is more partial to a
and an analysis of unknown words [134, 135]. Further, this        rule-based composer rather than parser. It is also worthy to
group presented a comparison of few Sinhala PoS taggers           note that this system could only handle simple sentences
that are available to them [136]. A RESTFul PoS tagging           which only contained the following 8 constituents: Attribu-
web service created by Jayaweera and Dias [137] using the         tive Adjunct of Subject, Subject, Attributive Adjunct of Object,
above research can still be accessed27 via POST and GET.          Object, Attributive Adjunct of Predicate, Attributive Adjunct
A hybrid PoS tagger for Sinhala language was proposed             of the Complement of Predicate, Complement of Predicate, and
by Gunasekara et al. [138]. The study by Kothalawala et al.       Predicate. With these, they propose the following grammar
[139] discussed the data availability problem in NLP with a       rules for Sinhala:
Sinhala POS tagging experiment among others. Withanage
                                                                  S = Subject Akkyanaya
  27. http://bit.ly/2F0jKid                                       Subject = SimpleSubject | ComplexSubject
8

TABLE 5: Morphological Analyzers comparison
Base: Rule-based (RB) / Machine Learning (ML)
Able to Handle Part of Speech (Handles): Yes (Y) / No (N)
Outputs: Yes (Y) / No (N) / No Information (O)
Abbreviations: Nouns (Nu), Verbs (Ve), Adjectives (Aj), Adverbs (Av), Function Words (Fn), Root (R), Person (P), Number (Nb), Gender (G),
Article (A), Case (C)

                                                                                     Handles                         Output
                                        Base       Modus Operandi
                                                                            Nu     Ve Aj Av          Fn    R    P    Nb G        A    C
  Hettige and Karunananda [59]           RB     Finite State Automata       Y      Y   N     N       N     Y    Y    Y    Y      Y    Y
  Hettige et al. [71]                    RB           Agent-based           Y      Y   Y     Y       Y     N    N    N   N       N    N
  Nandathilaka et al. [127]              RB               N/A               Y      N   N     N       N     Y    N    N   N       N    N
  Welgama et al. [124]                   ML      Morfessor algorithm        Y      Y   Y     Y       Y     Y    N    N   N       N    N
  Fernando and Weerasinghe [125]         RB     Finite State Transducer     N      Y   N     N       N     Y    Y    Y    Y      N    Y
  Dilshani and Dias [126]                RB               N/A               N      Y   N     N       N     O    O    O   O       O    O

ComplexSubject = SimpleSubject ConSub                                     Similar to Hettige and Karunananda [18], the work
SimpleSubject = Noun | Adjective Noun                                 by Liyanage et al. [38] also builds a CFG for Sinhala covering
ConSub = Conjunction SimpleSubject                                    10 out of the 25 types of simple sentence structures in Sin-
Akkyanaya = VerbP | Object VerbP                                      hala reported by Abhayasinghe [40]. This parser is unable
Object = SimpleObject | ComplexObject                                 to parse sentences where inanimate subjects do not consider
ComplexObject = Conjunction SimpleObject                              the number. Further, sentences which contain, compound
SimpleObject = Noun | Adjective Noun                                  verbs, auxiliary verbs, present participles, or past participles
VerbP = Verb | Adverb Verb                                            cannot be handled by this parser. If the verbs have imperative
                                                                      mood or negation those too cannot be handled by this. Non-
    The later work by Liyanage et al. [38] also involves
                                                                      verbal sentences which end with adjectives, oblique nominals,
formalizing a computational grammar for Sinhala. They
                                                                      locative predicates, adverbials, or any other language entity
claim that Sinhala can have any order of words in practice.
                                                                      which is not a verb cannot be handled by this parser.
However, they do not note that this is happening because
                                                                          The study by Kanduboda [37] covers not the whole of
practices of the spoken language, which does not share
                                                                      Sinhala parsing but analyzes a very specific property of
the strong SOV conventions of the written language, are
                                                                      Sinhala observed by Aissen [143] which states that it is pos-
slowly seeping into written text. However, they do make
                                                                      sible to notice Differential Object Marking (DOM) in Sinhala
note of how Sinhala grammar is modeled as a head-final
                                                                      active sentences. Kanduboda [37] define this as the choice
language [39]. They propose the Sinhala Noun Phrase (N P )
                                                                      of /wa/ and /ta/ object markers. They further observe three
to be defined as shown in equation 1 where N N is a noun
                                                                      unique aspects of DOM in Sinhala: (a) it is only observed
which can be of types: common noun (N ), pronoun (P rN )
                                                                      in active sentences which contain transitive verbs, (b) it can
or proper noun (P ropN ). The adjectival phrase (ADJP )
                                                                      occur with accusative marked nouns but not with any other
is then defined as as shown in equation 2 where: Det is
                                                                      cases, (c) it exists only if the sentence has placed an animate
a Determiner, Adj is the adjective, and Deg is an optional
                                                                      noun in the accusative position. They do a statistical analysis
operator Degrees which can be used to intensify the meaning
                                                                      and provide a number of short gazetteer lists as appendixes.
of the adjective in cases where the adjective is qualitative.
                                                                      However, they observe that further work has to be done
While they note that according to Gunasekara [142], there
                                                                      for this particular language rule in Sinhala given that they
has to three classes of adjectives (qualitative, quantitative,
                                                                      found some examples which proved to be exceptions to the
and demonstrative), they do not implement this distinction
                                                                      general model which they proposed.
in their system. Similarly, they propose Sinhala Verb Phrase
(V P ) to be defined as shown in equation 3 where V is a
single verb. They here note that they are ignoring compound
                                                                      3.8   Named Entity Recognition Tools
verbs and auxiliary verbs in their grammar. The adverbial
phrases (ADV P ) are then recursively defined as as shown             As shown in Fig 1, once the text is properly parsed, it
in equation 4.                                                        has to be processed using a Named Entity Recognition
                                                                      (NER) system. The first attempt of Sinahla NER was done
                     N P = [ADJP ][N N ]                       (1)    by Dahanayaka and Weerasinghe [144]. Given that they
                                                                      were conducting the first study for Sinhala NER, they based
                                                                     their approach on NER research done for other languages. In
                                        i
                                                                      this, they gave prominent notice to that of Indic languages.
                            h
                ADJP = [Det] [Deg][Adj]                        (2)
                                                                      On that matter, they were the first to make the interesting
                                                                      observation that NER for Indic languages (including, but
                        V P = [ADV P ][V ]                     (3)    not limited to Sinhala) is more difficult than that of English
                                                                      by the virtue of the absence of a capitalization mechanic.
                                                                      Following prior work done on other languages, they used
                    "                        #                       Conditional Random Fields (CRF) as their main model and
                                            i
                                                                      compared it against a baseline of a Maximum Entropy (ME)
                               h
        ADV P = [N P ] [ADV P ] [Deg][ADV ]                    (4)
                                                                      model. However, they only use the candidate word, Context
9

TABLE 6: NER system comparison
* Denotes a baseline.
F1 to F11 denotes Context Words, Word Prefixes and Suffixes, Length of the Word, Frequency of the Word, First Word/ Last Word of a
Sentence, (POS) Tags, Gazetteer Lists, Clue Words, Outcome Prior, Previous Map, and Cutoff Value

                                                                                          Features
                                           CRF     ME
                                                           F1    F2     F3    F4    F5      F6     F7   F8     F9    F10   F11
    Dahanayaka and Weerasinghe [144]        Yes    Yes*    Yes   Yes    No    No    No      No No       No     No    No    No
    Senevirathne et al. [145]               Yes    No      Yes   Yes    Yes   No    Yes     No No       Yes    No    Yes   No
    Manamini et al. [96]                    Yes    Yes     Yes   Yes    Yes   Yes   Yes     Yes Yes     Yes    Yes   Yes   Yes

Words around the candidate word, and a simple analysis of         work on NER [96] and PoS tagging [95].
Sinhala suffixes as their features.
    The follow up work by Senevirathne et al. [145] kept the      3.9   Semantic Tools
CRF model with all the previous features but did not report       Applications of the semantic layer are more advanced than
comparative analysis with an ME model. The innovation in-         the ones below it in Figure 1. But even with the obvious
troduced by this work is a richer set of features. In addition    lack of resources and tools, a number of attempts have
to the features used by Dahanayaka and Weerasinghe [144],         been made on semantic level applications for the Sinhala
they introduced, Length of the Word as a threshold feature.       Language. The earliest attempt on semantic analysis was
They also introduced First Word feature after observing           done by Herath et al. [147] using their earlier work which
certain rigid grammatical rules of Sinhala. A feature of clue     dealt with Sinhala morphological analysis [48]. A Sinhala
Words in the form of a subset of Context Words feature            semantic similarity measure has been developed for short
was first proposed by this work. Finally, they introduced         sentences by Kadupitiya et al. [148]. This work has been
a feature for Previous Map which is essentially the NE value      then extended by Kadupitiya et al. [149] for the application
of the preceding word. Some of these feature extractions          use case of short answer grading. Data and tools for these
are done with the help of a rule-based post-processor which       projects are not publicly available. A deterministic process
utilizes context-based word lists.                                flow for automatic Sinhala text summarizing was proposed
    The third attempt at Sinhala NER was by Manamini et al.       by Welgama [150]. The study by Wimalasuriya [151], which
[96] who dubbed their system Ananya. They inherit the CRF         has the same name as the above work by Welgama [150],
model and ME baseline from the work of Dahanayaka and             uses graph based TextRank algorithm for automatic Sinhala
Weerasinghe [144]. In addition to that, they take the en-         text summarizing.
hanced feature list of Senevirathne et al. [145] and enrich it        There have been multiple attempts to do word sense
further more. They introduce a Frequency of the Word feature      disambiguation (WSD) [152–156] for Sinhala. For this, Aruk-
based on the assumption that most commonly occurring              goda et al. [121] have proposed a system named Aruth
words are not NEs. Thus, they model this as a Boolean             based on the Lesk Algorithm[157]. An online tool29 , an API30
value with a threshold applied on the word frequency. They        of the algorithm, and code along with data on github31
extend the First Word feature proposed by Senevirathne            are available. For the same task, Marasinghe et al. [158]
et al. [145] to a First Word/ Last Word of a Sentence feature     have proposed a system based on probabilistic modeling.
noting that Sinhala grammar is of SOV configuration. They         A dialogue act recognition system which utilizes simple
introduce a (PoS) Tag feature and a gazetteer lists based         classification algorithms has been proposed by Palihakkara
feature keeping in line with research done on NER in              et al. [159].
other languages. They formally introduce clue Words, which            Text classification is a popular application on the seman-
was initially proposed as a sub-feature by Dahanayaka             tic layer of the NLP stack. A very basic Sinhala text classi-
and Weerasinghe [144], as an independent feature. Utilizing       fication using Naı̈ve Bayes Classifier, Zipf’s Law Behavior,
the fact that they have the ME model unlike Dahanayaka            and SVMs was attempted by Gallege [160]. A smaller imple-
and Weerasinghe [144], they introduce a complementary             mentation of Sinhala news classification has been attempted
feature to Previous Map named Outcome Prior, which uses the       by de Silva [22]. As mentioned in Section 3.2, their news
underlying distribution of the outcomes of the ME model.          corpus is publicly available32 . Another attempt on Sinhala
Finally, they introduce a Cutoff Value feature to handle the      text classification using six popular rule based algorithms
over-fitting problem.                                             was done by Lakmali and Haddela [161]. Even-though they
    The table 6 provides comparative summery of the dis-          talk about building a corpus named SinNG5, they do not
cussion above. It should be noted that all three of these mod-    indicate of means for others to obtain the said corpus.
els only tag NEs of types: person names, location names and       Another study by Kumari and Haddela [162] utilizes the
organization names. The Ananya system by Manamini et al.          SinNG5 corpus as the data set for their attempt to use
[96] is available to download at GitHub 28 . The data and         LIME [163] for human interpretability of Sinhala document
code for the approaches by Dahanayaka and Weerasinghe             classification. However, they too do not provide access cor-
[144] and by Senevirathne et al. [145] are not accessible to      pus. Nanayakkara and Ranathunga [164] have implemented
the public. Azeez and Ranathunga [146] proposed a fine-
                                                                    29. http://aruth.herokuapp.com/
grained NER model for Sinhala building on their earlier
                                                                    30. https://bit.ly/3sJEYbS
                                                                    31. https://github.com/jseanm1/aruthSWSD
  28. http://bit.ly/2XrwCoK                                         32. https://osf.io/tdb84/
10

a system which uses corpus-based similarity measures for        created Sinhala document readers for visually impaired per-
Sinhala text classification. Gunasekara and Haddela [165]       sons to be used on Android devices. An OCR and Text-to-
claim to have created a context aware stop word extraction      Speech system for Sinhala named Bhashitha was proposed
method for Sinhala text classification based on simple TF-      by De Zoysa et al. [194]. Anuradha and Thelijjagoda [195]
IDF. An LSTM based textual entailment system for Sinhala        proposed a machine translation system to convert Sinhala
was proposed by Jayasinghe and Sirts [166].                     and English Braille documents into voice. A separate group
    A simple MLP based method to classify sentiments            has done work on Sinhala text-to-speech systems indepen-
in Sinhala text was initially proposed by Medagoda [167]        dent to above [196].
based on their prior work [168]. A word2vec based tool33            On the converse, Nadungodage et al. [197] have done a
for sentiment analysis of Sinhala news comments is avail-       series of work on Sinhala speech recognition with special
able. A methodology for constructing a sentiment lexicon        notice given to Sinhala being a resource poor language. This
for Sinhala Language in a semi-automated manner based           project divides its focus on: continuity [198], active learn-
on a given corpus was proposed by Chathuranga et al.            ing [199], and speaker adaptation [200]. A Sinhala speech
[169]. Demotte et al. [170] proposed a sentiment analysis       recognition for voice dialing which is speaker independent
system based on sentence-state LSTM Networks for Sinhala        was proposed by Amarasingha and Gamini [201] and on
news comments. In the subsequent work [171], they used          the other end, a Sinhala speech recognition methodology
word similarity to generate a Sinhala semantic lexicon. They    for interactive voice response systems, which are accessed
followed this up with a further study [172] which discussed     through mobile phones was proposed by Manamperi et al.
a number of other deep learning models such as RNN and          [202]. Priyadarshani [203] proposes a method for speaker
Bi-LSTM in the domain of Sinhala sentiment analysis. Jaya-      dependant speech recognition based on their previous work
suriya et al. [173] proposed a method to classify Sinhala       on: dynamic time warping for recognizing isolated Sin-
posts in the domain of sports into positive and negative        hala words [204], genetic algorithms [205], and syllable
class sentiments. Ranathunga and Liyanage [174] claimed         segmentation method utilizing acoustic envelopes [206].
that using word embedding models as semantic features can       The method proposed by Gunasekara and Meegama [207]
compensate for the lack of well developed language-specific     utilizes an HMM model for Sinhala speech-to-text. A Sin-
linguistic or language resources in the case of analysing       hala speech recognizer supporting bi-directional conver-
sentiment of Sinhala news comments.                             sion between Unicode Sinhala and phonetic English was
    A model for detecting grammatical mistakes in Sinhala       proposed by Punchimudiyanse and Meegama [208]. The
was developed by Pabasara and Jayalal [175]. They followed      work by Karunanayake et al. [209] transfer learns CNNs for
this up with an grammatical error detection and correc-         transcribing free-form Sinhala and Tamil speech data sets
tion model [176]. Gunasekara et al. [177] used annotation       for the purpose of classification. Dilshan [210] conducted
projection for semantic role labeling for Sinhala. A cross-     a study for the specific use case of transcribing number
lingual document similarity measurement using the use-          sequences in continuous Sinhala speech. The work by Di-
case of Sinhala and English was developed by Isuranga et al.    nushika et al. [211] uses automatic speech recognition of
[178].                                                          Sinhala for speech command classification. Gamage et al.
                                                                [212] explored the use of combinational acoustic models
3.10   Phonological Tools                                       such as Deep Neural Network - Hidden Markov Model
On the case of phonological layer, a report on Sinhala pho-     (DNN-HMM) [213] and Subspace Gaussian Mixture Model
netics and phonology was published by Wasala and Gamage         (SGMM) [213] in Sinhala speech recognition. Karunathilaka
[179]. Wickramasinghe et al. [180] discussed the practical      et al. [214] explore Sinhala speech recognition using deep
issues in developing Sinhala Text-to-Speech and Speech          learning models such as: pre-trained DNN, DNN, TDNN,
Recognition systems. Based on the earlier work by Weeras-       TDNN+LSTM.
inghe et al. [181], Wasala et al. [182] have developed meth-        The Sinhala speech classification system proposed
ods for Sinhala grapheme-to-phoneme conversion along            by Buddhika et al. [215] does so without converting the
with a set of rules for schwa epenthesis. This work was then    speech-to-text. However, they report that this approach
extended by Nadungodage et al. [183]. Weerasinghe et al.        only works for specific domains with well-defined limited
[184] developed a Sinhala text-to-speech system. However,       vocabularies. Kavmini et al. [216] presented a Sinhala speech
it is not publicly accessible. They internally extended it to   command classification system which can be used for down-
create a system capable of helping a mute person achieve        stream applications.
synthesized real-time interactive voice communication in
Sinhala [185]. A rule based approach for automatic segmen-      3.11   Optical Character Recognition Applications
tation of a small set of Sinhala text into syllables was pro-
posed by Kumara et al. [186]. An ew prosodic phrasing method    While it is not necessarily a component of the NLP stack
to help with Sinhala Text-to-Speech process was proposed        shown in Fig 1, which follows the definition by Liddy [24],
by Bandara et al. [187, 188, 189]. Sodimana et al. [190]        it is possible to swap out the bottom most phonological layer
proposed a text normalization methodology for Sinhala text-     of the stack in favour of an Optical Character Recognition
to-speech systems. Further, Sodimana et al. [191] formalized    (OCR) and text rendering layer.
a step-by-step process for building text-to-speech voices for        An attempt for Sinhala OCR system has been taken
Sinhala. Both Jayamanna [192] and Mishangi [193] have           by Rajapakse et al. [217] before any other work has been
                                                                done on the topic. Much later, a linear symmetry based
 33. http://bit.ly/2QKI9Np                                      approach was proposed by Premaratne and Bigun [218, 219].
11

They then used hidden Markov model-based optimization               has been proposed by Dharmapala et al. [248]. The study
on the recognized Sinhala script [220]. Similarly, Hewavitha-       by Walawage and Ranathunga [249] specifically focuses on
rana et al. [221] used hidden Markov models for off-line            segmentation of overlapping and touching Sinhala hand-
Sinhala character recognition. Statistical approaches with          written characters. Silva and Jayasundere [250] focused on
histogram projections for Sinhala character recognition is          recognizing character modifiers in Sinhala handwriting.
proposed by Hewavitharana and Kodikara [222], by Ajward                 Summarizing on optically recognized old Sinhala text
et al. [223], and by Madushanka et al. [224]. Karunanayaka          for the purpose of archival search and preservation was
et al. [225] also did off-line Sinhala character recognition        explored by Rathnasena et al. [251]. The work of Peiris
with a use case for postal city name recognition. A separate        [252] also focused on OCR for ancient Sinhala inscriptions.
group had attempted Sinhala OCR [226] mainly involving              A neural network based method for recognizing ancient
the nearest-neighbor method [227]. A study by Ediriweera            Sinhala inscriptions was proposed by Karunarathne et al.
[228] uses dictionaries to correct errors in Sinhala OCR.           [253]. Chanda et al. [254] proposed a Gaussian kernel SVM
An early attempt for Sinhala OCR by Dias et al. [229] has           based method for word-wise Sinhala, Tamil, and English
been extended to be online and made available to use via            script identification.
desktops [230] and hand-held devices [231] with the ability
to recognize handwriting. A simple neural network based             3.12   Translators
approach for Sinhala OCR was utilized by Rimas et al. [232].        A series of work has been done by a group towards English
A fuzzy-based model for identifying printed Sinhala charac-         to Sinhala translation as mentioned in some of the above
ters was proposed by Gunarathna et al. [233]. Premachandra          subsections. This work includes; building a morphological
et al. [234] proposes a simple back-propagation artificial          analyzer [59], lexicon databases [62], a transliteration sys-
neural network with hand crafted features for Sinhala char-         tem [63], an evaluation model [68], a computational model
acter recognition. Another neural network with specialized          of grammar [18], and a multi-agent solution [74]. After
feature extraction for Sinhala character recognition was            working on human-assisted machine translation [64], Het-
proposed by Jayamaha and Naleer [235]. On the matter of             tige and Karunananda [67, 69] have attempted to establish a
neural networks and feature extraction, a feature selection         theoretical basics for English to Sinhala machine translation.
process for Sinhala OCR was proposed by Kumara and                  A very simplistic web based translator was proposed [65,
Ragel [236]. Jayawickrama et al. [237] worked on Sinhala            66]. The same group have worked on a Sinhala ontology
printed characters with special focus on handling diacritic         generator for the purpose of machine translation [73] and
vowels. However, they opted to refer to diacritic vowels            a phrase level translator [75] based on the previous work
as modifiers in their work. Gunawardhana and Ranathunga             on a multi-agent system for translation [72]. Further, an
[238] proposed a limited approach to recognize Sinhala              application of the English to Sinhala translator on the use
letters on Facebook images. A CNN-based methodology                 case of selected text for reading was implemented [70]. They
to improve printed Sinhala character OCR was proposed               later continued their work on multi agent English to Sinhala
by Liyanage [239]. A meta-study on the effects of text              translation with the AGR organizational model [76].
genre, image resolution, and algorithmic complexity needed              Another group independently attempted English-to-
for Sinhala OCR from books and newspapers was con-                  Sinhala machine translation [255] with a statistical ap-
ducted by Anuradha et al. [240]. An OCR and Text-to-                proach [256]. Wijerathna et al. [257] and De Silva et al.
Speech system for Sinhala named Bhashitha was proposed              [258] have proposed simple rule based translators. An
by De Zoysa et al. [194]. Anuradha and Thelijjagoda [195]           example-based method applied on the English-Sinhala sen-
uses OCR on Sinhala and English Braille documents on                tence aligned government domain corpus was proposed
which they then run a machine translation system in order           by Silva and Weerasinghe [259]. A translator based on a
to convert them into voice.                                         look-up system was proposed by Vidanaralage et al. [260].
     Fernando et al. [241] claim to have created a database         In a preprint, Joseph et al. [261] proposes an evolutionary
for handwriting recognition research in Sinhala language            algorithm for Sinhala to English translation with a basis of
and further claims that the data set is available at National       Point-wise Mutual Information (PMI) and claims that the
Science Foundation (NSF) of Sri Lanka. However, the paper           code will be shared once the paper is accepted. However,
provides no URLs and we were not able to find the data              they do not report any quantitative results to be compared
set on the NSF website either. The work by Karunanayaka             and the reported qualitative results are also superficial. Fer-
et al. [242] is focused on noise reduction and skew correction      nando et al. [262] tries to solve the Out of vocabulary (OOV)
of Sinhala handwritten words. A genetic algorithm-based             problem for Sinhala in the context of Sinhala-English-Tamil
approach for non-cursive Sinhala handwritten script recog-          statistical machine translation. Fonseka et al. [263] used Byte
nition was proposed by Jayasekara and Udawatta [243]. Ni-           Pair Encoding (BPE) for English to Sinhala neural machine
laweera et al. [244] compare projection and wavelet-based           translation. As an another solution to the OOV problem,
techniques for recognizing handwritten Sinhala script. Silva        an analysis of subword techniques to improve English to
and Kariyawasam [245] worked on segmenting Sinhala                  Sinhala Neural Machine Translation (NMT) was conducted
handwritten characters with special focus on handling dia-          by Naranpanawa et al. [264]. A data augmentation method
critic vowels. A comparative study of few available Sinhala         to expand bilingual lexicon terms based on case-markers
handwriting recognition methods was done by Silva et al.            for the purpose of solving the OOV problem in the domain
[246]. Silva et al. [247] uses contour tracing for isolated char-   of NMT was proposed by Fernando et al. [265]. A Singlish
acters in handwritten Sinhala text. A Sinhala handwriting           to Sinhala converter which uses an LSTM was proposed
OCR system which utilizes zone-based feature extraction             by De Silva [266].
12

    Most of the cross Sinhala and Tamil work has been done         ten Sinhala [294] and animation of finger-spelled words and
in the domain of machine translation. A neural machine             number signs [295]. A simple Sinhala chat bot which utilizes
translation for Sinhala and Tamil languages was initiated          a small knowledge base has been proposed by Hettige and
by Tennage et al. [267, 268]. Then they further enhanced           Karunananda [61]. A study on the effect of word embed-
it with transliteration and byte pair encoding [269] and           dings on a Sinhala chat bot was conducted by Gamage
used synthetic training data to handle the rare word prob-         et al. [296] where they used, the fasttext model trained by
lem [270]. This project produced Si-Ta [271] a machine             Facebook [97–99], on a RASA36 chat bot. Dissanayake and
translation system of Sinhala and Tamil official documents.        Hettige [297] implemented a question and answer generator
In the statistical machine translation front, Farhath et al.       for Sinhala with the limited PoS: pronouns, adjectives, verbs,
[272] worked on integrating bilingual lists. The attempts          and adverbs.
by Weerasinghe [273] and Sripirakas et al. [274] were also fo-          Fernando [298] proposed a method for inexact matching
cused on statistical machine translation while Jeyakaran and       of Sinhala proper names. A study on determining canonical
Weerasinghe [275] attempted a kernel regression method.            word order of colloquial Sinhala sentences using priority
A yet another attempt was made by Pushpananda et al.               information was conducted by Kanduboda and Tamaoka
[276] which they later extended with some quality im-              [299] which they later extended [300–302]. Jayakody et al.
provements [277]. An attempt on real-time direct translation       [303] uses simple KNN and SVM methods on a PoS tagged
between Sinhala and Tamil was done by Rajpirathap et al.           Sinhala corpus to create a question-answering system which
[278]. Dilshani et al. [279] have done a study on the linguistic   they name Mahoshadha. Sandaruwan et al. [304] have at-
divergence of Sinhala and Tamil languages in respect to            tempted to identify abusive Sinhala comments in social
machine translation. Mokanarangan [280] claims to have             media using text mining and machine learning techniques.
built a named entity translator between Sinhala and Tamil          An extremely simple plagiarism detection tool which only
for official government documents. But this work is locked         uses n-grams of simply tokenized text was proposed by Bas-
behind an institutional repository wall and thus is not ac-        nayake et al. [305]. Another simple plagiarism detection tool
cessible by other researchers. Arukgoda et al. [281] studied       that uses synonymy and Hyponymy-Hypernymy (which
the possibility of using deep learning techniques to improve       they call Generalization in the paper) was attempted by Ra-
Sinhala-Tamil translation. Pramodya et al. [282] compared          jamanthri and Thelijjagoda [306].
Transformers, Recurrent Neural Networks, and Statistical               A dataset consisting of Sinhala documents drawn from
Machine Translation (SMT) in the context of Tamil to Sinhala       Sri Lankan news websites was published by Jayawickrama
machine translation. The work by Nissanka et al. [283] used        et al. [102] along with the benchmark misinformation clas-
monolingual word embedding to improve NMT between                  sification models. A trending topic detection model for
Sinhala and Tamil. While not related to Tamil, there have          Sinhala tweets using simple clustering and ranking algo-
been attempts to link Sinhala NLP with Japanese by Herath          rithms was proposed by Jayasekara and Ahangama [307]. A
et al. [21, 77, 78], Thelijjagoda [79], and Kanduboda [8].         machine learning approach to detect hate speech in Sinhala
There has been an attempt to use dictionary-based machine          was proposed by De Silva [308]. The work by Liyanage
translation [284] between Sinhala and the liturgical language      and Ranathunga [309, 310] attempt to use LSTMs for math-
of Buddhism, Pali [285–287]. The study by Anuradha and             ematical word problem generation in Sinhala and other
Thelijjagoda [195] uses machine translation on the unique          languages. The problem of recognizing Sinhala and English
application of converting Sinhala and English Braille docu-        code-mixed data where the Sinhala text is written in Singlish
ments, which they have run OCR on, into voice.                     was explored by Smith and Thayasivam [311] using an
                                                                   XGB classifier and a CRF model building on their previous
3.13   Miscellaneous Applications                                  work [312], which analysed such data. Sandathara et al.
In this section, we discuss NLP tools and research which           [313] proposed a system which they named Arunalu that
are either hard to categorize under above sections or are          they claimed to use Voice recognition, Natural Language
reasonably involving multiples of them. The first miscel-          Processing, Machine Learning, and Deep Learning concepts
laneous application of Sinhala NLP is spell checking. The          to help individuals with dyslexia overcome problems of
open-source data driven approach proposed by Wasala et al.         reading Sinhala. Rajitha et al. [314] used the data set that
[288, 289] claims to be able to check and correct spelling         they introduced in their previous work [88] to prove that
errors in Sinhala. The approach by Jayalatharachchi et al.         task specific supervised distance learning metrics outper-
[290] attempts to obtain synergy between two algorithms            form their unsupervised counterparts, for document align-
for the same purpose. These efforts [288, 290] were then           ment.
extended by Subhagya et al. [291]. A rule-based Sinhala spell
checker named SinSpell based on Hunspell34 was introduced          4     P RIMARY S OURCES
by Liyanapathirana et al. [292]. They have also made the tool
available35 online for use. The study by Sithamparanathan          Even though the main objective of this survey is to cover
and Uthayasanker [293] extended the Generic Environment            NLP tools and research, we noticed that much of these NLP
for context-aware spell correction to handle Sinhala and Tamil.    tools and research depend on primary sources of Sinhala
    On the matter of Sinhala sign language, strides have           language such as printed books in the role of knowledge
been made in the domains of computer interpreting for writ-        sources and ground truth. Therefore, for the benefit of other
                                                                   researchers who venture into Sinhala NLP, we decided to
  34. http://hunspell.github.io/
  35. http://nlp-tools.uom.lk/sinspell/                                36. https://rasa.com/
You can also read