Large scale annotated dataset for code-mix abusive short noisy text

Page created by Shannon Holt
 
CONTINUE READING
Large scale annotated dataset for code-mix abusive short noisy text
Large scale annotated dataset for code-mix abusive
short noisy text
Paras Tiwari (  parastiwari.rs.cse19@iitbhu.ac.in )
 Indian Institute of Technology (BHU)
Sawan Rai
 Indian Institute of Information Technology, Design & Manufacturing
C. Ravindranath Chowdary
 Indian Institute of Technology (BHU)

Research Article

Keywords: Code-mix dataset, Abusive text, Noisy text

Posted Date: April 25th, 2023

DOI: https://doi.org/10.21203/rs.3.rs-2826989/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.
Large scale annotated dataset for code-mix abusive short noisy text
Springer Nature 2021 LATEX template

     Large scale annotated dataset for code-mix
              abusive short noisy text

Paras Tiwari1*, Sawan Rai2 and C. Ravindranath Chowdary1
1*
  Department of Computer Science & Engineering, Indian Institute
  of Technology (BHU), Varanasi, 221005, Uttar Pradesh, India.
2
  Department of Computer Science & Engineering, Indian Institute
  of Information Technology, Design & Manufacturing, Jabalpur,
                 482005, Madhya Pradesh, India.

                 *Corresponding author(s). E-mail(s):
                    parastiwari.rs.cse19@iitbhu.ac.in;
              Contributing authors: sawanrai@iiitdmj.ac.in;
                      rchowdary.cse@iitbhu.ac.in;

                                   Abstract
     With globalization and cultural exchange around the globe, most of
     the population gained knowledge of at least two languages. The bilin-
     gual user base on the Social Media Platform (SMP) has significantly
     contributed to the popularity of code-mixing. However, apart from
     multiple vital uses, SMP also suffer with abusive text content. Iden-
     tifying abusive instances for a single language is a challenging task,
     and even more challenging for code-mix. The abusive posts detec-
     tion problem is more complicated than it seems due to its unseemly,
     noisy data and uncertain context. To analyze these contents, the re-
     search community needs an appropriate dataset. A small dataset is
     not a suitable sample for the research work. In this paper, we have
     analyzed the dimensions of Devanagari-Roman code-mix in short noisy
     text. We have also discussed the challenges of abusive instances. We
     have proposed a cost-effective methodology with 20.38% relevancy score
     to collect and annotate the code-mix abusive text instances. Our
     dataset is eight times to the related state-of-the-art dataset. Our
     dataset ensures the balance with 55.81% instances in the abusive class
     and 44.19% in the non-abusive class. We have also conducted ex-
     periments to verify the usefulness of the dataset. We have proposed

                                       1
Large scale annotated dataset for code-mix abusive short noisy text
Springer Nature 2021 LATEX template

2     Large scale annotated dataset for code-mix abusive short noisy text

      baseline architecture with 0.5194 MCC score. From our experiments, we
      have observed the suitability of the dataset for further scientific work.

      Keywords: Code-mix dataset, Abusive text, Noisy text

1 Introduction
Language has been an essential part of human evolution. Various sociolo-
gist researchers studied the evolution of conversational linguistics concerning
the cultural, sociological, geographical and economic factors [1][2][3]. Mul-
tiple rules have also been proposed to maintain uniformity in the language
across society. Author in [4] discusses conversational linguistic anthropology
with the absence of strict linguistic grammatical structure in the conversation
along with the difference in linguistic features of dialogues from the formal
mode of communication. Fusion and diffusion of languages in the bilingual
communities led to the code-mix linguistic conversations [5].
    Code-mix is a widespread phenomenon in various domains like sentiment
classification, polarity identification, dialect identification, question answer-
ing, part-of-speech tagging, named entity recognition, speech technologies,
etc.[6]. As per an independent article1 , even in 2020, the internet has the
presence of less than 8% of total languages or dialects available throughout
the world. However, most users prefer to surf the web in their native language
[7]. The use of code-mix language fills the gap between information quest
and availability. Code-mix languages are popular trend on SMP [8]. Multiple
factors are responsible for the popularity of code-mix like the freedom to ex-
press, satisfaction, ease of understanding, etc.[9]. The popularity of code-mix
language has helped even Indian politicians to broaden their reach [10].
    Among various SMP, Twitter®2 is one of the widely used platforms. The
popularity and character limit for each instance make Twitter suitable for col-
lecting noisy code-mix instances. As per the official Twitter blog3 in 2013,
there were an average of 5700 tweets per second posted on Twitter. The
number of tweets per second has exponentially exploded in recent years due
to the deeper penetration of the internet and the active engagement of vari-
ous stakeholders in India. Understanding the reach of this platform, Indian
stakeholders started utilizing this platform as a tool to get direct feedback
and grievances. However, on the other hand, studies also show the increase of
offensive tweets after more active participation of political entities [11].
    People tend to have a short temperament over disagreements [12]. Anger
makes people anxious and responsive to the subject. The SMP facilitate users
to express themselves anytime. However, it becomes a flaw as users do not

  1
    https://www.bbc.com/future/article/20200414-the-many-lanuages-still-missing-from-the-
internet
  2
    https://www.twitter.com
  3
    https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-
how.htm
Large scale annotated dataset for code-mix abusive short noisy text
Springer Nature 2021 LATEX template

     Large scale annotated dataset for code-mix abusive short noisy text      3

wait to calm down before responding to the sensitive subject. Users do not
want to process thoughts or translate from their native language to English (or
the platform’s official supportive language). So, either they express in the mix
or complete transliteration. Hence, the majority of the heated conversations
are code-mix. Users easily abuse on SMP because they usually do not feel
empathy for the victim [13]. For bullies, victims are just a random profile with
a random username, not an actual person.
    The rest of the paper is as follows: Section 2 discusses the objective and
contribution of the proposed work. Section 3 discusses the code-mix, its fea-
tures, and its dimensions. Section 4 explains various dimensions of abusive
instances. Section 5 elaborates the data collection methodology, with its fac-
tors and challenges. Section 6 discusses about the proposed dataset. Section
7 discusses the architecture with its scope and limitations for detection of
code-mix abusive. Section 8 discusses the dataset quality and factors of per-
formance. Section 9 illustrates the related dataset proposed so far with their
strength and weaknesses. Section 10 discusses the conclusion and future work
of the problem.

2 Motivation
There has been adequate contribution for abusive tweet detection but lim-
ited for the code-mix abusive text detection task. Our work has been inspired
by [14][15][16] for the requirement of large-scale data dedicated to the task.
Authors in [16][17][15] discussed various challenges and methods for creating
a dataset for such tasks. There is scarcity of quality code-mix datasets for
such tasks [18]. The scarcity is due to heavy the cost requirement for filtering
the relevant data instances. An improper data collection methodology would
waste enormous time and effort. Even after filtering of code-mix data, annota-
tors need to deal with various multi-dimensional complexity caused by heavy
noises in the dataset. There are various assessments for labelling an instance
abusive or non-abusive. There is a minor difference among abusive, offensive,
obscene and hate speech. Considering such challenges and requirements, in
this work, we have made the following contributions:
• Discussed multiple variants of the code-mix instances consisting of Devana-
  gari and Roman script characters.
• Discussed relationship among the abusive, offensive, hate speech and
  obscene textual content.
• Proposed an efficient methodology to collect, filter and annotate code-mix
  dataset.
• We have proposed a significant, balanced multi-domain Devanagari-Roman
  code-mix abusive short noisy text dataset.
Springer Nature 2021 LATEX template

4       Large scale annotated dataset for code-mix abusive short noisy text

Figure 1 Indian schedule-language percentage.

3 Code-mix in noisy text
The essence and the impact of conversational code-mix features are promi-
nently visible in a diversified country like India. India is the second most
populated country globally4 , has the second-largest digital population5 and
is among the top five nations for the active users on the Twitter6 . In such a
linguistically diverse country, Hindi is the most prominent language that has
been in practice by around 45.63% of the total Indian population7 , as shown
in Figure 1. After a long tenure of British rule in India, the English language
got comfortably mixed with the other native languages. Most of the Hindi
native users on Twitter are bilingual, i.e. in both Hindi and English.
    Words in verbal or written communication need not confined to Hindi and
English. In written conversation, users tend to use Roman characters for
Hindi words. Words phonetically belonging to the Hindi (Devanagari Script)
but written in English alphabets are popularly known as Hinglish. Hinglish
got popularised with broadcasts in various advertisements. Also, there are
cases when users use Devanagari characters for English words like स्कूल8 . We
are referring to such tokens as Enghind. In short text written conversation,
the users explores creativity beyond the constraints. The user has the liberty
to use a mix of characters in a conversation. There are also tokens carrying
characters from both Hindi and English. A few of such tokens are generated
with mollified intention. However, sometimes users miss the space between
the words. Such tokens enhance the complexity of determining the primitive
language the token belongs.
    Various opportunities and dimensions in conversational linguistic man-
agement have enhanced the complexity of having a definition of code-mix.
Generally, if tokens in a sentence belong to more than one language, it comes
under code-mix. The modern linguistic trends have also led to the origin of

    4
    https://www.census.gov/popclock/print.php?component=counter
    5
    https://www.statista.com/statistics/309866/india-digital-population/
  6
    https://www.statista.com/statistics/242606/number-of-active-twitter-users-in-selected-
countries/
  7
    https://censusindia.gov.in/nada/index.php/catalog/42561
  8
    English transliteration is school
Springer Nature 2021 LATEX template

      Large scale annotated dataset for code-mix abusive short noisy text       5

the various informal form of languages. Considering various dimensions for the
code-mix dataset, we have assumed transliteration of Hindi words to English
and vice-versa as an independent set of languages.
    Along with the standard words in the languages, we have included the
criteria based on script level (character). If there is a presence of both De-
vanagari and Roman characters in an instance, we consider it a code-mix. We
have excluded digits from consideration as digits need not be categorized into
a language. A user can use multiple sentences in a tweet where one of the
sentences can be code-mix and another code-switch. However, users generally
avoid using both code-mix and code-switch in a tweet [19]. Since our work is
dedicated to code-mix instances, we have assumed the code-mix at the tweet
level. In our work, we have collected tweets where tokens in a tweet must be-
long to at least two of the Hindi, English, Hindi words in Roman alphabets
(Hinglish), or English words in Devanagari characters (Enghind). Authors in
[20] analyzed the grammatical structure for the code-mix languages based on
the generative models of code-mix.
    Judging the language of a token is a non-trivial task. Some words do not
have an exact translation from the parent language to another, like कन्यादान9 ,
राखी10 , जूठा11 , etc. are Hindi language words that do not have an exact transla-
tion in the English language. Similarly, certain words in the English language
do not have an exact translation in Hindi, for example, online, cyber, bank, etc.
The existence of such words in an instance creates a dilemma to consider it a
code-mix or not. With globalization, new circumstances and various trends,
the presence of such words have increased. There are open discussions over
such words in the core linguistic committees. Authors in [6] discusses about
various phonetic features to differentiate between the borrowing and mixing.
    Apart from such borrowed words, users easily adopt different words. The
SMP users are the least concerned about the morphological and syntactical
structure. Some English words have translations, but due to ease of use or
popularity, the dialect of such words has been more used in the Hindi con-
versation, and vice-versa. Users, rather than using translation, confidently
adopt the word. As, railway, camera, mobile phone, etc. are English language
words that have multi-token Hindi translation. Even for some single unit
translations, some words got so much in practice that people use such words
in the conversation, assuming them as Hindi words. For example, doctor is
frequently used by native Hindi speakers instead of िचिकत्सक12 . This dense lin-
guistic mixture have also lead to the origination of various new words which
do not exist in any of the language like familiyan (means families), doctron
(means doctors’), kitabs (means books), etc. Such practice has been adopted
in most of the Indian language.
    To feel special, people try to name uniquely and meaningful. The trends
have an impact on the names. The names or proper nouns are considered

 9
  A Hindu wedding ritual
 10
    closest English translation is wrist band for brother
 11
    closest English translation is leftover food
 12
    English transliteration is chikitsak
Springer Nature 2021 LATEX template

6          Large scale annotated dataset for code-mix abusive short noisy text

language-neutral tokens. However, named-entity recognition is a stand-alone
problem. In tweets, users usually use @username13 , to refer to a person rather
than their official name. The users use official names in case of the unavailabil-
ity of that person on the Twitter. Various organizations also have widespread
existence on SMP. The names of some organizations creatively code-mixed for
the sake of marketing. If a user is unaware of an organization, such tokens
may cause ambiguity. For example, as in tweets, ohh there is no mochi in
jabalpur... here mochi could refer to the footwear brand or cobbler, or @user
passion se chlega? here passion could refer to a motorbike brand or devo-
tion. The language of such words depends upon the context in which they are
used. The decision to treat token as English, Hinglish, Hindi or Enghind con-
tributes to the labelling of code-mix data. In our work, we opted for language
neutrality and categorized based on the characters used for the known noun
phrases.
    The issue of categorizing the sense of a word as being a proper noun has
extended complexity in a linguistically rich, diverse country like India. There
are multiple names adopted for ‘India‘. Even the constitution of India has
mentioned India, i.e. Bharat. ‘India‘ and ‘Bharat‘ are proper nouns in En-
glish. However, the word ‘India‘ is popularly treated as an English word, while
‘Bharat‘ as the Hinglish token. We have not categorized Indian/Bharat into
different languages in our work.
    The ambiguity of deciding the primitive language of the word is not limited
to the name entities. In the code-mix domain identifying the sense of token is
an important step. The written conversations have some words that carry a
certain meaning in English, but their transliteration has another meaning in
Hindi and vice-versa. For example, tokens like dam, jab, door, to, etc. have
a meaning in the English language however, the transliteration has different
meaning.
    The transliteration expands the complexity in various dimensions. Their
are also cases when users use correct English words but intent to use the
transliteration, for example get lost you chore14 , you cute a15 , etc. This
dimension is one of the most challenging in noisy abusive text instances. Such
ambiguous instances are challenging even for human annotators. We have
discussed this in detail in Section 5 concerning the code-mix abusive instances.
Such instances are contextually code-mix. Such instances are rare and found
mostly in noisy abusive instances. The incorporation of such instances extends
the potential of our dataset. It enriches our dataset with contextually code-mix
instances rather than just transliterated code-mix.
    The dimensions of complexity upsurge in the informal written conversation
on the SMP. Since on the platforms like Twitter, there is a limit on the num-
ber of characters per post. Users tend to use abbreviations of several words to
express more in the limited words. Some English words like ‘prime minister‘
and Hinglish words ‘pradhan mantri‘ has same abbreviation i.e. pm. Since

    13
         Twitter handle of the person
    14
         Transliteration of ‘thief‘ in Hindi
    15
         The combination of these two tokens is transliteration of ‘dog‘ in Hindi
Springer Nature 2021 LATEX template

       Large scale annotated dataset for code-mix abusive short noisy text     7

Figure 2 Code-mix tweet examples.

one word belongs to the English language and the other belongs to Hinglish,
it is tough to decide the language abbreviation belongs. A more complicated
scenario arises when even the abbreviation has a certain meaning. For exam-
ple UP, it is a word in English and also being an abbreviation means Uttar
Pradesh16 .
     Apart from expressing through the meaningful words, users also use ex-
pressive phonetic words like hahahaha to express laughter, ohhh for surprise,
hmmm to express engagement with the topic etc. The expressions do not
have language, but when expressed in text form, they need to get categorized.
For such tokens, we opted the character-based strategy. Figure 2 shows some
cases of code-mix tweets.

4 Abusive instances
In the Oxford dictionary17 , ‘abusive‘ is defined as an adjective that is ex-
tremely offensive and insulting. However, no precise formal definition has yet
been given to the ”abuse”. Even the official law firms do not have an entirely
acceptable viable definition of abusive content. The Protection of Women
from Domestic Violence Act, 2005 of the Parliament of India18 explains the
‘verbal and emotional abuse‘ as:
• ”insults, ridicule, humiliation, name calling and insults or ridicule specially
  with regard to not having a child or a male child; and
• repeated threats to cause physical pain to any person in whom the aggrieved
  person is interested;”

 16
      Name of a state in India.
 17
      https://en.oxforddictionaries.com/definition/abusive
 18
      https://www.indiacode.nic.in/handle/123456789/2021
Springer Nature 2021 LATEX template

8     Large scale annotated dataset for code-mix abusive short noisy text

Figure 3 Difference and similarity of abusive.

    Abusive instances are usually misinterpreted as offensive, hate speech and
obscene. We understand the overlapping of abusive and offensive, but ev-
ery offensive instance needs not be abusive. For example, certain politicians
get accused of various uncommitted crimes, and users offend their intentions
but not always abuse them. Similarly, hate speeches usually involve misinter-
pretation of various statements. Moreover, professional pornography is legal
in some demographic regions. Since the internet is beyond the demographic
limitations, various pornographic artists post their sample content on SMP.
Such obscene instances could not label as abusive. Figure 3 represents the
relationship among offensive, hate speech, obscene and offensive.
    The code-mix abusive instance detection problem is more complex than
it seems due to its inept, unstructured noisy data and unpredictable context.
There is a general assumption that an instance is abusive only if it contains
abusive tokens (either text or emoticons). This assumption is partially correct.
Some instances do not have a direct abusive token but are still abusive. There
are abusive instances, which instead of using exact abusive tokens, special
characters or a combination of alphabets and characters, still are contextually
abusive. Authors in [21] propose a dataset of 4640 offensive tweets categorized
based on the existence of profanity token, targeting individuals or groups and
directly or indirectly abusive.
    Tweets are not only unstructured; they are also unordered. token-meaning-
based models need a rich dataset of abusive words. Authors in [22] manually
created a set of abusive tokens to generate the set of highly offensive words in
Hinglish. Creating a dictionary for all the highly offensive words is not feasible.
However, authors in [23] created a directory that contained 3000 multi-lingual
words, among them 2400 belong to English, 400 belong to Japanese, and
slightly over 200 words to Bulgarian, Polish, and Swedish. Still, there could
not be a claim that a directory includes all the abusive words, and no new
abusive word will be in the future. There is no pre-defined criteria for the term
to be abusive; it solely depends upon the creativity of users, which evolves
with time. The complexity enhances when mixed words has been added to
Springer Nature 2021 LATEX template

    Large scale annotated dataset for code-mix abusive short noisy text       9

the task. There are multiple words whose transliteration is very similar to the
abusive tokens in other languages.
    There are cases when an individual token‘s meaning is insufficient to de-
tect whether the token is used for abusive context or other. For example, if
multiple abusive label instances in the dataset contain the token ‘sex‘. It will
increase the probability of the instance being abusive whenever the word ‘sex‘
appears. However, there could be instances when a user posts to draw aware-
ness towards ‘sex education‘. As per the dependency on the particular token’s
meaning, the probability of the instance about sex education will be miss-label
as abusive. Also, there are sets of tokens that seems normal individually, but
their transliteration combination makes different sense. The highly creative
users step up with a combination of symbols, digits and words rather than
just words.
    So we need to take care of multi-unit tokens with their interdependency.
Also, for tweets like ‘does your mother remember your fathers’ name‘ or ‘no
one could see you even in daylight‘ token dependency with abusive tokens will
be inefficient. As there is no limit on the user’s creativity and the number
of highly offensive words, we must also consider the contextual meanings.
Authors in [24] discussed the relevancy of contextual dependency for abusive
tweets.
    The complexity increases when we honestly consider the relationship be-
tween an offensive instance and an abusive instance. There are instances
unintended to offend, although some readers might find them abusive, and
others may not [25]. For example, ‘you look like a pathetic idiot tonight, how
could you @username‘; this instance is offensive but not abusive. We should
not ignore the point that even the openness for the content varies from person
to person. Some people have thin skin for sarcasm, while some have thick.
It led to another open debate to label a statement as offensive; the same
statement could be offensive to person A, sarcastic to person B, and normal
to person C. Authors in [26] discussed various factors that may lead to the
aggression in the users.

5 Creating a large scale dataset
5.1 Data collection
We understand the required quantity and quality for the appropriate dataset
to be a suitable representation of actual data. In our data collection strategy,
we used a greedy strategy inspired by the honeypot technique [27][28]. Authors
in [29] illustrated the challenges of maintaining the quality of the dataset.
In Section 1, we have illustrated the prominent contribution of Devanagari-
Roman code-mix tweets on Twitter. Since we aim to propose a balanced code-
mix abusive tweet dataset, using only specific profiles, keywords, hashtags, or
trending topics would not cover the broad, diverse range of actual tweets. We
have designed a five-step efficient procedure to collect the relevant data.
Springer Nature 2021 LATEX template

10        Large scale annotated dataset for code-mix abusive short noisy text

    We understand the diversity in Indian society that has various topics of
interest. Collecting data for each domain would lead to an unnecessary over-
lapping. Also, our work is limited to the collection of abusive code-mix tweets.
So we filtered out the five most popular and sensitive domains of our target
users.
    For the most common interest domains, it is next to impossible for such a
diversified community to maintain uniformity of opinion for each event. The
author in [30] studies the division of people into small groups based on the
sensitivity and contextual range for the flow of information. In SMP, grouping
people with similar opinions leads to the formation of echo chambers, which
drag followers’ beliefs to the extreme positions [31][11]. These extreme broad
divisions of opinions sometimes lead to controversies. Controversies happens
or some times intentionally created [32]. Authors in [33] analyzed the twisted
use of Twitter for creating and propagating controversies. Twitter purposes
to be a platform where every user can post her opinion independently. Rather
people tend to have heated debates over opposite narratives over various events
on SMP. In these heated conversations occasionally, users do not hesitate to
abuse. We listed several highly controversial events related to each selected
domain that are deeply related to most of the population.
    After listing the events, we need to collect the triggering tweets where users
prominently have bipolar opinions. There was an option to use hashtags like
in [34] to collect the tweets, but as mentioned in [32] SMP are now treated
as two-way communication. Users tend to abuse at the individual level while
countering or defending the orientation presented by the celebrity they hate or
admire. The verified users have a higher tendency to set the orientation and
receive hateful replies [35]. We have listed the personalities with a remarkable
number of followers on SMP and grouped them in the selected domain. From
the list, we opted out of the user profiles depending upon their sensitivity
towards the event, engagement of users with extreme polarised opinions and
number of followers.
    After selecting the user profiles, we knew that not each tweet related to
the event by the selected personalities would cause severity. So we choose
the most triggered tweets of selected users concerning the sensitivity of the
event. These tweets have the highest probability of heated arguments of both
extremes. We collected all the tweets in that conversation using the Twitter
developer API19 . We removed the noisy tweets from this collection and kept
only targeted code-mix tweets. In our work, we have intentionally collected
tweets from the conversation threads as it gives higher scope for the inclusion
of contextually abusive tweets. The search query respective to keywords or
time constraint would result in biasness of the dataset.

5.2 Noise removal
A tweet has a limit of 280 characters to express anything. The gap between the
desire to express and the opportunity given originates in several creative forms

 19
      https://developer.twitter.com/en/products/twitter-api
Springer Nature 2021 LATEX template

    Large scale annotated dataset for code-mix abusive short noisy text                                              11

              •Presence of various domains to maintain the diversity and inclusion of all important aspects.
              •Among various domains we short list the most popular domains, i.e.politics, activist, sports, news,
 Domain        and entertainment. As people are more sensitive to these domains, they tend to be more expressive.

              •People tend to have polarized views and dedicatedly contribute in support of their assertions.
              •We opted for major events in the major domains that trigger sensitivity among the population like
Controver-     farm laws, and the cricket world cup.
sial Events

              •Responses of followers are highly affected by the opinions of the influential celebrities.
              •We selected celebrities that resemble the domain and event.
Celebrity

              •People defend their assertions without worrying about the words they use.
              •We selected tweets of verified celebrities about the most controversial events belonging to diverse
 Tweets        most popular domains.

           •Twitter facilitates users to reply to any tweet posted publically. The replies to the controversial tweets
            usually tend to lead high voltage conversations.
   Data    •We collected all the replies to the selected tweets.
collection

Figure 4 Data collection strategy.

like emojis, unconventional abbreviations, numbers, URLs, slang, acronyms
etc. These special tokens have contextual meanings and are individual chal-
lenges for text processing. Reciprocal to the character limit of a tweet, users’
creativity has no limit. The trend-variant evolution of these tokens increases
the complexity of processing the tweets, hence considered noises. The unpro-
cessed noisy tweets cost both unnecessary efforts and performance degradation
of experiments. To reduce the waste of effort, we have pre-processed the data
and minimized the noises respective to the human annotator, as shown in
Section 5.2.1. We have avoided rigorous pre-processing steps to maintain the
essence of actual data. At the same time, we processed up to the extent where
human annotators have maximum ease in deciding the label. The machines
are more sensitive to noises than the human annotators, so we kept different
pre-processing steps before inputting the data to the algorithm, as shown in
Section 5.2.2.
    Apart from these token noises, the biggest challenge was to classify code-
mix and spam. In Section 1, we have discussed the availability of the data.
Even though data is available in enormous amounts with various freely avail-
able data collection tools, annotating such a dataset is still a cost-intensive
task. Most of the cost is wasted manually filtering the relevant data instances.
We created a dictionary containing relevant tokens. We collected various
Hinglish tokens proposed in [36]. We removed the tokens that overlap with
the English language among the collected tokens. We further updated the
dictionary by adding the most frequent Hinglish (including profane and ob-
scene tokens) and Enghind tokens. We used a simple python program to keep
Springer Nature 2021 LATEX template

12        Large scale annotated dataset for code-mix abusive short noisy text

only instances that have at least one token belonging to our dictionary. We
recursively improved the dictionary. It helped to omit the majority of spam
instances. We kept the threshold of only one token to avoid biasness of the
dataset towards limited tokens in the dictionary. We are well aware of losing
some relevant tweets with this step. However, this step was aimed at mak-
ing the annotation costeffective. Even after losing some relevant instances, we
had a sustainable number of instances.

5.2.1 Noise removal before annotation
Before submitting the batches of tweets to the annotators, we minimally pro-
cess them. Our primary intention is to ease ambiguity, maintaining the
characteristics of actual tweets. The following steps have been taken to process
before annotation:
• We have replaced hyperlink text in the tweet with ”⟨url⟩”. We are well
  aware of a scenario when a user can use spam URLs rather than tokens to
  abuse. For example, ”you are a ⟨spam-url⟩”. However, spam URL detection
  is out of the scope of this article. Including URLs might also confuse the
  annotators, as it represents incomplete information in the tweet.
• We replaced emojis in the tweet with the respective text using the open
  tool20 . The emojis do not have standard notations. Annotators could have
  different opinions about an emoji. [37]. Replacing the emojis with the text
  removes ambiguity among annotators and maintains uniformity.
• Removed tweets that carry character belonging to the script other than
  Devanagari or Roman. We are aware of the complexity and creativity used
  in tweets, so even translating the word to the known language would not
  fully justify the label.

5.2.2 Noise removal before architecture input
Due to the processing limitations of machines for text, the algorithms are the
most sensitive of these noises [38]. Following steps are taken in pre-processing
of data:
• We have also understood the importance of smileys in the context of the
  statement. We have substituted these token into meaningful tags as ⟨f ace⟩ ,
  ⟨smiley⟩, ⟨eye⟩. It helps our model to learn the exceptions between abusive
  and sarcastic tweets.
• We also assume that if a token consists of combination of alphabets and
  special characters. They will be abusive words as, it is very trendy when
  user type abusive words like, ‘motherf%%%‘ or ‘f##k‘ for offending. These
  tokens are symbolically considered as abusive.
• It is rare that digits used in tweets (comparatively to alphabets) to ex-
  press feelings like ‘143‘ for ‘I love you,‘ or ‘153‘ for ‘I adore you.‘ We have
  generalised set of digits as ⟨number⟩.

  20
       https://github.com/carpedm20/emoji/
Springer Nature 2021 LATEX template

    Large scale annotated dataset for code-mix abusive short noisy text                       13

• There are several tweets consist of user tagging as @user_handel. We need
  not worry about ‘who‘ is tagged in a tweet, so we replaced a user tagging
  as ⟨user⟩.
• If a token starts with an hashtag (#) followed by string, we have removed
  the hashtag and retain string.

5.3 Annotators’ profile
We needed annotators who know both languages (Hindi and English) and
have experience reading these code-mix sentences. To ensure the experience,
we have selected the annotators with an active SMP account older than three
years. A three-year-old active SMP account ensures their experience with the
trending features and noisy data. Similarly, to ensure the knowledge of both
the scripts, we have selected annotators who belong to the Northern part of
India. Both the scripts are popular in this region of India.
    We have selected six annotators and divided them into two groups, i.e.,
NLTP experts and conventional users. To be in the NLTP expert group, the
annotator must have either research experience in text processing or qualified
a course related to the text processing in their academic record. We have also
taken care of gender representation, as each group carries at least one female
and at least one male annotator.

5.4 Challenges in annotation
Annotating abusive instances is challenging due to its unstructured, am-
biguous and diverse sensitivity. We have discussed various dimensions, for
instance, being abusive in the Section 4. Authors in [24] discussed lexical and
contextual dependency for classifying tweets as normal or abusive. Among
various challenges, following are the major challenges:

5.4.1 Ambiguity
Annotating a tweet correctly is possible only after the correct assessment of
the tweet. There are possibilities that a tweet would be assessed to have
multiple meanings due to ambiguous mapping of Hinglish tokens to respective
Hindi/English words [39][40]. For example, ”how gud u r” could be translated
either to ”how good you are” or ”How jaggery21 you are”. We understand
the challenges due to transliteration of Hindi to English and vice-versa. The
ambiguity is majorly due to phonetic similarity. The ambiguity enhances when
the user intended to write a Hinglish word but misspelled it to correct English
word. For example, pura is a Hinglish word which means complete, but it is
usually misspelled as pure, which is a correct English word. A similar challenge
for annotators is due to various scopes in the transliteration of words. For
example, the Hinglish word for ‘ear‘ can be transliterated to kaan, can, etc..
Such words led to an extra challenge in deciding the intended language of the

  21
     It is a coarse dark brown sugar made in India by evaporation of the sap of palm trees. It is
contextually used for a person, physically in bad shape but sweet in nature.
Springer Nature 2021 LATEX template

14        Large scale annotated dataset for code-mix abusive short noisy text

word. The challenges are more troublesome for machines than humans. To
minimize this challenge, we selected the annotators who are experienced in
each opted language and its transliteration.

5.4.2 Sensitivity
The diversity of cultural background, dialects, and other geo-socio-economic
factors result in very diverse sensitivity. For example, an English language
statement, ”Are you coming from a picnic?” can be translated to more than
one code-mix statement, i.e., ”Tu kya picnic se aa rha h”, ”Aap kya picnic
krr k aa rhe hai” or ”Tum kya picnic kr k aa rhe ho”. There are decent
chances that a person from the eastern part of Uttar Pradesh22 will treat the
first translation as offensive. In contrast, another person from the western
Uttar Pradesh state of India will consider it normal. The detailed analysis for
sensitivity has been discussed in Section 4. In case of conflict of votes between
the two groups, we preferred NLTP experts as the annotator must have visited
various locations to gain suitable academic experience. That enhanced the
probability of broader real-life experiences and awareness about such code-mix
challenges.

5.4.3 Trends
The life span of trends is very short in SMP. However, they leave a mark
on future trends. That leads to very unstructured time-variant acronyms,
smileys, slang, etc. The annotator should be updated with these changes.
There is no specific official source of emerging these trends. Users usually get
to know about them only when it crosses their conversations. This challenge
has been overcome by keeping the criteria for an annotator to have a SMP
account older than three years.

5.5 Annotator’s training
Labelling the large-scale dataset is much more complicated than collecting
large-scale tweets. For sustainable quality, each group was given unambiguous
definitions of code-mixing abusive and normal instances with five different
examples of each, as shown in Table 1. We also performed a small test of 50
code-mix tweets covering the most probable dimensions to ensure the clarity
of annotators regarding the definition of abusive tweets.
    We also considered the interest decay phenomenon. Annotators may lose
interest in labelling a large chunk of data. That may affect work quality, so we
gave each annotator only 1/3rd of the total tweets in a group. However, we
understand that even 1/3rd of total tweets is a large number, and annotators
may feel bored, so to overcome this challenge, we gave a reasonable time
frame of four weeks to each annotator. The author in [41] studied the human
psychology of work procrastination due to mismanagement of work and time.

 22
      A state of India
Springer Nature 2021 LATEX template

       Large scale annotated dataset for code-mix abusive short noisy text              15
Table 1 Pre-processed tweets with label.
                                  Tweet                                       Label
 ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ Kaloo Teri photo taangni chahiye thi      Abusive
 taaki jet ko najar na lagti kaluwe
 ⟨user⟩ Chutiya admi... Jab log train me Jada chalte hai.. To Ye to hona      Abusive
 tha..
 ⟨user⟩⟨user⟩ Uska malik tere jaisi soch wala hai k___y darbari jo tha        Abusive
 zuthan khane wale desh ki chita hum karlenge tu apne ghar ke liye soch
 medam teri d____ bhagi thi na
 ⟨user⟩ muh me le le mera..                                                   Abusive
 ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ ⟨user⟩ Koi    Normal
 business luta de to arthvyavastha achhi ho kar usaka hath to nhi rok degi
 ladki chakkar mein bhi dhandha kharab hota hai, offer se bhi dhandha         Normal
 kharab hota hai. depend karta hai dhanda karne wala kaisa dhanda karta
 hai. yahan ka profit wwhan, aut kahin aur ka loss airline mein ghussa
 diya hoga
 ⟨user⟩ thode concept clear kro... ise dekho ⟨url⟩                            Normal
 Maa behan ki izzat krna seekho... ⟨user⟩ rape case ko itna hlke me kaise     Normal
 le sskte h
 ⟨user⟩ tum kisi ko b aise hi ch__iya thode kh doge... respect the opinions   Normal

Inspired from [41] we also considered the situation when an annotator might
try to procrastinate the task for the last day and label all tweets in a single
day. It may impact the quality of annotations. So, we kept an essential limit
to label a maximum of up to 1/28 of the total given tweets in a day. The batch
of the subsequent 1/28 tweets was given to annotate only after receiving the
previous batch of tweets. These practices kept the annotators disciplined. We
had also designed a simple click based GUI using python library23 to keep the
annotation work interactive. We are very thankful to all the annotators for
their patient contributions. The dictionary, code and dataset have been made
available publicly24 .

6 Large scale annotated dataset
The proposed dataset covers diverse domains. The diversity ensures the in-
clusion of a wide variety of opinions, the jargon used in the individual domain
and appropriate quantity. We have limited our work for the selected domain
to maintain the relevancy score. The total instances collected from the se-
lected domains, i.e., entertainment, politics, activist, sports and news, carry
26011, 41708, 15623, 48375 and 46944 instances, respectively. We have main-
tained the balance even among the selected domains to avoid biasness as
demonstrated in Figure fig:domain. After the filtration and manual annota-
tion, among a total of 36,423 relevant code-mix instances, 4562, 6701, 2312,
12076, and 10772 belong to entertainment, politics, activist, sports and news,
respectively. Justifying the assumptions in the proposed strategy, the sports
domain had the highest relevancy score, followed by news. The cause for the

  23
       https://pysimplegui.readthedocs.io/en/latest/
  24
       https://github.com/sawan16/code_mix
Springer Nature 2021 LATEX template

16                               Large scale annotated dataset for code-mix abusive short noisy text

                               50000

                               40000
     No. of tweets collected

                               30000

                               20000

                               10000

                                  0
                                       Entertainment        Politics   Activist    Sports    News
                                                                       Domains

Figure 5 Number of tweets from each domain.

relevancy score of the news domain also lies in the selection criteria. The In-
dian news organizations have discrete language-based divison. For the scope
of this paper, we selected events and personalities of the news domain that
have a dominant Hindi-speaking audience.
    Our proposed dataset is enriched with 90695 tokens. In our work, we have
considered tokenization based on the spaces. The reasons for so have been
discussed in Section 7. Cluster of words in Figure 7 represents relevant tokens
present in the proposed dataset. The length of each tweet in our dataset ranged
from 2 tokens to 78 tokens, as shown in Figure 6. Tweets with token length 13
being highest with 1271 number of instances. The mean length of tweets in the
proposed dataset is around 38 tokens. As Twitter support a limited number
of characters, users tend to use small-length tokens. However, for hashtags,
users tend to use the exact trendy hashtag irrespective of concern of length.
Hashtags to give objective and reach to the post. In our proposed dataset
number of hashtags in an instance reached from 0 to 12. The proposed dataset
is enriched with various features to explore the user behaviour concerning the
domain, event and celebrity behaviour. Due to the limited scope of this paper,
we have experimented only with abusive instance detection.

7 Learning based abusive detection
The BERT (Bi-directional Encoder Representations from Transformers)
model has gained popularity for various natural language text processing tasks.
BERT has outperformed classical neural networks for various tasks. Various
fundamental factors, parameters and reasons for BERT’s performance are yet
to be explored. For various tasks, BERT has acted as a black-box. In the
code-mix domain, because of the unavailability of sufficient relevant datasets,
authors in [42] created a synthetic dataset by replacing tokens for training
Springer Nature 2021 LATEX template

   Large scale annotated dataset for code-mix abusive short noisy text      17

        1200

        1000

        800

        600

        400

        200

          0
                2
                3
                4
                5
                6
                7
                8
                9
               10
               11
               12
               13
               14
               15
               16
               17
               18
               19
               20
               21
               22
               23
               24
               25
               26
               27
               28
               29
               30
               31
               32
               33
               34
               35
               36
               37
               38
               39
               40
               41
               42
               43
               44
               45
               46
               47
               48
               49
               50
               51
               52
               53
               54
               55
               56
               57
               58
               59
               60
               61
               62
               63
               64
               65
               66
               67
               68
               69
               70
               71
               72
               73
               74
               78
Figure 6 Size of instances.

Figure 7 Abusive code-mix dataset word cloud.

the BERT model. The strategy for the replacement of tokens keeps the syn-
tactic structure intact. However, there is an absence of syntactic structures
and an abundance of non-traditional tokens in the actual instances. So we
have intentionally opted for a classical neural network over pre-trained BERT
models to discuss the features of the dataset and parameters essential for the
performance of such a task. For cross-domain (English, Italian, Spanish, and
German) abusive text detection, authors in [43] used the multilingual token
tool Hurtlex [44]. To the best of our knowledge, we did not find any exact ex-
isting work. So, we compare our work, with the dataset proposed for similar
task, i.e., Hindi-English code-mix hate speech detection [34].

7.1 Tokenizer
For noisy code-mix instances, every elementary step has its importance. In
our work, we aim to maintain a balance between the cutting-edge contribu-
tions and the most basic way of dealing with the task. Hence, even for the
preliminary step, i.e., opting the tokenizer, we considered various tokenizers
that could tokenize the instances without missing the essence of the instances.
Springer Nature 2021 LATEX template

18         Large scale annotated dataset for code-mix abusive short noisy text

Figure 8 Tokenization of sample instance.

We used pre-processing steps to emit the irrelevant tokens that do not influ-
ence the consideration of abusive instances. Most of the tokenizers have been
trained on the uni-lingual data. There are multiple tokenizers available for
both Hindi and English language. However, every tokenizer has a different
strategy for dealing with tokens. In our work, we used popular WhiteSpace-
Tokenizer of the nltk package25 . We tested a tokenizer trained for the Indian
languages, i.e., inltk tokenizer26 . The inltk tokenizer for Hindi language (inltk
(hi)) tokenized English tokens imprecisely. Also, the inltk tokenizer, trained
on synthetic code-mix data (inltk (en-hi)), did not tokenized as per the re-
quirements. We opted for WhiteSpaceTokenizer as it is language-independent.
The major limitation of this tokenizer is for the tokens where two or more
words have been used without spaces. However, the majority of tokenizers
faces similar limitation. In Figure 8, we have shown tokens for a sample tweet
that carries major noisy tokens.

7.2 Embedding matrix
For the embedding matrix, to the best of our knowledge, there is no relevant
embedding trained on the noisy code-mix data. For abusive tweet detec-
tion, authors in [24] used Glove embedding [45]. We compare our model with
and without a 200-dimensional Glove embedding matrix. Keeping other pa-
rameters constant, we observed that self-training of the embedding matrix
performed slightly better than the pre-trained Glove embedding. The per-
formance difference is majorly due to the presence of noisy code-mix tokens.
The pre-trained embeddings consider the majority of such tokens out of the
vocabulary. Although we have implemented various pre-processing steps to
minimize the noises, too much pre-processing would deteriorate the actual

  25
       https://www.nltk.org/api/nltk.tokenize.html
  26
       https://inltk.readthedocs.io/en/latest/api_docs.html
Springer Nature 2021 LATEX template

      Large scale annotated dataset for code-mix abusive short noisy text     19

data. In Sections 3 and 4, we have discussed the evolving nature of the data.
Too much pre-processing would overfit the model for the specific noisy data.

7.3 Evaluation benchmark
In our work, rather than using the F1 score, we have used Matthews Correla-
tion Coefficient (MCC)(also known as pi coefficient). Authors in [46] studied
the reliability of MCC over F1 score. MCC gives a balanced score considering
all four parameters, i.e., true positive (TP), false positive (FP), true negative
(TN), and false negative (FN). Such considerations keep the MCC score con-
stant, even if we exchange the target labels. The F1 score uses precision and
recall, which makes it biased to the true positive instances. The precision and
recall could drastically differ in the case of exchanging the target labels and
hence the F1 score. The value of MCC ranges from -1 to 1 (except in cases, if
TP and FP, or TN and FN becomes zero).

7.4 Architecture
While designing the architecture, we considered various dependencies dis-
cussed in [24]. We needed architecture capable for both the contextual
dependency and lexical dependency. As discussed in [24], the popular network
models LSTM and CNN have a good tendency to consider such dependency.
However, since we limited our architecture for classical neural networks, we
initiated with 2048 neurons in the first layer (L1) to capture the maximum
dependency. We reduced the number of neurons in the next layer by a fac-
tor of 2 till we reached a layer with 256 neurons (L4). After L4, we modified
every second layer by a factor of 2 till L16. L17 being the last layer, had 1
neuron. We split our dataset into 8:1:1 as train, validate and test. We trained
our model for 100 epochs with batch size 64 and tuned it with adam[47] opti-
mizer. We have also used early stopping with best weights restoration to save
the computational cost. To maintain the benchmark uniformity with the pre-
trained embedding matrix, we kept our embedding matrix’s dimension 256.
We understand the importance of activation for neural network performance.
Inspiring from the study in [48] for activation functions in the neural network,
we have used relu activation function for hidden layers and sigmoid for output
layer.
    We understand the impact of hyperparameters and resource allotment over
the performance of the neural network model. For our experiments we have
used freely available cloud jupyter notebook environment, i.e., colaboratory27 .
Since the allocation of resources is dynamic in this platform, it may raise
the issue of reproducibility. To ensure the reproducibility of results, we ran
the experiments for 10 times and presented the top-5 results. We have also
included two popular machine learning techniques, i.e., Naive Bayes (NB)
and Random Forest (RF). RF is a popular bagging technique and NB been
preferred for noisy data. We experimented for number of estimators ranging

 27
      https://colab.research.google.com/
Springer Nature 2021 LATEX template

20     Large scale annotated dataset for code-mix abusive short noisy text

Figure 9 Proposed neural network architecture.

Table 2 Architecture layers.
 Layer      L1   L2 L3 L4 L5:L6 L7:L8 L9:L10 L11:L12 L13:L14 L15:L16 L17
 Neurons    2048 1024 512 256 128 64      32   16    8       4       1
 Activation relu relu relu relu relu relu relu relu  relu    relu    sigmoid

from 100 to 1000 (with difference of 100) for RF. We found highest MCC score
at 200 for our work and at 700 for [34].

8 Results
Our proposed dataset size is nearly eight times larger than the latest related
work [34]. The number of tokens is more than twelve times. Our work con-
sists of balanced classes having 55.81% instances in the abusive class and
the remaining 44.19% in the non-abusive class. The higher number of abu-
sive instances ensures the inclusion of diverse cases discussed in the Section
4. We have also compared the performance with the related work [34] over
the proposed neural network architecture and machine learning techniques.
The performance over proposed classical neural network ensures the dataset’s
quantity and quality. Our proposed dataset has significantly outperformed
the nearest related work [34] for both neural network and machine learning
techniques. The details have been presented in the Table 4.
    We have performed experiments with both, i.e., inclusion of pre-trained
embedding and exclusion of pre-trained embedding. The pre-trained embed-
ding act as a filtration step that ensures the dataset’s quality. The pre-trained
embedding considers only tokens that exist in its vocabulary. Due to this,
noisy tokens are filtered and remaining token matching Glove vocabulary.
Omitting noisy tokens reduce the impact of noises in both datasets. However,
our proposed dataset outperformed [34].
    The model without pre-trained embedding performed better than model
with pre-trained embedding. The difference in model’s performance is as ex-
pected. Both datasets carry various tokens that are out of the vocabulary of
the pre-trained embedding matrix. Omitting a large chunk out of the vocabu-
lary tokens misses the information required for the model’s training. However,
in some cases for [34], model with pre-trained embedding may outperform
Springer Nature 2021 LATEX template

   Large scale annotated dataset for code-mix abusive short noisy text         21
Table 3 Cohen’s Kappa inter-annotation similarity.
                                                  NLTP Expert Group
         Conventional                       Abusive       Non Abusive
         User Group
                          Abusive           19923         1052
                          Non-Abusive       405           15043

model without pre-trained embedding. It represents the lack of noisy in-
stances quantity in [34]. For the proposed work, the model without pre-trained
embedding outperformed model with pre-trained embedding.
    The performance of machine learning techniques over the proposed work
is close to the performance of architecture with embedding. We have not
experimented with multiple diverse hyperparameters for NB and RF classifiers.
However, RF has outperformed over the small dataset.
    In our experiments we have also considered the reproducibility issue.
Keeping the parameters constant, we found variations among the results for
different training-testing sets. The variation in the score for model without
pre-trained embedding is slightly higher compared to the variation in model
with pre-training embedding. Since pre-trained embedding reduces the noisy
tokens, even for the diverse training-testing set, the classical neural network
gets similar input. The variation represents the inclusion of significant target
noises in the dataset and the limitation of neural network for learning over
the noisy data.
    The Cohen’s kappa similarity of our proposed work is lower in comparison.
We have discussed various challenges in code-mix noisy instances. We have
discussed examples where even human annotators had ambiguity about the
label of instances. However, due to precise instructions and pre-training of
annotators, less than 0.05% instances had a conflict of the label as shown
in Table 3. The factors responsible for the conflict of the label have been
discussed in Section 4 and 5. In our dataset, we considered the label given by
the NLTP expert group in case of different opinions for a instance.

9 Related work
Due to the popularity and enhanced use of code-mix languages, several
datasets are available for different tasks. Authors in [49] collected dialogues at
the restaurant’s reservation for a code-mix goal-oriented conversation dataset
in four languages, i.e., Hindi, Tamil, Gujarati and Bengali. Authors in [50]
proposed a Hindi-English code-mix dataset collected from Twitter for the
irony detection task. Authors in [34] collected a cod-mix SMP dataset for
hate speech detection using hashtags and topics from politics. Authors in [51]
proposed a dataset of 1460 Hindi-English code-mix tweets for semantic role
labelling by mapping proposition bank labels from Paninian Dependency la-
bels. Authors in [52] used an innovative way to create a code-mixed dataset
for language inference by utilizing movies with Hindi dialogues as premise
and hypothesis generated through crowd-sourcing. Authors in [53] proposed
Springer Nature 2021 LATEX template

22       Large scale annotated dataset for code-mix abusive short noisy text
Table 4 Dataset and benchmark performance.
                                                        [34]     Proposed
                                                                 work
                                 Size                   4575     36423
                                 Tokens                 7553     90695
             Properties          Abusive / Hate         2584     20328
                                 Non-Abusive / Normal   1991     16095
                                 Cohen’s kappa score    0.982    0.9185
                                                        0.0992   0.3591
                                                        0.0974   0.3550
                                 With
                                                        0.0898   0.3513
                                 Embedding
                                                        0.0769   0.3442
                                                        0.0632   0.3436

                                                        0.1663   0.5194
             Performance
                                                        0.1558   0.4953
                                 Without
                                                        0.1189   0.4821
                                 embedding
                                                        0.1107   0.4797
                                                        0.0899   0.4796

                                 NB                     0.1453   0.3380
                                 RF                     0.1999   0.3874

a dataset of 5062 instances filtered from more than 90,000 instances, collected
using particular keywords for bullying and non-bullying classification. Apart
from Hindi-English code-mix datasets [54] proposed 6,739 Malayalam-English
code-mix comments from Youtube®28 for sentiment analysis. Authors in [55]
collected hateful posts on Facebook®29 and Twitter about elections. Authors
in [56] analyzed code-mix data in the English-Dravidian language for sentiment
analysis and offensive text detection with 4,851 instances and 191 instances,
respectively. Authors in [56] also mentioned the challenges due to the limited
size availability of the dataset.
    There are considerable contributions by the research community to deal
with the challenges in code-mix data. Authors in [57] designed a framework for
identifying the language in the Hindi-English code-mix transcript of Bollywood
songs. However, there is a scarcity of datasets containing actual noisy code-
mix conversational instances. The difference of actual data and synthetic data
resists the relevant contributions to the domain [58]. In our work, we have
tried to full fill this gap.
    There is a difference between hate speech and abuse. The various features
of abusive instances are discussed in the Section 4. A dedicated dataset would
help the community comprehensively analyze and design the most feasible
solutions. Author in [59] concluded the limitations in the performance of
various architectures due to the unavailability of a suitable dataset. A small
quantity dataset is inefficient for designing and testing complex architecture
for abusive instances detection [24]. Authors in [17] discussed the need for a

 28
      https://www.youtube.com
 29
      https://www.facebook.com
You can also read