SemEval-2021 Task 1: Lexical Complexity Prediction - ACL ...

Page created by Alice Hernandez
 
CONTINUE READING
SemEval-2021 Task 1: Lexical Complexity Prediction
    Matthew Shardlow1 , Richard Evans2 , Gustavo Henrique Paetzold3 , Marcos Zampieri4
                          1
                            Manchester Metropolitan University, UK
                              2
                                University of Wolverhampton, UK
                   3
                     Universidade Tecnológica Federal do Paraná, Brazil
                           4
                             Rochester Institute of Technology USA
                                 m.shardlow@mmu.ac.uk

                       Abstract                               complexity is given (i.e., is a word complex or
                                                              not?), we instead focus on the Lexical Complexity
     This paper presents the results and main find-           Prediction (LCP) task (Shardlow et al., 2020) in
     ings of SemEval-2021 Task 1 - Lexical Com-
                                                              which a value is assigned from a continuous scale
     plexity Prediction. We provided participants
     with an augmented version of the CompLex                 to identify a word’s complexity (i.e., how complex
     Corpus (Shardlow et al., 2020). CompLex                  is this word?). We ask multiple annotators to give
     is an English multi-domain corpus in which               a judgment on each instance in our corpus and take
     words and multi-word expressions (MWEs)                  the average prediction as our complexity label. The
     were annotated with respect to their complex-            former task (CWI) forces each user to make a sub-
     ity using a five point Likert scale. SemEval-            jective judgment about the nature of the word that
     2021 Task 1 featured two Sub-tasks: Sub-task
                                                              models their personal vocabulary. Many factors
     1 focused on single words and Sub-task 2 fo-
     cused on MWEs. The competition attracted
                                                              may affect the annotator’s judgment including their
     198 teams in total, of which 54 teams submit-            education level, first language, specialism or famil-
     ted official runs on the test data to Sub-task 1         iarity with the text at hand. The annotators may
     and 37 to Sub-task 2.                                    also disagree on the level of difficulty at which to
                                                              label a word as complex. One annotator may label
1    Introduction                                             every word they feel is above average difficulty,
The occurrence of an unknown word in a sentence               another may label words that they feel unfamiliar
can adversely affect its comprehension by read-               with, but understand from the context, whereas an-
ers. Either they give up, misinterpret, or plough on          other annotator may only label those words that
without understanding. A committed reader may                 they find totally incomprehensible, even in context.
take the time to look up a word and expand their              Our introduction of the LCP task seeks to address
vocabulary, but even in this case they must leave             this annotator confusion by giving annotators a
the text, undermining their concentration. The nat-           Likert scale to provide their judgments. Whilst
ural language processing solution is to identify can-         annotators must still give a subjective judgment
didate words in a text that may be too difficult              depending on their own understanding, familiarity
for a reader (Shardlow, 2013; Paetzold and Specia,            and vocabulary — they do so in a way that better
2016a). Each potential word is assigned a judgment            captures the meaning behind each judgment they
by a system to determine if it was deemed ‘com-               have given. By aggregating these judgments we
plex’ or not. These scores indicate which words are           have developed a dataset that contains continuous
likely to cause problems for a reader. The words              labels in the range of 0–1 for each instance. This
that are identified as problematic can be the subject         means that rather than a system predicting whether
of numerous types of intervention, such as direct             a word is complex or not (0 or 1), instead a system
replacement in the setting of lexical simplification          must now predict where, on our continuous scale,
(Gooding and Kochmar, 2019), or extra informa-                a word falls (0–1).
tion being given in the context of explanation gen-              Consider the following sentence taken from a
eration (Rello et al., 2015).                                 biomedical source, where the target word ‘observa-
   Whereas previous solutions to this task have typ-          tion’ has been highlighted:
ically considered the Complex Word Identification
(CWI) task (Paetzold and Specia, 2016a; Yimam                  (1) The observation of unequal expression leads
et al., 2018) in which a binary judgment of a word’s               to a number of questions.

                                                          1
           Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1–16
            Bangkok, Thailand (online), August 5–6, 2021. ©2021 Association for Computational Linguistics
In the binary annotation setting of CWI some anno-             2   Related Tasks
tators may rightly consider this term non-complex,
whereas others may rightly consider it to be com-              CWI 2016 at SemEval The CWI shared task
plex. Whilst the meaning of the word is reasonably             was organized at SemEval 2016 (Paetzold and Spe-
clear to someone with scientific training, the con-            cia, 2016a). The CWI 2016 organizers introduced
text in which it is used is unfamiliar for a lay reader        a new CWI dataset and reported the results of 42
and will likely lead to them considering it com-               CWI systems developed by 21 teams. Words in
plex. In our new LCP setting, we are able to ask               their dataset were considered complex if they were
annotators to mark the word on a scale from very               difficult to understand for non-native English speak-
easy to very difficult. Each user can give their sub-          ers according to a binary labelling protocol. A word
jective interpretation on this scale indicating how            was considered complex if at least one of the anno-
difficult they found the word. Whilst annotators               tators found it to be difficult. The training dataset
will inevitably disagree (some finding it more or              consisted of 2,237 instances, each labelled by 20
less difficult), this is captured and quantified as part       annotators and the test dataset had 88,221 instances,
of our annotations, with a word of this type likely            each labelled by 1 annotator (Paetzold and Specia,
to lead to a medium complexity value.                          2016a).
                                                                  The participating systems leveraged lexical fea-
   LCP is useful as part of the wider task of lexi-            tures (Choubey and Pateria, 2016; Bingel et al.,
cal simplification (Devlin and Tait, 1998), where              2016; Quijada and Medero, 2016) and word em-
it can be used to both identify candidate words for            beddings (Kuru, 2016; S.P et al., 2016; Gillin,
simplification (Shardlow, 2013) and rank poten-                2016), as well as finding that frequency features,
tial words as replacements (Paetzold and Specia,               such as those taken from Wikipedia (Konkol, 2016;
2017). LCP is also relevant to the field of readabil-          Wróbel, 2016) were useful. Systems used binary
ity assessment, where knowing the proportion of                classifiers such as SVMs (Kuru, 2016; S.P et al.,
complex words in a text helps to identify the overall          2016; Choubey and Pateria, 2016), Decision Trees
complexity of the text (Dale and Chall., 1948).                (Choubey and Pateria, 2016; Quijada and Medero,
                                                               2016; Malmasi et al., 2016), Random Forests (Ron-
   This paper presents SemEval-2021 Task 1: Lex-               zano et al., 2016; Brooke et al., 2016; Zampieri
ical Complexity Prediction. In this task we devel-             et al., 2016; Mukherjee et al., 2016) and threshold-
oped a new dataset for complexity prediction based             based metrics (Kauchak, 2016; Wróbel, 2016) to
on the previously published CompLex dataset. Our               predict the complexity labels. The winning system
dataset covers 10,800 instances spanning 3 genres              made use of threshold-based methods and features
and containing unigrams and bigrams as targets for             extracted from Simple Wikipedia (Paetzold and
complexity prediction. We solicited participants               Specia, 2016b).
in our task and released a trial, training and test               A post-competition analysis (Zampieri et al.,
split in accordance with the SemEval schedule. We              2017) with oracle and ensemble methods showed
accepted submissions in two separate Sub-tasks,                that most systems performed poorly due mostly to
the first being single words only and the second               the way in which the data was annotated and the
taking single words and multi-word expressions                 the small size of the training dataset.
(modelled by our bigrams). In total 55 teams par-
ticipated across the two Sub-tasks.                            CWI 2018 at BEA The second CWI Shared Task
                                                               was organized at the BEA workshop 2018 (Yimam
   The rest of this paper is structured as folllows:           et al., 2018). Unlike the first task, this second task
In Section 2 we discuss the previous two iterations            had two objectives. The first objective was the
of the CWI task. In Section 3, we present the                  binary complex or non-complex classification of
CompLex 2.0 dataset that we have used for our task,            target words. The second objective was regression
including the methodology we used to produce trial,            or probabilistic classification in which 13 teams
test and training splits. In Section 5, we show the            were asked to assign the probability of a target word
results of the participating systems and compare               being considered complex by a set of language
the features that were used by each system. We                 learners. A major difference in this second task was
finally discuss the nature of LCP in Section 7 and             that datasets of differing genres: (TEXT GENRES)
give concluding remarks in Section 8                           as well as English, German and Spanish datasets

                                                           2
for monolingual speakers and a French dataset for           reliable than initially expected
multilingual speakers were provided (Yimam et al.,             For the Shared Task we chose to boost the num-
2018).                                                      ber of annotations on the same data as used for
   Similar to 2016, systems made use of a variety           CompLex 1.0 using Amazon’s Mechanical Turk
of lexical features including word length (Wani             platform. We requested a further 10 annotations
et al., 2018; De Hertog and Tack, 2018; AbuRa’ed            on each data instance bringing up the average num-
and Saggion, 2018; Hartmann and dos Santos,                 ber of annotators per instance. Annotators were
2018; Alfter and Pilán, 2018; Kajiwara and Ko-             presented with the same task layout as in the anno-
machi, 2018), frequency (De Hertog and Tack,                tation of CompLex 1.0 and we defined the Likert
2018; Aroyehun et al., 2018; Alfter and Pilán, 2018;       Scale points as previously:
Kajiwara and Komachi, 2018), N-gram features
                                                            Very Easy: Words which were very familiar to an
(Gooding and Kochmar, 2018; Popović, 2018; Hart-
                                                                annotator.
mann and dos Santos, 2018; Alfter and Pilán, 2018;
Butnaru and Ionescu, 2018) and word embeddings              Easy: Words with which an annotator was aware
(De Hertog and Tack, 2018; AbuRa’ed and Sag-                    of the meaning.
gion, 2018; Aroyehun et al., 2018; Butnaru and
Ionescu, 2018). A variety of classifiers were used          Neutral: A word which was neither difficult nor
ranging from traditional machine learning classi-               easy.
fiers (Gooding and Kochmar, 2018; Popović, 2018;           Difficult: Words which an annotator was unclear
AbuRa’ed and Saggion, 2018), to Neural Networks                  of the meaning, but may have been able to
(De Hertog and Tack, 2018; Aroyehun et al., 2018).               infer the meaning from the sentence.
The winning system made use of Adaboost with
WordNet features, POS tags, dependency parsing              Very Difficult: Words that an annotator had never
relations and psycholinguistic features (Gooding                seen before, or were very unclear.
and Kochmar, 2018).
                                                            These annotations were aggregated with the re-
                                                            tained annotations of CompLex 1.0 to give our new
3   Data
                                                            dataset, CompLex 2.0, covering 10,800 instances
We previously reported on the annotation of the             across single and multi-words and across 3 genres.
CompLex dataset (Shardlow et al., 2020) (hereafter             The features that make our corpus distinct from
referred to as CompLex 1.0), in which we anno-              other corpora which focus on the CWI and LCP
tated around 10,000 instances for lexical complex-          tasks are described below:
ity using the Figure Eight platform. The instances
                                                            Continuous Annotations: We have annotated our
spanned three genres: Europarl, taken from the
                                                                data using a 5-point Likert Scale. Each in-
proceedings of the European Parliament (Koehn,
                                                                stance has been annotated multiple times and
2005); The Bible, taken from an electronic dis-
                                                                we have taken the mean average of these anno-
tribution of the World English Bible translation
                                                                tations as the label for each data instance. To
(Christodouloupoulos and Steedman, 2015) and
                                                                calculate this average we converted the Likert
Biomedical literature, taken from the CRAFT cor-
                                                                Scale points to a continuous scale as follows:
pus (Bada et al., 2012). We limited our annotations
                                                                Very Easy → 0, Easy → 0.25, Neutral → 0.5,
to focus only on nouns and multi-word expressions
                                                                Difficult → 0.75, Very Difficult → 1.0.
following a Noun-Noun or Adjective-Noun pat-
tern, using the POS tagger from Stanford CoreNLP            Contextual Annotations: Each instance in the
(Manning et al., 2014) to identify these patterns.              corpus is presented with its enclosing sentence
   Whilst these annotations allowed us to report on             as context. This ensures that the sense of a
the dataset and to show some trends, the overall                word can be identified when assigning it a
quality of the annotations we received was poor                 complexity value. Whereas previous work
and we ended up discarding a large number of the                has reannotated the data from the CWI–2018
annotations. For CompLex 1.0 we retained only                   shared task with word senses (Strohmaier
instances with four or more annotations and the                 et al., 2020), we do not make explicit sense
low number of annotations (average number of                    distinctions between our tokens, instead leav-
annotators = 7) led to the overall dataset being less           ing this task up to participants.

                                                        3
Repeated Token Instances: We provide more                 indicating that annotators found these more dif-
    than one context for each token (up to a maxi-        ficult to understand. Similarly Europarl and the
    mum of five contexts per genre). These words          Bible were less complex than the corpus average,
    were annotated separately during annotation,          whereas the Biomedical articles were more com-
    with the expectation that tokens in different         plex. The number of unique tokens varies from
    contexts would receive differing complexity           one genre to another as the tokens were selected at
    values. This deliberately penalises systems           random and discarded if there were already more
    that do not take the context of a word into           than 5 occurrences of the given token already in
    account.                                              the dataset. This stochastic selection process led to
                                                          a varied dataset with some tokens only having one
Multi-word Expressions: In our corpus we have             context, whereas others have as many as five in a
    provided 1,800 instances of multi-word ex-            given genre. On average each token has around 2
    pressions (split across our 3 sub-corpora).           contexts.
    Each MWE is modelled as a Noun-Noun or
    Adjective-Noun pattern followed by any POS            4       Data Splits
    tag which is not a noun. This avoids select-
                                                          In order to run the shared task we partitioned our
    ing the first portion of complex noun phrases.
                                                          dataset into Trial, Train and Test splits and dis-
    There is no guarantee that these will corre-
                                                          tributed these according to the SemEval schedule.
    spond to true MWEs that take on a meaning
                                                          A criticism of previous CWI shared tasks is that
    beyond the sum of their parts, and further in-
                                                          the training data did not accurately reflect the dis-
    vestigation into the types of MWEs present in
                                                          tribution of instances in the testing data. We sought
    the corpus would be informative.
                                                          to avoid this by stratifying our selection process
                                                          for a number of factors. The first factor we consid-
Aggregated Annotations: By aggregating the
                                                          ered was genre. We ensured that an even number
    Likert scale labels we have generated crowd-
                                                          of instances from each genre was present in each
    sourced complexity labels for each instance
                                                          split. We also stratified for complexity, ensuring
    in our corpus. We are assuming that, although
                                                          that each split had a similar distribution of com-
    there is inevitably some noise in any large an-
                                                          plexities. Finally we also stratified the splits by
    notation project (and especially so in crowd-
                                                          token, ensuring that multiple instances containing
    sourcing), this will even out in the averaging
                                                          the same token occurred in only one split. This last
    process to give a mean value reflecting the
                                                          criterion ensures that systems do not overfit to the
    appropriate complexity for each instance. By
                                                          test data by learning the complexities of specific
    taking the mean average we are assuming uni-
                                                          tokens in the training data.
    modal distributions in our annotations.
                                                             Performing a robust stratification of a dataset
Varied Genres: We have selected for diverse gen-          according to multiple features is a non-trivial op-
    res as mentioned above. Previous CWI                  timisation problem. We solved this by first group-
    datasets have focused on informal text such as        ing all instances in a genre by token and sorting
    Wikipedia and multi-genre text, such as news.         these groups by the complexity of the least com-
    By focusing on specific texts we force systems        plex instance in the group. For each genre, we
    to learn generalised complexity annotations           passed through this sorted list and for each set of
    that are appropriate in a cross-genre setting.        20 groups we put the first group in the trial set, the
                                                          next two groups in the test set and the remaining 17
We have presented summary statistics for Com-             groups in the training data. This allowed us to get
pLex 2.0 in Table 1. In total, 5,617 unique words         a rough 5-85-10 split between trial, training and
are split across 10,800 contexts, with an average         test data. The trial and training data were released
complexity across our entire dataset of 0.321. Each       in this ordered format, however to prevent systems
genre has 3,600 contexts, with each split between         from guessing the labels based on the data ordering
3,000 single words and 600 multi-word expres-             we randomised the order of the instances in the test
sions. Whereas single words are slightly below the        data prior to release. The splits that we used for the
average complexity of the dataset at 0.302, multi-        Shared Task are available via GitHub1 .
                                                              1
word expressions are much more complex at 0.419,                  https://github.com/MMU-TDMLab/CompLex

                                                      4
Subset     Genre      Contexts    Unique Tokens        Average Complexity
                               Total       10,800         5,617                  0.321
                               Europarl    3,600          2,227                  0.303
                    All
                               Biomed       3,600         1,904                  0.353
                               Bible        3,600         1,934                  0.307
                               Total        9,000         4,129                  0.302
                               Europarl    3,000          1,725                  0.286
                    Single
                               Biomed       3,000         1,388                  0.325
                               Bible        3,000         1,462                  0.293
                               Total        1,800         1,488                  0.419
                               Europarl      600           502                   0.388
                    MWE
                               Biomed        600           516                   0.491
                               Bible         600           472                   0.377

                                      Table 1: The statistics for CompLex 2.0.

  Table 2 presents statistics on each split in our             sions due to the relatively small number of MWEs
data, where it can be seen that we were able to                in our corpus (184 in the test set).
achieve a roughly even split between genres across                The metrics we chose for ranking were as fol-
the trial, train and test data.                                lows:

     Subset     Genre        Trial   Train   Test              Pearson’s Correlation: We chose this metric as
                Total        520     9179    1101                  our primary method of ranking as it is well
                Europarl     180     3010     410                  known and understood, especially in the con-
     All
                Biomed       168     3090     342                  text of evaluating systems with continuous
                Bible        172     3079     349                  outputs. Pearson’s correlation is robust to
                Total        421     7662    917                   changes in scale and measures how the input
                Europarl     143     2512    345                   variables change with each other.
     Single
                Biomed       135     2576     289
                Bible        143     2574    283               Spearman’s Rank: This metric does not consider
                Total         99     1517    184                   the values output by a system, or in the test
                Europarl      37      498     65                   labels, only the order of those labels. It was
     MWE
                Biomed        33      514      53                  chosen as a secondary metric as it is more
                Bible         29      505     66
                                                                   robust to outliers than Pearson’s correlation.
Table 2: The Trial, Train and Test splits that were used
                                                               Mean Absolute Error (MAE): Typically used
as part of the shared task.
                                                                  for the evaluation of regression tasks, we
                                                                  included MAE as it gives an indication of
5   Results                                                       how close the predicted labels were to the
                                                                  gold labels for our task.
The full results of our task can be seen in Ap-
pendix A. We had 55 teams participate in our 2                 Mean Squared Error (MSE): There is little dif-
Sub-tasks, with 19 participating in Sub-task 1 only,              ference in the calculation of MSE vs. MAE,
1 participating in Sub-task 2 only and 36 partici-                however we also include this metric for com-
pating in both Sub-tasks. We have used Pearson’s                  pleteness.
correlation for our final ranking of participants, but
                                                               R2: This measures the proportion of variance of
we have also included other metrics that are appro-
                                                                   the original labels captured by the predicted
priate for evaluating continuous and ranked data
                                                                   labels. It is possible to do well on all the other
and provided secondary rankings of these.
                                                                   metrics, yet do poorly on R2 if a system pro-
   Sub-task 1 asked participants to assign complex-
                                                                   duces annotations with a different distribution
ity values to each of the single words instances in
                                                                   than those in the original labels.
our corpus. For Sub-task 2, we asked participants
to submit results on both single words and MWEs.                 In Table 3 we show the scores of the top 10 sys-
We did not rank participants on MWE-only submis-               tems across our 2 Sub-tasks according to Pearson’s

                                                           5
Task 1                     6     Participating Systems
  Team
                            Pearson          R2
  JUST BLUE               0.7886 (1)     0.6172 (2)            In this section we have analysed the participating
  DeepBlueAI              0.7882 (2)     0.6210 (1)            systems in our task. System Description papers
  Alejandro Mosquera      0.7790 (3)     0.6062 (3)            were submitted by 32 teams. In the subsections
  Andi                    0.7782 (4)     0.6036 (4)            below, we have first given brief summaries of some
  CS-UM6P                 0.7779 (5) 0.3813 (47)               of the top systems according to Pearson’s correla-
  tuqa                    0.7772 (6) 0.5771 (12)               tion for each task for which we had a description.
  OCHADAI-KYOTO           0.7772 (7)     0.6015 (5)            We then discuss the features used across different
  BigGreen                0.7749 (8)     0.5983 (6)            systems, as well as the approaches to the task that
  CSECU-DSC                0.7716 (9)    0.5909 (8)            different teams chose to take. We have prepared
  IA PUCP                 0.7704 (10) 0.5929 (7)               a comprehensive table comparing the features and
  Frequency Baseline         0.5287        0.2779              approaches of all systems for which we have the
                                    Task 2                     relevant information in Appendix B.
  DeepBlueAI              0.8612 (1)     0.7389 (1)
  rg pa                   0.8575 (2)     0.7035 (5)            6.1    System Summaries
  xiang wen tian          0.8571 (3)     0.7012 (7)            DeepBlueAI: This system attained the highest
  andi gpu                0.8543 (4)     0.7055 (4)            Pearson’s Correlation on Sub-task 2 and the sec-
  ren wo xing             0.8541 (5)     0.6967 (8)            ond highest Pearson’s Correlation on Sub-task 1. It
  Andi                    0.8506 (6)     0.7107 (2)            also attained the highest R2 score across both tasks.
  CS-UM6P                  0.8489 (7) 0.6380 (17)
                                                               The system used an ensemble of pre-trained lan-
  OCHADAI-KYOTO           0.8438 (8)     0.7103 (3)
                                                               guage models fine-tuned for the task with Pseudo
  LAST                    0.8417 (9)     0.7030 (6)
                                                               Labelling, Data Augmentation, Stacked Training
  KFU                     0.8406 (10) 0.6967 (9)
  Frequency Baseline         0.6571        0.4030              Models and Multi-Sample Dropout. The data was
                                                               encoded for the transformer models using the genre
Table 3: The top 10 systems for each task according to         and token as a query string and the given context
Pearson’s correlation. We have also included R2 score          as a supplementary input.
to help interpret the former. For full rankings, see Ap-
pendix A                                                       JUST BLUE: This system attained the highest
                                                               Pearson’s Correlation for Sub-task 1. The sys-
                                                               tem did not participate in Sub-task 2. This system
                                                               makes use of an ensemble of BERT and RoBERTa.
Correlation. We have only reported on Pearson’s                Separate models are fine-tuned for context and to-
correlation and R2 in these tables, but the full re-           ken prediction and these are weighted 20-80 re-
sults with all metrics are available in Appendix A.            spectively. The average of the BERT models and
We have included a Frequency Baseline produced                 RoBERTa models is taken to give a final score.
using log-frequency from the Google Web1T and
                                                               RG PA: This system attained the second highest
linear regression, which was beaten by the majority
                                                               Pearson’s Correlation for Sub-task 2. The system
of our systems. From these results we can see that
                                                               uses a fine-tuned RoBERTa model and boosts the
systems were able to attain reasonably high scores
                                                               training data for the second task by identifying
on our dataset, with the winning systems reporting
                                                               similar examples from the single-word portion of
Pearson’s Correlation of 0.7886 for Sub-task 1 and
                                                               the dataset to train the multi-word classifier. They
0.8612 for Sub-task 2, as well as high R2 scores of
                                                               use an ensemble of RoBERTa models in their fi-
0.6210 for Sub-task 1 and 0.7389 for Sub-task 2.
                                                               nal classification, averaging the outputs to enhance
The rankings remained stable across Spearman’s
                                                               performance.
rank, MAE and MSE, with some small variations.
Scores were generally higher on Sub-task 2 than on             Alejandro Mosquera: This system attained the
Sub-task 1, and this is likely to be because of the            third highest Pearson’s Correlation for Sub-task 1.
different groups of token-types (single words and              The system used a feature-based approach, incor-
MWEs). MWEs are known to be more complex                       porating length, frequency, semantic features from
than single words and so this fact may have im-                WordNet and sentence level readability features.
plictly helped systems to better model the variance            These were passed through a Gradient Boosted Re-
of complexities between the two groups.                        gression.

                                                           6
Andi: This system attained the fourth highest              Semantic features taken from WordNet modelling
Pearson’s Correlation for Sub-task 1. They com-            the sense of the word and it’s ambiguity or abstract-
bine a traditional feature based approach with fea-        ness have been used widely, as well as sentence
tures from pre-trained language models. They use           level features aiming to model the context around
psycholinguistic features, as well as GLoVE and            the target words. Some systems chose to identify
Word2Vec Embeddings. They also take features               named entities, as these may be innately more dif-
from an ensemble of Language models: BERT,                 ficult for a reader. Word inclusion lists were also
RoBERTa, ELECTRA, ALBERT, DeBERTa. All                     a popular feature, denoting whether a word was
features are passed through Gradient Boosted Re-           found on a given list of easy to read vocabulary.
gression to give the final output score.                   Finally, word embeddings are a popular feature,
                                                           coming from static resources such as GLoVE or
CS-UM6P: This system attained the fifth highest
                                                           Word2Vec, but also being derived through the use
Pearson’s Correlation for Sub-task 1 and the sev-
                                                           of Transformer models such as BERT, RoBERTa,
enth highest Pearson’s Correlation for Sub-task 2.
                                                           XLNet or GPT-2, which provide context dependent
The system uses BERT and RoBERTa and encodes
                                                           embeddings suitable for our task.
the context and token for the language models to
learn from. Interestingly, whilst this system scored          These features are passed through a regres-
highly for Pearson’s correlation the R2 metric is          sion system, with Gradient Boosted Regression
much lower on both Sub-tasks. This may indicate            and Random Forest Regression being two popu-
the presence of significant outliers in the system’s       lar approaches amongst participants for this task.
output.                                                    Both apply scale invariance meaning that less pre-
                                                           processing of inputs is necessary.
OCHADAI-KYOTO: This system attained the                       Deep Learning Based systems invariably rely on
seventh highest Pearson’s Correlation on Sub-task          a pre-trained language model and fine-tune this us-
1 and the eight highest Pearson’s Correlation on           ing transfer learning to attain strong scores on the
Sub-task 2. The system used a fine-tuned BERT              task. BERT and RoBERTa were used widely in
and RoBERTa model with the token and context en-           our task, with some participants also opting for AL-
coded. They employed multiple training strategies          BERT, ERNIE, or other such language models. To
to boost performance.                                      prepare data for these language models, most par-
6.2   Approaches                                           ticipants following this approach concatenated the
                                                           token with the context, separated by a special token
There are three main types of systems that were            (hSEP i). The Language Model was then trained
submitted to our task. In line with the state of           and the embedding of the hCLSi token extracted
the art in modern NLP, these can be categorised            and passed through a further fine-tuned network for
as: Feature-based systems, Deep Learning Sys-              complexity prediction. Adaptations to this method-
tems and Systems which use a hybrid of the former          ology include applying training strategies such as
two approaches. Although Deep Learning Based               adversarial training, multi-task learning, dummy
systems have attained the highest Pearson’s Corre-         annotation generation and capsule networks.
lation on both Sub-tasks, occupying the first two
                                                              Finally, hybrid approaches use a mixture of Deep
places in each task, Feature based systems are not
                                                           Learning by fine-tuning a neural network alongside
far behind, attaining the third and fourth spots on
                                                           feature-based approaches. The features may be
Sub-task 1 with a similar score to the top systems.
                                                           concatenated to the input embeddings, or may be
We have described each approach as applied to our
                                                           concatenated at the output prior to further train-
task below.
                                                           ing. Whilst this strategy appears to be the best of
   Feature-based systems use a variety of features
                                                           both worlds, uniting linguistic knowledge with the
known to be useful for lexical complexity. In par-
                                                           power of pre-trained language models, the hybrid
ticular, lexical frequency and word length feature
                                                           systems do not tend to perform as well as either
heavily with many different ways of calculating
                                                           feature based or deep learning systems.
these metrics such as looking at various corpora
and investigating syllable or morpheme length. Psy-
                                                           6.3   MWEs
cholinguistic features which model people’s per-
ception of words are understandably popular for            For Sub-task 2 we asked participants to submit
this task as complexity is a perceived phenomenon.         both predictions for single words and multi-words

                                                       7
from our corpus. We hoped this would encourage               ity scale. The value of the threshold to be selected
participants to consider models that adapted single          will likely depend on the target audience, with more
word lexical complexity to multi-word lexical com-           competent speakers requiring a higher threshold.
plexity. We observed a number of strategies that             When selecting a threshold, the categories we used
participants employed to create the annotations for          for annotation should be taken into account, so for
this secondary portion of our data.                          example a threshold of 0.5 would indicate all words
   For systems that employed a deep learning ap-             that were rated as neutral or above.
proach, it was relatively simple to incorporate                 To create our annotated dataset, we employed
MWEs as part of their training procedure. These              crowdsourcing with a Likert scale and aggregated
systems encoded the input using a query and con-             the categorical judgments on this scale to give a
text, separated by a hSEP i token. The number                continuous annotation. It should be noted that this
of tokens prior to the hSEP i token did not mat-             is not the same as giving a truly continuous judg-
ter and either one or two tokens could be placed             ment (i.e., asking each annotator to give a value
there to handle single and multi-word instances              between 0 and 1). We selected this protocol as
simultaneously.                                              the Likert Scale is familiar to annotators and al-
   However, feature based systems could not em-              lows them to select according to defined points (we
ploy this trick and needed to devise more imagina-           provided the definitions given earlier at annotation
tive strategies for handling MWEs. Some systems              time). The annotation points that we gave were
handled them by averaging the features of both to-           not intended to give an even distribution of anno-
kens in the MWE, or by predicting scores for each            tations and it was our expectation that most words
token and then averaging these scores. Other sys-            would be familiar to some degree, falling in the
tems doubled their feature space for MWEs and                very easy or easy categories. We pre-selected for
trained a new model which took the features of               harder words to ensure that there were also words in
both words into account.                                     the difficult and very difficult categories. As such,
                                                             the corpus we have presented is not designed to be
7   Discussion                                               representative of the distribution of words across
                                                             the English language. To create such a corpus, one
In this paper we have posited the new task of Lex-
                                                             would need to annotate all words according to our
ical Complexity Prediction. This builds on previ-
                                                             scale with no filtering. The general distribution of
ous work on Complex Word Identification, specif-
                                                             annotations in our corpus is towards the easier end
ically by providing annotations which are contin-
                                                             of the Likert scale.
uous rather than binary or probabilistic as in pre-
vious tasks. Additionally, we provided a dataset                A criticism of the approach we have employed
with annotations in context, covering three diverse          is that it allows for subjectivity in the annotation
genres and incorporating MWEs, as well as single             process. Certainly one annotator’s perception of
tokens. We have moved towards this task, rather              complexity will be different to another’s. Giving
than rerunning another CWI task as the outputs               fixed values of complexity for every word will not
of the models are more useful for a diverse range            reflect the specific difficulties that one reader, or
of follow-on tasks. For example, whereas CWI                 one reader group will face. The annotations we
is particularly useful as a preprocessing step for           have provided are averaged values of the annota-
Lexical simplification (identifying which words              tions given by our annotators, we chose to keep
should be transformed), LCP may also be useful               all instances, rather than filtering out those where
for readability assessment or as a rich feature in           annotators gave a wide spread of complexity anno-
other downstream NLP tasks. A continuous annota-             tations. Further work may be undertaken to give
tion allows a ranking to be given over words, rather         interesting insights into the nature of subjectivity
than binary categories, meaning that we can not              in annotations. For example, some words may be
only tell whether a word is likely to be difficult for       rated as easy or difficult by all annotators, whereas
a reader, but also how difficult that word is likely         others may receive both easy and difficult annota-
to be. If a system requires binary complexity (as            tions, indicating that the perceived complexity of
in the case of lexical simplification) it is easy to         the instance is more subjective. We did not make
transform our continuous complexity values into a            the individual annotations available as part of the
binary value by placing a threshold on the complex-          shared task data, to encourage systems to focus

                                                         8
primarily on the prediction of complexity.                    is the inclusion of previous CWI datasets (Yimam
   An issue with the previous shared tasks is that            et al., 2017; Maddela and Xu, 2018). We were
scores were typically low and that systems tended             aware of these when developing the corpus and
to struggle to beat reasonable baselines, such as             attempted to make our data sufficiently distinct in
those based on lexical frequency. We were pleased             style to prevent direct reuse of these resources.
to see that systems participating in our task returned           A limitation of our task is that it focuses solely
scores that indicated that they had learnt to model           on LCP for the English Language. Previous CWI
the problem well (Pearson’s Correlation of 0.7886             shared tasks (Yimam et al., 2018) and simplifi-
on Task 1 and 0.8612 on Task 2). MWEs are typi-               cation efforts (Saggion et al., 2015; Aluı́sio and
cally more complex than single words and it may               Gasperin, 2010) have focused on languages other
be the case that these exhibited a lower variance,            than English and we hope to extend this task in the
and were thus easier to predict for the systems. The          future to other languages.
strong Pearson’s Correlation is backed up by a high
                                                              8   Conclusion
R2 score (0.6172 for Task 1 and 0.7389 for Task
2), which indicates that the variance in the data             We have presented the SemEval-2021 Task 1 on
is captured accurately by the models’ predictions.            Lexical Complexity Prediction. We developed a
These models strongly outperformed a reasonable               new dataset focusing on continuous annotations in
baseline based on word frequency as shown in Ta-              context across three genres. We solicited partici-
ble 3.                                                        pants via SemEval and 55 teams submitted results
   Whilst we have chosen in this report to rank sys-          across our two Sub-tasks. We have shown the re-
tems based on their score on Pearson’s correlation,           sults of these systems and discussed the factors that
giving a final ranking over all systems, it should            helped systems to perform well. We have analysed
be noted that there is very little variation in score         all the systems that participated and categorised
between the top systems and all other systems. For            their findings to help future researchers understand
Task 1 there are 0.0182 points of Pearson’s Corre-            which approaches are suitable for LCP.
lation separating the systems at ranks 1 and 10. For
Task 2 a similar difference of 0.021 points of Pear-
son’s Correlation separates the systems at ranks 1
                                                              References
and 10. These are small differences and it may be             Ahmed AbuRa’ed and Horacio Saggion. 2018. LaS-
the case that had we selected a different random                TUS/TALN at Complex Word Identification (CWI)
                                                                2018 Shared Task. In Proceedings of the 13th Work-
split in our dataset this would have led to a different         shop on Innovative Use of NLP for Building Edu-
ordering in our results (Gorman and Bedrick, 2019;              cational Applications, New Orleans, United States.
Søgaard et al., 2020). This is not unique to our task           Association for Computational Linguistics.
and is something for the SemEval community to
                                                              David Alfter and Ildikó Pilán. 2018. SB@GU at the
ruminate on as the focus of NLP tasks continues to              Complex Word Identification 2018 Shared Task. In
move towards better evaluation rather than better               Proceedings of the 13th Workshop on Innovative Use
systems.                                                        of NLP for Building Educational Applications, New
                                                                Orleans, United States. Association for Computa-
   An analysis of the systems that participated                 tional Linguistics.
in our task showed that there was little variation
between Deep Learning approaches and Feature                  Sandra Aluı́sio and Caroline Gasperin. 2010. Foster-
                                                                ing digital inclusion and accessibility: The PorSim-
Based approaches, although Deep Learning ap-                    ples project for simplification of Portuguese texts. In
proaches ultimately attained the highest scores on              Proceedings of the NAACL HLT 2010 Young Investi-
our data. Generally the Deep Learning and Feature               gators Workshop on Computational Approaches to
Based approaches are interleaved in our results ta-             Languages of the Americas, pages 46–53, Los Ange-
                                                                les, California. Association for Computational Lin-
ble, showing that both approaches are still relevant
                                                                guistics.
for LCP. One factor that did appear to affect system
output was the inclusion of context, whether that             Segun Taofeek Aroyehun, Jason Angel, Daniel Alejan-
was in a deep learning setting or a feature based               dro Pérez Alvarez, and Alexander Gelbukh. 2018.
                                                                Complex Word Identification: Convolutional Neural
setting. Systems which reported using no context                Network vs. Feature Engineering . In Proceedings
appeared to perform worse in the overall rankings.              of the 13th Workshop on Innovative Use of NLP for
Another feature that may have helped performance                Building Educational Applications, New Orleans,

                                                          9
United States. Association for Computational Lin-            Dirk De Hertog and Anaı̈s Tack. 2018. Deep Learn-
  guistics.                                                      ing Architecture for Complex Word Identification .
                                                                 In Proceedings of the 13th Workshop on Innovative
Abdul Aziz, MD. Akram Hossain, and Abu Nowshed                   Use of NLP for Building Educational Applications,
  Chy. 2021. CSECU-DSG at SemEval-2021 Task 1:                   New Orleans, United States. Association for Com-
  Fusion of Transformer Models for Lexical Complex-              putational Linguistics.
  ity Prediction. In Proceedings of the Fifteenth Work-
  shop on Semantic Evaluation, Bangkok, Thailand               Abhinandan Desai, Kai North, Marcos Zampieri, and
  (online). Association for Computational Linguistics.           Christopher Homan. 2021. LCP-RIT at SemEval-
                                                                 2021 Task 1: Exploring Linguistic Features for
Michael Bada, Miriam Eckert, Donald Evans, Kristin               Lexical Complexity Prediction. In Proceedings
  Garcia, Krista Shipley, Dmitry Sitnikov, William A             of the Fifteenth Workshop on Semantic Evaluation,
  Baumgartner, K Bretonnel Cohen, Karin Verspoor,                Bangkok, Thailand (online). Association for Com-
  Judith A Blake, et al. 2012. Concept annotation in             putational Linguistics.
  the craft corpus. BMC bioinformatics, 13(1):1–20.            S. Devlin and J. Tait. 1998. The use of a psycholin-
                                                                  guistic database in the simplification of text for apha-
Yves Bestgen. 2021. LAST at SemEval-2021 Task 1:                  sic readers. In Linguistic Databases. Stanford, USA:
  Improving Multi-Word Complexity Prediction Us-                  CSLI Publications.
  ing Bigram Association Measures. In Proceedings
  of the Fifteenth Workshop on Semantic Evaluation,            Robert Flynn and Matthew Shardlow. 2021. Manch-
  Bangkok, Thailand (online). Association for Com-               ester Metropolitan at SemEval-2021 Task 1: Convo-
  putational Linguistics.                                        lutional Networks for Complex Word Identification.
                                                                 In Proceedings of the Fifteenth Workshop on Seman-
Joachim Bingel, Natalie Schluter, and Héctor                    tic Evaluation, Bangkok, Thailand (online). Associ-
  Martı́nez Alonso. 2016. CoastalCPH at SemEval-                 ation for Computational Linguistics.
  2016 Task 11: The importance of designing your
  Neural Networks right. In Proceedings of the                 Nat Gillin. 2016. Sensible at SemEval-2016 Task 11:
  10th International Workshop on Semantic Eval-                  Neural Nonsense Mangled in Ensemble Mess. In
  uation (SemEval-2016), pages 1028–1033, San                    Proceedings of the 10th International Workshop on
  Diego, California. Association for Computational               Semantic Evaluation (SemEval-2016), pages 963–
  Linguistics.                                                   968, San Diego, California. Association for Compu-
                                                                 tational Linguistics.
Julian Brooke, Alexandra Uitdenbogerd, and Timo-               Sebastian Gombert and Sabine Bartsch. 2021. TUDA-
   thy Baldwin. 2016. Melbourne at SemEval 2016                  CCL at SemEval-2021 Task 1: Using Gradient-
   Task 11: Classifying Type-level Word Complexity               boosted Regression Tree Ensembles Trained on a
   using Random Forests with Corpus and Word List                Heterogeneous Feature Set for Predicting Lexical
   Features. In Proceedings of the 10th International            Complexity. In Proceedings of the Fifteenth Work-
  Workshop on Semantic Evaluation (SemEval-2016),                shop on Semantic Evaluation, Bangkok, Thailand
   pages 975–981, San Diego, California. Association             (online). Association for Computational Linguistics.
   for Computational Linguistics.
                                                               Sian Gooding and Ekaterina Kochmar. 2018. CAMB
Andrei Butnaru and Radu Tudor Ionescu. 2018.                      at CWI Shared Task 2018: Complex Word Iden-
  UnibucKernel: A kernel-based learning method for                tification with Ensemble-Based Voting . In Pro-
  complex word identification . In Proceedings of the             ceedings of the 13th Workshop on Innovative Use
  13th Workshop on Innovative Use of NLP for Build-               of NLP for Building Educational Applications, New
  ing Educational Applications, New Orleans, United               Orleans, United States. Association for Computa-
  States. Association for Computational Linguistics.              tional Linguistics.

Prafulla Choubey and Shubham Pateria. 2016. Garuda             Sian Gooding and Ekaterina Kochmar. 2019. Recur-
  & Bhasha at SemEval-2016 Task 11: Complex Word                  sive context-aware lexical simplification. In Pro-
  Identification Using Aggregated Learning Models.                ceedings of the 2019 Conference on Empirical Meth-
  In Proceedings of the 10th International Workshop               ods in Natural Language Processing and the 9th In-
  on Semantic Evaluation (SemEval-2016), pages                    ternational Joint Conference on Natural Language
  1006–1010, San Diego, California. Association for               Processing (EMNLP-IJCNLP), pages 4855–4865.
  Computational Linguistics.                                   Kyle Gorman and Steven Bedrick. 2019. We need to
                                                                 talk about standard splits. In Proceedings of the
Christos Christodouloupoulos and Mark Steedman.                  57th Annual Meeting of the Association for Com-
  2015. A massively parallel corpus: the bible in                putational Linguistics, pages 2786–2791, Florence,
  100 languages. Language resources and evaluation,              Italy. Association for Computational Linguistics.
  49(2):375–395.
                                                               Nathan Hartmann and Leandro Borges dos Santos.
E. Dale and J. S. Chall. 1948. A formula for predicting          2018. NILC at CWI 2018: Exploring Feature Engi-
   readability. Educational research bulletin, 27.               neering and Feature Learning . In Proceedings of the

                                                          10
13th Workshop on Innovative Use of NLP for Build-          Mounica Maddela and Wei Xu. 2018.           A word-
  ing Educational Applications, New Orleans, United           complexity lexicon and a neural readability ranking
  States. Association for Computational Linguistics.          model for lexical simplification. In Proceedings of
                                                              the 2018 Conference on Empirical Methods in Natu-
Bo Huang, Yang Bai, and Xiaobing Zhou. 2021. hub              ral Language Processing, pages 3749–3760.
  at SemEval-2021 Task 1: Fusion of Sentence and
  Word Frequency to Predict Lexical Complexity. In           Shervin Malmasi, Mark Dras, and Marcos Zampieri.
  Proceedings of the Fifteenth Workshop on Seman-              2016. LTG at SemEval-2016 Task 11: Complex
  tic Evaluation, Bangkok, Thailand (online). Associ-          Word Identification with Classifier Ensembles. In
  ation for Computational Linguistics.                         Proceedings of the 10th International Workshop on
                                                               Semantic Evaluation (SemEval-2016), pages 996–
Aadil Islam, Weicheng Ma, and Soroush Vosoughi.                1000, San Diego, California. Association for Com-
  2021. BigGreen at SemEval-2021 Task 1: Lexical               putational Linguistics.
  Complexity Prediction with Assembly Models. In
  Proceedings of the Fifteenth Workshop on Seman-            Nabil El Mamoun, Abdelkader El Mahdaouy, Abdel-
  tic Evaluation, Bangkok, Thailand (online). Associ-          lah El Mekki, Kabil Essefar, and Ismail Berrada.
  ation for Computational Linguistics.                         2021. CS-UM6P at SemEval-2021 Task 1: A Deep
                                                               Learning Model-based Pre-trained Transformer En-
Tomoyuki Kajiwara and Mamoru Komachi. 2018.                    coder for Lexical Complexity. In Proceedings of
  Complex Word Identification Based on Frequency               the Fifteenth Workshop on Semantic Evaluation,
  in a Learner Corpus . In Proceedings of the 13th             Bangkok, Thailand (online). Association for Com-
  Workshop on Innovative Use of NLP for Building Ed-           putational Linguistics.
  ucational Applications, New Orleans, United States.
  Association for Computational Linguistics.                 Christopher Manning, Mihai Surdeanu, John Bauer,
                                                               Jenny Finkel, Steven Bethard, and David McClosky.
David Kauchak. 2016. Pomona at SemEval-2016 Task               2014. The Stanford CoreNLP natural language pro-
  11: Predicting Word Complexity Based on Corpus               cessing toolkit. In Proceedings of 52nd Annual
  Frequency. In Proceedings of the 10th International          Meeting of the Association for Computational Lin-
  Workshop on Semantic Evaluation (SemEval-2016),              guistics: System Demonstrations, pages 55–60, Bal-
  pages 1047–1051, San Diego, California. Associa-             timore, Maryland. Association for Computational
  tion for Computational Linguistics.                          Linguistics.
Milton King, Ali Hakimi Parizi, Samin Fakharian, and         Alejandro Mosquera. 2021. Alejandro Mosquera at
  Paul Cook. 2021. UNBNLP at SemEval-2021 Task                 SemEval-2021 Task 1: Exploring Sentence and
 1: Predicting lexical complexity with masked lan-             Word Features for Lexical Complexity Prediction.
  guage models and character-level encoders. In Pro-           In Proceedings of the Fifteenth Workshop on Seman-
  ceedings of the Fifteenth Workshop on Semantic               tic Evaluation, Bangkok, Thailand (online). Associ-
  Evaluation, Bangkok, Thailand (online). Associa-             ation for Computational Linguistics.
  tion for Computational Linguistics.
                                                             Niloy Mukherjee, Braja Gopal Patra, Dipankar Das,
Philipp Koehn. 2005. Europarl: A parallel corpus for           and Sivaji Bandyopadhyay. 2016.        JU NLP at
  statistical machine translation. In MT summit, vol-          SemEval-2016 Task 11: Identifying Complex Words
  ume 5, pages 79–86. Citeseer.                                in a Sentence. In Proceedings of the 10th Interna-
                                                               tional Workshop on Semantic Evaluation (SemEval-
Michal Konkol. 2016. UWB at SemEval-2016 Task                  2016), pages 986–990, San Diego, California. Asso-
 11: Exploring Features for Complex Word Identi-               ciation for Computational Linguistics.
  fication. In Proceedings of the 10th International
 Workshop on Semantic Evaluation (SemEval-2016),             Gustavo Paetzold and Lucia Specia. 2016a. SemEval
  pages 1038–1041, San Diego, California. Associa-             2016 Task 11: Complex Word Identification. In
  tion for Computational Linguistics.                          Proceedings of the 10th International Workshop on
                                                               Semantic Evaluation (SemEval-2016), pages 560–
Onur Kuru. 2016. AI-KU at SemEval-2016 Task 11:                569, San Diego, California. Association for Compu-
  Word Embeddings and Substring Features for Com-              tational Linguistics.
  plex Word Identification. In Proceedings of the
  10th International Workshop on Semantic Evalua-            Gustavo Paetzold and Lucia Specia. 2016b. SV000gg
  tion (SemEval-2016), pages 1042–1046, San Diego,             at SemEval-2016 Task 11: Heavy Gauge Complex
  California. Association for Computational Linguis-           Word Identification with System Voting. In Proceed-
  tics.                                                        ings of the 10th International Workshop on Seman-
                                                               tic Evaluation (SemEval-2016), pages 969–974, San
Chaya Liebeskind, Otniel Elkayam, and Shmuel Liebe-            Diego, California. Association for Computational
  skind. 2021. JCT at SemEval-2021 Task 1: Context-            Linguistics.
  aware Representation for Lexical Complexity Pre-
  diction. In Proceedings of the Fifteenth Workshop          Gustavo Paetzold and Lucia Specia. 2017. Lexical sim-
  on Semantic Evaluation, Bangkok, Thailand (on-               plification with neural ranking. In Proceedings of
  line). Association for Computational Linguistics.            the 15th Conference of the European Chapter of the

                                                        11
Association for Computational Linguistics: Volume              models, behavioural norms, and lexical resources.
  2, Short Papers, pages 34–40, Valencia, Spain. As-             In Proceedings of the Fifteenth Workshop on Seman-
  sociation for Computational Linguistics.                       tic Evaluation, Bangkok, Thailand (online). Associ-
                                                                 ation for Computational Linguistics.
Gustavo Henrique Paetzold. 2021. UTFPR at SemEval-
  2021 Task 1: Complexity Prediction by Combining              Erik Rozi, Niveditha Iyer, Gordon Chi, Enok Choe,
  BERT Vectors and Classic Features. In Proceedings               Kathy Lee, Kevin Liu, Patrick Liu, Zander Lack, Jil-
  of the Fifteenth Workshop on Semantic Evaluation,               lian Tang, and Ethan A. Chi. 2021. Stanford MLab
  Bangkok, Thailand (online). Association for Com-                at SemEval-2021 Task 1: Tree-Based Modelling of
  putational Linguistics.                                         Lexical Complexity using Word Embeddings. In
                                                                 Proceedings of the Fifteenth Workshop on Seman-
Chunguang Pan, Bingyan Song, Shengguang Wang,                     tic Evaluation, Bangkok, Thailand (online). Associ-
  and Zhipeng Luo. 2021. DeepBlueAI at SemEval-                   ation for Computational Linguistics.
  2021 Task 1: Lexical Complexity Prediction with
  A Deep Ensemble Approach. In Proceedings of                  Irene Russo. 2021. archer at SemEval-2021 Task 1:
  the Fifteenth Workshop on Semantic Evaluation,                  Contextualising Lexical Complexity. In Proceed-
  Bangkok, Thailand (online). Association for Com-                ings of the Fifteenth Workshop on Semantic Evalu-
  putational Linguistics.                                         ation, Bangkok, Thailand (online). Association for
                                                                  Computational Linguistics.
Maja Popović. 2018. Complex Word Identification us-
 ing Character n-grams. In Proceedings of the 13th             Horacio Saggion, Sanja Štajner, Stefan Bott, Simon
 Workshop on Innovative Use of NLP for Building Ed-              Mille, Luz Rello, and Biljana Drndarevic. 2015.
 ucational Applications, New Orleans, United States.             Making it simplext: Implementation and evaluation
 Association for Computational Linguistics.                      of a text simplification system for spanish. ACM
                                                                 Transactions on Accessible Computing (TACCESS),
Maury Quijada and Julie Medero. 2016. HMC at                     6(4):1–36.
 SemEval-2016 Task 11: Identifying Complex Words
 Using Depth-limited Decision Trees. In Proceed-               Matthew Shardlow. 2013. A comparison of techniques
 ings of the 10th International Workshop on Semantic            to automatically identify complex words. In 51st
 Evaluation (SemEval-2016), pages 1034–1037, San                Annual Meeting of the Association for Computa-
 Diego, California. Association for Computational               tional Linguistics Proceedings of the Student Re-
 Linguistics.                                                   search Workshop, pages 103–109, Sofia, Bulgaria.
                                                                Association for Computational Linguistics.
Gang Rao, Maochang Li, Xiaolong Hou, Lianxin Jiang,
  Yang Mo, and Jianping Shen. 2021. RG PA at                   Matthew Shardlow, Michael Cooper, and Marcos
  SemEval-2021 Task 1: A Contextual Attention-                  Zampieri. 2020. CompLex — a new corpus for lexi-
  based Model with RoBERTa for Lexical Complexity               cal complexity prediction from Likert Scale data. In
  Prediction. In Proceedings of the Fifteenth Work-             Proceedings of the 1st Workshop on Tools and Re-
  shop on Semantic Evaluation, Bangkok, Thailand                sources to Empower People with REAding DIfficul-
  (online). Association for Computational Linguistics.          ties (READI), pages 57–62, Marseille, France. Euro-
                                                                pean Language Resources Association.
Luz Rello, Roberto Carlini, Ricardo Baeza-Yates, and
  Jeffrey P Bigham. 2015. A plug-in to aid online              Neil Shirude, Sagnik Mukherjee, Tushar Shandhilya,
  reading in spanish. In Proceedings of the 12th Web             Ananta Mukherjee, and Ashutosh Modi. 2021.
  for All Conference, pages 1–4.                                 IITK@LCP at SemEval-2021 Task 1: Classifica-
                                                                 tion for Lexical Complexity Regression Task. In
Kervy Rivas Rojas and Fernando Alva-Manchego.                    Proceedings of the Fifteenth Workshop on Seman-
  2021. IAPUCP at SemEval-2021 Task 1: Stacking                  tic Evaluation, Bangkok, Thailand (online). Associ-
  Fine-Tuned Transformers is Almost All You Need                 ation for Computational Linguistics.
  for Lexical Complexity Prediction. In Proceedings
  of the Fifteenth Workshop on Semantic Evaluation,            Greta Smolenska, Peter Kolb, Sinan Tang, Mironas
  Bangkok, Thailand (online). Association for Com-               Bitinis, Héctor Hernández, and Elin Asklöv. 2021.
  putational Linguistics.                                        CLULEX at SemEval-2021 Task 1: A Simple Sys-
                                                                 tem Goes a Long Way. In Proceedings of the Fif-
Francesco Ronzano, Ahmed Abura’ed, Luis Es-                      teenth Workshop on Semantic Evaluation, Bangkok,
  pinosa Anke, and Horacio Saggion. 2016. TALN at                Thailand (online). Association for Computational
  SemEval-2016 Task 11: Modelling Complex Words                  Linguistics.
  by Contextual, Lexical and Semantic Features. In
  Proceedings of the 10th International Workshop on            Anders Søgaard, Sebastian Ebert, Joost Bastings, and
  Semantic Evaluation (SemEval-2016), pages 1011–                Katja Filippova. 2020. We need to talk about ran-
  1016, San Diego, California. Association for Com-              dom splits. arXiv preprint arXiv:2005.00636.
  putational Linguistics.
                                                               Sanjay S.P, Anand Kumar M, and Soman K P. 2016.
Armand Rotaru. 2021. ANDI at SemEval-2021 Task 1:                AmritaCEN at SemEval-2016 Task 11: Complex
  Predicting complexity in context using distributional          Word Identification using Word Embedding. In

                                                          12
Proceedings of the 10th International Workshop on            Tuqa Bani Yaseen, Qusai Ismail, Sarah Al-Omari, Es-
  Semantic Evaluation (SemEval-2016), pages 1022–                lam Al-Sobh, and Malak Abdullah. 2021. JUST-
  1027, San Diego, California. Association for Com-              BLUE at SemEval-2021 Task 1: Predicting Lexical
  putational Linguistics.                                        Complexity using BERT and RoBERTa Pre-trained
                                                                 Language Models. In Proceedings of the Fifteenth
Regina Stodden and Gayatri Venugopal. 2021. RS GV                Workshop on Semantic Evaluation, Bangkok, Thai-
  at SemEval-2021 Task 1: Sense Relative Lexical                 land (online). Association for Computational Lin-
  Complexity Prediction. In Proceedings of the Fif-              guistics.
  teenth Workshop on Semantic Evaluation, Bangkok,
  Thailand (online). Association for Computational             Seid Muhie Yimam, Chris Biemann, Shervin Malmasi,
  Linguistics.                                                   Gustavo Paetzold, Luci Specia, Sanja Štajner, Anaı̈s
                                                                 Tack, and Marcos Zampieri. 2018. A Report on
David Strohmaier, Sian Gooding, Shiva Taslimipoor,               the Complex Word Identification Shared Task 2018.
  and Ekaterina Kochmar. 2020. SeCoDa: Sense com-                In Proceedings of the 13th Workshop on Innovative
  plexity dataset. In Proceedings of the 12th Lan-               Use of NLP for Building Educational Applications,
  guage Resources and Evaluation Conference, pages               New Orleans, United States. Association for Com-
  5962–5967, Marseille, France. European Language                putational Linguistics.
  Resources Association.
                                                               Seid Muhie Yimam, Sanja Štajner, Martin Riedl, and
Yuki Taya, Lis Kanashiro Pereira, Fei Cheng, and                 Chris Biemann. 2017. CWIG3G2 - complex word
  Ichiro Kobayashi. 2021. OCHADAI-KYODAI at                      identification task across three text genres and two
  SemEval-2021 Task 1: Enhancing Model General-                  user groups. In Proceedings of the Eighth Interna-
  ization and Robustness for Lexical Complexity Pre-             tional Joint Conference on Natural Language Pro-
  diction. In Proceedings of the Fifteenth Workshop              cessing (Volume 2: Short Papers), pages 401–407,
  on Semantic Evaluation, Bangkok, Thailand (on-                 Taipei, Taiwan. Asian Federation of Natural Lan-
  line). Association for Computational Linguistics.              guage Processing.

Giuseppe Vettigli and Antonio Sorgente. 2021.                  Zheng Yuan, Gladys Tyen, and David Strohmaier. 2021.
  CompNA at SemEval-2021 Task 1: Prediction of                   Cambridge at SemEval-2021 Task 1: An Ensem-
  lexical complexity analyzing heterogeneous fea-                ble of Feature-Based and Neural Models for Lexical
  tures. In Proceedings of the Fifteenth Workshop on             Complexity Prediction. In Proceedings of the Fif-
  Semantic Evaluation, Bangkok, Thailand (online).               teenth Workshop on Semantic Evaluation, Bangkok,
  Association for Computational Linguistics.                     Thailand (online). Association for Computational
                                                                 Linguistics.
Katja Voskoboinik. 2021. katildakat at SemEval-2021            George-Eduard Zaharia, Dumitru-Clementin Cercel,
  Task 1: Lexical Complexity Prediction of Single                and Mihai Dascalu. 2021. UPB at SemEval-2021
  Words and Multi-Word Expressions in English. In                Task 1: Combining Deep Learning and Hand-
  Proceedings of the Fifteenth Workshop on Seman-                Crafted Features for Lexical Complexity Prediction.
  tic Evaluation, Bangkok, Thailand (online). Associ-            In Proceedings of the Fifteenth Workshop on Seman-
  ation for Computational Linguistics.                           tic Evaluation, Bangkok, Thailand (online). Associ-
                                                                 ation for Computational Linguistics.
Nikhil Wani, Sandeep Mathias, Jayashree Aanand Gaj-
  jam, and Pushpak Bhattacharyya. 2018. The Whole              Marcos Zampieri, Shervin Malmasi, Gustavo Paetzold,
  is Greater than the Sum of its Parts: Towards the Ef-         and Lucia Specia. 2017. Complex Word Identifi-
  fectiveness of Voting Ensemble Classifiers for Com-           cation: Challenges in Data Annotation and System
  plex Word Identification. In Proceedings of the 13th          Performance. In Proceedings of the 4th Workshop
  Workshop on Innovative Use of NLP for Building Ed-            on Natural Language Processing Techniques for Ed-
  ucational Applications, New Orleans, United States.           ucational Applications (NLPTEA 2017), pages 59–
  Association for Computational Linguistics.                    63, Taipei, Taiwan. Asian Federation of Natural Lan-
                                                                guage Processing.
Krzysztof Wróbel. 2016. PLUJAGH at SemEval-2016
  Task 11: Simple System for Complex Word Identi-              Marcos Zampieri, Liling Tan, and Josef van Genabith.
  fication. In Proceedings of the 10th International            2016. MacSaar at SemEval-2016 Task 11: Zipfian
  Workshop on Semantic Evaluation (SemEval-2016),               and Character Features for ComplexWord Identifi-
  pages 953–957, San Diego, California. Association             cation. In Proceedings of the 10th International
  for Computational Linguistics.                                Workshop on Semantic Evaluation (SemEval-2016),
                                                                pages 1001–1005, San Diego, California. Associa-
Rong Xiang, Jinghang Gu, Emmanuele Chersoni, Wen-               tion for Computational Linguistics.
  jie Li, Qin Lu, and Chu-Ren Huang. 2021. PolyU
  CBS-Comp at SemEval-2021 Task 1: Lexical Com-
  plexity Prediction (LCP). In Proceedings of the Fif-
  teenth Workshop on Semantic Evaluation, Bangkok,
  Thailand (online). Association for Computational
  Linguistics.

                                                          13
You can also read