Improved Evaluation Framework for Complex Plagiarism Detection

Page created by Carl Caldwell
 
CONTINUE READING
Improved Evaluation Framework for Complex Plagiarism Detection

                        Anton Belyy1 Marina Dubova2 Dmitry Nekrasov1
             1
              International laboratory “Computer Technologies”, ITMO University, Russia
                               2
                                 Saint Petersburg State University, Russia
             {anton.belyy, marina.dubova.97, dpokrasko}@gmail.com

                         Abstract                                 2     Plagiarism Detection
      Plagiarism is a major issue in science                      Most of the contributions to the plagiarism text
      and education. It appears in various                        alignment were made during the PAN annual
      forms, starting from simple copying and                     track for plagiarism detection held from 2009 to
      ending with intelligent paraphrasing and                    2015. The latest winning approach (Sanchez-
      summarization. Complex plagiarism, such                     Perez et al., 2014) achieved good performance on
      as plagiarism of ideas, is hard to detect,                  all the plagiarism types except the Summary part.
      and therefore it is especially important to                 Moreover, this type of plagiarism turned out to be
      track improvement of methods correctly                      the hardest for all the competitors.
      and not to overfit to the structure of                         In a brief review, Kraus emphasized (2016)
      particular datasets. In this paper, we study                that the main weakness of modern PDS is
      the performance of plagdet, the main                        imprecision in manually paraphrased plagiarism
      measure for Plagiarism Detection Systems                    and, as a consequence, the weak ability to deal
      evaluation, on manually paraphrased                         with real-world problems. Thus, the detection of
      plagiarism datasets (such as PAN                            manually paraphrased plagiarism cases is a focus
      Summary).       We reveal its fallibility                   of recently proposed methods for plagiarism text
      under certain conditions and propose an                     alignment. In the most successful contributions,
      evaluation framework with normalization                     scientists applied genetic algorithms (Sanchez-
      of inner terms, which is resilient to the                   Perez et al., 2018; Vani and Gupta, 2017), topic
      dataset imbalance. We conclude with the                     modeling methods (Le et al., 2016), and word
      experimental justification of the proposed                  embedding models (Brlek et al., 2016) to manually
      measure. The implementation of the new                      paraphrased plagiarism text alignment. In all of
      framework is made publicly available as a                   these works, authors used PAN Summary datasets
      Github repository.                                          to develop and evaluate their methods.

1     Introduction                                                3     Task, Dataset, and Evaluation Metrics

Plagiarism is a problem of primary concern among                  3.1    Text Alignment
publishers, scientists, teachers (Maurer et al.,                  In this work we deal with an extrinsic text
2006). It is not only about text copying with                     alignment problem. Thus, we are given pairs of
minor revisions but also borrowing of ideas.                      suspicious documents and source candidates and
Plagiarism appears in substantially paraphrased                   try to detect all contiguous passages of borrowed
forms and presents conscious and unconscious                      information. For a review of plagiarism detection
appropriation of others’ thoughts (Gingerich and                  tasks, see Alzahrani et al. 2012.
Sullivan, 2013). This kind of borrowing has very
serious consequences and can not be detected with                 3.2    Datasets
common Plagiarism Detection Systems (PDS).                        PAN corpora of datasets for plagiarism text
That is why detection of complex plagiarism cases                 alignment is the main resource for PDS evaluation.
comes to the fore and becomes a central challenge                 This collection consists of slightly or substantially
in the field.                                                     different datasets used at the PAN competitions

                                                            157
    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 157–162
                 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
since 2009 to 2015. We used the most recent                                    Plagiarism                      Source
                                                                               document                       document
2013 (Potthast et al., 2013) and 2014 (Potthast                                   (dplg)                        (dsrc)
et al., 2014) English datasets to develop and
evaluate our models and metrics. They consist                                (Rs)plg                                  (Rs)src
of copy&paste, random, translation, and summary                                              true case s
plagiarism types. We consider only the last part,
as it exhibits the problems of plagdet framework
                                                                               splg                                  ssrc
to the greatest extent.

3.3    Evaluation Metrics
Standard evaluation framework for text alignment                                            detection r1
task is plagdet (Potthast et al., 2010), which
consists of macro- and micro-averaged precision,
recall, granularity and the overall plagdet score. In                     (r1)plg                                           (r1)src
this work, we consider only the macro-averaged
metrics, where recall can be defined as follows:                                            detection r2

                         1 X
    recmacro (S, R) =        rec single (s, Rs ), (1)                          (r2)plg                           (r2)src
                        |S|     macro
                            s∈S

   and precision can be defined through recall as
follows:                                                         Figure 1: Single case recall computation for text
                                                                 alignment task. Note the imbalance in this case:
                                                                 plagiarism part splg is much shorter than source
        precmacro (S, R) = recmacro (R, S),          (2)         part ssrc .
  where S and R are true plagiarism cases and
system’s detections, respectively.                                 • For any given case, its plagiarism part is
  Single case recall rec single (s, Rs ) is defined as               much shorter than its source part1 :
                            macro
follows:
                                                                                       ∀s ∈ S : |splg |
Other datasets for manual plagiarism detection                                              d
display the similar properties, however, not to
the PAN Summary extent. Examples include:
Palkovskii15, Mashhadirajab et al. 2016, and                         e1              e1 ∩ e2                     e2
Sochenkov et al. 2017.
Discussion
The important question is whether such dataset                                     a(e1,e2, d)
imbalance reflects the real-world plagiarizers’
behavior.                                                                                      d
   There is an evidence that performing length
unrestricted plagiarism task people tend to make
texts shorter, however, not to the PAN Summary
                                                                            e1 = e1 ∩ e2               e2
extent (Barrón-Cedeño et al., 2013). Moreover,
we can find some supporting theoretical reasons.
Firstly, summarization and paraphrasing are the
only techniques students are taught to use for                                b(e1,e2)
the transformation of texts. Hence, they can
use summarization to plagiarize intellectually.
                                                               Figure 2:       Degenerate intersection lemma.
Secondly, in the cases of inadvertent plagiarism
                                                               Intuitively, lower bound (a) is achieved when e1
and the plagiarism of ideas details of source texts
                                                               and e2 are “farthest” away from each other in d,
are usually omitted or forgotten. This should
                                                               and upper bound (b) is achieved when e1 ⊆ e2 (or
also lead to smaller plagiarized texts. Though
                                                               e2 ⊆ e1 ).
we can find some reasons, such huge imbalance
does not seem to be supported enough and may be
considered as a bias.                                             This results in a smaller possible value range
                                                               of intersection length and, therefore, range of
4.2     Degenerate Intersection                                precision and recall values. Because of (3), on
Lemma 4.1. For any sets e1 ⊆ d and e2 ⊆ d,                     PAN Summary this leads to the extreme case of
their intersection length |e1 ∩ e2 | is bounded by:            asrc = bsrc = |dsrc |, which causes precision and
                                                               recall to take constant values on the source part of
         a(e1 , e2 , d) 6 |e1 ∩ e2 | 6 b(e1 , e2 ),            the dataset.
where:
                                                               5     Proposed Metrics
       a(e1 , e2 , d) = max(0, |e1 | + |e2 | − |d|),           5.1    Normalized Single Case Recall
         b(e1 , e2 ) = min(|e1 |, |e2 |).
                                                               To address issues of dataset imbalance (section
  Let us take a fresh look at a source part of                 4.1) and degenerate intersection (section 4.2),
rec single . We assume that |ssrc|s
                                 ∩(Rs )src |
                                             ∈ [0; 1],         we propose the following normalized version of
                                    src |
      macro                                                    single case recall nrec single (s, Rs ) for macro-
and this is actually the case if:                                                              macro
                                                               averaged case:
              0 6 |ssrc ∩ (Rs )src | 6 |ssrc |.
                                                                wplg (|splg ∩ (Rs )plg |) + wsrc (|ssrc ∩ (Rs )src |)
                                                                                                                      ,
But, according to lemma 4.1, we see that:                                   wplg (|splg |) + wsrc (|ssrc |)

    0 6 asrc 6 |ssrc ∩ (Rs )src | 6 bsrc 6 |ssrc |,                where:

where:                                                                                (x − ai )(bi − ai )
                                                                             wi (x) =                       ,
                                                                                               |di |
              asrc = a(ssrc , (Rs )src , dsrc ),
                                                                                 ai = a(si , (Rs )i , di ),
              bsrc = b(ssrc , (Rs )src ).
                                                                                  bi = b(si , (Rs )i ),
.                                                                                  i ∈ {plg, src}.

                                                         159
5.2    Normalized recall, precision and plagdet                 7   Results and Discussion
The result of (1) where every rec single (s, Rs )               We evaluated our adversarial models as well as
                                              macro
term is replaced for nrec single (s, Rs ) is defined as         several state-of-the-art algorithms, whose source
                              macro                             code was available to us, using plagdet and
normalized recall nrecmacro (S, R). Normalized
                                                                normplagdet scores on all PAN Summary datasets
precision nprecmacro (S, R) can be obtained from
                                                                available to date.
normalized recall using eq. 2.
                                                                   In plagdet score comparison (Table 1) we
  Normalized macro-averaged plagdet, or
                                                                included additional state-of-the-art algorithms’
normplagdet, is defined as follows:
                                                                results (marked by ∗), borrowed from respective
                                      Fα (S, R)                 papers. Proposed models M1 and M2 outperform
 normplagdet(S, R) =                                 ,
                               log2 (1 + gran(S, R))            all algorithms by macro-averaged plagdet and
                                                                recall measures on almost every dataset. Despite
  where Fα is the weighted harmonic mean of
                                                                their simplicity, they show rather good results.
nprecmacro (S, R) and nrecmacro (S, R), i.e. the
                                                                   On the contrary, while measuring normplagdet
Fα -measure, and gran(S, R) is defined as in
                                                                score (Table 2), M1 and M2 exhibit poor
Potthast et al. 2010.
                                                                results, while tested state-of-the-art systems
6     Adversarial models                                        evenly achieve better recall and normplagdet
                                                                scores. These experimental results back up our
To justify the proposed evaluation metrics, we                  claim that normplagdet is more resilient to dataset
construct two models, M1 and M2, which achieve                  imbalance and degenerate intersection attacks and
inadequately high macro-averaged precision and                  show that tested state-of-the-art algorithms do not
recall.                                                         exploit these properties of PAN Summary datasets.
6.1    Preprocessing                                               The code for calculating normplagdet metrics,
                                                                both macro- and micro-averaged, is made
We represent each plagiarism document dplg as
                                                                available as a Github repository2 . We preserved
a sequence of sentences, where each sentence
                                                                the command line interface of plagdet framework
sentdplg ,i ∈ dplg is a set of tokens. Each source
                                                                to allow easy adaptation for existing systems.
document dsrc will be represented as a set of its
tokens.                                                         8   Conclusion
  For each sentence sentdplg ,i we also define a
measure of similarity simdplg ,dsrc ,i with respect to          Our paper shows that the standard evaluation
the source document as:                                         framework with plagdet measure can be misused
                                                                to achieve high scores on datasets for manual
                              |sentdplg ,i ∩ dsrc |             plagiarism detection.       We constructed two
         simdplg ,dsrc ,i =                         .
                                 |sentdplg ,i |                 primitive models that achieve state-of-the-art
                                                                results for detecting plagiarism of ideas by
6.2    Models
                                                                exploiting flaws of standard plagdet. Finally, we
Our models are rule-based classifiers, which                    proposed a new framework, normplagdet, that
proceed in three steps for each pair of documents               normalizes single case scores to prevent misuse
dplg , dsrc :                                                   of datasets such as PAN Summary, and proved
    1. Form a candidate                                         its correctness experimentally. The proposed
                    n set according to similarity
                                                                evaluation framework seems beneficial not only
                                            o
       score: cand = i|simdplg ,dsrc ,i > 34 .
                                                                for plagiarism detection but for any other text
    2. Find     the    candidate      with  highest             alignment task with imbalance or degenerate
       similarity score (if it exists): best =                 intersection dataset properties.
       arg max simdplg ,dsrc ,i |i ∈ cand .
          i                                                     Acknowledgments
    3. (M1) Output sentence best as a detection (if             This work      was financially supported by
       it exists).                                             Government     of Russian Federation (Grant
       (M2) Output sentences i|i 6= best as a                   08-08).
       detection (or all sentences if best does not               2
                                                                    https://github.com/AVBelyy/
       exist).                                                  normplagdet

                                                          160
Table 1: Results of Summary Plagiarism Detection using Plagdet
Dataset         Model                   Year    Precision Recall          Plagdet
                Sanchez-Perez et al.    2014    0.9942       0.4235       0.5761
                Brlek et al.            2016    0.9154       0.6033       0.7046
PAN 2013 Train
                Le et al. ∗             2016    0.8015       0.7722       0.7866
                Sanchez-Perez et al.    2018    0.9662       0.7407       0.8386
                Adversarial M1          2018    0.9676       0.7892       0.8693
                Adversarial M2          2018    0.5247       0.8704       0.4816
                Sanchez-Perez et al.    2014    1.0000       0.5317       0.6703
                Brlek et al.            2016    0.9832       0.7003       0.8180
PAN 2013 Test-1 Vani and Gupta ∗        2017    0.9998       0.7622       0.8149
                Sanchez-Perez et al.    2018    0.9742       0.8093       0.8841
                Adversarial M1          2018    0.9130       0.7641       0.8320
                Adversarial M2          2018    0.4678       0.8925       0.4739
                Sanchez-Perez et al.    2014    0.9991       0.4158       0.5638
                Brlek et al.            2016    0.9055       0.6144       0.7072
PAN 2013 Test-2 Le et al. ∗             2016    0.8344       0.7701       0.8010
                Vani and Gupta ∗        2017    0.9987       0.7212       0.8081
                Sanchez-Perez et al.    2018    0.9417       0.7226       0.8125
                Adversarial M1          2018    0.9594       0.8109       0.8789
                Adversarial M2          2018    0.5184       0.8938       0.4848

        Table 2: Results of Summary Plagiarism Detection using NormPlagdet
Dataset           Model                   Year    Precision Recall       Plagdet
                  Sanchez-Perez et al.    2014    0.9917       0.6408    0.7551
PAN 2013 Train
                  Brlek et al.            2016    0.8807       0.7889    0.8064
                  Sanchez-Perez et al.    2018    0.8929       0.9238    0.9081
                  Adversarial M1          2018    0.9673       0.1617    0.2770
                  Adversarial M2          2018    0.1769       0.2984    0.1634
                  Sanchez-Perez et al.    2014    0.9997       0.7020    0.7965
PAN 2013 Test-1
                  Brlek et al.            2016    0.9384       0.8254    0.8783
                  Sanchez-Perez et al.    2018    0.9180       0.9463    0.9319
                  Adversarial M1          2018    0.9130       0.1525    0.2614
                  Adversarial M2          2018    0.1488       0.4237    0.1700
                  Sanchez-Perez et al.    2014    0.9977       0.6377    0.7470
PAN 2013 Test-2
                  Brlek et al.            2016    0.8701       0.8104    0.8107
                  Sanchez-Perez et al.    2018    0.8771       0.9067    0.8859
                  Adversarial M1          2018    0.9585       0.1687    0.2869
                  Adversarial M2          2018    0.1552       0.3299    0.1559

    Table 3: Average Length of Plagiarism and Source Cases in Summary Datasets
  Dataset                    Plagiarism (plg)           Source (src)
  PAN 2013 Train             626 ± 45                   5109 ± 2431
  PAN 2013 Test-1            639 ± 40                   3874 ± 1427
  PAN 2013 Test-2            627 ± 42                   5318 ± 3310

                                       161
References                                                     Miguel A Sanchez-Perez, Alexander Gelbukh,
                                                                 Grigori Sidorov, and Helena Gómez-Adorno.
Salha M Alzahrani, Naomie Salim, and Ajith Abraham.              2018.    Plagiarism detection with genetic-based
  2012. Understanding plagiarism linguistic patterns,            parameter tuning.      International Journal of
  textual features, and detection methods. IEEE                  Pattern Recognition and Artificial Intelligence,
  Transactions on Systems, Man, and Cybernetics,                 32(01):1860006.
  Part C (Applications and Reviews), 42(2):133–149.
                                                               Miguel A Sanchez-Perez, Grigori Sidorov, and
Alberto Barrón-Cedeño, Marta Vila, M Antònia
                                                                 Alexander F Gelbukh. 2014. A winning approach to
  Martı́, and Paolo Rosso. 2013. Plagiarism meets
                                                                 text alignment for text reuse detection at pan 2014.
  paraphrasing: Insights for the next generation in
                                                                 In CLEF (Working Notes), pages 1004–1011.
  automatic plagiarism detection.    Computational
  Linguistics, 39(4):917–947.                                  Ilya Sochenkov, Denis Zubarev, and Ivan Smirnov.
Arijana Brlek, Petra Franjic, and Nino Uzelac.                    2017. The paraplag: russian dataset for paraphrased
  2016. Plagiarism detection using word2vec model.                plagiarism detection. In Proceedings of the Annual
  Text Analysis and Retrieval 2016 Course Project                 International Conference “Dialogue” 2017 (1),
  Reports, pages 4–7.                                             pages 284–297.

Amanda C Gingerich and Meaghan C Sullivan. 2013.               K Vani and Deepa Gupta. 2017.          Detection of
 Claiming hidden memories as one’s own: a review                 idea plagiarism using syntax–semantic concept
 of inadvertent plagiarism. Journal of Cognitive                 extractions with genetic algorithm. Expert Systems
 Psychology, 25(8):903–916.                                      with Applications, 73:11–26.

Christina Kraus. 2016. Plagiarism detection-state-
  of-the-art systems (2016) and evaluation methods.
  arXiv preprint arXiv:1603.03014.
Huong T Le, Lam N Pham, Duy D Nguyen,
  Son V Nguyen, and An N Nguyen. 2016.
  Semantic text alignment based on topic modeling.
  In Computing & Communication Technologies,
  Research, Innovation, and Vision for the Future
  (RIVF), 2016 IEEE RIVF International Conference
  on, pages 67–72. IEEE.
Fatemeh Mashhadirajab, Mehrnoush Shamsfard,
  Razieh Adelkhah, Fatemeh Shafiee, and Chakaveh
  Saedi. 2016. A text alignment corpus for persian
  plagiarism detection. In FIRE (Working Notes),
  pages 184–189.
Hermann A Maurer, Frank Kappe, and Bilal Zaka.
  2006. Plagiarism-a survey. J. UCS, 12(8):1050–
  1084.
Martin Potthast, Matthias Hagen, Anna Beyer,
 Matthias Busse, Martin Tippmann, Paolo Rosso,
 and Benno Stein. 2014. Overview of the 6th
 international competition on plagiarism detection.
 In CEUR Workshop Proceedings, volume 1180,
 pages 845–876. CEUR Workshop Proceedings.
Martin Potthast, Matthias Hagen, Tim Gollub, Martin
 Tippmann, Johannes Kiesel, Paolo Rosso, Efstathios
 Stamatatos, and Benno Stein. 2013. Overview
 of the 5th international competition on plagiarism
 detection. In CLEF Conference on Multilingual and
 Multimodal Information Access Evaluation, pages
 301–331. CELCT.
Martin Potthast, Benno Stein, Alberto Barrón-Cedeño,
 and Paolo Rosso. 2010. An evaluation framework
 for plagiarism detection. In Proceedings of the
 23rd international conference on computational
 linguistics: Posters, pages 997–1005. Association
 for Computational Linguistics.

                                                         162
You can also read