Page created by Martha Mclaughlin
ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error
                                                                Correction Benchmark
                                             Haoran Wu† Wenxuan Wang† Yuxuan Wan † Wenxiang Jiao‡ Michael R. Lyu†
                                               Department of Computer Science and Engineering, The Chinese University of Hong Kong
                                                                                                 Tencent AI Lab

                                                                   Abstract                                Despite the widespread use of ChatGPT, it re-
                                                                                                        mains unclear to the NLP community that to what
                                                 ChatGPT is a cutting-edge artificial intelli-
                                                                                                        extent ChatGPT is capable of revising the text and
                                                 gence language model developed by OpenAI,
                                                                                                        correcting grammatical errors. To fill this research
arXiv:2303.13648v1 [cs.CL] 15 Mar 2023

                                                 which has attracted a lot of attention due to
                                                 its surprisingly strong ability in answering           gap, we empirically study the Grammatical Error
                                                 follow-up questions. In this report, we aim            Correction (GEC) ability of ChatGPT by evalu-
                                                 to evaluate ChatGPT on the Grammatical Er-             ating on the CoNLL2014 benchmark dataset (Ng
                                                 ror Correction (GEC) task, and compare it              et al., 2014), and comparing its performance to
                                                 with commercial GEC product (e.g., Gram-               Grammarly, a prevalent cloud-based English typing
                                                 marly) and state-of-the-art models (e.g., GEC-         assistant with 30 million users daily (Grammarly,
                                                 ToR). By testing on the CoNLL2014 bench-
                                                                                                        2023) and GECToR (Omelianchuk et al., 2020), a
                                                 mark dataset, we find that ChatGPT performs
                                                 not as well as those baselines in terms of the         state-of-the-art GEC model. With this study, we
                                                 automatic evaluation metrics (e.g., F0.5 score),       aim to answer a research question:
                                                 particularly on long sentences. We inspect
                                                 the outputs and find that ChatGPT goes be-                  Is ChatGPT a good tool for GEC?
                                                 yond one-by-one corrections. Specifically, it
                                                 prefers to change the surface expression of            To the best of our knowledge, this is the first study
                                                 certain phrases or sentence structure while
                                                 maintaining grammatical correctness. Human
                                                                                                        on ChatGPT’s ability in GEC.
                                                 evaluation quantitatively confirms this and              We present the major insights gained from this
                                                 suggests that ChatGPT produces less under-             evaluation as below:
                                                 correction or mis-correction issues but more
                                                 over-corrections. These results demonstrate               • ChatGPT performs worse than the baseline
                                                 that ChatGPT is severely under-estimated by                 systems in terms of the automatic evaluation
                                                 the automatic evaluation metrics and could be
                                                                                                             metrics (e.g., F0.5 score), particularly on long
                                                 a promising tool for GEC.
                                         1       Introduction
                                                                                                           • ChatGPT goes beyond one-by-one corrections
                                         ChatGPT1 , the current “super-star” in artificial in-               by introducing more changes to the surface
                                         telligence (AI) area, has attracted millions of reg-                expression of certain phrases or sentence struc-
                                         istered users within just a week since its launch                   ture while maintaining the grammatical cor-
                                         by OpenAI. One of the reasons for ChatGPT                           rectness.
                                         being so popular is its surprisingly strong per-
                                         formance on various natural language process-                     • Human evaluation quantitatively demon-
                                         ing (NLP) tasks (Bang et al., 2023), including ques-                strates that ChatGPT produces less under-
                                         tion answering (Omar et al., 2023), text summariza-                 correction or mis-correction issues but more
                                         tion (Yang et al., 2023), machine translation (Jiao                 over-corrections.
                                         et al., 2023), logic reasoning (Frieder et al., 2023),
                                         code debugging (Xia and Zhang, 2023), etc. There
                                                                                                        Our evaluation indicates the limitation of relying
                                         is also a trend of using ChatGPT as a writing assis-
                                                                                                        solely on automatic evaluation metrics to assess
                                         tant for text polishing.
                                                                                                        the performance of GEC models and suggests that
                                                                  ChatGPT is a promising tool for GEC.
Type                        Error                         Correction
                Preposition            I sat in the talk              I sat in on the talk
                Morphology                 dreamed                           dreamt
                Determiner          I like the ice cream                I like ice cream
                Tense/Aspect       I like play basketball         I like playing basketball
                Syntax              I have not the book            I do not have the book
                Punctuation      We met they talked and left     We met, they talked and left

                                   Table 1: Different types of error in GEC.

2     Background                                               System          Precision   Recall   F0.5
2.1    ChatGPT                                                 GECToR            71.2       38.4    60.8
ChatGPT is an intelligent chatbot powered by large             Grammarly         67.3       51.1    63.3
language models developed by OpenAI. It has at-                ChatGPT           51.2       62.8    53.1
tracted great attention from industry, academia, and
the general public due to its strong ability in an-      Table 2: GEC performance of GECToR, Grammarly,
                                                         and ChatGPT.
swering various follow-up questions, correcting
inappropriate questions (Zhong et al., 2023), and
even refusing illegal questions. While the tech-               but introduces a new dataset, namely, the
nical details of ChatGPT have not been released                Write&Improve+LOCNESS corpus, which
systematically, it is known to be built upon Instruct-         represents a wider range of native and learner
GPT (Ouyang et al., 2022) which is trained using               English levels and abilities (Bryant et al.,
instruction tuning (Wei et al., 2022a) and reinforce-          2019).
ment learning from human feedback (RLHF, Chris-
tiano et al., 2017).                                         • JFLEG: It represents a broad range of lan-
                                                               guage proficiency levels and uses holistic flu-
2.2    Grammatical Error Correction
                                                               ency edits to not only correct grammatical
Grammatical Error Correction (GEC) is a task of                errors but also make the original text more
correcting different kinds of errors in text such              native sounding (Tetreault et al., 2017).
as spelling, punctuation, grammatical, and word
choice errors (Ruder, 2022). It is highly demanded
as writing plays an important role in academics,         3     ChatGPT for GEC
work, and daily life. Table 1 presents the illustra-
tion of different grammatical errors borrowed from       3.1    Experimental Setup
Bryant et al. (2022) in a comprehensive survey on        Dataset. We evaluate the ability of ChatGPT in
grammatical error correction. In general, gram-          grammatical error correction on the CoNLL2014
matical errors can be roughly classified into three      task (Ng et al., 2014) dataset. The dataset is com-
categories: omission errors, such as "on" in the first   posed by short paragraphs that are written by non-
example; replacement errors, such as "dreamed"           native speakers of English, accompanied with the
for "dreamt" in the second example; and insertion        corresponding annotations on the grammatical er-
errors, such as "the" in the third example.              rors. We pulled 100 sentences from the official-
   To evaluate the performance of GEC, researchers       combined test set in the alternate folder of the
have built various benchmark datasets, which in-         dataset sequentially.
clude but are not limited to:
    • CoNLL-2014: Given the short English texts          Evaluation Metric. To evaluate the performance
      written by non-native speakers, the task re-       of GEC, we adopt three metrics that are widely
      quires a participating system to correct all er-   used in literature, namely, Precision, Recall, and
      rors present in each text.                         F0.5 score. Among them, F0.5 score combines both
                                                         Precision and Recall, where Precision is assigned
    • BEA-2019: It is similar to CoNLL-2014              a higher weight (Wikipedia contributors, 2023a).
Short                      Medium                     Long
                        Precision Recall F0.5 Precision Recall F0.5 Precision Recall F0.5
          GECToR           76.9     38.5 64.1         68.8         37.5 58.9     71.8      38.9 61.5
          Grammarly        62.5     60.6 62.1         68.9         56.0 65.9     67.3      45.3 61.4
          ChatGPT          58.5     66.7 60.0         48.7         60.7 50.7     51.0      62.8 53.0

                          Table 3: GEC performance with respect to sentence length.

Specifically, the three metrics are expressed as:              the grammar correction in the setting and only
                                                               ask it to correct the ones with correctness prob-
                      TP                                       lems (red underline), while leaving the clarity
       Precision =           ,                      (1)
                   TP + FP                                     (blue underline), engagement (green underline)
                    TP                                         and delivery (purple underline) unchanged. We
       Recall =            ,                        (2)
                TP + FN                                        iterate this process several times until there is no
              1.25 × Precision × Recall                        error detected by Grammarly.
       F0.5 =                           ,           (3)
              0.25 × Precision + Recall
                                                             • GECToR: Besides Grammarly, we also compare
where T P , F P and F N represent the true posi-               ChatGPT with GECToR (Omelianchuk et al.,
tives, false positives and false negatives of the pre-         2020), a state-of-the-art model on GEC in re-
dictions, respectively. We use the scoring program             search, which also exhibits good performance
provided by CoNLL2014 official but adapt it to be              on the CoNLL2014 task. We adopt the imple-
compatible with the latest Python environment.                 mentation based on the pre-trained RoBERTa
Baselines. In this report, we perform the GEC                  model.
task on three systems, including:                            3.2   Results and Analysis
• ChatGPT: We query ChatGPT manually rather                  Overall Performance. Table 2 presents the over-
  than using some API due to the instability of              all performance of the three systems. As seen,
  ChatGPT. For example, when a query sentence                ChatGPT obtains the highest recall value, GECToR
  resembles a question or demand, ChatGPT may                obtains the highest precision value, while Gram-
  stop the process of GEC but respond to the “de-            marly achieves a better balance between the two
  mand” instead. After a few trials, we find a               metrics and results in the highest F0.5 score. These
  prompt that works well for ChatGPT:                        results suggest that ChatGPT tends to correct as
                                                             many errors as possible, which may lead to more
       Do grammatical error correction                       overcorrections. Instead, GECToR corrects only
       on all the following sentences                        those it is confident about, which leaves many er-
       I type in the conversation.                           rors uncorrected. Grammarly combines the advan-
  We query ChatGPT with this prompt for each                 tages of both such that it performs more stably.
  test sample.                                            ChatGPT Performs Worse on Long Sentences?
                                                          To understand which kind of sentences ChatGPT
• Grammarly: Grammarly is a prevalent cloud-
                                                          are good at, we divide the 100 test sentences
  based English typing assistant. It reviews
                                                          into three equally sized categories, namely, Short,
  spelling, grammar, punctuation, clarity, engage-
                                                          Medium and Long. Table 3 shows the results with
  ment, and delivery mistakes in English texts,
                                                          respect to sentence length. As seen, the gap be-
  detects plagiarism and suggests replacements
                                                          tween ChatGPT and Grammarly is significantly
  for the identified errors (Wikipedia contribu-
                                                          bridged on short sentences. In contrast, ChatGPT
  tors, 2023b). As stated by Grammarly, every
                                                          performs much worse on those longer sentences, at
  day, 30 million people and 50,000 teams around
                                                          least in terms of the existing evaluation metrics.
  the world use Grammarly with their writing
  (Grammarly, 2023). When querying Grammarly,                ChatGPT Goes Beyond One-by-One Correc-
  we open a text file and paste all the test sam-            tions. We inspect the output of the three systems,
  ples into separate paragraphs. We enable all               especially those for long sentences, and find that
System                                              Sentence
      Source         For an example , if exercising is helpful for family potential disease , we can
                     always look for more chances for the family to go exercise .
      Reference      For example , if exercising (OR exercise) is helpful for a potential family disease
                     , we can always look for more chances for the family to do exercise .
      GECToR         For example , if exercising is helpful for family potential disease , we can
                     always look for more chances for the family to go exercise .
      Grammarly      For example , if exercising is helpful for a family ’s potential disease , we can
                     always look for more chances for the family to go exercise .
      ChatGPT        For example , if exercise is helpful in preventing potential family diseases , we
                     can always look for more opportunities for the family to exercise .

                        Table 4: Comparison of the outputs from different GEC systems.

  System               Precision    Recall     F0.5            System         #Under     #Mis    #Over
  GECToR                  71.2        38.4     60.8            GECToR           13         4        0
    + Grammarly           -5.9       +16.5     +2.1            Grammarly        14         0        1
  ChatGPT                 51.2        62.8     53.1            ChatGPT           3         3       30
    + Grammarly           +0.4       +0.8      +0.5
                                                          Table 6: Number of under-correction (Under), mis-
Table 5: GEC performance with Grammarly for further       correction (Mis) and over-correction (Over) produced
correction.                                               by different GEC systems.

                                                          Human Evaluation. We conduct a human eval-
ChatGPT is not limited to correcting the errors in        uation to further demonstrate the potential of Chat-
the one-by-one fashion. Instead, it is more will-         GPT for the GEC task. Specifically, we fol-
ing to change the superficial expression of some          low Wang et al. (2022) to manually annotate the
phrases or the sentence structure. For example,           issues in the outputs of the three systems, includ-
in Table 4, GECToR and Grammarly make mi-                 ing 1) Under-correction, which is the grammati-
nor changes to the source sentence (i.e., “an ex-         cal errors that are not found; 2) Mis-correction,
ample” to “example”, “family potential disease”           which is the grammatical errors that are found but
to “a family ’s potential disease”), while ChatGPT        modified incorrectly; it can be either grammati-
modifies the sentence structure (i.e., “for family        cally incorrect or semantically incorrect; 3) Over-
potential disease” to “in preventing potential fam-       correction, which is the other modifications beyond
ily diseases”) and word choice (i.e., “chances” to        the changes in the reference. We sample 20 sen-
“opportunities”). It indicates that the outputs by        tences out of the 100 test sentences and ask two
ChatGPT maintain the grammatical correctness, al-         annotators to identify the issues. Table 6 shows
though they do not follow the original expression         the results. Obviously, ChatGPT has the least num-
of the source sentences.                                  ber of under-corrections among the three systems
                                                          and fewer number of mis-corrections compared
   To validate our hypothesis, we let Grammarly to        with GECToR, which suggests its great potential
further correct the grammatical errors in the out-        in grammatical error correction. Meanwhile, Chat-
puts of GECToR and ChatGPT. Table 5 lists the             GPT produces more over-corrections, which may
results. We can observe that Grammarly introduces         come from the diverse generation ability as a large
a negligible improvement to the output of ChatGPT,        language model. While this usually leads to a lower
demonstrating that ChatGPT indeed generates cor-          F0.5 score, it also allows more flexible language
rect sentences. On the contrary, Grammarly further        expressions in GEC.
improves the performance of GECToR noticeably
(i.e., +2.1 F0.5 , +16.5 Recall), suggesting that there   Discussions. We have checked the outputs corre-
are still many errors in the output of GECToR.            sponding to the results of Table 5, and observed
different behaviors of ChatGPT and Grammarly.          Limitations and Future Works
The slight improvement (i.e., +0.5 F0.5 ) by Gram-
                                                       There are several limitations in this version, which
marly mainly comes from punctuation problems.
                                                       we leave for future work:
ChatGPT is not sensitive to punctuation problems
but Grammarly is, though the modifications are not     • More Datasets: In this version, we only use the
always correct. For example, when we manually            CoNLL-2014 test set and only randomly select
undo the corrections on punctuation, the F0.5 score      100 sentences to conduct the evaluation. In our
increases by +0.0015. Other than punctuation prob-       future work, we will conduct experiments on
lems, Grammarly also corrects a few grammatical          more datasets.
errors on articles, prepositions, and plurals. How-
ever, these corrections usually require Grammarly      • More Prompt and In-context Learning: In
to repeat the process twice. Take the following          this version, we only use one prompt to query
sentence as an example,                                  ChatGPT and do not utilize the advanced tech-
                                                         nology from the in-context learning field, such as
     ... constructs of the family and                    providing demonstration examples (Brown et al.,
     kinship are a social construct,                     2020) or providing chain-of-thought (Wei et al.,
     ...                                                 2022b), which may under-estimate the full po-
                                                         tential of ChatGPT. In our future work, we will
Grammarly first changes it to                            explore the in-context learning methods for GEC
                                                         to improve its performance.
     ... constructs of the family and
     kinship are a social constructs,                  • More Evaluation Metrics: In this version, we
     ...                                                 only adopt Precision, Recall and F0.5 as evalu-
                                                         ation metrics. In our future work, we will uti-
Then, changes it to                                      lize more metrics, such as pretraining-based met-
                                                         rics (Gong et al., 2022) to evaluate the perfor-
     ... constructs of the family and                    mance comprehensively.
     kinship are social constructs,

Nonetheless, it does correct some errors that Chat-    References
GPT fails to correct.                                  Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen-
                                                         liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Zi-
4   Conclusion                                           wei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan
                                                         Xu, and Pascale Fung. 2023. A multitask, multilin-
This paper evaluates ChatGPT on the task of Gram-        gual, multimodal evaluation of chatgpt on reasoning,
                                                         hallucination, and interactivity. ArXiv.
matical Error Correction (GEC). By testing on the
CoNLL2014 benchmark dataset, we find that Chat-        Tom Brown, Benjamin Mann, Nick Ryder, Melanie
GPT performs worse than a commercial product             Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
                                                         Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Grammarly and a state-of-the-art model GECToR
                                                         Askell, et al. 2020. Language models are few-shot
in terms of automatic evaluation metrics. By ex-         learners. NeurIPS.
amining the outputs, we find that ChatGPT dis-
plays a unique ability to go beyond one-by-one         Christopher Bryant, Mariano Felice, Øistein E. An-
                                                         dersen, and Ted Briscoe. 2019. The bea-2019
corrections by changing surface expressions and          shared task on grammatical error correction. In
sentence structure while maintaining grammatical         BEA@ACL.
correctness. Human evaluation results confirm this
finding and reveals that ChatGPT produces fewer        Christopher Bryant, Zheng Yuan, Muhammad Reza
                                                         Qorib, Hannan Cao, Hwee Tou Ng, and Ted Briscoe.
under-correction or mis-correction issues but more       2022. Grammatical error correction: A survey of the
over-corrections. These results demonstrate the          state of the art. ArXiv.
limitation of relying solely on automatic evaluation
                                                       Paul Francis Christiano, Jan Leike, Tom B. Brown, Mil-
metrics to assess the performance of GEC models          jan Martic, Shane Legg, and Dario Amodei. 2017.
and suggest that ChatGPT has the potential to be a       Deep reinforcement learning from human prefer-
valuable tool for GEC.                                   ences. NeruIPS.
Simon Frieder, Luca Pinchetti, Ryan-Rhys Grif-            Wikipedia contributors. 2023b.        Grammarly —
  fiths, Tommaso Salvatori, Thomas Lukasiewicz,             Wikipedia, the free encyclopedia. [Online; accessed
  Philipp Christian Petersen, Alexis Chevalier, and J J     2-March-2023].
  Berner. 2023. Mathematical capabilities of chatgpt.
  ArXiv.                                                  Chun Xia and Lingming Zhang. 2023. Conversational
                                                            automated program repair. ArXiv.
Peiyuan Gong, Xuebo Liu, Heyan Huang, and Min
  Zhang. 2022. Revisiting grammatical error correc-       Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen,
  tion evaluation and beyond. EMNLP.                        and Wei Cheng. 2023. Exploring the limits of chat-
                                                            gpt for query or aspect-based text summarization.
Grammarly. 2023. Grammarly website about us page.           ArXiv.
                                                          Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and
Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing
                                                            Dacheng Tao. 2023. Can chatgpt understand too?
 Wang, and Zhaopeng Tu. 2023. Is ChatGPT a good
                                                            a comparative study on chatgpt and fine-tuned bert.
  translator? a preliminary study. In ArXiv.
Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian
  Hadiwinoto, Raymond Hendy Susanto, and Christo-
  pher Bryant. 2014. The conll-2014 shared task on
  grammatical error correction. In CoNLL.

Reham Omar, Omij Mangukiya, Panos Kalnis, and Es-
  sam Mansour. 2023. Chatgpt versus traditional ques-
  tion answering for knowledge graphs: Current status
  and future directions towards knowledge graph chat-
  bots. ArXiv.

Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem N.
  Chernodub, and Oleksandr Skurzhanskyi. 2020.
  Gector – grammatical error correction: Tag, not
  rewrite. In Workshop on Innovative Use of NLP for
  Building Educational Applications.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
  roll L Wainwright, Pamela Mishkin, Chong Zhang,
  Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
  2022. Training language models to follow instruc-
  tions with human feedback. arXiv.

Sebastian Ruder. 2022. NLP-progress.

Joel R. Tetreault, Keisuke Sakaguchi, and Courtney
  Napoles. 2017. Jfleg: A fluency corpus and bench-
  mark for grammatical error correction. In EACL.

Wenxuan Wang, Wenxiang Jiao, Yongchang Hao, Xing
 Wang, Shuming Shi, Zhaopeng Tu, and Michael Lyu.
  2022. Understanding and improving sequence-to-
  sequence pretraining for neural machine translation.
  In ACL.

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
   Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
   Dai, and Quoc V. Le. 2022a. Finetuned language
   models are zero-shot learners. ICLR.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
   Bosma, Ed Huai hsin Chi, Quoc Le, and Denny
   Zhou. 2022b. Chain of thought prompting elicits
   reasoning in large language models. NeurIPS.

Wikipedia contributors. 2023a. F-score — Wikipedia,
  the free encyclopedia. [Online; accessed 5-March-
You can also read