CHATGPT OR GRAMMARLY? EVALUATING CHATGPT ON GRAMMATICAL ERROR CORRECTION BENCHMARK
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark Haoran Wu† Wenxuan Wang† Yuxuan Wan † Wenxiang Jiao‡ Michael R. Lyu† † Department of Computer Science and Engineering, The Chinese University of Hong Kong 1155157061@link.cuhk.edu.hk {wxwang,yxwan9,lyu}@cse.cuhk.edu.hk ‡ Tencent AI Lab joelwxjiao@tencent.com Abstract Despite the widespread use of ChatGPT, it re- mains unclear to the NLP community that to what ChatGPT is a cutting-edge artificial intelli- extent ChatGPT is capable of revising the text and gence language model developed by OpenAI, correcting grammatical errors. To fill this research arXiv:2303.13648v1 [cs.CL] 15 Mar 2023 which has attracted a lot of attention due to its surprisingly strong ability in answering gap, we empirically study the Grammatical Error follow-up questions. In this report, we aim Correction (GEC) ability of ChatGPT by evalu- to evaluate ChatGPT on the Grammatical Er- ating on the CoNLL2014 benchmark dataset (Ng ror Correction (GEC) task, and compare it et al., 2014), and comparing its performance to with commercial GEC product (e.g., Gram- Grammarly, a prevalent cloud-based English typing marly) and state-of-the-art models (e.g., GEC- assistant with 30 million users daily (Grammarly, ToR). By testing on the CoNLL2014 bench- 2023) and GECToR (Omelianchuk et al., 2020), a mark dataset, we find that ChatGPT performs not as well as those baselines in terms of the state-of-the-art GEC model. With this study, we automatic evaluation metrics (e.g., F0.5 score), aim to answer a research question: particularly on long sentences. We inspect the outputs and find that ChatGPT goes be- Is ChatGPT a good tool for GEC? yond one-by-one corrections. Specifically, it prefers to change the surface expression of To the best of our knowledge, this is the first study certain phrases or sentence structure while maintaining grammatical correctness. Human on ChatGPT’s ability in GEC. evaluation quantitatively confirms this and We present the major insights gained from this suggests that ChatGPT produces less under- evaluation as below: correction or mis-correction issues but more over-corrections. These results demonstrate • ChatGPT performs worse than the baseline that ChatGPT is severely under-estimated by systems in terms of the automatic evaluation the automatic evaluation metrics and could be metrics (e.g., F0.5 score), particularly on long a promising tool for GEC. sentences. 1 Introduction • ChatGPT goes beyond one-by-one corrections ChatGPT1 , the current “super-star” in artificial in- by introducing more changes to the surface telligence (AI) area, has attracted millions of reg- expression of certain phrases or sentence struc- istered users within just a week since its launch ture while maintaining the grammatical cor- by OpenAI. One of the reasons for ChatGPT rectness. being so popular is its surprisingly strong per- formance on various natural language process- • Human evaluation quantitatively demon- ing (NLP) tasks (Bang et al., 2023), including ques- strates that ChatGPT produces less under- tion answering (Omar et al., 2023), text summariza- correction or mis-correction issues but more tion (Yang et al., 2023), machine translation (Jiao over-corrections. et al., 2023), logic reasoning (Frieder et al., 2023), code debugging (Xia and Zhang, 2023), etc. There Our evaluation indicates the limitation of relying is also a trend of using ChatGPT as a writing assis- solely on automatic evaluation metrics to assess tant for text polishing. the performance of GEC models and suggests that 1 https://chat.openai.com/chat ChatGPT is a promising tool for GEC.
Type Error Correction Preposition I sat in the talk I sat in on the talk Morphology dreamed dreamt Determiner I like the ice cream I like ice cream Tense/Aspect I like play basketball I like playing basketball Syntax I have not the book I do not have the book Punctuation We met they talked and left We met, they talked and left Table 1: Different types of error in GEC. 2 Background System Precision Recall F0.5 2.1 ChatGPT GECToR 71.2 38.4 60.8 ChatGPT is an intelligent chatbot powered by large Grammarly 67.3 51.1 63.3 language models developed by OpenAI. It has at- ChatGPT 51.2 62.8 53.1 tracted great attention from industry, academia, and the general public due to its strong ability in an- Table 2: GEC performance of GECToR, Grammarly, and ChatGPT. swering various follow-up questions, correcting inappropriate questions (Zhong et al., 2023), and even refusing illegal questions. While the tech- but introduces a new dataset, namely, the nical details of ChatGPT have not been released Write&Improve+LOCNESS corpus, which systematically, it is known to be built upon Instruct- represents a wider range of native and learner GPT (Ouyang et al., 2022) which is trained using English levels and abilities (Bryant et al., instruction tuning (Wei et al., 2022a) and reinforce- 2019). ment learning from human feedback (RLHF, Chris- tiano et al., 2017). • JFLEG: It represents a broad range of lan- guage proficiency levels and uses holistic flu- 2.2 Grammatical Error Correction ency edits to not only correct grammatical Grammatical Error Correction (GEC) is a task of errors but also make the original text more correcting different kinds of errors in text such native sounding (Tetreault et al., 2017). as spelling, punctuation, grammatical, and word choice errors (Ruder, 2022). It is highly demanded as writing plays an important role in academics, 3 ChatGPT for GEC work, and daily life. Table 1 presents the illustra- tion of different grammatical errors borrowed from 3.1 Experimental Setup Bryant et al. (2022) in a comprehensive survey on Dataset. We evaluate the ability of ChatGPT in grammatical error correction. In general, gram- grammatical error correction on the CoNLL2014 matical errors can be roughly classified into three task (Ng et al., 2014) dataset. The dataset is com- categories: omission errors, such as "on" in the first posed by short paragraphs that are written by non- example; replacement errors, such as "dreamed" native speakers of English, accompanied with the for "dreamt" in the second example; and insertion corresponding annotations on the grammatical er- errors, such as "the" in the third example. rors. We pulled 100 sentences from the official- To evaluate the performance of GEC, researchers combined test set in the alternate folder of the have built various benchmark datasets, which in- dataset sequentially. clude but are not limited to: • CoNLL-2014: Given the short English texts Evaluation Metric. To evaluate the performance written by non-native speakers, the task re- of GEC, we adopt three metrics that are widely quires a participating system to correct all er- used in literature, namely, Precision, Recall, and rors present in each text. F0.5 score. Among them, F0.5 score combines both Precision and Recall, where Precision is assigned • BEA-2019: It is similar to CoNLL-2014 a higher weight (Wikipedia contributors, 2023a).
Short Medium Long System Precision Recall F0.5 Precision Recall F0.5 Precision Recall F0.5 GECToR 76.9 38.5 64.1 68.8 37.5 58.9 71.8 38.9 61.5 Grammarly 62.5 60.6 62.1 68.9 56.0 65.9 67.3 45.3 61.4 ChatGPT 58.5 66.7 60.0 48.7 60.7 50.7 51.0 62.8 53.0 Table 3: GEC performance with respect to sentence length. Specifically, the three metrics are expressed as: the grammar correction in the setting and only ask it to correct the ones with correctness prob- TP lems (red underline), while leaving the clarity Precision = , (1) TP + FP (blue underline), engagement (green underline) TP and delivery (purple underline) unchanged. We Recall = , (2) TP + FN iterate this process several times until there is no 1.25 × Precision × Recall error detected by Grammarly. F0.5 = , (3) 0.25 × Precision + Recall • GECToR: Besides Grammarly, we also compare where T P , F P and F N represent the true posi- ChatGPT with GECToR (Omelianchuk et al., tives, false positives and false negatives of the pre- 2020), a state-of-the-art model on GEC in re- dictions, respectively. We use the scoring program search, which also exhibits good performance provided by CoNLL2014 official but adapt it to be on the CoNLL2014 task. We adopt the imple- compatible with the latest Python environment. mentation based on the pre-trained RoBERTa Baselines. In this report, we perform the GEC model. task on three systems, including: 3.2 Results and Analysis • ChatGPT: We query ChatGPT manually rather Overall Performance. Table 2 presents the over- than using some API due to the instability of all performance of the three systems. As seen, ChatGPT. For example, when a query sentence ChatGPT obtains the highest recall value, GECToR resembles a question or demand, ChatGPT may obtains the highest precision value, while Gram- stop the process of GEC but respond to the “de- marly achieves a better balance between the two mand” instead. After a few trials, we find a metrics and results in the highest F0.5 score. These prompt that works well for ChatGPT: results suggest that ChatGPT tends to correct as many errors as possible, which may lead to more Do grammatical error correction overcorrections. Instead, GECToR corrects only on all the following sentences those it is confident about, which leaves many er- I type in the conversation. rors uncorrected. Grammarly combines the advan- We query ChatGPT with this prompt for each tages of both such that it performs more stably. test sample. ChatGPT Performs Worse on Long Sentences? To understand which kind of sentences ChatGPT • Grammarly: Grammarly is a prevalent cloud- are good at, we divide the 100 test sentences based English typing assistant. It reviews into three equally sized categories, namely, Short, spelling, grammar, punctuation, clarity, engage- Medium and Long. Table 3 shows the results with ment, and delivery mistakes in English texts, respect to sentence length. As seen, the gap be- detects plagiarism and suggests replacements tween ChatGPT and Grammarly is significantly for the identified errors (Wikipedia contribu- bridged on short sentences. In contrast, ChatGPT tors, 2023b). As stated by Grammarly, every performs much worse on those longer sentences, at day, 30 million people and 50,000 teams around least in terms of the existing evaluation metrics. the world use Grammarly with their writing (Grammarly, 2023). When querying Grammarly, ChatGPT Goes Beyond One-by-One Correc- we open a text file and paste all the test sam- tions. We inspect the output of the three systems, ples into separate paragraphs. We enable all especially those for long sentences, and find that
System Sentence Source For an example , if exercising is helpful for family potential disease , we can always look for more chances for the family to go exercise . Reference For example , if exercising (OR exercise) is helpful for a potential family disease , we can always look for more chances for the family to do exercise . GECToR For example , if exercising is helpful for family potential disease , we can always look for more chances for the family to go exercise . Grammarly For example , if exercising is helpful for a family ’s potential disease , we can always look for more chances for the family to go exercise . ChatGPT For example , if exercise is helpful in preventing potential family diseases , we can always look for more opportunities for the family to exercise . Table 4: Comparison of the outputs from different GEC systems. System Precision Recall F0.5 System #Under #Mis #Over GECToR 71.2 38.4 60.8 GECToR 13 4 0 + Grammarly -5.9 +16.5 +2.1 Grammarly 14 0 1 ChatGPT 51.2 62.8 53.1 ChatGPT 3 3 30 + Grammarly +0.4 +0.8 +0.5 Table 6: Number of under-correction (Under), mis- Table 5: GEC performance with Grammarly for further correction (Mis) and over-correction (Over) produced correction. by different GEC systems. Human Evaluation. We conduct a human eval- ChatGPT is not limited to correcting the errors in uation to further demonstrate the potential of Chat- the one-by-one fashion. Instead, it is more will- GPT for the GEC task. Specifically, we fol- ing to change the superficial expression of some low Wang et al. (2022) to manually annotate the phrases or the sentence structure. For example, issues in the outputs of the three systems, includ- in Table 4, GECToR and Grammarly make mi- ing 1) Under-correction, which is the grammati- nor changes to the source sentence (i.e., “an ex- cal errors that are not found; 2) Mis-correction, ample” to “example”, “family potential disease” which is the grammatical errors that are found but to “a family ’s potential disease”), while ChatGPT modified incorrectly; it can be either grammati- modifies the sentence structure (i.e., “for family cally incorrect or semantically incorrect; 3) Over- potential disease” to “in preventing potential fam- correction, which is the other modifications beyond ily diseases”) and word choice (i.e., “chances” to the changes in the reference. We sample 20 sen- “opportunities”). It indicates that the outputs by tences out of the 100 test sentences and ask two ChatGPT maintain the grammatical correctness, al- annotators to identify the issues. Table 6 shows though they do not follow the original expression the results. Obviously, ChatGPT has the least num- of the source sentences. ber of under-corrections among the three systems and fewer number of mis-corrections compared To validate our hypothesis, we let Grammarly to with GECToR, which suggests its great potential further correct the grammatical errors in the out- in grammatical error correction. Meanwhile, Chat- puts of GECToR and ChatGPT. Table 5 lists the GPT produces more over-corrections, which may results. We can observe that Grammarly introduces come from the diverse generation ability as a large a negligible improvement to the output of ChatGPT, language model. While this usually leads to a lower demonstrating that ChatGPT indeed generates cor- F0.5 score, it also allows more flexible language rect sentences. On the contrary, Grammarly further expressions in GEC. improves the performance of GECToR noticeably (i.e., +2.1 F0.5 , +16.5 Recall), suggesting that there Discussions. We have checked the outputs corre- are still many errors in the output of GECToR. sponding to the results of Table 5, and observed
different behaviors of ChatGPT and Grammarly. Limitations and Future Works The slight improvement (i.e., +0.5 F0.5 ) by Gram- There are several limitations in this version, which marly mainly comes from punctuation problems. we leave for future work: ChatGPT is not sensitive to punctuation problems but Grammarly is, though the modifications are not • More Datasets: In this version, we only use the always correct. For example, when we manually CoNLL-2014 test set and only randomly select undo the corrections on punctuation, the F0.5 score 100 sentences to conduct the evaluation. In our increases by +0.0015. Other than punctuation prob- future work, we will conduct experiments on lems, Grammarly also corrects a few grammatical more datasets. errors on articles, prepositions, and plurals. How- ever, these corrections usually require Grammarly • More Prompt and In-context Learning: In to repeat the process twice. Take the following this version, we only use one prompt to query sentence as an example, ChatGPT and do not utilize the advanced tech- nology from the in-context learning field, such as ... constructs of the family and providing demonstration examples (Brown et al., kinship are a social construct, 2020) or providing chain-of-thought (Wei et al., ... 2022b), which may under-estimate the full po- tential of ChatGPT. In our future work, we will Grammarly first changes it to explore the in-context learning methods for GEC to improve its performance. ... constructs of the family and kinship are a social constructs, • More Evaluation Metrics: In this version, we ... only adopt Precision, Recall and F0.5 as evalu- ation metrics. In our future work, we will uti- Then, changes it to lize more metrics, such as pretraining-based met- rics (Gong et al., 2022) to evaluate the perfor- ... constructs of the family and mance comprehensively. kinship are social constructs, ... Nonetheless, it does correct some errors that Chat- References GPT fails to correct. Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen- liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Zi- 4 Conclusion wei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A multitask, multilin- This paper evaluates ChatGPT on the task of Gram- gual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. ArXiv. matical Error Correction (GEC). By testing on the CoNLL2014 benchmark dataset, we find that Chat- Tom Brown, Benjamin Mann, Nick Ryder, Melanie GPT performs worse than a commercial product Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Grammarly and a state-of-the-art model GECToR Askell, et al. 2020. Language models are few-shot in terms of automatic evaluation metrics. By ex- learners. NeurIPS. amining the outputs, we find that ChatGPT dis- plays a unique ability to go beyond one-by-one Christopher Bryant, Mariano Felice, Øistein E. An- dersen, and Ted Briscoe. 2019. The bea-2019 corrections by changing surface expressions and shared task on grammatical error correction. In sentence structure while maintaining grammatical BEA@ACL. correctness. Human evaluation results confirm this finding and reveals that ChatGPT produces fewer Christopher Bryant, Zheng Yuan, Muhammad Reza Qorib, Hannan Cao, Hwee Tou Ng, and Ted Briscoe. under-correction or mis-correction issues but more 2022. Grammatical error correction: A survey of the over-corrections. These results demonstrate the state of the art. ArXiv. limitation of relying solely on automatic evaluation Paul Francis Christiano, Jan Leike, Tom B. Brown, Mil- metrics to assess the performance of GEC models jan Martic, Shane Legg, and Dario Amodei. 2017. and suggest that ChatGPT has the potential to be a Deep reinforcement learning from human prefer- valuable tool for GEC. ences. NeruIPS.
Simon Frieder, Luca Pinchetti, Ryan-Rhys Grif- Wikipedia contributors. 2023b. Grammarly — fiths, Tommaso Salvatori, Thomas Lukasiewicz, Wikipedia, the free encyclopedia. [Online; accessed Philipp Christian Petersen, Alexis Chevalier, and J J 2-March-2023]. Berner. 2023. Mathematical capabilities of chatgpt. ArXiv. Chun Xia and Lingming Zhang. 2023. Conversational automated program repair. ArXiv. Peiyuan Gong, Xuebo Liu, Heyan Huang, and Min Zhang. 2022. Revisiting grammatical error correc- Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen, tion evaluation and beyond. EMNLP. and Wei Cheng. 2023. Exploring the limits of chat- gpt for query or aspect-based text summarization. Grammarly. 2023. Grammarly website about us page. ArXiv. Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing Dacheng Tao. 2023. Can chatgpt understand too? Wang, and Zhaopeng Tu. 2023. Is ChatGPT a good a comparative study on chatgpt and fine-tuned bert. translator? a preliminary study. In ArXiv. ArXiv. Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christo- pher Bryant. 2014. The conll-2014 shared task on grammatical error correction. In CoNLL. Reham Omar, Omij Mangukiya, Panos Kalnis, and Es- sam Mansour. 2023. Chatgpt versus traditional ques- tion answering for knowledge graphs: Current status and future directions towards knowledge graph chat- bots. ArXiv. Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem N. Chernodub, and Oleksandr Skurzhanskyi. 2020. Gector – grammatical error correction: Tag, not rewrite. In Workshop on Innovative Use of NLP for Building Educational Applications. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instruc- tions with human feedback. arXiv. Sebastian Ruder. 2022. NLP-progress. Joel R. Tetreault, Keisuke Sakaguchi, and Courtney Napoles. 2017. Jfleg: A fluency corpus and bench- mark for grammatical error correction. In EACL. Wenxuan Wang, Wenxiang Jiao, Yongchang Hao, Xing Wang, Shuming Shi, Zhaopeng Tu, and Michael Lyu. 2022. Understanding and improving sequence-to- sequence pretraining for neural machine translation. In ACL. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022a. Finetuned language models are zero-shot learners. ICLR. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, Quoc Le, and Denny Zhou. 2022b. Chain of thought prompting elicits reasoning in large language models. NeurIPS. Wikipedia contributors. 2023a. F-score — Wikipedia, the free encyclopedia. [Online; accessed 5-March- 2023].
You can also read