Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers

Catherine A. Gao*, Frederick M. Howard, Nikolay S. Markov, Emma C. Dyer, Siddhi Ramesh, Yuan Luo, Alexander T. Pearson

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

334 Scopus citations

Abstract

Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing. We gathered fifth research abstracts from five high-impact factor medical journals and asked ChatGPT to generate research abstracts based on their titles and journals. Most generated abstracts were detected using an AI output detector, ‘GPT-2 Output Detector’, with % ‘fake’ scores (higher meaning more likely to be generated) of median [interquartile range] of 99.98% ‘fake’ [12.73%, 99.98%] compared with median 0.02% [IQR 0.02%, 0.09%] for the original abstracts. The AUROC of the AI output detector was 0.94. Generated abstracts scored lower than original abstracts when run through a plagiarism detector website and iThenticate (higher scores meaning more matching text found). When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, though abstracts they suspected were generated were vaguer and more formulaic. ChatGPT writes believable scientific abstracts, though with completely generated data. Depending on publisher-specific guidelines, AI output detectors may serve as an editorial tool to help maintain scientific standards. The boundaries of ethical and acceptable use of large language models to help scientific writing are still being discussed, and different journals and conferences are adopting varying policies.

Original languageEnglish (US)
Article number75
Journalnpj Digital Medicine
Volume6
Issue number1
DOIs
StatePublished - Dec 2023

Funding

C.A.G. is supported by NIH/NHLBI F32HL162377. F.M.H. is supported by ASCO/Conquer Cancer Foundation and Breast Cancer Research Foundation Young Investigator Award 2022YIA-6675470300 and NIH/NCI K12CA139160. S.R. is supported by the Burroughs Wellcome Fund Early Scientific Training to Prepare for Research Excellence Post-Graduation (BEST-PREP). Y.L. reports effort support from the National Institute of Health/NCATS U01TR003528, NLM R01LM013337. A.T.P. reports effort support from the National Institute of Health/National Cancer Institute NIH/NCI) U01-CA243075, National Institute of Health/National Institute of Dental and Craniofacial Research (NIH/NIDCR) R56-DE030958, grants from Cancer Research Foundation, grants from Stand Up to Cancer (SU2C) Fanconi Anemia Research Fund–Farrah Fawcett Foundation Head and Neck Cancer Research Team Grant, and the Horizon 2021-SC1-BHC I3LUNG grant.

ASJC Scopus subject areas

  • Medicine (miscellaneous)
  • Health Informatics
  • Computer Science Applications
  • Health Information Management

Fingerprint

Dive into the research topics of 'Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers'. Together they form a unique fingerprint.

Cite this