Published April 26, 2023 | Version v1
Journal article Open

Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers

Description

Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing. We gathered fifth research abstracts from five high-impact factor medical journals and asked ChatGPT to generate research abstracts based on their titles and journals. Most generated abstracts were detected using an AI output detector, 'GPT-2 Output Detector', with % 'fake' scores (higher meaning more likely to be generated) of median [interquartile range] of 99.98% 'fake' [12.73%, 99.98%] compared with median 0.02% [IQR 0.02%, 0.09%] for the original abstracts. The AUROC of the AI output detector was 0.94. Generated abstracts scored lower than original abstracts when run through a plagiarism detector website and iThenticate (higher scores meaning more matching text found). When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, though abstracts they suspected were generated were vaguer and more formulaic. ChatGPT writes believable scientific abstracts, though with completely generated data. Depending on publisher-specific guidelines, AI output detectors may serve as an editorial tool to help maintain scientific standards. The boundaries of ethical and acceptable use of large language models to help scientific writing are still being discussed, and different journals and conferences are adopting varying policies.

Data availability

The data used in the manuscript are available upon reasonable request to the corresponding author.

The code used in the manuscript are available upon reasonable request to the corresponding author.

Files

Comparing-scientific-abstracts-generated-by-ChatGPT-to-real-abstracts-with-detectors-and-blinded-human-reviewers.pdf

Files (2.2 MB)

Name Size Download all
Supplementary information
md5:a728bf409718b5682ee157f1bcb8ef59
45.4 kB Preview Download
Reporting summary
md5:2fb338b7fce878a7718fa48d1adef13b
1.5 MB Preview Download
Article
md5:b49a371c6c204296a5db71612afa6738
596.8 kB Preview Download

Additional details

Identifiers

DOI
10.1038/s41746-023-00819-6
Other
oai:uchicago.tind.io:5820

Funding

NIH/NHLBI
F32HL162377
ASCO/Conquer Cancer Foundation
Breast Cancer Research Foundation
Young Investigator Award
National Cancer Institute
K12CA139160
Burroughs Wellcome Fund
Early Scientific Training to Prepare for Research Excellence Post-Graduation (BEST-PREP)
National Institute of Health/NCATS
U01TR003528
NLM
R01LM013337
National Cancer Institute
U01-CA243075
National Institute of Dental and Craniofacial Research
R56-DE030958
Cancer Research Foundation
Stand Up to Cancer (SU2C) Fanconi Anemia Research Fund
Farrah Fawcett Foundation Head and Neck Cancer Research Team Grant
Horizon
2021-SC1-BHC I3LUNG

UChicago Information

Division(s)
Biological Sciences Division
Department(s)
Medicine