In a recent study published in Radiology, researchers found that the language model Generative Pre-trained Transformer 4 (GPT-4), developed by OpenAI, holds promise in standardizing and generating radiology reports, potentially enhancing diagnostic accuracy. The study aimed to assess its performance, time efficiency, and cost-effectiveness compared to human radiologists.
Radiology reports play a critical role in medical diagnoses but often suffer from errors and inconsistencies. These reports, typically authored by residents and reviewed by certified radiologists, require significant resources to ensure accuracy. Challenges such as heavy workloads, high-pressure environments, and unreliable speech recognition contribute to frequent errors, including incorrect laterality and descriptor misregistrations. However, further research is needed to ensure the reliability and effective integration of GPT-4 into radiological practices.
The study, conducted by Gertz et al. at the University Hospital Cologne, involved 200 radiology reports from radiography and cross-sectional imaging. These reports were randomized into two groups: 100 correct and 100 incorrect reports intentionally introduced with errors by a radiology resident. Errors were categorized into omissions, insertions, spelling mistakes, side confusion, and other inaccuracies. A team of six radiologists with varying levels of experience, alongside GPT-4, evaluated these reports for errors.
GPT-4 was instructed to assess each report’s findings and impressions sections for consistency and errors, utilizing zero-shot prompting. The time taken for GPT-4 to process the reports was also recorded. Costs were calculated based on German national labor agreements for the radiologists and per-token usage for GPT-4. Statistical analysis, including error detection rates and processing time, was conducted using SPSS and Python. The performance of GPT-4 was compared to human radiologists through chi-square tests, with significance marked by P < .05 and effect sizes measured by Cohen’s d.
The study found that the performance of GPT-4 varied compared to human radiologists in detecting errors in radiology reports. GPT-4 detected 82.7% of errors compared to the senior’s 94.7%, but its performance was generally comparable to other radiologists involved in the study. There were no statistically significant differences in average error detection rates between GPT-4 and the radiologists across general radiology, radiography, and computed tomography (CT)/ magnetic resonance imaging (MRI) report evaluations, except in specific cases such as side confusion where GPT-4’s performance was lower.
The ability of GPT-4 to detect side confusion was notably less effective than that of the top radiologist, marking a detection rate of 78% against 100%. However, across other error categories, GPT-4 demonstrated similar accuracy to the radiologists, showing no significant shortfall in identifying errors. Both GPT-4 and the radiologists occasionally flagged reports as erroneous when they were not, although this occurred infrequently and without significant differences between the groups.
The study also evaluated time efficiency, with GPT-4 requiring significantly less time to review all 200 reports compared to human radiologists. GPT-4 completed the task in just 0.19 hours, while human radiologists took between 1.4 to 5.74 hours. The cost analysis revealed that GPT-4 was vastly more cost-effective, completing the same task for only $5.78 compared to the total average cost of $190.17 for six human readers.
In conclusion, the study showed that the error detection capabilities GPT-4 in radiology reports were comparable to those of human radiologists, proving to be exceptionally cost-effective and time-efficient. However, human oversight remains necessary due to legal and accuracy concerns.
Reference
Gertz RJ, Dratsch T, Bunck AC, Lennartz S, Iuga AI, Hellmich MG, et al. Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy. Radiology. 2024 Apr;311(1):e232714.