Diagnostic Accuracy and Clinical Value of a Domain-specific Multimodal Generative AI Model for Chest Radiograph Report Generation

Eun Kyoung Hong, Jiyeon Ham, Byungseok Roh, Jawook Gu, Beomhee Park, Sunghun Kang, Kihyun You, Jihwan Eom, Byeonguk Bae, Jae Bock Jo, Ok Kyu Song, Woong Bae, Ro Woon Lee, Chong Hyun Suh, Chan Ho Park, Seong Jun Choi, Jai Soung Park, Jae Hyeong Park, Hyun Jeong Jeon, Jeong Ho HongDosang Cho, Han Seok Choi, Tae Hee Kim

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Background: Generative artificial intelligence (AI) is anticipated to alter radiology workflows, requiring a clinical value assessment for frequent examinations like chest radiograph interpretation. Purpose: To develop and evaluate the diagnostic accuracy and clinical value of a domain-specific multimodal generative AI model for providing preliminary interpretations of chest radiographs. Materials and Methods: For training, consecutive radiograph-report pairs from frontal chest radiography were retrospectively collected from 42 hospitals (2005–2023). The trained domain-specific AI model generated radiology reports for the radiographs. The test set included public datasets (PadChest, Open-i, VinDr-CXR, and MIMIC-CXR-JPG) and radiographs excluded from training. The sensitivity and specificity of the model-generated reports for 13 radiographic findings, compared with radiologist annotations (reference standard), were calculated (with 95% CIs). Four radiologists evaluated the subjective quality of the reports in terms of acceptability, agreement score, quality score, and comparative ranking of reports from (a) the domain-specific AI model, (b) radiologists, and (c) a general-purpose large language model (GPT-4Vision). Acceptability was defined as whether the radiologist would endorse the report as their own without changes. Agreement scores from 1 (clinically significant discrepancy) to 5 (complete agreement) were assigned using RADPEER; quality scores were on a 5-point Likert scale from 1 (very poor) to 5 (excellent). Results: A total of 8 838 719 radiograph-report pairs (training) and 2145 radiographs (testing) were included (anonymized with respect to sex and gender). Reports generated by the domain-specific AI model demonstrated high sensitivity for detecting two critical radiographic findings: 95.3% (181 of 190) for pneumothorax and 92.6% (138 of 149) for subcutaneous emphysema. Acceptance rate, evaluated by four radiologists, was 70.5% (6047 of 8680), 73.3% (6288 of 8580), and 29.6% (2536 of 8580) for model-generated, radiologist, and GPT-4Vision reports, respectively. Agreement scores were highest for the model-generated reports (median = 4 [IQR, 3–5]) and lowest for GPT-4Vision reports (median = 1 [IQR, 1–3]; P < .001). Quality scores were also highest for the model-generated reports (median = 4 [IQR, 3–5]) and lowest for the GPT-4Vision reports (median = 2 [IQR, 1–3]; P < .001). From the ranking analysis, model-generated reports were most frequently ranked the highest (60.0%; 5146 of 8580), and GPT-4Vision reports were most frequently ranked the lowest (73.6%; 6312 of 8580). Conclusion: A domain-specific multimodal generative AI model demonstrated potential for high diagnostic accuracy and clinical value in providing preliminary interpretations of chest radiographs for radiologists.

Original languageEnglish
Article numbere241476
JournalRadiology
Volume314
Issue number3
DOIs
StatePublished - Mar 2025

Fingerprint

Dive into the research topics of 'Diagnostic Accuracy and Clinical Value of a Domain-specific Multimodal Generative AI Model for Chest Radiograph Report Generation'. Together they form a unique fingerprint.

Cite this