TY - JOUR
T1 - Evaluation and Analysis of Large Language Models for Clinical Text Augmentation and Generation
AU - Latif, Atif
AU - Kim, Jihie
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024
Y1 - 2024
N2 - A major challenge in deep learning (DL) model training is data scarcity. Data scarcity is commonly found in specific domains, such as clinical or low-resource languages, that are not vastly explored in AI research. In this paper, we investigate the generation capability of large language models such as Text-To-Text Transfer Transformer (T5) and Bidirectional and Auto-Regressive Transformers (BART) for Clinical Health-Aware Reasoning across Dimensions (CHARDAT) dataset by applying the ChatGPT augmentation technique. We employed ChatGPT to rephrase each instance of the training set into conceptually similar but semantically different samples and augmented them to the dataset. This study aims to investigate the utilization of large language models, ChatGPT in particular, for data augmentation to overcome the limited availability in the clinical domain. In addition to the ChatGPT augmentation, we applied other augmentation techniques, such as easy data augmentation (EDA) and an easier data augmentation (AEDA), to clinical data. ChatGPT comprehended the contextual significance of sentences within the dataset and successfully modified English terms but not clinical terms. The original CHARDAT datasets represent 52 health conditions across three clinical dimensions, i.e., Treatments, Risk Factors, and Preventions. We compared the outputs for different augmentation techniques and evaluated their relative performance. Additionally, we examined how these techniques perform with different pre-trained language models, assessing their sensitivity in various contexts. Despite the relatively small size of the CHARDAT dataset, our results demonstrated that augmentation methods like ChatGPT augmentation surpassed the efficiency of the previously employed back-translation augmentation. Specifically, our findings revealed that the BART model resulted in superior performance, achieving a rouge score of 52.35 for ROUGE-1, 41.59 for ROUGE-2, and 50.71 for ROUGE-L.
AB - A major challenge in deep learning (DL) model training is data scarcity. Data scarcity is commonly found in specific domains, such as clinical or low-resource languages, that are not vastly explored in AI research. In this paper, we investigate the generation capability of large language models such as Text-To-Text Transfer Transformer (T5) and Bidirectional and Auto-Regressive Transformers (BART) for Clinical Health-Aware Reasoning across Dimensions (CHARDAT) dataset by applying the ChatGPT augmentation technique. We employed ChatGPT to rephrase each instance of the training set into conceptually similar but semantically different samples and augmented them to the dataset. This study aims to investigate the utilization of large language models, ChatGPT in particular, for data augmentation to overcome the limited availability in the clinical domain. In addition to the ChatGPT augmentation, we applied other augmentation techniques, such as easy data augmentation (EDA) and an easier data augmentation (AEDA), to clinical data. ChatGPT comprehended the contextual significance of sentences within the dataset and successfully modified English terms but not clinical terms. The original CHARDAT datasets represent 52 health conditions across three clinical dimensions, i.e., Treatments, Risk Factors, and Preventions. We compared the outputs for different augmentation techniques and evaluated their relative performance. Additionally, we examined how these techniques perform with different pre-trained language models, assessing their sensitivity in various contexts. Despite the relatively small size of the CHARDAT dataset, our results demonstrated that augmentation methods like ChatGPT augmentation surpassed the efficiency of the previously employed back-translation augmentation. Specifically, our findings revealed that the BART model resulted in superior performance, achieving a rouge score of 52.35 for ROUGE-1, 41.59 for ROUGE-2, and 50.71 for ROUGE-L.
KW - BART
KW - large language models
KW - T5
KW - text augmentation
KW - text mining
UR - http://www.scopus.com/inward/record.url?scp=85189643086&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2024.3384496
DO - 10.1109/ACCESS.2024.3384496
M3 - Article
AN - SCOPUS:85189643086
SN - 2169-3536
VL - 12
SP - 48987
EP - 48996
JO - IEEE Access
JF - IEEE Access
ER -