TY - JOUR
T1 - Expansive data, extensive model
T2 - Investigating discussion topics around LLM through unsupervised machine learning in academic papers and news
AU - Jung, Hae Sun
AU - Lee, Haein
AU - Woo, Young Seok
AU - Baek, Seo Yeon
AU - Kim, Jang Hyun
N1 - Publisher Copyright:
Copyright: © 2024 Jung et al.
PY - 2024/5
Y1 - 2024/5
N2 - This study presents a comprehensive exploration of topic modeling methods tailored for large language model (LLM) using data obtained from Web of Science and LexisNexis from June 1, 2020, to December 31, 2023. The data collection process involved queries focusing on LLMs, including “Large language model,” “LLM,” and “ChatGPT.” Various topic modeling approaches were evaluated based on performance metrics, including diversity and coherence. latent Dirichlet allocation (LDA), nonnegative matrix factorization (NMF), combined topic models (CTM), and bidirectional encoder representations from Transformers topic (BERTopic) were employed for performance evaluation. Evaluation metrics were computed across platforms, with BERTopic demonstrating superior performance in diversity and coherence across both LexisNexis and Web of Science. The experiment result reveals that news articles maintain a balanced coverage across various topics and mainly focus on efforts to utilize LLM in specialized domains. Conversely, research papers are more concise and concentrated on the technology itself, emphasizing technical aspects. Through the insights gained in this study, it becomes possible to investigate the future path and the challenges that LLMs should tackle. Additionally, they could offer considerable value to enterprises that utilize LLMs to deliver services.
AB - This study presents a comprehensive exploration of topic modeling methods tailored for large language model (LLM) using data obtained from Web of Science and LexisNexis from June 1, 2020, to December 31, 2023. The data collection process involved queries focusing on LLMs, including “Large language model,” “LLM,” and “ChatGPT.” Various topic modeling approaches were evaluated based on performance metrics, including diversity and coherence. latent Dirichlet allocation (LDA), nonnegative matrix factorization (NMF), combined topic models (CTM), and bidirectional encoder representations from Transformers topic (BERTopic) were employed for performance evaluation. Evaluation metrics were computed across platforms, with BERTopic demonstrating superior performance in diversity and coherence across both LexisNexis and Web of Science. The experiment result reveals that news articles maintain a balanced coverage across various topics and mainly focus on efforts to utilize LLM in specialized domains. Conversely, research papers are more concise and concentrated on the technology itself, emphasizing technical aspects. Through the insights gained in this study, it becomes possible to investigate the future path and the challenges that LLMs should tackle. Additionally, they could offer considerable value to enterprises that utilize LLMs to deliver services.
UR - https://www.scopus.com/pages/publications/85194940148
U2 - 10.1371/journal.pone.0304680
DO - 10.1371/journal.pone.0304680
M3 - Article
C2 - 38820285
AN - SCOPUS:85194940148
SN - 1932-6203
VL - 19
JO - PLoS ONE
JF - PLoS ONE
IS - 5 May
M1 - e0304680
ER -