TY - JOUR
T1 - An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models
AU - Shin, Changyong
AU - Go, Younghun
AU - Yoo, Yeonho
AU - Yang, Gyeongsik
AU - Yoo, Chuck
N1 - Publisher Copyright:
© 2024, Korean Institute of Communications and Information Sciences. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Recently, large language models, such as GPT, LLaMA, and PaLM, have been actively applied in various fields such as medicine, education, finance, law, and marketing. These models have a vast number of parameters that require multiple GPUs to perform inference. For system administrators of inference services in clusters or clouds, it is critical to utilize the given GPU and network resources as efficiently as possible to quickly respond to numerous user requests. To achieve this, existing inference systems employ various parallelization and optimization strategies. This paper profiles and analyzes inference time, prediction accuracy, GPU communication amount, and GPU memory usage for different parallelization strategies, optimization techniques, and batch size changes. Notably, we develop a new resource profiler for precise resource measurement of GPU resources. Our profiling results reveal that increasing batch size can lead to inefficiencies due to increased GPU communication. In terms of GPU memory, larger batch sizes result in more aggressive memory utilization, but a specific threshold exists where out-of-memory issues arise for the limited GPU memory. Such observations are expected to serve as a baseline for designing efficient inference systems for large language models.
AB - Recently, large language models, such as GPT, LLaMA, and PaLM, have been actively applied in various fields such as medicine, education, finance, law, and marketing. These models have a vast number of parameters that require multiple GPUs to perform inference. For system administrators of inference services in clusters or clouds, it is critical to utilize the given GPU and network resources as efficiently as possible to quickly respond to numerous user requests. To achieve this, existing inference systems employ various parallelization and optimization strategies. This paper profiles and analyzes inference time, prediction accuracy, GPU communication amount, and GPU memory usage for different parallelization strategies, optimization techniques, and batch size changes. Notably, we develop a new resource profiler for precise resource measurement of GPU resources. Our profiling results reveal that increasing batch size can lead to inefficiencies due to increased GPU communication. In terms of GPU memory, larger batch sizes result in more aggressive memory utilization, but a specific threshold exists where out-of-memory issues arise for the limited GPU memory. Such observations are expected to serve as a baseline for designing efficient inference systems for large language models.
KW - Batch size
KW - Communication overhead
KW - GPU utilization
KW - Kernel fusion
KW - Large language model
KW - Model parallelism
KW - Tensor parallelism
UR - https://www.scopus.com/pages/publications/85211235106
U2 - 10.7840/kics.2024.49.10.1377
DO - 10.7840/kics.2024.49.10.1377
M3 - Article
AN - SCOPUS:85211235106
SN - 1226-4717
VL - 49
SP - 1377
EP - 1385
JO - Journal of Korean Institute of Communications and Information Sciences
JF - Journal of Korean Institute of Communications and Information Sciences
IS - 10
ER -