An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models

  • Changyong Shin
  • , Younghun Go
  • , Yeonho Yoo
  • , Gyeongsik Yang
  • , Chuck Yoo

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Recently, large language models, such as GPT, LLaMA, and PaLM, have been actively applied in various fields such as medicine, education, finance, law, and marketing. These models have a vast number of parameters that require multiple GPUs to perform inference. For system administrators of inference services in clusters or clouds, it is critical to utilize the given GPU and network resources as efficiently as possible to quickly respond to numerous user requests. To achieve this, existing inference systems employ various parallelization and optimization strategies. This paper profiles and analyzes inference time, prediction accuracy, GPU communication amount, and GPU memory usage for different parallelization strategies, optimization techniques, and batch size changes. Notably, we develop a new resource profiler for precise resource measurement of GPU resources. Our profiling results reveal that increasing batch size can lead to inefficiencies due to increased GPU communication. In terms of GPU memory, larger batch sizes result in more aggressive memory utilization, but a specific threshold exists where out-of-memory issues arise for the limited GPU memory. Such observations are expected to serve as a baseline for designing efficient inference systems for large language models.

Original languageEnglish
Pages (from-to)1377-1385
Number of pages9
JournalJournal of Korean Institute of Communications and Information Sciences
Volume49
Issue number10
DOIs
StatePublished - 2024

Keywords

  • Batch size
  • Communication overhead
  • GPU utilization
  • Kernel fusion
  • Large language model
  • Model parallelism
  • Tensor parallelism

Fingerprint

Dive into the research topics of 'An Analysis on Inference Time, Accuracy, Communication, and GPU Memory Usage for Inference Batch of Large Language Models'. Together they form a unique fingerprint.

Cite this