Selective Preemption of Distributed Deep Learning Training

  • Younghun Go
  • , Changyong Shin
  • , Jeunghwan Lee
  • , Yeonho Yoo
  • , Gyeongsik Yang
  • , Chuck Yoo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

As more distributed deep learning (DDL) jobs run in public clouds, their effective scheduling becomes a major challenge. Current studies prioritize the execution of jobs with less remaining time, which is known to be the best in reducing average job completion time (JCT). However, we observe that this approach does not work when the preemption for pausing and loading jobs weighs in; sometimes, the preemption overheads of DDL jobs take up to hundreds of seconds. This results in very ineffective scheduling, so in some cases, the first-in-first-out policy performs much better. This paper proposes a new scheduling framework called Xion that takes into account the preemption overheads and only preempts DDL jobs when it is beneficial. Our evaluation results demonstrate that Xion effectively reduces the average JCT by 19% and improves the waiting time by 1.64×.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE 16th International Conference on Cloud Computing, CLOUD 2023
EditorsClaudio Ardagna, Nimanthi Atukorala, Pete Beckman, Carl K. Chang, Rong N. Chang, Constantinos Evangelinos, Jing Fan, Geoffrey C. Fox, Judy Fox, Christoph Hagleitner, Zhi Jin, Tevfik Kosar, Manish Parashar
PublisherIEEE Computer Society
Pages175-177
Number of pages3
ISBN (Electronic)9798350304817
DOIs
StatePublished - 2023
Event16th IEEE International Conference on Cloud Computing, CLOUD 2023 - Hybrid, Chicago, United States
Duration: 2 Jul 20238 Jul 2023

Publication series

NameIEEE International Conference on Cloud Computing, CLOUD
Volume2023-July
ISSN (Print)2159-6182
ISSN (Electronic)2159-6190

Conference

Conference16th IEEE International Conference on Cloud Computing, CLOUD 2023
Country/TerritoryUnited States
CityHybrid, Chicago
Period2/07/238/07/23

Keywords

  • Distributed deep learning
  • GPU cloud
  • GPU scheduling
  • Job scheduling
  • Preemption
  • SRTF

Fingerprint

Dive into the research topics of 'Selective Preemption of Distributed Deep Learning Training'. Together they form a unique fingerprint.

Cite this