Computer Science Speaking Skills Talk

— 2:00pm

Location:
In Person - McWilliams Classroom, Gates Hillman 4303

Speaker:
SUHAS JAYARAM SUBRAMANYA , Ph.D. Student, Computer Science Department, Carnegie Mellon University
https://suhasjs.github.io/

Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling

Large GPU clusters are increasingly becoming more heterogeneous due to advances in GPU design and incremental deployment of a mix of GPU types over time. Deep learning (DL) training jobs running on these GPU clusters can see varying job completion times depending on the resources allocated by the cluster scheduler and job hyper-parameters configured by users at the time of job submission. Sia is a cluster scheduler that (1) efficiently assigns heterogeneous GPU resources to elastic resource-adaptive DL training jobs, and (2) configures the job hyper-parameters to maintain high training efficiency for all running jobs without sacrificing the quality of trained models. 

We will discuss challenges in optimizing resource-adaptivity for deep learning training (DLT) jobs on large clusters with many GPU types, and introduce a new scheduling formulation that efficiently matches DLT jobs and their configurations to GPU types and counts, while adapting to changes in cluster load and job mix over time. On job traces derived from real datacenters, Sia improves job completion times by 30-93% while using 12-60% fewer GPU hours. Furthermore, its scheduling policy is quick to evaluate and easily scales to GPU clusters with many GPU types and 1000s of GPUs. 

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement


Add event to Google
Add event to iCal