Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor

Zhen Xie, Murali Emani, Xiaodong Yu, Dingwen Tao, Xin He, Pengfei Su, Keren Zhou, Venkatram Vishwanath

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

For an extended period, graphics processing units (GPUs) have stood as the exclusive choice for training deep neural network (DNN) models. Over time, to serve the growing demands in a more targeted manner, various artificial intelligence-specific hardware, referred to as AI accelerators, have been vigorously developed, aiming to provide more efficient DNN acceleration solutions. However, sufficient solutions are also heterogeneous and thus introduce complexities in accelerator selection. Given a DNN model and a training objective, such as throughput or price-performance ratio, it remains challenging to arrive at the optimal decision among many options due to high reimplementation costs and unexpected performance. To tackle this challenge, we propose Centimani, a performance predictor that accurately and rapidly predicts DNN training throughput on various AI accelerators, thereby facilitating the accelerator selection process. To achieve this goal, we first analyze typical AI accelerators and draw observations that abstract AI accelerator designs and guide our performance modeling approach. In particular, we construct a memory estimation model and decoupled performance models to select the most appropriate batch size and predict the execution time of DNN training. We validate our approach by applying Centimani to six common DNN models on four typical AI accelerators. Results show that Centimani predicts the throughput with an average accuracy of 93.1% on single-device training and 90.4% on multiple-device training, thus the optimal accelerator corresponding to the user’s training objective can be obtained.

Original languageEnglish
Title of host publicationProceedings of the 2024 USENIX Annual Technical Conference, ATC 2024
Pages1203-1221
Number of pages19
ISBN (Electronic)9781939133410
StatePublished - 2024
Event2024 USENIX Annual Technical Conference, ATC 2024 - Santa Clara, United States
Duration: 10 Jul 202412 Jul 2024

Publication series

NameProceedings of the 2024 USENIX Annual Technical Conference, ATC 2024

Conference

Conference2024 USENIX Annual Technical Conference, ATC 2024
Country/TerritoryUnited States
CitySanta Clara
Period10/07/2412/07/24

Fingerprint

Dive into the research topics of 'Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor'. Together they form a unique fingerprint.

Cite this