CRII: OAC: A Compressor-Assisted Collective Communication Framework for GPU-Based Large-Scale Deep Learning

Project: Research project

Project Details

Description

The scale of modern deep learning expands rapidly due to larger training datasets, larger neural network models, and new algorithms/techniques. It presents significant challenges to the current distributed high-performance computing (HPC) infrastructures since larger-scale training incurs more expensive collective communication costs for passing more significant gradient messages among nodes. A more powerful hardware platform may not necessarily help overcome this performance bottleneck, as optimized middleware supports are demanded to unleash the platform's computing capacity fully. This project aims to close the gap between the training scale and the infrastructure's capability by providing gradient-specific lossy compression techniques and an optimized GPU-aware compressor-assisted collective communication framework to reduce the gradient message sizes and improve communication performance systematically. The deliverables can help the end-users to get significantly faster training speed with preserved training accuracy. The success of this research can promote progress in both traditional AI research, such as computer vision and natural language processing, and emerging AI for Science research for domain sciences, including cosmology, X-ray imaging, and drug discovery. This project also contributes to educational and engagement activities by leveraging the research outcome to develop new curriculums and teaching tools for mentoring college students and training K-12 students in HPC and AI areas.Using current collective communication libraries for large-scale distributed deep learning can yield significant communication overhead since the gradient messages are large. Applying lossy compression techniques to gradient messages could potentially reduce the communication overhead. However, several important open research questions should be investigated to ensure the performance gain: 1) Are the current lossy compressors efficient enough for gradient data? 2) How can lossy compressors efficiently integrate into a GPU-aware collective communication framework? 3) How could the GPU resources be efficiently shared among different tasks? This project addresses these questions and delivers a novel compressor-assisted GPU-aware collective communication framework for large-scale deep learning. Specifically, the team 1) investigates the efficiency of using error-bounded scientific data lossy compressors to compress gradient data and develops a new gradient compressor by leveraging the advantages of different existing compressors to achieve a better compression ratio and training accuracy; 2) designs the new compressor's GPU implementation and integrates it into the GPU-aware MPI, then optimizes the workflow to ultimately hide the gradient compressor's cost in the communication cost; 3) profiles the GPU resource utilization of both the deep learning training and the compressor-assist collective communications, and designs a new communication framework to enable task scheduling of training, compression, and collectives' computations (e.g., reduction) on the same GPU to achieve optimal resource sharing for the end-to-end deep learning training.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
StatusActive
Effective start/end date1/06/2431/05/26

Funding

  • National Science Foundation

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.