Skip to main navigation Skip to search Skip to main content

Pfault: A general framework for analyzing the reliability of high-performance parallel file systems

  • Jinrui Cao
  • , Om Rameshwar Gatla
  • , Mai Zheng
  • , Dong Dai
  • , Vidya Eswarappa
  • , Yan Mu
  • , Yong Chen
  • New Mexico State University
  • Texas Tech University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

25 Scopus citations

Abstract

High-performance parallel file systems (PFSes) are of prime importance today. However, despite the importance, their reliability is much less studied compared with that of local storage systems, largely due to the lack of an effective analysis methodology. In this paper, we introduce PFault, a general framework for analyzing the failure handling of PFSes. PFault automatically emulates the failure state of each storage device in the target PFS based on a set of well-defined fault models, and enables analyzing the recoverability of the PFS under faults systematically. To demonstrate the practicality, we apply PFault to study Lustre, one of the most widely used PFSes. Our analysis reveals a number of cases where Lustre's checking and repairing utility LFSCK fails with unexpected symptoms (e.g., I/O error, hang, reboot). Moreover, with the help of PFault, we are able to identify a resource leak problem where a portion of Lustre's internal namespace and storage space become unusable even after running LFSCK. On the other hand, we also verify that the latest Lustre has made noticeable improvement in terms of failure handling comparing to a previous version. We hope our study and framework can help improve PFSes for reliable high-performance computing.

Original languageEnglish
Title of host publicationICS 2018
Subtitle of host publicationInternational Conference on Supercomputing
Pages1-11
Number of pages11
ISBN (Electronic)9781450357838
DOIs
StatePublished - 12 Jun 2018
Event32nd International Conference on Supercomputing, ICS 2018 - Beijing, China
Duration: 12 Jun 201815 Jun 2018

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference32nd International Conference on Supercomputing, ICS 2018
Country/TerritoryChina
CityBeijing
Period12/06/1815/06/18

Keywords

  • High performance computing
  • Parallel file systems
  • Reliability

Fingerprint

Dive into the research topics of 'Pfault: A general framework for analyzing the reliability of high-performance parallel file systems'. Together they form a unique fingerprint.

Cite this