TY - GEN
T1 - Pfault
T2 - 32nd International Conference on Supercomputing, ICS 2018
AU - Cao, Jinrui
AU - Gatla, Om Rameshwar
AU - Zheng, Mai
AU - Dai, Dong
AU - Eswarappa, Vidya
AU - Mu, Yan
AU - Chen, Yong
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/6/12
Y1 - 2018/6/12
N2 - High-performance parallel file systems (PFSes) are of prime importance today. However, despite the importance, their reliability is much less studied compared with that of local storage systems, largely due to the lack of an effective analysis methodology. In this paper, we introduce PFault, a general framework for analyzing the failure handling of PFSes. PFault automatically emulates the failure state of each storage device in the target PFS based on a set of well-defined fault models, and enables analyzing the recoverability of the PFS under faults systematically. To demonstrate the practicality, we apply PFault to study Lustre, one of the most widely used PFSes. Our analysis reveals a number of cases where Lustre's checking and repairing utility LFSCK fails with unexpected symptoms (e.g., I/O error, hang, reboot). Moreover, with the help of PFault, we are able to identify a resource leak problem where a portion of Lustre's internal namespace and storage space become unusable even after running LFSCK. On the other hand, we also verify that the latest Lustre has made noticeable improvement in terms of failure handling comparing to a previous version. We hope our study and framework can help improve PFSes for reliable high-performance computing.
AB - High-performance parallel file systems (PFSes) are of prime importance today. However, despite the importance, their reliability is much less studied compared with that of local storage systems, largely due to the lack of an effective analysis methodology. In this paper, we introduce PFault, a general framework for analyzing the failure handling of PFSes. PFault automatically emulates the failure state of each storage device in the target PFS based on a set of well-defined fault models, and enables analyzing the recoverability of the PFS under faults systematically. To demonstrate the practicality, we apply PFault to study Lustre, one of the most widely used PFSes. Our analysis reveals a number of cases where Lustre's checking and repairing utility LFSCK fails with unexpected symptoms (e.g., I/O error, hang, reboot). Moreover, with the help of PFault, we are able to identify a resource leak problem where a portion of Lustre's internal namespace and storage space become unusable even after running LFSCK. On the other hand, we also verify that the latest Lustre has made noticeable improvement in terms of failure handling comparing to a previous version. We hope our study and framework can help improve PFSes for reliable high-performance computing.
KW - High performance computing
KW - Parallel file systems
KW - Reliability
UR - https://www.scopus.com/pages/publications/85055818808
UR - https://www.scopus.com/pages/publications/85055818808#tab=citedBy
U2 - 10.1145/3205289.3205302
DO - 10.1145/3205289.3205302
M3 - Conference contribution
AN - SCOPUS:85055818808
T3 - Proceedings of the International Conference on Supercomputing
SP - 1
EP - 11
BT - ICS 2018
Y2 - 12 June 2018 through 15 June 2018
ER -