TY - JOUR
T1 - Hadoop Perfect File
T2 - A fast and memory-efficient metadata access archive file to face small files problem in HDFS
AU - Zhai, Yanlong
AU - Tchaye-Kondi, Jude
AU - Lin, Kwei Jay
AU - Zhu, Liehuang
AU - Tao, Wenjun
AU - Du, Xiaojiang
AU - Guizani, Mohsen
N1 - Publisher Copyright:
© 2021 Elsevier Inc.
PY - 2021/10
Y1 - 2021/10
N2 - HDFS faces several issues when it comes to handling a large number of small files. These issues are well addressed by archive systems, which combine small files into larger ones. They use index files to hold relevant information for retrieving a small file content from the big archive file. However, existing archive-based solutions require significant overheads when retrieving a file content since additional processing and I/Os are needed to acquire the retrieval information before accessing the actual file content, therefore, deteriorating the access efficiency. This paper presents a new archive file named Hadoop Perfect File (HPF). HPF minimizes access overheads by directly accessing metadata from the part of the index file containing the information. It consequently reduces the additional processing and I/Os needed and improves the access efficiency from archive files. Our index system uses two hash functions. Metadata records are distributed across index files using a dynamic hash function. We further build an order-preserving perfect hash function that memorizes the position of a small file's metadata record within the index file.
AB - HDFS faces several issues when it comes to handling a large number of small files. These issues are well addressed by archive systems, which combine small files into larger ones. They use index files to hold relevant information for retrieving a small file content from the big archive file. However, existing archive-based solutions require significant overheads when retrieving a file content since additional processing and I/Os are needed to acquire the retrieval information before accessing the actual file content, therefore, deteriorating the access efficiency. This paper presents a new archive file named Hadoop Perfect File (HPF). HPF minimizes access overheads by directly accessing metadata from the part of the index file containing the information. It consequently reduces the additional processing and I/Os needed and improves the access efficiency from archive files. Our index system uses two hash functions. Metadata records are distributed across index files using a dynamic hash function. We further build an order-preserving perfect hash function that memorizes the position of a small file's metadata record within the index file.
KW - Distributed file system
KW - Fast access
KW - HDFS
KW - Massive small files
UR - http://www.scopus.com/inward/record.url?scp=85108089031&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85108089031&partnerID=8YFLogxK
U2 - 10.1016/j.jpdc.2021.05.011
DO - 10.1016/j.jpdc.2021.05.011
M3 - Article
AN - SCOPUS:85108089031
SN - 0743-7315
VL - 156
SP - 119
EP - 130
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
ER -