TY - JOUR
T1 - Improved similarity assessment and spectral clustering for unsupervised linking of data extracted from bridge inspection reports
AU - Liu, Kaijian
AU - El-Gohary, Nora
N1 - Publisher Copyright:
© 2021
PY - 2022/1
Y1 - 2022/1
N2 - Textual bridge inspection reports are important data sources for supporting data-driven bridge deterioration prediction and maintenance decision making. Information extraction methods are available to extract data/information from these reports to support data-driven analytics. However, directly using the extracted data/information in data analytics is still challenging because, even within the same report, there exist multiple data records that describe the same entity, which increases the dimensionality of the data and adversely affects the performance of the analytics. The first step to address this problem is to link the multiple records that describe the same entity and same type of instances (e.g., all cracks on a specific bridge deck), so that they can be subsequently fused into a single unified representation for dimensionality reduction without information loss. To address this need, this paper proposes a spectral clustering-based method for unsupervised data linking. The method includes: (1) a concept similarity assessment method, which allows for assessing concept similarity even when corpus or semantic information is not available for the application at hand; (2) a record similarity assessment method, which captures and uses similarity assessment dependencies to reduce the number of falsely-linked records; and (3) an improved spectral clustering method, which uses iterative bi-partitioning to better link records in an unsupervised way and to address the transitive closure problem. The proposed data linking method was evaluated in linking records extracted from ten bridge inspection reports. It achieved an average precision, recall, and F-1 measure of 96.2%, 88.3%, and 92.1%, respectively.
AB - Textual bridge inspection reports are important data sources for supporting data-driven bridge deterioration prediction and maintenance decision making. Information extraction methods are available to extract data/information from these reports to support data-driven analytics. However, directly using the extracted data/information in data analytics is still challenging because, even within the same report, there exist multiple data records that describe the same entity, which increases the dimensionality of the data and adversely affects the performance of the analytics. The first step to address this problem is to link the multiple records that describe the same entity and same type of instances (e.g., all cracks on a specific bridge deck), so that they can be subsequently fused into a single unified representation for dimensionality reduction without information loss. To address this need, this paper proposes a spectral clustering-based method for unsupervised data linking. The method includes: (1) a concept similarity assessment method, which allows for assessing concept similarity even when corpus or semantic information is not available for the application at hand; (2) a record similarity assessment method, which captures and uses similarity assessment dependencies to reduce the number of falsely-linked records; and (3) an improved spectral clustering method, which uses iterative bi-partitioning to better link records in an unsupervised way and to address the transitive closure problem. The proposed data linking method was evaluated in linking records extracted from ten bridge inspection reports. It achieved an average precision, recall, and F-1 measure of 96.2%, 88.3%, and 92.1%, respectively.
KW - Bridges
KW - Data linking/linkage
KW - Deterioration prediction
KW - Maintenance decision making
KW - Similarity assessment
KW - Spectral clustering
KW - Unsupervised machine learning
UR - http://www.scopus.com/inward/record.url?scp=85123825365&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123825365&partnerID=8YFLogxK
U2 - 10.1016/j.aei.2021.101496
DO - 10.1016/j.aei.2021.101496
M3 - Article
AN - SCOPUS:85123825365
SN - 1474-0346
VL - 51
JO - Advanced Engineering Informatics
JF - Advanced Engineering Informatics
M1 - 101496
ER -