TY - GEN
T1 - Unsupervised named entity normalization for supporting information fusion for big bridge data analytics
AU - Liu, Kaijian
AU - El-Gohary, Nora
N1 - Publisher Copyright:
© Springer International Publishing AG, part of Springer Nature 2018.
PY - 2018
Y1 - 2018
N2 - The large amount of multi-type and multi-source bridge data open unprecedented opportunities to big data analytics for better bridge deterioration prediction. Information fusion is needed prior to the analytics to transform the heterogeneous data from different sources into a unified representation. Resolving the ambiguities in the named entities extracted from bridge inspection reports is one of the most important fusion tasks. The ambiguity stems from the use of different and ambiguous surface forms to the same target named entity. There is, thus, a need for named entity normalization (NEN) methods that can map these ambiguous surface forms into their canonical form – an identifier concept. However, existing NEN methods are limited in this regard. This is because they mostly require pre-established knowledge (e.g., dictionaries or Wikipedia) and/or training data, and mostly ignore the impact of the normalization on data analytics. To address this need, this paper proposes an unsupervised NEN method. It includes two main components: candidate identifier concept generation based on multi-grams of each named entity set, and candidate identifier concept ranking based on a proposed ranking function. The function uses the TF-IDF (term frequency–inverse document frequency) weight and is further improved by considering the impacts of gram lengths and positions on the ranking. It aims to balance the abstractness and detailedness of the identifier concepts, so as to ensure that the resulting data are neither too dense nor too sparse for the analytics. A set of experiments were conducted to evaluate the performance of the proposed method. It achieved an accuracy of 84.5%.
AB - The large amount of multi-type and multi-source bridge data open unprecedented opportunities to big data analytics for better bridge deterioration prediction. Information fusion is needed prior to the analytics to transform the heterogeneous data from different sources into a unified representation. Resolving the ambiguities in the named entities extracted from bridge inspection reports is one of the most important fusion tasks. The ambiguity stems from the use of different and ambiguous surface forms to the same target named entity. There is, thus, a need for named entity normalization (NEN) methods that can map these ambiguous surface forms into their canonical form – an identifier concept. However, existing NEN methods are limited in this regard. This is because they mostly require pre-established knowledge (e.g., dictionaries or Wikipedia) and/or training data, and mostly ignore the impact of the normalization on data analytics. To address this need, this paper proposes an unsupervised NEN method. It includes two main components: candidate identifier concept generation based on multi-grams of each named entity set, and candidate identifier concept ranking based on a proposed ranking function. The function uses the TF-IDF (term frequency–inverse document frequency) weight and is further improved by considering the impacts of gram lengths and positions on the ranking. It aims to balance the abstractness and detailedness of the identifier concepts, so as to ensure that the resulting data are neither too dense nor too sparse for the analytics. A set of experiments were conducted to evaluate the performance of the proposed method. It achieved an accuracy of 84.5%.
KW - Big data analytics
KW - Bridge deterioration prediction
KW - Named entity normalization
UR - http://www.scopus.com/inward/record.url?scp=85048955417&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85048955417&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-91638-5_7
DO - 10.1007/978-3-319-91638-5_7
M3 - Conference contribution
AN - SCOPUS:85048955417
SN - 9783319916378
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 130
EP - 149
BT - Advanced Computing Strategies for Engineering - 25th EG-ICE International Workshop 2018, Proceedings
A2 - Smith, Ian F.
A2 - Domer, Bernd
T2 - 25th Workshop of the European Group for Intelligent Computing in Engineering, EG-ICE 2018
Y2 - 10 June 2018 through 13 June 2018
ER -