TY - JOUR
T1 - RPT
T2 - Toward Transferable Model on Heterogeneous Researcher Data via Pre-Training
AU - Qiao, Ziyue
AU - Fu, Yanjie
AU - Wang, Pengyang
AU - Xiao, Meng
AU - Ning, Zhiyuan
AU - Zhang, Denghui
AU - Du, Yi
AU - Zhou, Yuanchun
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2023/2/1
Y1 - 2023/2/1
N2 - With the growth of the academic engines, the mining and analysis acquisition of massive researcher data, such as collaborator recommendation and researcher retrieval, has become indispensable for improving the quality and intelligence of services. However, most of the existing studies for researcher data mining focus on a single task for a particular application scenario and learning a task-specific model, which is usually unable to transfer to out-of-scope tasks. In this paper, we propose a multi-task self-supervised learning-based researcher data pre-training model named RPT, which is efficient to accomplish multiple researcher data mining tasks. Specifically, we divide the researchers' data into semantic document sets and community graph. We design the hierarchical Transformer and the local community encoder to capture information from the two categories of data, respectively. Then, we propose three self-supervised learning objectives to train the whole model. For RPT's main task, we leverage contrastive learning to discriminate whether these captured two kinds of information belong to the same researcher. In addition, two auxiliary tasks, named hierarchical masked language model and community relation prediction for extracting semantic and community information, are integrated to improve pre-training. Finally, we also propose two transfer modes of RPT for fine-tuning in different scenarios. We conduct extensive experiments to evaluate RPT, results on three downstream tasks verify the effectiveness of pre-training for researcher data mining.
AB - With the growth of the academic engines, the mining and analysis acquisition of massive researcher data, such as collaborator recommendation and researcher retrieval, has become indispensable for improving the quality and intelligence of services. However, most of the existing studies for researcher data mining focus on a single task for a particular application scenario and learning a task-specific model, which is usually unable to transfer to out-of-scope tasks. In this paper, we propose a multi-task self-supervised learning-based researcher data pre-training model named RPT, which is efficient to accomplish multiple researcher data mining tasks. Specifically, we divide the researchers' data into semantic document sets and community graph. We design the hierarchical Transformer and the local community encoder to capture information from the two categories of data, respectively. Then, we propose three self-supervised learning objectives to train the whole model. For RPT's main task, we leverage contrastive learning to discriminate whether these captured two kinds of information belong to the same researcher. In addition, two auxiliary tasks, named hierarchical masked language model and community relation prediction for extracting semantic and community information, are integrated to improve pre-training. Finally, we also propose two transfer modes of RPT for fine-tuning in different scenarios. We conduct extensive experiments to evaluate RPT, results on three downstream tasks verify the effectiveness of pre-training for researcher data mining.
KW - Pre-training
KW - contrastive learning
KW - graph representation learning
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=85125335148&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85125335148&partnerID=8YFLogxK
U2 - 10.1109/TBDATA.2022.3152386
DO - 10.1109/TBDATA.2022.3152386
M3 - Article
AN - SCOPUS:85125335148
VL - 9
SP - 186
EP - 199
JO - IEEE Transactions on Big Data
JF - IEEE Transactions on Big Data
IS - 1
ER -