TY - GEN
T1 - Financial Semantic Textual Similarity
T2 - 2024 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics, CIFEr 2024
AU - Yang, Shanshan
AU - Yang, Steve
AU - Mai, Feng
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - We introduce FinSTS, a novel dataset for financial semantic textual similarity (STS), comprising 4,000 sentence pairs from earnings calls and SEC filings. To improve models for the Financial STS task, we propose an active learning (AL) algorithm that efficiently selects informative sentence pairs for annotation by GPT-4 and creates high-quality training data. Using this approach, we train FinSentenceBERT, a model that generates semantic embeddings specifically for financial text. FinSentenceBERT establishes a new performance benchmark on FinSTS, outperforming models that use basic pooling strategies or are fine-tuned on general datasets. Surprisingly, a general SBERT model trained using our AL approach surpasses even models based on FinBERT, a language model pre-trained on financial text. Our research contributes a specialized dataset, model, and methodology that advance semantic understanding in the financial domain, with potential applications to other specialized domains.
AB - We introduce FinSTS, a novel dataset for financial semantic textual similarity (STS), comprising 4,000 sentence pairs from earnings calls and SEC filings. To improve models for the Financial STS task, we propose an active learning (AL) algorithm that efficiently selects informative sentence pairs for annotation by GPT-4 and creates high-quality training data. Using this approach, we train FinSentenceBERT, a model that generates semantic embeddings specifically for financial text. FinSentenceBERT establishes a new performance benchmark on FinSTS, outperforming models that use basic pooling strategies or are fine-tuned on general datasets. Surprisingly, a general SBERT model trained using our AL approach surpasses even models based on FinBERT, a language model pre-trained on financial text. Our research contributes a specialized dataset, model, and methodology that advance semantic understanding in the financial domain, with potential applications to other specialized domains.
KW - Active learning
KW - BERT
KW - Representation learning
KW - Text similarity
UR - http://www.scopus.com/inward/record.url?scp=85215012004&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85215012004&partnerID=8YFLogxK
U2 - 10.1109/CIFER62890.2024.10772793
DO - 10.1109/CIFER62890.2024.10772793
M3 - Conference contribution
AN - SCOPUS:85215012004
T3 - 2024 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics, CIFEr 2024
BT - 2024 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics, CIFEr 2024
Y2 - 22 October 2024 through 23 October 2024
ER -