Financial Semantic Textual Similarity: A New Dataset and Model

Shanshan Yang, Steve Yang, Feng Mai

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We introduce FinSTS, a novel dataset for financial semantic textual similarity (STS), comprising 4,000 sentence pairs from earnings calls and SEC filings. To improve models for the Financial STS task, we propose an active learning (AL) algorithm that efficiently selects informative sentence pairs for annotation by GPT-4 and creates high-quality training data. Using this approach, we train FinSentenceBERT, a model that generates semantic embeddings specifically for financial text. FinSentenceBERT establishes a new performance benchmark on FinSTS, outperforming models that use basic pooling strategies or are fine-tuned on general datasets. Surprisingly, a general SBERT model trained using our AL approach surpasses even models based on FinBERT, a language model pre-trained on financial text. Our research contributes a specialized dataset, model, and methodology that advance semantic understanding in the financial domain, with potential applications to other specialized domains.

Original languageEnglish
Title of host publication2024 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics, CIFEr 2024
ISBN (Electronic)9798350354836
DOIs
StatePublished - 2024
Event2024 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics, CIFEr 2024 - Hoboken, United States
Duration: 22 Oct 202423 Oct 2024

Publication series

Name2024 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics, CIFEr 2024

Conference

Conference2024 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics, CIFEr 2024
Country/TerritoryUnited States
CityHoboken
Period22/10/2423/10/24

Keywords

  • Active learning
  • BERT
  • Representation learning
  • Text similarity

Fingerprint

Dive into the research topics of 'Financial Semantic Textual Similarity: A New Dataset and Model'. Together they form a unique fingerprint.

Cite this