AuditBench: A Benchmark for Large Language Models in Financial Statement Auditing

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Financial statement auditing is essential for stakeholders to understand a company’s financial health, yet current manual processes are inefficient and error-prone. Even with extensive verification procedures, auditors frequently miss errors, leading to inaccurate financial statements that fail to meet stakeholder expectations for transparency and reliability. To this end, we harness large language models (LLMs) to automate financial statement auditing and rigorously assess their capabilities, providing insights on their performance boundaries in the scenario of automated auditing. Our work introduces a comprehensive benchmark using a curated dataset combining real-world financial tables with synthesized transaction data. In the benchmark, we developed a rigorous five-stage evaluation framework to assess LLMs’ auditing capabilities. The benchmark also challenges models to map specific financial statement errors to corresponding violations of accounting standards, simulating real-world auditing scenarios through test cases. Our testing reveals that current state-of-the-art LLMs successfully identify financial statement errors when given historical transaction data. However, these models demonstrate significant limitations in explaining detected errors and citing relevant accounting standards. Furthermore, LLMs struggle to execute complete audits and make necessary financial statement revisions. These findings highlight a critical gap in LLMs’ domain-specific accounting knowledge. Future research must focus on enhancing LLMs’ understanding of auditing principles and procedures. Our benchmark and evaluation framework establish a foundation for developing more effective automated auditing tools that will substantially improve the accuracy and efficiency of real-world financial statement auditing.

Original languageEnglish
Title of host publicationAI for Research and Scalable, Efficient Systems - Second International Workshop, AI4Research 2025, and First International Workshop, SEAS 2025, Held in Conjunction with AAAI 2025, Proceedings
EditorsQingyun Wang, Wenpeng Yin, Abhishek Aich, Yumin Suh, Kuan-Chuan Peng
Pages59-81
Number of pages23
DOIs
StatePublished - 2025
Event2nd AI4Research Workshop: Towards a Knowledge-Grounded Scientific Research Lifecycle, AI4Research 2025 and 1st Workshop on Scalable and Efficient Artificial Intelligence Systems, SEAS 2025, held in conjunction with the 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States
Duration: 25 Feb 20254 Mar 2025

Publication series

NameCommunications in Computer and Information Science
Volume2533 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference2nd AI4Research Workshop: Towards a Knowledge-Grounded Scientific Research Lifecycle, AI4Research 2025 and 1st Workshop on Scalable and Efficient Artificial Intelligence Systems, SEAS 2025, held in conjunction with the 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Country/TerritoryUnited States
CityPhiladelphia
Period25/02/254/03/25

Keywords

  • Automated Auditing
  • Error Detection
  • Financial Statement Auditing
  • Large Language Models (LLMs)

Fingerprint

Dive into the research topics of 'AuditBench: A Benchmark for Large Language Models in Financial Statement Auditing'. Together they form a unique fingerprint.

Cite this