TY - GEN
T1 - Natural Language Querying on Domain-Specific NoSQL Database with Large Language Models
AU - Zhang, Wenlong
AU - He, Chengyang
AU - Yang, Guanqun
AU - Bandyopadhyay, Dipankar
AU - Shi, Tian
AU - Wang, Ping
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Efficiently and accurately retrieving specific information from healthcare datasets, such as the Vaccine Adverse Event Reporting System (VAERS) 1, presents significant challenges. A promising solution to this problem is the Text-to-ESQ approach, which is akin to Text-to-SQL tasks but leverages NoSQL database Elasticsearch, to thoroughly explore VAERS data. Non-relational databases are particularly adept at managing complex and dynamic data formats, thereby enabling the extraction of more valuable insights. However, generating executable NoSQL queries is still challenging due to the limited availability of NoSQL query datasets, which constrains model training. One potential remedy involves the use of large language models (LLMs), which can be applied in few-shot and even zero-shot learning scenarios. Nonetheless, the lack of prior evaluation for this novel task, coupled with the absence of a comprehensive, unbiased assessment of existing LLMs and prompting strategies, impedes the development of a robust architecture. Motivated by these challenges, we introduce a new Instruction-Enhanced Explainable (InstructEx) Chain-of-Thought (CoT) prompting by integrating existing CoT prompts and conducting a comprehensive investigation of LLMs and CoT prompting. The extensive experimental analysis demonstrates the effectiveness of using LLMs for Text-to-ESQ when combined with the InstructExCoT prompting. It also sheds light on the strengths and weaknesses of these methods from multiple perspectives.
AB - Efficiently and accurately retrieving specific information from healthcare datasets, such as the Vaccine Adverse Event Reporting System (VAERS) 1, presents significant challenges. A promising solution to this problem is the Text-to-ESQ approach, which is akin to Text-to-SQL tasks but leverages NoSQL database Elasticsearch, to thoroughly explore VAERS data. Non-relational databases are particularly adept at managing complex and dynamic data formats, thereby enabling the extraction of more valuable insights. However, generating executable NoSQL queries is still challenging due to the limited availability of NoSQL query datasets, which constrains model training. One potential remedy involves the use of large language models (LLMs), which can be applied in few-shot and even zero-shot learning scenarios. Nonetheless, the lack of prior evaluation for this novel task, coupled with the absence of a comprehensive, unbiased assessment of existing LLMs and prompting strategies, impedes the development of a robust architecture. Motivated by these challenges, we introduce a new Instruction-Enhanced Explainable (InstructEx) Chain-of-Thought (CoT) prompting by integrating existing CoT prompts and conducting a comprehensive investigation of LLMs and CoT prompting. The extensive experimental analysis demonstrates the effectiveness of using LLMs for Text-to-ESQ when combined with the InstructExCoT prompting. It also sheds light on the strengths and weaknesses of these methods from multiple perspectives.
KW - Elasticsearch query
KW - Natural language querying
KW - NoSQL
KW - Text-to-ESQ
KW - VAERS
UR - http://www.scopus.com/inward/record.url?scp=85217275696&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85217275696&partnerID=8YFLogxK
U2 - 10.1109/BIBM62325.2024.10822485
DO - 10.1109/BIBM62325.2024.10822485
M3 - Conference contribution
AN - SCOPUS:85217275696
T3 - Proceedings - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
SP - 5174
EP - 5181
BT - Proceedings - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
A2 - Cannataro, Mario
A2 - Zheng, Huiru
A2 - Gao, Lin
A2 - Cheng, Jianlin
A2 - de Miranda, Joao Luis
A2 - Zumpano, Ester
A2 - Hu, Xiaohua
A2 - Cho, Young-Rae
A2 - Park, Taesung
T2 - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
Y2 - 3 December 2024 through 6 December 2024
ER -