Project Details
Description
In the era of big data, it is increasingly challenging for domain experts, such as physicians and pharmaceutical scientists, to efficiently retrieve relevant information from different databases to support their decision-making. Knowledge screening, including searching and filtering, based on traditional relational databases may suffer from several problems. For example, keyword-based searching and rule-based filtering can only handle limited types of questions and lack flexibility. Over the past few years, modern deep learning methods have alleviated this problem by automatically translating natural language questions into structured query languages. However, if different types of data such as tabular, textual, and heterogeneous graph data are consolidated into a relational database with many tables, structured queries can become very complex, which poses new challenges to deep learning models. This project will tackle these challenges by designing a new paradigm based on non-relational databases that store data in a more flexible non-tabular form. The new paradigm can easily incorporate reasoning into question-to-query translation, enabling deep learning models to handle more complex questions, which will benefit many domain-specific applications. The project will also promote teaching and mentoring activities, such as developing new courses and training of next generation experts in machine learning, natural language processing, data management, and health informatics. The project outcomes and observations will be open for public use.The project will forge a new research direction for natural language-driven knowledge screening on non-relational databases. Although there are many well-known, efficient, and scalable non-relational databases and search engines, little effort has been devoted to developing natural language querying methods for them and exploiting their potential. This project aims to fill this gap by designing new underlying frameworks for natural language-based searching and querying, including data consolidation in non-relational databases, reasoning integration in both databases and query templates, and human-in-the-loop model development and evaluations. Two primary research activities will be undertaken based on a popular search engine known as ElasticSearch: (1) The investigator will develop new deep learning models for translating natural language questions into ElasticSearch queries and create new datasets for training and evaluating the models. (2) The investigator will propose a unified approach, standard format, and extensible way to create knowledge “nuggets” to store multi-modal data and develop new question-generation models to automatically generate questions from nested knowledge. The project will produce a variety of outcomes, such as data used for model development, algorithms for model training and inference, and annotation tools used for creating training data. These products will benefit data management and screening, and support decision-making in healthcare, bioinformatics, and scientific research.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Status | Active |
---|---|
Effective start/end date | 1/06/23 → 31/05/25 |
Funding
- National Science Foundation
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.