CAREER: Learning from Observational Data with Knowledge

Project: Research project

Project Details

Description

Large observational datasets from social networks, climatology, finance, and other areas have made it possible for researchers to test complex hypotheses that previous studies would have been under-powered to tackle. This is especially true in biology and health, with the proliferation of new methods for gathering long-term population data, such as from electronic medical records, and real-world health data from body-worn sensors. However, the number of complex hypotheses that can be tested in datasets with hundreds or thousands of variables far surpasses what humans can propose and reason about. Exhaustively testing all possible relationships is not computationally feasible, and after this testing a researcher must still examine a non-trivial number of seemingly significant findings to determine which still need to be validated experimentally. This project aims specifically to infer causal relationships, as these provide insight into not only how a system behaves, but also why it behaves as it does, enabling the development of successful interventions. Results from this work will be incorporated into education at three levels (high school, undergraduate, and graduate) through university courses and summer programs for high school students. In addition to communicating the core concepts of causal inference, the summer programs will also introduce potential computer scientists to key areas of computer science research. Applications of the methods developed to data from stroke and diabetes may lead to new knowledge about the physiologic processes underlying recovery in stroke, and the complex interaction of factors affecting glucose in people with diabetes.

This work will lead to more robust and efficient inference of causal relationships from large-scale datasets, through a feedback loop between experiments and prior knowledge. Current approaches require users to specify the set of variables and hypotheses to be tested, but these limit findings to the set a user chose to explore. Instead this work will develop methods that can use prior knowledge in the form of causal relationships as well as prior experimental results to constrain what will be tested and generate new hypotheses. Causes provide information about their effect that are not contained in other variables, so this work will develop measures of how explanatory a cause is and how much information it yields, and use changes in this measure to guide generation of complex relationships in the constrained hypothesis space. The proposed approach differs from stochastic heuristics in that the new method will be deterministic, and will evaluate relationships individually, thus addressing the computational challenge and reducing the impact of incorrect inference. Second, the work will lead to algorithms that can automatically evaluate how findings relate to prior knowledge, whether they are, for example, consistent, novel, or contradictory. This will allow researchers to focus more in depth on findings likely to be significant or interesting, rather than those that simply confirm prior knowledge. It also provides a feedback loop between knowledge and inference.

StatusFinished
Effective start/end date1/05/1430/04/20

Funding

  • National Science Foundation

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.