Skip to main navigation Skip to search Skip to main content

Hierarchical Information Criterion for Variable Abstraction

  • Stevens Institute of Technology

Research output: Contribution to journalConference articlepeer-review

3 Scopus citations

Abstract

Large biomedical datasets can contain thousands of variables, creating challenges for machine learning tasks such as causal inference and prediction. Feature selection and ranking methods have been developed to reduce the number of variables and determine which are most important. However in many cases, such as in classification from diagnosis codes, ontologies, and controlled vocabularies, we must choose not only which variables to include but also at what level of granularity. ICD-9 codes, for example, are arranged in a hierarchy, and a user must decide at what level codes should be analyzed. Thus it is currently up to a researcher to decide whether to use any diagnosis of diabetes or whether to distinguish between specific forms, such as Type 2 diabetes with renal complications versus without mention of complications. Currently, there is no existing method that can automatically make this determination and methods for feature selection do not exploit this hierarchical information, which is found in other areas including nutrition (hierarchies of foods), and bioinformatics (hierarchical relationship of genes). To address this, we propose a novel Hierarchical Information Criterion (HIC) that builds on mutual information and allows fully automated abstraction of variables. Using HIC allows us to rank hierarchical features and select the ones with the highest score. We show that this significantly improves performance by an average AUROC of 0.053 over traditional feature selection methods and hand crafted features on two mortality prediction tasks using MIMIC-III ICU data. Our method also improves on the state of the art (Fu et al., 2019) with an AUROC increase from 0.819 to 0.887.

Original languageEnglish
Pages (from-to)440-460
Number of pages21
JournalProceedings of Machine Learning Research
Volume149
StatePublished - 2021
Event6th Machine Learning for Healthcare Conference, MLHC 2021 - Virtual, Online
Duration: 6 Aug 20217 Aug 2021

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Fingerprint

Dive into the research topics of 'Hierarchical Information Criterion for Variable Abstraction'. Together they form a unique fingerprint.

Cite this