Columbia University Department of Biomedical Informatics


Discovering and Applying Knowledge in Clinical Databases


The long term goal of our ongoing project, “Discovering and applying knowledge in clinical databases,” is to learn from data in the electronic health record (EHR) and to apply that knowledge to relevant problems. The increasing adoption of the EHR promises to provide data for clinical research and informatics research, but secondary use of the data has been limited. Challenges include the complexity, incompleteness, and inaccuracy of the record. Our current focus is to study the EHR from an information theoretic point of view, treating the EHR as a natural object worthy of study, and applying methods from non-linear time series analysis. Armed with a better understanding of the record, we hope to measure and account for data completeness and to improve interpretation and use of the data. We hypothesize that we can characterize an electronic health record using a formal information theoretic framework, and that the measured properties can help answer informatics and clinical questions.


The Team


D3S_7813-Edit web

George Hripcsak, MD, MS

Vivian Beaumont Allen Professor and Chair, Biomedical Informatics

David Madigan, PhD

Professor and Chair, Department of Statistics

Carol Friedman

Carol Friedman, PhD

Professor of Biomedical Informatics

Colin Walsh

Colin Walsh, MD

Postdoctoral fellow

David Albers

David Albers, PhD

Associate Research Scientist of Biomedical Informatics




Funded by a grant from the National Library of Medicine, “Discovering and applying knowledge in clinical databases” (R01 LM006910).


Project publications


Project software


Current directions


1. Development of an information theoretic approach to understanding electronic health record data



One of the challenges of electronic health record data is that the data are sampled irregularly, and usually when patients are ill, producing biased retrospective experiments. We find that—after some lag—patients are sampled more frequently when they are ill, and then the rate drops off as the disease resolves.


Here is a study of serum glucose, where predictability—past values predicting a future value—is quantified as mutual information. Tau is the number of glucose measurements between any pair of values, and delta-t is the actual time between measurements. It reveals the diurnal variation of glucose, as evidenced by the ridges at 24 hour intervals.




We have extended non-linear time series analysis using mutual information to quantify predictability, applying it to sparse and irregularly sampled time series. In Chaos, Solitons & Fractals, we study its properties.



Electronic health records comprise the data of many patients. Aggregating across patients brings its own challenges, which are presented in Chaos.




2. Properties of electronic health records and the effects of health care processes



In JAMIA, we studied parwise correlations in the electronic health record using lagged linear correlation on a 3.7-million patient, 24-year database. We found that there were several types of associations: definitional associations included low blood potassium preceding “hypokalemia”; low potassium preceding the drug spironolactone with high potassium following spironolactone exemplified intentional and physiologic associations, respectively. Counterintuitive results such as the fact that diseases appeared to follow their effects may be due to the workflow of healthcare, in which clinical findings precede the clinician’s diagnosis of a disease even though the disease actually preceded the findings.



In JAMIA, we discuss the effects of health care processes on electronic health record data and their likely effects on studies that use such data. We put forward a framework for addressing these biases by studying the electronic health record as an object of interest in itself and by creating a model of health care processes.




3. Population physiology



In PLOS ONE, we studied glucose control in populations of patients using electronic health record data. We showed that we could validate a physiologic glucose model using time-delayed mutual information even though the raw data values had too much noise for the same task.



In Annals of Neurology, we study a population of patients with subarachnoid hemorrhage to better understand whether secondary seizures cause further damage or are simply harmless passengers. Seizures do in fact appear to be associated with outcomes.



In Physics Letters A, we demonstrate our statistical dynamics approach to study physiology at the population scale. We confirm diurnal variation in creatinine data.