Data Acquisition: Natural Language Processing (Introduction to Medical Informatics) (http://www.cpmc.columbia.edu/edu/textbook) LAST REVIEWED: 1 October 1997 automated decision-support systems and research databases require coded data get data that are already on computers (lab data, ...) try to get data from other sources, but unreliable agreement between claims data and clinical data (manual extraction) 0.09 (angina) to 0.83 (diabetes) chance corrected agreement (poor) claims failed to identify 1/2 of important conditions methods: nlp, sensors, speech, pen, (see also user interfaces) NATURAL LANGUAGE PROCESSING (NLP) a wealth of data lies locked up in departmental word processor files (CIS) untold more is deleted after printing might gather even more if it could be coded could use manual coding, but expensive and slow could collect coded data from users, but resistance exists and this does not address old data to convert text reports to coded form is difficult synonyms (not just between words but phrases) heart is enlarged cardiomegaly heart is above average size qualifiers possible cardiomegaly heart may be enlarged heart is slightly enlarged heart is probably not enlarged negation pneumonia is present pneumonia is absent no pneumonia "" lung fields are unremarkable pneumonia is the furthest thing from my mind ambiguities possible worsening infiltrate possibility of an infiltrate which is worsening possible worsening of a definite infiltrate conjunction (and other grouping) "evidence of opacities and bullae in the right lung" evidence of opacities and (bullae in the right lung) evidence of (opacities and bullae) in the right lung people do not follow the grammatical rules in textbooks they do follow a grammar, however let radiologist interpret image and dictate result use NLP to obtain coded data from narrative text radiologists can enter data naturally can still use data for research, automated decisions this is more difficult than mere querying (bibliographic search) more than just finding relevant reports yes / no output humans can filter result but must represent the knowledge in the report complex data no human filter linguistic competence: knowledge of language and domain embedded in a system levels of knowledge phonetic: constructing words from basic sounds morphologic: constructing words from subunits (e.g., friend + ly = friendly) syntactic: constructing sentences from words semantic: deriving meaning from sentences pragmatic: adding meaning from a sentence's context world: background, cultural, common sense information that adds additional meaning to a sentence examples (ambiguity can occur at each level) "Time flies": Syntactic ambiguity (Is "flies" a verb or noun?) "Green frogs have large noses": Syntactically and semantically well-formed, but pragmatically ill-formed. "Green ideas have large noses": Syntactically well-formed but semantically ill-formed "Have noses green ideas large": Syntactically ill-formed. overall nlp transitions: tokenization (identify words) parsing (identify syntax) semantic interpretation (identify logical form or LF) contextual interpretation (add pragmatic and world knowledge to LF) nlp methods (vary in linguistic competence) components vocabulary grammar (analogous to schema for database) parser "Mild elevation of the left hemidiaphragm" simple keyword lookup poor negation,... statistical / corpus-based approach match key tokens (single words, bigrams or trigrams) to determine meaning semantic (meaning) pattern matching of look for linear combinations of words (like keyword) right word class in right order extract words into coding form specific to domain, does not handle many ways to say one thing script-based find "hemidiaphragm", tells you what to look for select script by looking for keywords script describes what patterns to look for in effect, like many specific parsers specific to domain, does not handle many ways to say one thing semantic grammar := elevation, ... := left, right, upper, lower, ... ... := := := of assign words to relevant semantic classes (disease, bacterium, ...) define how these can be combined in a flexible grammar specific to domain, large grammar, some sensitivity to many ways syntactic (form) parser := := ... assign words to grammatical classes (noun, verb) assemble words -> phrases (np) -> sentences often ambiguous without semantics, difficult to code "the dog bit the balloon, and it popped" (disambiguation of anaphora: To what does "it" refer?) semantic + syntactic systems syntactic system to handle complex grammar (conjunctions) semantic grammar to code content semantic + syntactic + knowledge add pragmatic, commonsense, domain knowledge today, NLP works best in limited domains (eg, CXR) less vocab, grammar, ambiguity input: user typing=direct, but training, computer access dictate+transcription=easy, but cost, time lag, no immediate feedback speech input=not yet feasible for continuous speech open domain no training research examples: history and physical, echo, discharge summary, case histories on liver disease, radiology reports, coroner's reports, diagnoses, ... NLP EXAMPLES original form: Probable mild pulmonary vascular congestion with new left pleural effusion, question mild congestive changes. Elevated left hemidiaphragm, not changed from prior film. Possible pneumothorax. coded form: pulmonary vascular congestion certainty: high degree: low pleural effusion region: left status: new congestive changes certainty: moderate degree: low elevated hemidiaphragm bodyloc: hemidiaphragm (region: left) change: no change previous_exam: available pneumothorax certainty: moderate pertinent decision-support rule: if finding is in ("pneumothorax", "hydropneumothorax") and certainty_modifier is not in ("no", "rule out", "cannot evaluate") and status_modifier is not in ("resolved") then conclude true; endif; example 2: Mild elevation of the left hemidiaphragm. Mild copd. No infiltrate. Cardiomegaly. finding: elevated hemidiaphragm|degree: low degree| bodyloc: hemidiaphragm[[region,left]]| finding: chronic obstructive pulmonary disease| degree: low degree| finding: infiltrate|certainty: no| finding: cardiomegaly| example 3: Pulmonary edema with no significant change since 10/6. finding: edema|bodyloc: lung|change: no change| previous_exam: [date,10,6]| example 4: Clear lungs. Normal sized cardiac silhouette. Possible mediastinal and hilar adenopathy. finding: clear lungs| finding: normal size of heart|bodyloc: heart| finding: adenopathy|bodyloc: mediastinum| certainty: moderate|bodyloc: hilum| example 5: Rapid interval increase in cardiac silhouette suggesting pericardial effusion. Echocardiogram is suggested to further evaluate this finding. Increased interstitial markings suggestive of either cardiac failure vs lymphangitic spread. Paratracheal widening, right greater than left at the right side widening representing either adenopathy or a distended superior vena cava. Decreased visualization of multiple lung nodules as compared with 6/19/90. finding: increase|change_rate: fast|bodyloc: heart| finding: pericardial effusion|certainty: moderate| finding: echocardiogram| finding: unspecified|certainty: moderate| finding: interstitial markings|change: increase| finding: congestive heart failure|bodyloc: heart| finding: lymphangitic spread|bodyloc: heart|certainty: no| finding: paratracheal widening|region: right[[region,left]] ||region: right|descriptor: widening| finding: adenopathy|certainty: moderate| finding: distended superior vena cava|certainty: moderate| related reading: Friedman C, Johnson SB. Medical text processing: past achievements, future directions. In: Ball MJ, Collen MF, editors. Aspects of the computer-based patient record. New York: Springer-Verlag, 1992: 212-28.