Vocabularies, Data Coding and Classification (Introduction to Medical Informatics) (http://www.cpmc.columbia.edu/edu/textbook) LAST REVIEWED: 22 September 1997 VOCABULARY def: a finite, enumerated set of terms intended to convey information unambiguously (go through criteria; english vocab breaks all of them) "vocabulary" vs "terminology": ASTM vocabulary: dictionary containing the terminology of a subject or related subjects terminology: a set of terms representing the system of concepts of a particular field where would you use vocabularies? lab tests, diagnosis codes, ... for storage and retrieval; common storage for all users clinical database epidemiologic database bibliographic searches dictionary of terms for human use natural language processing automated decision support maintenance of the vocabulary itself unambiguous retrieval = sensitive, specific, reliable sensitive = all the data I want specific = only the data I want reliable = same data for every query term: symbolic representation of a single concept (and optionally, some of the concept's attributes) components: - code (required): a symbol (eg, number, string) that uniquely identifies the concept in the vocabulary - name: word or phrase that makes sense to a person names vs codes: names are easier to write and recognize as distinct codes are unambiguous similar names can have different meanings (e.g., Cushing disease vs syndrome) different names can have the same meaning (e.g., heart attack vs MI) - definition: narrative text, name itself, relations & attributes often defined by what it is NOT (ICD-9) ex: ICD-9 codes 190-199 = MALIGNANT NEOPLASM OF OTHER AND UNSPECIFIED SITES - other attributes: can be literal (arbitrary string); categorical (controlled possibilities); semantic (another code = relation) intensional knowledge: describe a term itself extensional knowledge: describe relations between terms - relations to other terms (this is structure), including hierarchy - implementation information (how to store & query) CPMC example name = urine potassium ion measurement code = 1533 defintion = name + relations relations: specimen = urine specimen; substance measured = potassium part_of = urine electrolytes attributes: units = mEq/l (distinguish relations from attributes: relations have a MED code as a value) why organize (why add structure)? increase power of the vocabulary - map terms among sub-vocabularies - help a user choose the correct term - facilitate its own maintenance - be used as a knowledge base for other systems how to organize the collection of terms "flat list" of terms okay for small vocabularies human being keeps track of relations, ... even in alphabetical order, user cannot find term "lung disease" or "disease of the lung" therefore impose structure "classification hierarchy" organize large number of terms into classes go from most general to most specific end up with a tree structure, much like an outline eg, heart attack is a heart disease, which is a disease define parent, child, ancestor, descendent, root, leaf facilitates finding terms and adding new ones (maintenance) EXISTING VOCABULARY EXAMPLES ICD-9 - International Classification of Diseases, #9 World Health Org for collecting health statistics originally for epidemiology Clinical Modifications added for clinical coding now also for billing (strict hierachy with extensive synonym indexing) has name, code, definition, implementation, hierarchy SNOMED - Systematized Nomenclature of Medicine ICD-9 inadequate for College of Amer. Pathologists wrote SNOP, and then SNOMED 7 axes: topology, morphology, etiology, function, disease, procedure, occupation can assemble complex terms from simple ones MeSH - Medical Subject Headings National Library of Medicine indexing the medical literature loose hierarchy MED - Medical Entities Dictionary CPMC made of other vocabularies (ICD-9,MeSH,local,...) (directed acyclic graph) made of other vocabularies UMLS - Unified Medical Language System NLM 1987 goal: facilitate the use of disparate medical terminologies for accessing a variety of information sources components: information sources map metathesaurus semantic network: stores both intensional and extensional knowledge note the many different ways of saying the same thing EXERCISE: CREATE A VOCABULARY aspirin ASA tylenol morphine warfarin coumadin anitcoagulant antipyretic analgesic pain-killer fever-medication antibiotic oxacillin penicillin respiratory diseases pneumonia emphysema lung left lung right lung LLL RLL RML RUL LUL lingula heart atrium ventricle brain infectious diseases meningitis meninges ventricle chem-7 Presbyterian chem-7 Allen chem-7 laboratory test laboratory chemistry laboratory hematology laboratory battery test sodium ion sodium test urine electrolytes making semantic relation knowledge explicit identify relation add concept to vocabulary add attributes to concepts put value in attributes, linking concepts PROPERTIES OF A CONTROLLED MEDICAL VOCABULARY domain completeness anticipate all terms in a domain not really possible to foresee all needed terms result of incompleteness is that you cannot record all the information you have pratical solutions SNOMED permits assembling terms into complex ones MeSH uses modifiers to add meanings ICD-9 uses "other" (not elsewhere classified) but at least its structure should not limit ICD-9: 4 levels of depth with 10 nodes per level SNOMED: 5 levels with 12 nodes per level unambiguous same term must not have more than one meaning often occurs because it is not worth differentiating in vocabulary's domain (but would be in others) result of ambiguity is that may retrieve unwanted data from a database based upon the vocabulary "ventricle" as part of heart and part of brain MeSH: "cardiac output" as parameter and test ICD-9: "other" = "not elsewhere classified" non-redundant each concept must have exactly one term (one way to say the same thing) difficult with multiple vocabulary authors result of redundancy is that may only find one of two terms for same concept (and therefore only some of the data in a database) eg, if find MI, myocardial infarction, heart attack SNOMED: many ways to assemble terms pulmonary tuberculosis D0188 vs lung+granuloma+M TB+fever synonymy allow more than one name for a single term easier for users to find a term not redundant since still only one code ex: MED uses a literal attribute to do this eg, let MI, myocardial infarction, and heart attack be synonyms of same term lexical variants: synonymous terms that vary only in word order, punctuation, pluralization and the like ex: Tetralogy of Fallot vs Fallot's Tetralogy hierarchical classification (explained above) actual "requirement" is ease of use and maintenance classification may be seen as one approach associative network could be an alternative (like human beings) can also add inheritance children inherit relations and attributes from parent diseases have etiologies, so heart diseases have etiologies multiple classification strict hierarchy (tree) - each term belongs to one class but some terms really belong to two classes bacteria pneumonia is bacterial AND respiratory therefore use directed acyclic graph (DAG) directed = parent and child (not equal) acyclic = no term is an ancestor of itself but terms can have more than one parent (unlike strict) multiple hierarchy is DAG with one root so same term can have several parents SMOMED: Pneumococcal pneumonia under bacterial disease Clinical pneumonia under respiratory disease Staphylococcal pneumonia under morphology consistency of views one way to implement multiple classification is to put the same term in two places (in a strict hierarchy) with a pointer between them is the term exactly the same (eg, same children) in each location? MeSH: salicylates has aspirin only in some contexts could argue that "contexts" (inconsistency of views) are a benefit standard terminologies are *replete* with such inconsistencies inconsistency is likely when vocabularies are written by committees explicit relationships (semantic network) what does the parent-child relationship signify? is_a = "is a type of" (ulnar nerve -> nerve) part_of = "is a part of" (ulnar nerve -> arm) etiologic = causes/caused_by even part_of can be division_of (lobe of lung) vs compenent_of (nerve in lung) SNOMED: is_a, is_part_of, is_made_of, causes, is_in MeSH: is_a, part_of, associated_with, equivalent_to query: types_of vs causes_of pulmonary disease inheritance: lung gets emphysema; so lobe of lung gets emphysema; but nerve of lung does not get emphysema therefore want to define each relation explicitly (can then choose which use inheritance) ISSUES should the codes contain semantic information? the code is the unique symbol for a concept usually safest to let codes contain no information can change the term's name, relations, location without changing the code (as long as the concept has not changed) eg, if pneumonia X is first thought to be infectious, then found to be autoimmune, need to change the hierarchy, but not the concept if code is term name, then obviously must contain info many put the hierarchical path into code 3:45:3 = disease:heart disease:MI better efficiency for queries eg, ICD-9 is hierarchy of digits but DAG will have several paths to one term also, ICD-9 and SNOMED put info in last digit problem = maintenance: changing hierarchy requires changing code structure and content of vocabulary depends on use ICD-9 to classify diseases, not findings complete coverage of TB, important in epidemiology MeSH to index literature, not treat patients what does a non-leaf node mean? class of terms, but not a term itself "not otherwise specified" "not elsewhere classified" = other "all children" maintenance functions add a new term (must fulfill above requirements) change a term merge two vocabularies delete a term (is it possible if data are stored?) use vocabulary to help maintain itself redundancy: look for redundant relations and attributes classification: pick logical classes from attributes ambiguity: force term to fill in a set of attributes related reading: Cimino JJ, Clayton PD, Hripcsak G, Johnson SB. Knowledge- based approaches to the maintenance of a large controlled medical terminology. J Am Med Informatics Assoc 1994;1:35-50.