Information Retrieval
(Introduction to Medical Informatics)
(http://www.cpmc.columbia.edu/edu/textbook)
LAST REVIEWED: 12 November 1996
narrative text - huge collection of text, want to find the
relevant parts. Usually generated by many persons, many
departments, .... No one really knows what is there,
what is relevant, how to get it.
INFORMATION RETRIEVAL (IR)
the name is usually applied to narrative text, but could
refer to coded data
goal is to retrieve information from:
information sources like bibliographic DB
electronic medical record for clinical care
clinical research database
eg biomedical literature
20,000 journals published; 17,000 new books per year
size doubles every 10-15 years for 300 years
1879 Index Medicus: subject and author listings
1964 MEDLARS = Medical Literature Analysis and Retrieval System
1966 MEDLINE: computerized database of journals
each paper categorized by subject
subject vocabulary: MeSH (now UMLS)
1975: began to include abstracts
Others: CINAHL, PsycINFO, HealthSTAR
--Measuring IR
1. recall
am I getting the information I want?
also known as sensitivity
let TP (true positive) = number of items I retrieved
correctly
recall = TP / (total # items I wanted)
2. precision
am I getting extra information I do not want?
also known as predictive value positive
let FP (false positive) = number of items I retrieved
that I did not really want
precision = TP / (total # items retrieved)
= TP / (TP + FP)
(a) must define "items" for each query, eg:
bibliographic: number of literature references
clinical: number of lab results retrieved
research: number of patients found to satisfy criteria
(b) example:
I want articles on aspirin causing renal failure
there are 100 relevant papers in the literature
bibliographic search for "aspirin" and "renal failure"
I retrieve:
80 papers on aspirin causing renal failure
(some are missed due to synonyms: "kidney")
120 papers on other topics
(eg, aspirin dosage in renal failure)
TP = 80
FP = 120
recall = 80 / 100 = 0.8
precision = 80 / (80+120) = 0.4
the goal is to get the highest recall and precision possible
there is a tradeoff between the two
broader search
higher recall
lower precision
narrower search
lower recall
higher precision
METHODS FOR NARRATIVE TEXT
assume there is a large collection of documents
you want to select the ones that are relevant to you
"document" (show example)
a document contains a passage of narrative text
each document counts as one "item" as defined above
examples of a documents in a collection:
a paper published in the literature
an article in a newspaper
a radiology report in a medical record
a paragraph in a larger document
1. KEYWORD SEARCH
manually assign to each doc set of keywords that identify it
("indexing" or "coding")
keywords are chosen from a standard vocabulary
perform Boolean search on the keywords
and, or, not
MEDLINE
6 million articles from 3500 journals, hand indexed
keyword vocabulary is MeSH, a hierarchy of terms
can explode = query by class
can add modifiers, roles, relations (diagnosis of)
eg, in MEDLINE (using MeSH vocabulary) search for
"aspirin" and "kidney-failure" (show example)
assessment
very efficient to do the query (only code once)
but laborious to assign keywords in first place (15
minutes, $4/ art.)
must phrase query in terms of vocabulary
ie, must find the correct synonym to use
vocabulary may have different focus than the searcher
coders may not be not consistent (61% if critical, 34%
if main/sub heading)
as vocabulary expands, old documents are not re-coded
2. FREE TEXT SEARCH
look for the occurence of words or phrases anywhere in a
document
instead of coder figuring out ahead of time, user codes
it
query techniques =>
1. Boolean: and, or, not
("renal insufficiency" or "kidney failure")
and ("aspirin" or "ASA")
2. vicinity: two phrases must occur one after the other
(Adj), in same sentence (With), in same paragraph
(Same), or just in same document (And)
"renal" with "insufficiency"
3. stemming: often suffix does not matter (or even prefix)
bacteria* = bacteria, bacterial, ...
4. sections: ask about particular sections of a document
if it is formatted (title, abstract)
if word found in title, probably more pertinent
ex: ("renal failure" or "aspirin").
5. index file:
inverted index of words for speed
made of list of words with pointers to documents
eg, BRS free text search engine
can use MEDLINE database, but allows free text
searching on abstracts, rather than just keyword search
assessment of free-text search
can tailor query to the searcher's needs
must state synonyms explicitly (usually miss documents)
ie, harder than keyword method
must state ALL possible synonyms
never know if miss one
overall: precise but poor recall
3. SCORING ALGORITHMS
a. Basic Method
create a matrix showing how many times a given word appears
in each document ("frequency")
drop unhelpful common words like (the, and, or)
words are pre-stemmed
doc# kidney renal failure insuffic* heart aspirin mg...
-------------------------------------------------------
1 0 6 0 2 0 2 3
2 6 0 0 1 0 4 20
3 0 0 2 0 20 7 20
...
where
doc 1 is the relevant example listed below
doc 2 is another relevant document
doc 3 is about heart failure and aspirin therapy
calculate a score
algorithm based upon matching words in query to words
in the document, weighted by the frequency for that
document
then rank documents in descending order
helps when 100s of documents are retrieved
eg, (asprin, kidney, failure)
for each word in the query that matches a word in the
document, add its frequency to the score
doc 2 = 10
doc 3 = 9
doc 1 = 2
b. Vicinity - include word order, sentence # in matrix for
vicinity checks
"phrase" is a subset of this
eg, could reduce score of doc 3 by specifying that "failure"
must come after "kidney", so "heart failure" will not
be counted
c. Discrimination
best words are those that appear frequently in desired
reports but not in most others
words that occur commonly in all reports ("medicine") do not
discriminate among documents, but they make up a large
part of the score
(more important, they increase the variance of the
score, which washes out the effect of the better words)
low frequency words not that useful, since they do not
occur in documents
medium frequency words are the most helpful
cannot eliminate all common words (continuum of commonness)
therefore weight words by discriminating ability rather than
by raw frequency
weight proportional to frequency in this document
weight inversely proportional to frequency in others
eg, "mg" (milligram) appears in many documents, and does not
add much information; therefore its affect on the
overall score should be small
variation: inverse document frequency
d. Thesaurus
rather than using words found in document (thus suffering
from the synonym problem), match words to a preferred
synonym found in a thesaurus (vocabulary)
eg, UMLS, which contains many vocabularies (MeSH, ICD-9,
...)
e. Clustering
study the collection of documents and create word clusters
statistically, rather than using a thesaurus
thus synonyms are based on the relevant body of
knowledge, not an outside thesaurus
find where a word commonly occurs, and then look for other
words that occur in the same places
renal "insufficiency"
renal "failure"
"insufficiency" =? "failure"
eg, if "kidney" and "failure" are the names of clusters
doc# kidney failure heart aspirin mg...
---------------------------------------
1 6 4 0 2 3
2 6 1 0 4 20
3 0 2 20 7 20
...
doc 1 = 12
doc 2 = 11
doc 3 = 9
f. Natural Language Processing (NLP)
rather than a matrix of words, use higher level concepts
use NLP to parse the syntactic structure, in order to find
groups of related words, such as noun phrases (this
sort of NLP is relatively easy)
eg, want to find "kidney failure"
document about heart failure that mentions kidney
excretion of a heart medication would match
inappropriately
the concept "kidney failure" would not have matched
have found that syntax alone is not as good as simple
frequency analysis,
but with semantic information might be okay
g. N-grams - co-occurence of words
4. OTHER APPROACHES
a. Factor Analysis
create a small set of factors that categorize the reports
(can be done by hand or automatically)
query consists of specifying weights or values for factors
b. Neural Network
associative network - given query match words to documents
but large number of documents, how to train
c. Feedback
show the searcher the initial results, asking which
documents are relevant, then re-do the query
automatically taking answer into account
existing system hybrid of scoring algorithm + neural network
USER SYSTEMS
MEDLINE and BRS and others exist for some time but:
still need intermediary to make a reasonable search
until recently, 85% searches done by intermediary
usually the librarian
need to know syntax; Boolean searches are an art
but then cannot get information in real time
eg, drug interaction info at bedside
GRATEFUL MED: PC-based program to guide user through form
screens
to perform a structured search on MEDLINE
uses both keyword and free text searches on MEDLINE
helps user exploit structure of MeSH (parent-child)
feedback: mark relevance of each retrieved article; it
suggests a better search
assembles search, then dials in to NLM
MEDLINE survey
4.5 million searches/year
users: .5 MD, .25 PhD
.5 academic, .4 clinical
not very experienced
relatively young
purpose: .46 research, .38 clinical
per user: 4 search /month
.7 using it less than one year
satisfaction: .75 satisfied
.5 good #, .25 too many, .15 too few articles
.7 use MeSH
WAIS (Wide Area Information Servers)
protocol for coordinating disparate information sources
first find the proper source (set of servers)
then find relevant part within the source
uses heuristic scoring algorithm
try it: logon with id "wais" to "quake.think.com"
similar ideas
gopher: (run "gopher" on "cunixf")
anonymous ftp
Medline DB with BRS search engine
"aspirin" and "kidney failure"
BI PaperChase
RETRIEVAL EXAMPLE
Free Text Area:
TI Renovascular effects of nonprescription ibuprofen in elderly
hypertensive patients with mild renal impairment.
AB To determine the renovascular effects of nonprescription ibuprofen in
the maximum labeled over-the-counter (OTC) dosage for 7 days, and to
compare these effects with those of two other available OTC
analgesics, aspirin and acetaminophen, we evaluated 25 elderly
patients with mild thiazide-treated hypertension and mild renal
insufficiency. Under double-blind conditions, patients were randomly
allocated to one of three treatment groups: ibuprofen 400 mg 3
times/day, aspirin 650 mg 3 times/day, or acetaminophen 650 mg 3
times/day. Blood pressure and indexes of renal function (blood urea
nitrogen, creatinine clearance, serum electrolytes) were measured
over 7 days in a clinical research center. None of the treatments
had a clinically significant effect on blood pressure. Renal
function indexes also remained unchanged during all three treatments.
We conclude that elderly patients with mild thiazide-treated
hypertension and mild renal insufficiency seem not to be at risk of
developing additional renal compromise or of having their
hypertension control diminished by treatment with these OTC
analgesics for 7 days. Author-abstract.
Keyword Area:
MJ HYPERTENSION-RENAL: complications (co). IBUPROFEN: pharmacology
(pd). KIDNEY: drug-effects (de). KIDNEY-FAILURE: complications
(co).
MN ACETAMINOPHEN: pharmacology (pd). AGED. AGED-80-AND-OVER. ASPIRIN:
pharmacology (pd). BLOOD-PRESSURE: drug-effects (de). BODY-WEIGHT.
COMPARATIVE-STUDY. DOUBLE-BLIND-METHOD. DRUGS-NON-PRESCRIPTION:
pharmacology (pd). FEMALE. HUMAN. HYPERTENSION-RENAL: drug-therapy
(dt), physiopathology (pp). KIDNEY-FAILURE: physiopathology (pp).
MALE. MIDDLE-AGE. RENAL-CIRCULATION: drug-effects (de).
SUPPORT-NON-U-S-GOVT.
Where:
ASPIRIN (in MJ, MN) = keyword search match
aspirin (in TI, AB) = free text search match
renal impairment (in TI, AB) = free text search failure due
to synonyms
related reading:
Hersh WR, Greenes RA. Information retrieval in
medicine: state of the art. M.D. Computing
1990;7(5):302-11.
Wallingford KT, Humphreys BL, Selinger NE, Siegel ER.
Bibliographic retrieval: a survey of individual users
of Medline. M.D. Computing 1990;7(3):166-71.