Programs to Pre-Process Notes.
For better NLP performance EHR notes, which are directly entered by health care providers instead of transcriptionists, generally require special preprocessing to fix text causing problems for NLP systems due to irregular sentence endings and irregularly formated sections. Frequently providers use the end of line and itemized lists instead of usual sentence end markers (e.g. '.', ';'). In addition, the section headers are often abbreviated and irregularly demarcated.
The preprocessing task is achieved in few steps.
- Collect existing section headers from a large corpus of notes to create a list of known section headers.
- Run TestUnit program. The program determines whether an end of line is a sentences boundary and the program recognizes sections.
Usage: java TestUnit “section list” clincialnote.txt
Download it here TestUnit.tar
- Remove unnecessary tags to prepare notes for MedLEE processing.
Usage: perl fix_sections.pl "section list" "output from step2"
Download a program to remove tags fix_sections.pl
Rules to Identify Section
- if characters before colon can match the section names in “section list” exactly, and there is a blank line before these characters, and these characters are the beginning of a sentence then the program outputs the exact section name.
- if characters before colon but can't match the section names in “section list” exactly, and there is a blank line before these characters, and these characters are the beginning of a sentence then the program outputs "unknown" section name.
- if there is no colon appearing in the first sentence of a paragraph, in another word, the program can't find the candidate section name, then the program outputs "no_header”
“No_header” and “unknown” tags could be used to add additional section names to the “section list”.
Rules to Recognize Sentences Boundaries
- if a period appears EOL ( end of line) , then EOL will be EOS (end of sentence)
- if a line with EOL is less than 120 characteristics, then EOL will be EOS
- if a colon appears on the first 10 positions of a line with EOL, then EOL will be EOS
- if a line with EOL begins with '-', then EOL will be EOS
- if a line with EOL begins with numbers, then EOL will be EOS
- if a blank line or several blank lines appears after a line with EOL, then EOL will be EOS
- if a '-' appears after a line with EOL, then EOL will be EOS
- if a number appears after a line with EOL, then EOL will be EOS
- if a colon appears on the first 10 positions of another line after a line with EOS, then EOL will be EOS
Examples:
Original Note:
|
problem list:
hypertension
hypercholesterolemia
h/o cataracts
b/l LE myalgias which continue off statin (-) neuro exam , (-) CK, (-) ESR
today presents complaining of persistent b/l LE muscle pain.
NKDA
Meds:
HCTZ 25 mg po qday
senna 1-2 tabs po qhs
Output:
|
<sec value="report history of present illness item">
problem list:.
hypertension.
hypercholesterolemia.
h/o cataracts.
b/l LE myalgias which continue off statin (-) neuro exam , (-) CK, (-) ESR.
</sec>
<sec value="no_header">
today presents complaining of persistent b/l LE muscle pain.
</sec>
<sec value="no_header">
NKDA.
</sec>
<sec value="report medication item">
Meds:.
HCTZ 25 mg po qday.
senna 1-2 tabs po qhs.
</sec>
Rules defined by Carol Friedman.
Programs developed by Ying Li and Lyudmila Ena.