Abbreviation parser - atagger

The atagger is a program which handles certain types of parenthetical expressions that occur in the literature so that subsequent text processing programs can handle them appropriately. On type consists of abbreviations, which are typically defined in biomedical text using a parenthetical expression where the abbreviation follows the full form and is enclosed in parentheses. This program ensures that the full form is associated with the abbreviation throughout the article so that it can be subsequently interpreted correctly. The second type of parenthetical expressions that are handled are ones which should be ignored which processing the article because they conteain numerical data or references to Figures and Tables, such as "(n=10)". abstract.

Download

The atagger is developed using pure C language. The source code can be dowloaded in here.

Usage

Command line: atagger -i(NECESSARY) -a(OPTIONAL) -o(OPTIONAL)

-i: The article file name (with the path) for which you want to collect all the abbreviations. It is necessary.

e.g. -i D:\Semantic\Files\ArticlesOffset\K_sarcoma\txt\Classic_Kaposi_sarcoma_10092829.txt

-a: The output file name with abbreviation list

e.g. -a D:\Semantic\Files\list\Classic_Kaposi_sarcoma_10092829.dlex
e.g. -aY or -ay (Automatically) -aN or -an or nothing (does not output abbreviation list)

NOTE: The rule automatically to create file name:

the input file extend name (.txt or others) will be changed to .dlex.
For exsample:
Based on the above example, it will be D:\Semantic\Files\ArticlesOffset\K_sarcoma\txt\Classic_Kaposi_sarcoma.dlex

-o: The output file name with abbreviation tags

e.g. -o D:\Semantic\Files\ArticlesOffset\tagged\Classic_Kaposi_sarcoma.tags.txt
e.g. -oY or -oy (Automatically) -oN or -on or nothing (does not output the article with abbreviation and ignoring tags into the file)

NOTE: The rule automatically to create file name:

Put the file with the same file name into the sub folder "tagged".
For example:
Based on the above example, it will be D:\Semantic\Files\ArticlesOffset\K_sarcoma\txt\tagged\Classic_Kaposi_sarcoma_10092829.txt

Rules to Recognize abbreviations

Abbreviation parser algorithm is adapted using a simplified version of the Schwartz and Hearst algorithm

Term definition:
The rules are as follows:
  1. assumptions
  2. Identifying short form and long form candidate
  3. Find a definition candidate if the expression is a short form candidate
  4. Find an possible abbreviation if the expression is a long form candidate
  5. Identifying short form
  6. Identifying correct definition
  7. Determining parenthetical articles which were not abbreviations but which should be ignored, such as (n=20) or (p=.0005)

Examples

Command line:
        atagger -i D:\Semantic\Files\ArticlesOffset\K_sarcoma\txt\Classic_Kaposi_sarcoma_10559088.txt
    
Output: The text with tags
        Title:
            Primary pulmonary AIDS-related lymphoma: radiographic and CT findings.
        Abstract:
            STUDY OBJECTIVES: To describe the radiographic and CT findings of primary <term>AIDS-related lymphoma of the lung</term><ign>(ARLL)</ign>, and to evaluate <term>percutaneous transthoracic needle biopsy</term><ign></ign> in the diagnosis of primary <ign>ARLL</ign><term>AIDS-related lymphoma of the lung</term>. MATERIALS AND METHODS: Seven chest radiographs and seven CT scans of HIV-infected patients with histologically proved <term>primary pulmonary non-Hodgkin's lymphoma</term><ign></ign> were reviewed at our institution. All of the patients had fibroscopy with BAL. The diagnosis of <ign>PPL</ign><term>primary pulmonary non-Hodgkin's lymphoma</term> was established histologically by means of <ign>PTNB</ign><term>percutaneous transthoracic needle biopsy</term> <ign>(n = 4)</ign>, open-lung biopsy <ign>(n = 2)</ign>, or autopsy <ign>(n = 1)</ign>. RESULTS: All but one patient had multiple peripheral well-defined nodules of various sizes on the chest X-ray film and CT scan. One patient had a subpleural parenchymal infiltrate and another had a main peripheral mass with spontaneous cavitation. Hilar/mediastinal adenopathies and pericardial/pleural effusion were never associated with the parenchymal abnormalities. Fibroscopy with BAL was always negative. <ign>PTNB</ign><term>percutaneous transthoracic needle biopsy</term>, done in six cases, was diagnostic in four cases and suggested primary <ign>ARLL</ign><term>AIDS-related lymphoma of the lung</term> in two cases. No complications occurred during these procedures. CONCLUSION: After excluding infectious causes, multiple peripheral nodules and/or masses without hilar or mediastinal adenopathies and without pleural effusion are suggestive of primary pulmonary <ign>ARL</ign><term>AIDS-related lymphoma</term>. A specific diagnosis can be obtained by means of <ign>PTNB</ign><term>percutaneous transthoracic needle biopsy</term>.
    
Output: The abbreviation list
        ARL|AIDS-related lymphoma
        ARLL|AIDS-related lymphoma (ARL) of the lung
        PTNB|percutaneous transthoracic needle biopsy
        PPL|primary pulmonary non-Hodgkin's lymphoma
    

External link:

Schwartz and Hearst algorithm

Rules defined by Carol Friedman.
Programs developed by Feng Lui with funding from grants R01LM008635 and R01LM010140 from the National Library of Medicine.