Columbia Unsupervised Biomedical Named Entity Recognizer
Authors: Shaodian Zhang, Noemie Elhadad
Department of Biomedical Informatics, Columbia University
contact: shaodian@dbmi.columbia.edu

------------------------------------------------------------------------

This Java code implements the unsupervised biomedical named entity recognition
algorithm presented in the paper "Unsupervised Biomedical Named Entity Recognition: Experiments with clinical and biological texts" by Shaodian Zhang and Noemie Elhadad, to appear in JBI.

Files provided:
* CubNER.jar: the NER java runnable file
* domain_rep_sample.xml: An example domain represention configuration file for sample data set.
* sample: An example test data.
* sample_tagged: An example tagged file.
* stopwords.txt: An example stopwords list. 
* readme.txt: This file.
* src.tar.gz: Java source files

Files not provided in this package but MUST be included at runtime:
* MRCONSO.RRF from UMLS
* MRREL.RRF from UMLS
* MRSTY.RRF from UMLS
* en-pos-maxent.bin from OpenNLP
* en-chunker.bin from OpenNLP (see http://opennlp.sourceforge.net/models-1.5/)
The package was tested using UMLS 2012AB and OpenNLP 1.5.2. Any version with same scheme as 2012AB can be supported.

------------------------------------------------------------------------

Usage:
First you need to collect seed terms from UMLS based on the domain representation file (see domain_rep_sample.xml as an example for sample data):
java -jar CubNER.jar seed -c domain_rep_sample.xml

Seed terms will be automatically generated in the folder "seedterm". Remove the folder if seed terms are collected for another set of entity classes.

Then following command could be used for actual recognizing:
java -jar CubNER.jar recognize -b sample -f sample

parameters:
-b(required) the corpus for signature generation, usually the same as test data. must be raw texts with one sentence/paragraph/document per line.
-f(required) the test corpus. must be raw texts with one sentence/paragraph/document per line.
-c(optional) how many context words before and after are considered, default value 2
-t(optional) the threshold value in entity classifition. Details refer to the paper. default value 0.002.
-i(optional) weight of internal words. Details refer to the paper. default value 20.
-r(optional) whether to regenerage signatures. value 1: regenerate signatures, use when change parameters or first use; 0: use existing signaturess, saving time. default value 1.