TY - JOUR
T1 - Towards a semantic lexicon for clinical natural language processing.
AU - Liu, Hongfang
AU - Wu, Stephen T.
AU - Li, Dingcheng
AU - Jonnalagadda, Siddhartha
AU - Sohn, Sunghwan
AU - Wagholikar, Kavishwar
AU - Haug, Peter J.
AU - Huff, Stanley M.
AU - Chute, Christopher G.
PY - 2012
Y1 - 2012
N2 - A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text. In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics. A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text. The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing. Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively. The study also yields some insights in developing practical NLP systems.
AB - A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs). Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text. In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics. A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text. The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing. Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively. The study also yields some insights in developing practical NLP systems.
UR - http://www.scopus.com/inward/record.url?scp=84880802623&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84880802623&partnerID=8YFLogxK
M3 - Article
C2 - 23304329
AN - SCOPUS:84880802623
SN - 0891-5849
VL - 2012
SP - 568
EP - 576
JO - Free Radical Biology and Medicine
JF - Free Radical Biology and Medicine
ER -