Automatic lymphoma classification with sentence subgraph mining from pathology reports

Yuan Luo*, Aliyah R. Sohani, Ephraim P. Hochberg, Peter Szolovits

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

47 Scopus citations


Objective: Pathology reports are rich in narrative statements that encode a complex web of relations among medical concepts. These relations are routinely used by doctors to reason on diagnoses, but often require hand-crafted rules or supervised learning to extract into prespecified forms for computational disease modeling. We aim to automatically capture relations from narrative text without supervision. Methods: We design a novel framework that translates sentences into graph representations, automatically mines sentence subgraphs, reduces redundancy in mined subgraphs, and automatically generates subgraph features for subsequent classification tasks. To ensure meaningful interpretations over the sentence graphs, we use the Unified Medical Language System Metathesaurus to map token subsequences to concepts, and in turn sentence graph nodes. We test our system with multiple lymphoma classification tasks that together mimic the differential diagnosis by a pathologist. To this end, we prevent our classifiers from looking at explicit mentions or synonyms of lymphomas in the text. Results and Conclusions: We compare our system with three baseline classifiers using standard n-grams, full MetaMap concepts, and filtered MetaMap concepts. Our system achieves high F-measures on multiple binary classifications of lymphoma (Burkitt lymphoma, 0.8; diffuse large B-cell lymphoma, 0.909; follicular lymphoma, 0.84; Hodgkin lymphoma, 0.912). Significance tests show that our system outperforms all three baselines. Moreover, feature analysis identifies subgraph features that contribute to improved performance; these features agree with the state-of-the-art knowledge about lymphoma classification. We also highlight how these unsupervised relation features may provide meaningful insights into lymphoma classification.

Original languageEnglish (US)
Pages (from-to)824-832
Number of pages9
JournalJournal of the American Medical Informatics Association
Issue number5
StatePublished - 2014

ASJC Scopus subject areas

  • Health Informatics


Dive into the research topics of 'Automatic lymphoma classification with sentence subgraph mining from pathology reports'. Together they form a unique fingerprint.

Cite this