Traditional Chinese medicine clinical records classification with BERT and domain specific corpora

Liang Yao, Zhe Jin, Chengsheng Mao, Yin Zhang, Yuan Luo*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

19 Scopus citations


Traditional Chinese Medicine (TCM) has been developed for several thousand years and plays a significant role in health care for Chinese people. This paper studies the problem of classifying TCM clinical records into 5 main disease categories in TCM. We explored a number of state-of-the-art deep learning models and found that the recent Bidirectional Encoder Representations from Transformers can achieve better results than other deep learning models and other state-of-the-art methods. We further utilized an unlabeled clinical corpus to fine-tune the BERT language model before training the text classifier. The method only uses Chinese characters in clinical text as input without preprocessing or feature engineering. We evaluated deep learning models and traditional text classifiers on a benchmark data set. Our method achieves a state-of-the-art accuracy 89.39% ± 0.35%, Macro F1 score 88.64% ± 0.40% and Micro F1 score 89.39% ± 0.35%. We also visualized attention weights in our method, which can reveal indicative characters in clinical text.

Original languageEnglish (US)
Pages (from-to)1632-1636
Number of pages5
JournalJournal of the American Medical Informatics Association
Issue number12
StatePublished - Nov 15 2019


  • BERT
  • TCM
  • clinical records classification
  • domain knowledge
  • natural language processing

ASJC Scopus subject areas

  • Health Informatics


Dive into the research topics of 'Traditional Chinese medicine clinical records classification with BERT and domain specific corpora'. Together they form a unique fingerprint.

Cite this