A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Yikuan Li, Hanyin Wang, Yuan Luo

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (\mathrm{V}+\mathrm{L}) tasks, including medical visual question answering, clinical image-text retrieval, clinical report auto-generation. In this study, we adopt four pre-trained \mathrm{V}+\mathrm{L} models: LXMERT, VisualBERT, UNIER and PixelBERT to learn multimodal representation from MIMIC-CXR images and associated reports. External evaluation using the OpenI dataset shows that the joint embedding learned by pre-trained \mathrm{V}+\mathrm{L} models demonstrates performance improvement of 1.4% in thoracic finding classification tasks compared to a pioneering CNN + RNN model. Ablation studies are conducted to further analyze the contribution of certain model components and validate the advantage of joint embedding over text-only embedding. Attention maps are also visualized to illustrate the attention mechanism of \mathrm{V}+\mathrm{L} models.

Original languageEnglish (US)
Title of host publicationProceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020
EditorsTaesung Park, Young-Rae Cho, Xiaohua Tony Hu, Illhoi Yoo, Hyun Goo Woo, Jianxin Wang, Julio Facelli, Seungyoon Nam, Mingon Kang
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1999-2004
Number of pages6
ISBN (Electronic)9781728162157
DOIs
StatePublished - Dec 16 2020
Event2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020 - Virtual, Seoul, Korea, Republic of
Duration: Dec 16 2020Dec 19 2020

Publication series

NameProceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020

Conference

Conference2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020
CountryKorea, Republic of
CityVirtual, Seoul
Period12/16/2012/19/20

Keywords

  • Multi-modal Representation Learning
  • Thoracic Findings Classification
  • Vision-and-Language

ASJC Scopus subject areas

  • Computer Science Applications
  • Information Systems and Management
  • Medicine (miscellaneous)
  • Health Informatics

Fingerprint Dive into the research topics of 'A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports'. Together they form a unique fingerprint.

Cite this