pathCLIP: Detection of Genes and Gene Relations from Biological Pathway Figures through Image-Text Contrastive Learning

IEEE J Biomed Health Inform. 2024 Apr 3:PP. doi: 10.1109/JBHI.2024.3383610. Online ahead of print.

Abstract

In biomedical literature, biological pathways are commonly described through a combination of images and text. These pathways contain valuable information, including genes and their relationships, which provide insight into biological mechanisms and precision medicine. Curating pathway information across the literature enables the integration of this information to build a comprehensive knowledge base. While some studies have extracted pathway information from images and text independently, they often overlook the correspondence between the two modalities. In this paper, we present a pathway figure curation system named pathCLIP for identifying genes and gene relations from pathway figures. Our key innovation is the use of an image-text contrastive learning model to learn coordinated embeddings of image snippets and text descriptions of genes and gene relations, thereby improving curation. Our validation results, using pathway figures from PubMed, showed that our multimodal model outperforms models using only a single modality. Additionally, our system effectively curates genes and gene relations from multiple literature sources. Two case studies on extracting pathway information from literature of non-small cell lung cancer and Alzheimer's disease further demonstrate the usefulness of our curated pathway information in enhancing related pathways in the KEGG database.