PubLabeler: Enhancing automatic classification of publications in UniProtKB using protein textual description and PubMedBERT

Shaojun Wang, Junyi Bian, Xiaodi Huang, Hong Zhou, Shanfeng Zhu

Research output: Contribution to journalArticlepeer-review

Abstract

In UniProtKB, each protein is linked to numerous publications covering topics such as sequence, function, and structure, which are annotated manually or through automated methods. Given the vast number of proteins and literature, manual annotation is timeconsuming and labour-intensive. Although UniProtKB offers automated annotations, their quality often falls short. Therefore, developing an accurate automated classifier to identify the topics of publications associated with each protein is imperative for advancing biomedical knowledge discovery. Classifying publications in UniProtKB involves protein-publication pairs characterized by multi-label, label co-occurrence, and class imbalance, which increases complexity. This paper proposes a novel method called PubLabeler, which simultaneously considers protein description and scientific literature texts as input. PubLabeler employs the PubMedBERT model to encode input texts and integrates label co-occurrence information into the model parameters. Additionally, it uses focal loss to update parameters, allowing the model to focus more on classes with a few instances. Using newly annotated literature from Swiss-Prot in 2023 as a test set, PubLabeler achieved superior results in both micro and macro metrics, showing a 28.5% improvement in macro-F1 compared to UniProtKB's automated annotation method, UPCLASS. Furthermore, we validated PubLabeler's effectiveness in TrEMBL annotation, showcasing its comprehensive prediction results compared to TrEMBL's automated annotations. These findings highlight PubLabeler's reliability and potential to advance protein-related information extraction and knowledge discovery.

Original languageEnglish
Pages (from-to)1-9
Number of pages9
JournalIEEE Journal of Biomedical and Health Informatics
DOIs
Publication statusPublished - 20 Dec 2024

Fingerprint

Dive into the research topics of 'PubLabeler: Enhancing automatic classification of publications in UniProtKB using protein textual description and PubMedBERT'. Together they form a unique fingerprint.

Cite this