TY - JOUR
T1 - PubLabeler
T2 - Enhancing automatic classification of publications in UniProtKB using protein textual description and PubMedBERT
AU - Wang, Shaojun
AU - Bian, Junyi
AU - Huang, Xiaodi
AU - Zhou, Hong
AU - Zhu, Shanfeng
PY - 2024/12/20
Y1 - 2024/12/20
N2 - In UniProtKB, each protein is linked to numerous publications covering topics such as sequence, function, and structure, which are annotated manually or through automated methods. Given the vast number of proteins and literature, manual annotation is timeconsuming and labour-intensive. Although UniProtKB offers automated annotations, their quality often falls short. Therefore, developing an accurate automated classifier to identify the topics of publications associated with each protein is imperative for advancing biomedical knowledge discovery. Classifying publications in UniProtKB involves protein-publication pairs characterized by multi-label, label co-occurrence, and class imbalance, which increases complexity. This paper proposes a novel method called PubLabeler, which simultaneously considers protein description and scientific literature texts as input. PubLabeler employs the PubMedBERT model to encode input texts and integrates label co-occurrence information into the model parameters. Additionally, it uses focal loss to update parameters, allowing the model to focus more on classes with a few instances. Using newly annotated literature from Swiss-Prot in 2023 as a test set, PubLabeler achieved superior results in both micro and macro metrics, showing a 28.5% improvement in macro-F1 compared to UniProtKB's automated annotation method, UPCLASS. Furthermore, we validated PubLabeler's effectiveness in TrEMBL annotation, showcasing its comprehensive prediction results compared to TrEMBL's automated annotations. These findings highlight PubLabeler's reliability and potential to advance protein-related information extraction and knowledge discovery.
AB - In UniProtKB, each protein is linked to numerous publications covering topics such as sequence, function, and structure, which are annotated manually or through automated methods. Given the vast number of proteins and literature, manual annotation is timeconsuming and labour-intensive. Although UniProtKB offers automated annotations, their quality often falls short. Therefore, developing an accurate automated classifier to identify the topics of publications associated with each protein is imperative for advancing biomedical knowledge discovery. Classifying publications in UniProtKB involves protein-publication pairs characterized by multi-label, label co-occurrence, and class imbalance, which increases complexity. This paper proposes a novel method called PubLabeler, which simultaneously considers protein description and scientific literature texts as input. PubLabeler employs the PubMedBERT model to encode input texts and integrates label co-occurrence information into the model parameters. Additionally, it uses focal loss to update parameters, allowing the model to focus more on classes with a few instances. Using newly annotated literature from Swiss-Prot in 2023 as a test set, PubLabeler achieved superior results in both micro and macro metrics, showing a 28.5% improvement in macro-F1 compared to UniProtKB's automated annotation method, UPCLASS. Furthermore, we validated PubLabeler's effectiveness in TrEMBL annotation, showcasing its comprehensive prediction results compared to TrEMBL's automated annotations. These findings highlight PubLabeler's reliability and potential to advance protein-related information extraction and knowledge discovery.
KW - Proteins
KW - Protein engineering
KW - Annotations
KW - Bioinformatics
KW - Training
KW - Manuals
KW - Biological system modeling
KW - Text categorization
KW - Predictive models
KW - Pipelines
KW - Deep learning
UR - http://www.scopus.com/inward/record.url?scp=85213017055&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85213017055&partnerID=8YFLogxK
U2 - 10.1109/JBHI.2024.3520579
DO - 10.1109/JBHI.2024.3520579
M3 - Article
C2 - 40030652
SN - 2168-2194
SP - 1
EP - 9
JO - IEEE Journal of Biomedical and Health Informatics
JF - IEEE Journal of Biomedical and Health Informatics
ER -