NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information

Shuwei Yao, Ronghui You, Shaojun Wang, Yi Xiong, Xiaodi Huang, Shanfeng Zhu

Research output: Contribution to journalArticlepeer-review

52 Citations (Scopus)
51 Downloads (Pure)

Abstract

With the explosive growth of protein sequences, large-scale automated protein function prediction (AFP) is becoming challenging. A protein is usually associated with dozens of gene ontology (GO) terms. Therefore, AFP is regarded as a problem of large-scale multi-label classification. Under the learning to rank (LTR) framework, our previous NetGO tool integrated massive networks and multi-type information about protein sequences to achieve good performance by dealing with all possible GO terms (>44 000). In this work, we propose the updated version as NetGO 2.0, which further improves the performance of large-scale AFP. NetGO 2.0 also incorporates literature information by logistic regression and deep sequence information by recurrent neural network (RNN) into the framework. We generate datasets following the critical assessment of functional annotation (CAFA) protocol. Experiment results show that NetGO 2.0 outperformed NetGO significantly in biological process ontology (BPO) and cellular component ontology (CCO). In particular, NetGO 2.0 achieved a 12.6% improvement over NetGO in terms of area under precision-recall curve (AUPR) in BPO and around 2.6% in terms of Fmax in CCO. These results demonstrate the benefits of incorporating text and deep sequence information for the functional annotation of BPO and CCO. The NetGO 2.0 web server is freely available at http://issubmission.sjtu.edu.cn/ng2/.
Original languageEnglish
Pages (from-to)W469-W475
Number of pages7
JournalNucleic Acids Research
Volume49
Issue numberW1
Early online date26 May 2021
DOIs
Publication statusPublished - 02 Jul 2021

Fingerprint

Dive into the research topics of 'NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information'. Together they form a unique fingerprint.

Cite this