Abstract
With the explosive growth of protein sequences, large-scale automated
protein function prediction (AFP) is becoming challenging. A protein is
usually associated with dozens of gene ontology (GO) terms. Therefore,
AFP is regarded as a problem of large-scale multi-label classification.
Under the learning to rank (LTR) framework, our previous NetGO tool
integrated massive networks and multi-type information about protein
sequences to achieve good performance by dealing with all possible GO
terms (>44 000). In this work, we propose the updated version as
NetGO 2.0, which further improves the performance of large-scale AFP.
NetGO 2.0 also incorporates literature information by logistic
regression and deep sequence information by recurrent neural network
(RNN) into the framework. We generate datasets following the critical
assessment of functional annotation (CAFA) protocol. Experiment results
show that NetGO 2.0 outperformed NetGO significantly in biological
process ontology (BPO) and cellular component ontology (CCO). In
particular, NetGO 2.0 achieved a 12.6% improvement over NetGO in terms
of area under precision-recall curve (AUPR) in BPO and around 2.6% in
terms of Fmax
in CCO. These results demonstrate the benefits of incorporating text
and deep sequence information for the functional annotation of BPO and
CCO. The NetGO 2.0 web server is freely available at http://issubmission.sjtu.edu.cn/ng2/.
Original language | English |
---|---|
Pages (from-to) | W469-W475 |
Number of pages | 7 |
Journal | Nucleic Acids Research |
Volume | 49 |
Issue number | W1 |
Early online date | 26 May 2021 |
DOIs | |
Publication status | Published - 02 Jul 2021 |