TY - JOUR
T1 - DeepText2GO
T2 - Improving large-scale protein function prediction with deep semantic text representation
AU - You, Ronghui
AU - Huang, Xiaodi
AU - Zhu, Shanfeng
N1 - Includes bibliographical references.
PY - 2018/8/1
Y1 - 2018/8/1
N2 - As of April 2018, UniProtKB has collected more than 115 million protein sequences. Less than 0.15% of these proteins, however, have been associated with experimental GO annotations. As such, the use of automatic protein function prediction (AFP) to reduce this huge gap becomes increasingly important. The previous studies conclude that sequence homology based methods are highly effective in AFP. In addition, mining motif, domain, and functional information from protein sequences has been found very helpful for AFP. Other than sequences, alternative information sources such as text, however, may be useful for AFP as well. Instead of using BOW (bag of words) representation in traditional text-based AFP, we propose a new method called DeepText2GO that relies on deep semantic text representation, together with different kinds of available protein information such as sequence homology, families, domains, and motifs, to improve large-scale AFP. Furthermore, DeepText2GO integrates text-based methods with sequence-based ones by means of a consensus approach. Extensive experiments on the benchmark dataset extracted from UniProt/SwissProt have demonstrated that DeepText2GO significantly outperformed both text-based and sequence-based methods, validating its superiority.
AB - As of April 2018, UniProtKB has collected more than 115 million protein sequences. Less than 0.15% of these proteins, however, have been associated with experimental GO annotations. As such, the use of automatic protein function prediction (AFP) to reduce this huge gap becomes increasingly important. The previous studies conclude that sequence homology based methods are highly effective in AFP. In addition, mining motif, domain, and functional information from protein sequences has been found very helpful for AFP. Other than sequences, alternative information sources such as text, however, may be useful for AFP as well. Instead of using BOW (bag of words) representation in traditional text-based AFP, we propose a new method called DeepText2GO that relies on deep semantic text representation, together with different kinds of available protein information such as sequence homology, families, domains, and motifs, to improve large-scale AFP. Furthermore, DeepText2GO integrates text-based methods with sequence-based ones by means of a consensus approach. Extensive experiments on the benchmark dataset extracted from UniProt/SwissProt have demonstrated that DeepText2GO significantly outperformed both text-based and sequence-based methods, validating its superiority.
KW - Large-scale protein function prediction
KW - Text classification
UR - http://www.scopus.com/inward/record.url?scp=85048880771&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85048880771&partnerID=8YFLogxK
U2 - 10.1016/j.ymeth.2018.05.026
DO - 10.1016/j.ymeth.2018.05.026
M3 - Article
C2 - 29883746
AN - SCOPUS:85048880771
SN - 1046-2023
VL - 145
SP - 82
EP - 90
JO - Methods
JF - Methods
ER -