TY - JOUR
T1 - SAPPHIRE
T2 - A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins
AU - Charoenkwan, Phasit
AU - Schaduangrat, Nalini
AU - Moni, Mohammad Ali
AU - Lio’, Pietro
AU - Manavalan, Balachandran
AU - Shoombuatong, Watshara
N1 - Publisher Copyright:
© 2022
PY - 2022/7
Y1 - 2022/7
N2 - Thermophilic proteins (TPPs) are important in the field of protein biochemistry and development of new enzymes. Thus, computational methods must be urgently developed to accurately and rapidly identify TPPs. To date, several computational methods have been developed for TPP identification; however, few limitations in terms of performance and utility remain. In this study, we present a novel computational method, SAPPHIRE, to achieve more accurate identification of TPPs using only sequence information without any need for structural information. We combined twelve different feature encodings representing different perspectives and six popular machine learning algorithms to train 72 baseline models and extract the key information of TPPs. Subsequently, the informative predicted probabilities from the baseline models were mined and selected using a genetic algorithm in conjunction with a self-assessment-report approach. Finally, the final meta-predictor, SAPPHIRE, was built and optimized by applying an optimal feature set. The performance of SAPPHIRE in the 10-fold cross-validation test showed that a superior predictive performance compared with several baseline models could be achieved. Moreover, SAPPHIRE yielded an accuracy of 0.942 and Matthew's coefficient correlation of 0.884, which were 7.68 and 5.12% higher than those of the current existing methods, respectively, as indicated by the independent test. The proposed computational approach is anticipated to facilitate large-scale identification of TPPs and accelerate their applications in the food industry. The codes and datasets are available at https://github.com/plenoi/SAPPHIRE.
AB - Thermophilic proteins (TPPs) are important in the field of protein biochemistry and development of new enzymes. Thus, computational methods must be urgently developed to accurately and rapidly identify TPPs. To date, several computational methods have been developed for TPP identification; however, few limitations in terms of performance and utility remain. In this study, we present a novel computational method, SAPPHIRE, to achieve more accurate identification of TPPs using only sequence information without any need for structural information. We combined twelve different feature encodings representing different perspectives and six popular machine learning algorithms to train 72 baseline models and extract the key information of TPPs. Subsequently, the informative predicted probabilities from the baseline models were mined and selected using a genetic algorithm in conjunction with a self-assessment-report approach. Finally, the final meta-predictor, SAPPHIRE, was built and optimized by applying an optimal feature set. The performance of SAPPHIRE in the 10-fold cross-validation test showed that a superior predictive performance compared with several baseline models could be achieved. Moreover, SAPPHIRE yielded an accuracy of 0.942 and Matthew's coefficient correlation of 0.884, which were 7.68 and 5.12% higher than those of the current existing methods, respectively, as indicated by the independent test. The proposed computational approach is anticipated to facilitate large-scale identification of TPPs and accelerate their applications in the food industry. The codes and datasets are available at https://github.com/plenoi/SAPPHIRE.
KW - Bioinformatics
KW - Feature selection
KW - Machine learning
KW - Sequence analysis
KW - Stacking strategy
KW - Thermophilic protein
UR - http://www.scopus.com/inward/record.url?scp=85131801539&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85131801539&partnerID=8YFLogxK
U2 - 10.1016/j.compbiomed.2022.105704
DO - 10.1016/j.compbiomed.2022.105704
M3 - Article
C2 - 35690478
AN - SCOPUS:85131801539
SN - 0010-4825
VL - 146
SP - 1
EP - 9
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
M1 - 105704
ER -