Big data analytics is a very fast growing research domain which embedded the combination of computational (i.e. computer-intensive) and inferential (i.e. statistics-oriented) thinking. Information are increasingly gathered into big data environment such as distinct protein-coding data for identifying various critical diseases and its cure. Data pre-processing techniques are used to make the data clean, noise free and consistent to model in various real life purposes. This paper examines a range of statistics-based data pre-processing methods and machine learning algorithms to assess their performances in the big data analysis setting. Tuberculosis affected protein’s amino acid sequences data from the National Center for Biotechnology Information (NCBI) database is utilized for empirical results. Findings reveal that statistics-based pre-processing methods are effective to make the big data useable for significant modelling and analysis with novel machine learning algorithms such as the hidden Markov chain model, box-cox and linear transformation, and they also maintain the performance of those algorithms. Although there are significant differences observed between predictive outcomes and performances of the algorithms, results further demonstrate that the hidden Markov chain model produced more accurate, exact and faster analysis with reliable estimates.
|Number of pages||22|
|Journal||International Journal of Artificial Intelligence|
|Publication status||Published - Oct 2019|