23 Citations (Scopus)


Big data analytics is a very fast growing research domain which embedded the combination of computational (i.e. computer-intensive) and inferential (i.e. statistics-oriented) thinking. Information are increasingly gathered into big data environment such as distinct protein-coding data for identifying various critical diseases and its cure. Data pre-processing techniques are used to make the data clean, noise free and consistent to model in various real life purposes. This paper examines a range of statistics-based data pre-processing methods and machine learning algorithms to assess their performances in the big data analysis setting. Tuberculosis affected protein’s amino acid sequences data from the National Center for Biotechnology Information (NCBI) database is utilized for empirical results. Findings reveal that statistics-based pre-processing methods are effective to make the big data useable for significant modelling and analysis with novel machine learning algorithms such as the hidden Markov chain model, box-cox and linear transformation, and they also maintain the performance of those algorithms. Although there are significant differences observed between predictive outcomes and performances of the algorithms, results further demonstrate that the hidden Markov chain model produced more accurate, exact and faster analysis with reliable estimates.
Original languageEnglish
Article number3
Pages (from-to)44-65
Number of pages22
JournalInternational Journal of Artificial Intelligence
Issue number2
Publication statusPublished - Oct 2019


Dive into the research topics of 'Statistics-based data preprocessing methods and machine learning algorithms for big data analysis'. Together they form a unique fingerprint.

Cite this