Novel algorithms for cost-sensitive classification and knowledge discovery in class imbalanced datasets with an application to NASA software defects

Michael J. Siers, Md Zahidul Islam

Research output: Contribution to journalArticlepeer-review

32 Citations (Scopus)

Abstract

Software defect prediction (SDP) involves using machine learning to locate bugs in source code. Datasets used for SDP are typically affected by an issue called class imbalance. Traditional learning algorithms do not perform well on class imbalanced datasets. Cost-sensitive learning has been used in SDP to minimise the monetary costs incurred by predictions. We propose a framework which produces cost-sensitive predictions and also mitigates class imbalance. Since our algorithm builds a decision forest classifier, knowledge can be extracted by manual inspection of the individual decision trees. To enhance this knowledge discovery process, we propose an algorithm for extracting the most interesting patterns from a decision forest. Our algorithm calculates interestingness as the potential financial gain of knowing the pattern. We then present a process which combines the above-mentioned techniques into an end-to-end cost-sensitive knowledge discovery process. This process is demonstrated by extracting knowledge from four software projects undertaken by the National Aeronautics and Space Administration (NASA).

Original languageEnglish
Pages (from-to)53-70
Number of pages18
JournalInformation Sciences
Volume459
Early online dateMay 2018
DOIs
Publication statusPublished - Aug 2018

Fingerprint

Dive into the research topics of 'Novel algorithms for cost-sensitive classification and knowledge discovery in class imbalanced datasets with an application to NASA software defects'. Together they form a unique fingerprint.

Cite this