Data science for class imbalanced and cost-sensitive data and its application to software defect prediction

Michael Siers

Research output: ThesisDoctoral Thesis

83 Downloads (Pure)

Abstract

Class imbalance and cost-sensitivity are two prominent challenges in classification. The overwhelming majority of techniques which address these issues only focus on predictive performance rather than suitability for knowledge discovery. This thesis focuses on addressing both issues. This thesis proposes the design for an approach with four important characteristics. Firstly, a cost-sensitive decision forest is generated which avoids the negative effects of class imbalance. Secondly, the forest is generated using the entirety of the original training dataset which means that the knowledge it contains directly matches the original data. Thirdly, a clear process is proposed which automatically extracts, ranks, and values the forest’s discovered knowledge. Lastly, the resulting classifier achieves competitive performance compared to several existing techniques. The knowledge discovery approach is demonstrated by discovering patterns in software bugs present in several NASA programs (National Aeronautics and Space Administration). The conceptual design of a tool for real-time integration of the proposed techniques into the software development process is also presented at the end of this thesis.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Charles Sturt University
Supervisors/Advisors
  • Islam, Zahid, Principal Supervisor
  • Bossomaier, Terry, Co-Supervisor
Award date21 Apr 2019
Place of PublicationAustralia
Publisher
Publication statusPublished - 2019

Fingerprint

Data mining
NASA
Defects
Conceptual design
Costs
Software engineering
Classifiers

Cite this

@phdthesis{1f751acff61c40e38a28afd1f147e3cc,
title = "Data science for class imbalanced and cost-sensitive data and its application to software defect prediction",
abstract = "Class imbalance and cost-sensitivity are two prominent challenges in classification. The overwhelming majority of techniques which address these issues only focus on predictive performance rather than suitability for knowledge discovery. This thesis focuses on addressing both issues. This thesis proposes the design for an approach with four important characteristics. Firstly, a cost-sensitive decision forest is generated which avoids the negative effects of class imbalance. Secondly, the forest is generated using the entirety of the original training dataset which means that the knowledge it contains directly matches the original data. Thirdly, a clear process is proposed which automatically extracts, ranks, and values the forest’s discovered knowledge. Lastly, the resulting classifier achieves competitive performance compared to several existing techniques. The knowledge discovery approach is demonstrated by discovering patterns in software bugs present in several NASA programs (National Aeronautics and Space Administration). The conceptual design of a tool for real-time integration of the proposed techniques into the software development process is also presented at the end of this thesis.",
keywords = "Class Imbalance, Data Science, Software Defect Prediction, Cost-Sensitive, Decision Forest, Decision Tree",
author = "Michael Siers",
year = "2019",
language = "English",
publisher = "Charles Sturt University",
address = "Australia",
school = "Charles Sturt University",

}

Siers, M 2019, 'Data science for class imbalanced and cost-sensitive data and its application to software defect prediction', Doctor of Philosophy, Charles Sturt University, Australia.

Data science for class imbalanced and cost-sensitive data and its application to software defect prediction. / Siers, Michael.

Australia : Charles Sturt University, 2019. 204 p.

Research output: ThesisDoctoral Thesis

TY - THES

T1 - Data science for class imbalanced and cost-sensitive data and its application to software defect prediction

AU - Siers, Michael

PY - 2019

Y1 - 2019

N2 - Class imbalance and cost-sensitivity are two prominent challenges in classification. The overwhelming majority of techniques which address these issues only focus on predictive performance rather than suitability for knowledge discovery. This thesis focuses on addressing both issues. This thesis proposes the design for an approach with four important characteristics. Firstly, a cost-sensitive decision forest is generated which avoids the negative effects of class imbalance. Secondly, the forest is generated using the entirety of the original training dataset which means that the knowledge it contains directly matches the original data. Thirdly, a clear process is proposed which automatically extracts, ranks, and values the forest’s discovered knowledge. Lastly, the resulting classifier achieves competitive performance compared to several existing techniques. The knowledge discovery approach is demonstrated by discovering patterns in software bugs present in several NASA programs (National Aeronautics and Space Administration). The conceptual design of a tool for real-time integration of the proposed techniques into the software development process is also presented at the end of this thesis.

AB - Class imbalance and cost-sensitivity are two prominent challenges in classification. The overwhelming majority of techniques which address these issues only focus on predictive performance rather than suitability for knowledge discovery. This thesis focuses on addressing both issues. This thesis proposes the design for an approach with four important characteristics. Firstly, a cost-sensitive decision forest is generated which avoids the negative effects of class imbalance. Secondly, the forest is generated using the entirety of the original training dataset which means that the knowledge it contains directly matches the original data. Thirdly, a clear process is proposed which automatically extracts, ranks, and values the forest’s discovered knowledge. Lastly, the resulting classifier achieves competitive performance compared to several existing techniques. The knowledge discovery approach is demonstrated by discovering patterns in software bugs present in several NASA programs (National Aeronautics and Space Administration). The conceptual design of a tool for real-time integration of the proposed techniques into the software development process is also presented at the end of this thesis.

KW - Class Imbalance

KW - Data Science

KW - Software Defect Prediction

KW - Cost-Sensitive

KW - Decision Forest

KW - Decision Tree

M3 - Doctoral Thesis

PB - Charles Sturt University

CY - Australia

ER -