Combining K-MEANS and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering

Research output: Research - peer-reviewArticle

Abstract

Knowledge discovery from data can be broadly categorized into two types: supervised and unsupervised. A supervised knowledge discovery process such as classification by decision trees typically requires class labels which are sometimes unavailable in datasets. Unsupervised knowledge discovery techniques such as an unsupervised clustering technique can handle datasets without class labels. They aim to let data reveal the groups (i.e. the data elements in each group) and the number of groups. For the ubiquitous task of clustering, K-MEANS is the most used algorithm applied in a broad range of areas to identify groups where intra-group distances are much smaller than inter-group distances. As a representative-based clustering approach, K-MEANS offers an extremely efficient gradient descent approach to the total squared error of representation; however, it not only demands the parameter k, but it also makes assumptions about the similarity of density among the clusters. Therefore, it is profoundly affected by noise. Perhaps more seriously, it can often be attracted to local optima despite its immersion in a multi-start scheme. We present an effective genetic algorithm that combines the capacity of genetic operators to conglomerate different solutions of the search space with the exploitation of the hill-climber. We advance a previous genetic-searching approach called GENCLUST, with the intervention of fast hill-climbing cycles of K-MEANS and obtain an algorithm that is faster than its predecessor and achieves clustering results of higher quality. We demonstrate this across a series of 18 commonly researched datasets.

LanguageEnglish
Pages402-417
Number of pages16
JournalExpert Systems with Applications
Volume91
Early online date14 Sep 2017
DOIs
StatePublished - 01 Jan 2018

Fingerprint

Data mining
Mathematical operators
Genetic algorithms
Labels
Decision trees

Cite this

@article{b5525d2790c544e19db0cc5e3916c20c,
title = "Combining K-MEANS and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering",
abstract = "Knowledge discovery from data can be broadly categorized into two types: supervised and unsupervised. A supervised knowledge discovery process such as classification by decision trees typically requires class labels which are sometimes unavailable in datasets. Unsupervised knowledge discovery techniques such as an unsupervised clustering technique can handle datasets without class labels. They aim to let data reveal the groups (i.e. the data elements in each group) and the number of groups. For the ubiquitous task of clustering, K-MEANS is the most used algorithm applied in a broad range of areas to identify groups where intra-group distances are much smaller than inter-group distances. As a representative-based clustering approach, K-MEANS offers an extremely efficient gradient descent approach to the total squared error of representation; however, it not only demands the parameter k, but it also makes assumptions about the similarity of density among the clusters. Therefore, it is profoundly affected by noise. Perhaps more seriously, it can often be attracted to local optima despite its immersion in a multi-start scheme. We present an effective genetic algorithm that combines the capacity of genetic operators to conglomerate different solutions of the search space with the exploitation of the hill-climber. We advance a previous genetic-searching approach called GENCLUST, with the intervention of fast hill-climbing cycles of K-MEANS and obtain an algorithm that is faster than its predecessor and achieves clustering results of higher quality. We demonstrate this across a series of 18 commonly researched datasets.",
keywords = "Cluster evaluation, Clustering, Data mining, Genetic algorithm, K-MEANS",
author = "Islam, {Md Zahidul} and Vladimir Estivill-Castro and Rahman, {Md Anisur} and Terry Bossomaier",
year = "2018",
month = "1",
doi = "10.1016/j.eswa.2017.09.005",
volume = "91",
pages = "402--417",
journal = "Expert Systems with Applications",
issn = "0957-4174",
publisher = "Elsevier",

}

TY - JOUR

T1 - Combining K-MEANS and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering

AU - Islam,Md Zahidul

AU - Estivill-Castro,Vladimir

AU - Rahman,Md Anisur

AU - Bossomaier,Terry

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Knowledge discovery from data can be broadly categorized into two types: supervised and unsupervised. A supervised knowledge discovery process such as classification by decision trees typically requires class labels which are sometimes unavailable in datasets. Unsupervised knowledge discovery techniques such as an unsupervised clustering technique can handle datasets without class labels. They aim to let data reveal the groups (i.e. the data elements in each group) and the number of groups. For the ubiquitous task of clustering, K-MEANS is the most used algorithm applied in a broad range of areas to identify groups where intra-group distances are much smaller than inter-group distances. As a representative-based clustering approach, K-MEANS offers an extremely efficient gradient descent approach to the total squared error of representation; however, it not only demands the parameter k, but it also makes assumptions about the similarity of density among the clusters. Therefore, it is profoundly affected by noise. Perhaps more seriously, it can often be attracted to local optima despite its immersion in a multi-start scheme. We present an effective genetic algorithm that combines the capacity of genetic operators to conglomerate different solutions of the search space with the exploitation of the hill-climber. We advance a previous genetic-searching approach called GENCLUST, with the intervention of fast hill-climbing cycles of K-MEANS and obtain an algorithm that is faster than its predecessor and achieves clustering results of higher quality. We demonstrate this across a series of 18 commonly researched datasets.

AB - Knowledge discovery from data can be broadly categorized into two types: supervised and unsupervised. A supervised knowledge discovery process such as classification by decision trees typically requires class labels which are sometimes unavailable in datasets. Unsupervised knowledge discovery techniques such as an unsupervised clustering technique can handle datasets without class labels. They aim to let data reveal the groups (i.e. the data elements in each group) and the number of groups. For the ubiquitous task of clustering, K-MEANS is the most used algorithm applied in a broad range of areas to identify groups where intra-group distances are much smaller than inter-group distances. As a representative-based clustering approach, K-MEANS offers an extremely efficient gradient descent approach to the total squared error of representation; however, it not only demands the parameter k, but it also makes assumptions about the similarity of density among the clusters. Therefore, it is profoundly affected by noise. Perhaps more seriously, it can often be attracted to local optima despite its immersion in a multi-start scheme. We present an effective genetic algorithm that combines the capacity of genetic operators to conglomerate different solutions of the search space with the exploitation of the hill-climber. We advance a previous genetic-searching approach called GENCLUST, with the intervention of fast hill-climbing cycles of K-MEANS and obtain an algorithm that is faster than its predecessor and achieves clustering results of higher quality. We demonstrate this across a series of 18 commonly researched datasets.

KW - Cluster evaluation

KW - Clustering

KW - Data mining

KW - Genetic algorithm

KW - K-MEANS

UR - http://www.scopus.com/inward/record.url?scp=85029490496&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85029490496&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2017.09.005

DO - 10.1016/j.eswa.2017.09.005

M3 - Article

VL - 91

SP - 402

EP - 417

JO - Expert Systems with Applications

T2 - Expert Systems with Applications

JF - Expert Systems with Applications

SN - 0957-4174

ER -