Data Convexity and Parameter Independent Clustering for Biomedical Datasets

Md Anisur Rahman, Li-Minn Ang, Kah Phooi Seng

Research output: Contribution to journalArticle

Abstract

In machine learning, the nature of the dataset itself such as convexity of the data point sets affects the right choice of clustering algorithm to give good performance. This brief paper first focuses on how data convexity influences the clustering performance on biomedical datasets. Then it addresses the main challenges of two well-known clustering groups which are centroid-based and density-based clustering. These techniques typically require a set of parameters to be provided by the user before the algorithms can perform well in terms of good clustering and give the optimal number of clusters. Two parameter independent clustering techniques utilizing unique neighborhood sets (UNSs) called Parameter Independent Convex Centroid-based Clustering (ConvexClust) for convex-dominated datasets and Parameter Independent Non-Convex Density-based Clustering (NonConvexClust) for nonconvex-dominated datasets are introduced. The ConvexClust and NonConvex Clust algorithms are extensively evaluated on real-world biomedical datasets. Their performances are also compared with other clustering algorithms using evaluation criteria such as SSE, entropy and purity. The results have revealed the good performance of the proposed parameter-independent clustering techniques and also shown that most of the biomedical datasets in the experiments demonstrated their tendency towards convex-dominated data point sets.
Original languageEnglish
JournalIEEE/ACM Transactions on Computational Biology and Bioinformatics
Publication statusAccepted/In press - 07 Feb 2020

Fingerprint

Clustering algorithms
Cluster Analysis
Convexity
Clustering
Learning systems
Entropy
Centroid
Point Sets
Clustering Algorithm
Experiments
Datasets
Number of Clusters
Two Parameters
Machine Learning
Evaluation
Experiment

Cite this

@article{7f2ca6deb8fd49a8be58ea6ade51dc07,
title = "Data Convexity and Parameter Independent Clustering for Biomedical Datasets",
abstract = "In machine learning, the nature of the dataset itself such as convexity of the data point sets affects the right choice of clustering algorithm to give good performance. This brief paper first focuses on how data convexity influences the clustering performance on biomedical datasets. Then it addresses the main challenges of two well-known clustering groups which are centroid-based and density-based clustering. These techniques typically require a set of parameters to be provided by the user before the algorithms can perform well in terms of good clustering and give the optimal number of clusters. Two parameter independent clustering techniques utilizing unique neighborhood sets (UNSs) called Parameter Independent Convex Centroid-based Clustering (ConvexClust) for convex-dominated datasets and Parameter Independent Non-Convex Density-based Clustering (NonConvexClust) for nonconvex-dominated datasets are introduced. The ConvexClust and NonConvex Clust algorithms are extensively evaluated on real-world biomedical datasets. Their performances are also compared with other clustering algorithms using evaluation criteria such as SSE, entropy and purity. The results have revealed the good performance of the proposed parameter-independent clustering techniques and also shown that most of the biomedical datasets in the experiments demonstrated their tendency towards convex-dominated data point sets.",
author = "Rahman, {Md Anisur} and Li-Minn Ang and Seng, {Kah Phooi}",
year = "2020",
month = "2",
day = "7",
language = "English",
journal = "IEEE/ACM Transactions on Computational Biology and Bioinformatics",
issn = "1545-5963",
publisher = "IEEE, Institute of Electrical and Electronics Engineers",

}

TY - JOUR

T1 - Data Convexity and Parameter Independent Clustering for Biomedical Datasets

AU - Rahman, Md Anisur

AU - Ang, Li-Minn

AU - Seng, Kah Phooi

PY - 2020/2/7

Y1 - 2020/2/7

N2 - In machine learning, the nature of the dataset itself such as convexity of the data point sets affects the right choice of clustering algorithm to give good performance. This brief paper first focuses on how data convexity influences the clustering performance on biomedical datasets. Then it addresses the main challenges of two well-known clustering groups which are centroid-based and density-based clustering. These techniques typically require a set of parameters to be provided by the user before the algorithms can perform well in terms of good clustering and give the optimal number of clusters. Two parameter independent clustering techniques utilizing unique neighborhood sets (UNSs) called Parameter Independent Convex Centroid-based Clustering (ConvexClust) for convex-dominated datasets and Parameter Independent Non-Convex Density-based Clustering (NonConvexClust) for nonconvex-dominated datasets are introduced. The ConvexClust and NonConvex Clust algorithms are extensively evaluated on real-world biomedical datasets. Their performances are also compared with other clustering algorithms using evaluation criteria such as SSE, entropy and purity. The results have revealed the good performance of the proposed parameter-independent clustering techniques and also shown that most of the biomedical datasets in the experiments demonstrated their tendency towards convex-dominated data point sets.

AB - In machine learning, the nature of the dataset itself such as convexity of the data point sets affects the right choice of clustering algorithm to give good performance. This brief paper first focuses on how data convexity influences the clustering performance on biomedical datasets. Then it addresses the main challenges of two well-known clustering groups which are centroid-based and density-based clustering. These techniques typically require a set of parameters to be provided by the user before the algorithms can perform well in terms of good clustering and give the optimal number of clusters. Two parameter independent clustering techniques utilizing unique neighborhood sets (UNSs) called Parameter Independent Convex Centroid-based Clustering (ConvexClust) for convex-dominated datasets and Parameter Independent Non-Convex Density-based Clustering (NonConvexClust) for nonconvex-dominated datasets are introduced. The ConvexClust and NonConvex Clust algorithms are extensively evaluated on real-world biomedical datasets. Their performances are also compared with other clustering algorithms using evaluation criteria such as SSE, entropy and purity. The results have revealed the good performance of the proposed parameter-independent clustering techniques and also shown that most of the biomedical datasets in the experiments demonstrated their tendency towards convex-dominated data point sets.

M3 - Article

JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics

JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics

SN - 1545-5963

ER -