Partial mixture model for tight clustering of gene expression time-course

Yinyin Yuan, Chang-Tsun Li, Roland Wilson

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

BACKGROUND: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored.

RESULTS: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms.

CONCLUSION: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the combination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion.

Original languageEnglish
Pages (from-to)287
JournalBMC Bioinformatics
Volume9
DOIs
Publication statusPublished - 18 Jun 2008

Fingerprint

Minimum Distance Estimator
Mixture Model
Gene expression
Gene Expression
Cluster Analysis
Clustering
Partial
Genes
Clustering algorithms
Clustering Algorithm
Gene
Parameter estimation
Maximum likelihood
Parameter Estimation
Minimum Distance Estimation
Regression
Robustness
Data Fitting
Gene Ontology
Maximum Likelihood Estimator

Cite this

Yuan, Yinyin ; Li, Chang-Tsun ; Wilson, Roland. / Partial mixture model for tight clustering of gene expression time-course. In: BMC Bioinformatics. 2008 ; Vol. 9. pp. 287.
@article{e57cd0c197084957855e2f932d3944be,
title = "Partial mixture model for tight clustering of gene expression time-course",
abstract = "BACKGROUND: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored.RESULTS: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms.CONCLUSION: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the combination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion.",
keywords = "Cell Cycle, Cluster Analysis, Computational Biology, Databases, Genetic, Gene Expression, Gene Expression Profiling, Genes, Fungal, Likelihood Functions, Models, Genetic, Multigene Family, Neural Networks (Computer), Regression Analysis, Saccharomyces cerevisiae, Time Factors, Journal Article",
author = "Yinyin Yuan and Chang-Tsun Li and Roland Wilson",
year = "2008",
month = "6",
day = "18",
doi = "10.1186/1471-2105-9-287",
language = "English",
volume = "9",
pages = "287",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

Partial mixture model for tight clustering of gene expression time-course. / Yuan, Yinyin; Li, Chang-Tsun; Wilson, Roland.

In: BMC Bioinformatics, Vol. 9, 18.06.2008, p. 287.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Partial mixture model for tight clustering of gene expression time-course

AU - Yuan, Yinyin

AU - Li, Chang-Tsun

AU - Wilson, Roland

PY - 2008/6/18

Y1 - 2008/6/18

N2 - BACKGROUND: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored.RESULTS: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms.CONCLUSION: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the combination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion.

AB - BACKGROUND: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored.RESULTS: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms.CONCLUSION: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the combination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion.

KW - Cell Cycle

KW - Cluster Analysis

KW - Computational Biology

KW - Databases, Genetic

KW - Gene Expression

KW - Gene Expression Profiling

KW - Genes, Fungal

KW - Likelihood Functions

KW - Models, Genetic

KW - Multigene Family

KW - Neural Networks (Computer)

KW - Regression Analysis

KW - Saccharomyces cerevisiae

KW - Time Factors

KW - Journal Article

U2 - 10.1186/1471-2105-9-287

DO - 10.1186/1471-2105-9-287

M3 - Article

C2 - 18564420

VL - 9

SP - 287

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

ER -