Abstract
Data pre-processing and cleansing play a vital role in data mining by ensuring good quality of data. Data cleansing tasks include imputation of missing values, identification of outliers, and identification and correction of noisy data. In this paper, we present a novel technique called \textit{A \textbf{F}uzzy \textbf{E}xpectation \textbf{M}aximisation and Fuzzy Clustering based Missing Value \textbf{I}mputation Framework for Data Pre-processing (FEMI)}. It imputes numerical and categorical missing values by making an educated guess based on records that are similar to the record having a missing value. While identifying a group of similar records and making a guess based on the group it applies a fuzzy clustering approach and our novel fuzzy expectation maxmisation algorithm. We evaluate FEMI on eight publicly available natural data sets by comparing its performance with the performance of two high quality existing techniques namely EMI and IBLLS. We use thirty two types (patterns) of missing values for each data set. Several evaluation criteria namely co-efficient of determination ($R 2$), index of agreement ($d_2$), root mean squared error ($RMSE$), and mean absolute error ($MAE$) are used. Our experimental results indicate (according to a confidence interval and t-test analysis) that FEMI performs significantly better than EMI and IBLLS.
Original language | English |
---|---|
Pages (from-to) | 389-422 |
Number of pages | 34 |
Journal | Knowledge and Information Systems |
Volume | 46 |
Issue number | 2 |
Early online date | 2015 |
DOIs | |
Publication status | Published - 2016 |