Abstract
Discretization is the process of converting numerical values into categorical values. There are many existing techniques for discretization. However, the existing techniques have various limitations such as the requirement of a user input on the number of categories and number of records in each category. Therefore, we propose a new discretization technique called low frequency discretizer (LFD) that does not require any user input. There are some existing techniques that do not require user input, but they rely on various assumptions such as the number of records in each interval is same, and the number of intervals is equal to the number of records in each interval. These assumptions are often difficult to justify. LFD does not require any assumptions. In LFD the number of categories and frequency of each category are not pre-defined, rather data driven. Other contributions of LFD are as follows. LFD uses low frequency values as cut points and thus reduces the information loss due to discretization. It uses all other categorical attributes and any numerical attribute that has already been categorized. It considers that the influence of an attribute in discretization of another attribute depends on the strength of their relationship. We evaluate LFD by comparing it with six (6) existing techniques on eight (8) datasets for three different types of evaluation, namely the classification accuracy, imputation accuracy and noise detection accuracy. Our experimental results indicate a significant improvement based on the sign test analysis.
Original language | English |
---|---|
Pages (from-to) | 410-423 |
Number of pages | 14 |
Journal | Expert Systems with Applications |
Volume | 45 |
DOIs | |
Publication status | Published - 2016 |