Discretization of Continuous Attributes Through Low Frequency Numerical Values and Attribute Interdependency

Research output: Contribution to journalArticle

16 Citations (Scopus)
1 Downloads (Pure)

Abstract

Discretization is the process of converting numerical values into categorical values. There are many existing techniques for discretization. However, the existing techniques have various limitations such as the requirement of a user input on the number of categories and number of records in each category. Therefore, we propose a new discretization technique called low frequency discretizer (LFD) that does not require any user input. There are some existing techniques that do not require user input, but they rely on various assumptions such as the number of records in each interval is same, and the number of intervals is equal to the number of records in each interval. These assumptions are often difficult to justify. LFD does not require any assumptions. In LFD the number of categories and frequency of each category are not pre-defined, rather data driven. Other contributions of LFD are as follows. LFD uses low frequency values as cut points and thus reduces the information loss due to discretization. It uses all other categorical attributes and any numerical attribute that has already been categorized. It considers that the influence of an attribute in discretization of another attribute depends on the strength of their relationship. We evaluate LFD by comparing it with six (6) existing techniques on eight (8) datasets for three different types of evaluation, namely the classification accuracy, imputation accuracy and noise detection accuracy. Our experimental results indicate a significant improvement based on the sign test analysis.
Original languageEnglish
Pages (from-to)410-423
Number of pages14
JournalExpert Systems with Applications
Volume45
DOIs
Publication statusPublished - 2016

Cite this

@article{2ffb5eea83314f96a5ce3d3c91c80f12,
title = "Discretization of Continuous Attributes Through Low Frequency Numerical Values and Attribute Interdependency",
abstract = "Discretization is the process of converting numerical values into categorical values. There are many existing techniques for discretization. However, the existing techniques have various limitations such as the requirement of a user input on the number of categories and number of records in each category. Therefore, we propose a new discretization technique called low frequency discretizer (LFD) that does not require any user input. There are some existing techniques that do not require user input, but they rely on various assumptions such as the number of records in each interval is same, and the number of intervals is equal to the number of records in each interval. These assumptions are often difficult to justify. LFD does not require any assumptions. In LFD the number of categories and frequency of each category are not pre-defined, rather data driven. Other contributions of LFD are as follows. LFD uses low frequency values as cut points and thus reduces the information loss due to discretization. It uses all other categorical attributes and any numerical attribute that has already been categorized. It considers that the influence of an attribute in discretization of another attribute depends on the strength of their relationship. We evaluate LFD by comparing it with six (6) existing techniques on eight (8) datasets for three different types of evaluation, namely the classification accuracy, imputation accuracy and noise detection accuracy. Our experimental results indicate a significant improvement based on the sign test analysis.",
keywords = "Corrupt data detection, Data cleansing, Data discretization, Data mining, Data pre-processing, Missing value imputation",
author = "Rahman, {Md Geaur} and Islam, {Md Zahidul}",
note = "Imported on 12 Apr 2017 - DigiTool details were: Journal title (773t) = Expert Systems with Applications. ISSNs: 1873-6793;",
year = "2016",
doi = "10.1016/j.eswa.2015.10.005",
language = "English",
volume = "45",
pages = "410--423",
journal = "Expert Systems with Applications",
issn = "0957-4174",
publisher = "Elsevier",

}

TY - JOUR

T1 - Discretization of Continuous Attributes Through Low Frequency Numerical Values and Attribute Interdependency

AU - Rahman, Md Geaur

AU - Islam, Md Zahidul

N1 - Imported on 12 Apr 2017 - DigiTool details were: Journal title (773t) = Expert Systems with Applications. ISSNs: 1873-6793;

PY - 2016

Y1 - 2016

N2 - Discretization is the process of converting numerical values into categorical values. There are many existing techniques for discretization. However, the existing techniques have various limitations such as the requirement of a user input on the number of categories and number of records in each category. Therefore, we propose a new discretization technique called low frequency discretizer (LFD) that does not require any user input. There are some existing techniques that do not require user input, but they rely on various assumptions such as the number of records in each interval is same, and the number of intervals is equal to the number of records in each interval. These assumptions are often difficult to justify. LFD does not require any assumptions. In LFD the number of categories and frequency of each category are not pre-defined, rather data driven. Other contributions of LFD are as follows. LFD uses low frequency values as cut points and thus reduces the information loss due to discretization. It uses all other categorical attributes and any numerical attribute that has already been categorized. It considers that the influence of an attribute in discretization of another attribute depends on the strength of their relationship. We evaluate LFD by comparing it with six (6) existing techniques on eight (8) datasets for three different types of evaluation, namely the classification accuracy, imputation accuracy and noise detection accuracy. Our experimental results indicate a significant improvement based on the sign test analysis.

AB - Discretization is the process of converting numerical values into categorical values. There are many existing techniques for discretization. However, the existing techniques have various limitations such as the requirement of a user input on the number of categories and number of records in each category. Therefore, we propose a new discretization technique called low frequency discretizer (LFD) that does not require any user input. There are some existing techniques that do not require user input, but they rely on various assumptions such as the number of records in each interval is same, and the number of intervals is equal to the number of records in each interval. These assumptions are often difficult to justify. LFD does not require any assumptions. In LFD the number of categories and frequency of each category are not pre-defined, rather data driven. Other contributions of LFD are as follows. LFD uses low frequency values as cut points and thus reduces the information loss due to discretization. It uses all other categorical attributes and any numerical attribute that has already been categorized. It considers that the influence of an attribute in discretization of another attribute depends on the strength of their relationship. We evaluate LFD by comparing it with six (6) existing techniques on eight (8) datasets for three different types of evaluation, namely the classification accuracy, imputation accuracy and noise detection accuracy. Our experimental results indicate a significant improvement based on the sign test analysis.

KW - Corrupt data detection

KW - Data cleansing

KW - Data discretization

KW - Data mining

KW - Data pre-processing

KW - Missing value imputation

U2 - 10.1016/j.eswa.2015.10.005

DO - 10.1016/j.eswa.2015.10.005

M3 - Article

VL - 45

SP - 410

EP - 423

JO - Expert Systems with Applications

JF - Expert Systems with Applications

SN - 0957-4174

ER -