TY - GEN
T1 - Approximate record matching using hash grams
AU - Gollapalli, Mohammed
AU - Li, Xue
AU - Wood, Ian
AU - Governatori, Guido
PY - 2011
Y1 - 2011
N2 - Accurately identifying duplicate records between multiple data sources is a persistent problem that continues to plague organizations and researchers alike. Small inconsistencies between records can prevent detection between two otherwise identical records. In this paper, we present a new probabilistic h-gram (hash gram) record matching technique by extending traditional n-grams and utilizing scale based hashing for equality testing. h-gram matching highly reduces the number of comparisons to be performed for duplicate record detection applicable to a variety of data types and data sizes by transforming data into its equivalent numerical realities. One of the key features of h-gram matching is that it is highly extensible providing more intuitive and flexible results. With the sampling technique in place, our method can be applied on variable size databases to perform data linkage and probabilistic results can be quickly obtained. We have extensively evaluated h-gram matching on large samples of real-world data and the results show higher level of accuracy as well as reduction in required time when compared with existing techniques.
AB - Accurately identifying duplicate records between multiple data sources is a persistent problem that continues to plague organizations and researchers alike. Small inconsistencies between records can prevent detection between two otherwise identical records. In this paper, we present a new probabilistic h-gram (hash gram) record matching technique by extending traditional n-grams and utilizing scale based hashing for equality testing. h-gram matching highly reduces the number of comparisons to be performed for duplicate record detection applicable to a variety of data types and data sizes by transforming data into its equivalent numerical realities. One of the key features of h-gram matching is that it is highly extensible providing more intuitive and flexible results. With the sampling technique in place, our method can be applied on variable size databases to perform data linkage and probabilistic results can be quickly obtained. We have extensively evaluated h-gram matching on large samples of real-world data and the results show higher level of accuracy as well as reduction in required time when compared with existing techniques.
KW - Approximate matching
KW - Data linkage
KW - Record matching
KW - Structure matching
UR - http://www.scopus.com/inward/record.url?scp=84863151535&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84863151535&partnerID=8YFLogxK
U2 - 10.1109/ICDMW.2011.33
DO - 10.1109/ICDMW.2011.33
M3 - Conference paper
AN - SCOPUS:84863151535
SN - 9780769544090
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 504
EP - 511
BT - Proceedings - 11th IEEE International Conference on Data Mining Workshops, ICDMW 2011
T2 - 11th IEEE International Conference on Data Mining Workshops, ICDMW 2011
Y2 - 11 December 2011 through 11 December 2011
ER -