TY - JOUR
T1 - Certus
T2 - An effective entity resolution approach with graph differential dependencies (GDDs)
AU - Kwashie, Selasi
AU - Liu, Lin
AU - Liu, Jixue
AU - Stumptner, Markus
AU - Li, Jiuyong
AU - Yang, Lujing
PY - 2019/2
Y1 - 2019/2
N2 - Entity resolution (ER) is the problem of accurately identifying multiple, differing, and possibly contradicting representations of unique real-world entities in data. It is a challenging and fundamental task in data cleansing and data integration. In this work, we propose graph differential dependencies (GDDs) as an extension of the recently developed graph entity dependencies (which are formal constraints for graph data) to enable approximate matching of values. Furthermore, we investigate a special discovery of GDDs for ER by designing an algorithm for generating a non-redundant set of GDDs in labelled data. Then, we develop an effective ER technique, Certus, that employs the learned GDDs for improving the accuracy of ER results. We perform extensive empirical evaluation of our proposals on five real-world ER benchmark datasets and a proprietary database to test their effectiveness and efficiency. The results from the experiments show the discovery algorithm and Certus are efficient; and more importantly, GDDs significantly improve the precision of ER without considerable trade-off of recall.
AB - Entity resolution (ER) is the problem of accurately identifying multiple, differing, and possibly contradicting representations of unique real-world entities in data. It is a challenging and fundamental task in data cleansing and data integration. In this work, we propose graph differential dependencies (GDDs) as an extension of the recently developed graph entity dependencies (which are formal constraints for graph data) to enable approximate matching of values. Furthermore, we investigate a special discovery of GDDs for ER by designing an algorithm for generating a non-redundant set of GDDs in labelled data. Then, we develop an effective ER technique, Certus, that employs the learned GDDs for improving the accuracy of ER results. We perform extensive empirical evaluation of our proposals on five real-world ER benchmark datasets and a proprietary database to test their effectiveness and efficiency. The results from the experiments show the discovery algorithm and Certus are efficient; and more importantly, GDDs significantly improve the precision of ER without considerable trade-off of recall.
UR - https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=woscharlessturt_pure&SrcAuth=WosAPI&KeyUT=WOS:000497518400004&DestLinkType=FullRecord&DestApp=WOS
U2 - 10.14778/3311880.3311883
DO - 10.14778/3311880.3311883
M3 - Article
SN - 2150-8097
VL - 12
SP - 653
EP - 666
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 6
ER -