TY - JOUR
T1 - A novel approach for entity resolution in scientific documents using context graphs
AU - Huang, Changqin
AU - Zhu, Jia
AU - Huang, Xiaodi
AU - Yang, Min
AU - Fung, Gabriel
AU - Hu, Qintai
N1 - Includes bibliographical references.
PY - 2018/3
Y1 - 2018/3
N2 - Entity resolution refers to disambiguating and resolving entities in structured and unstructured data. Developments of effective resolution algorithms are significant for processing scientific documents, particularly for biomedical literature. Specifically, name ambiguity among biomedical entities is a primary task that needs to be solved in the knowledge extraction process. In this paper, we present a novel approach to disambiguating gene/protein names by using context graphs. A set of abstracts of documents is used to build the context graphs through disclosing the indirect co-occurrence relationships among words. Feature vectors of the graphs can be constructed according to information gain (IG) on the word set. To evaluate the IG values, we propose a new metrics that integrates the word frequency (WF), dispersion degree (DD) and concentration degree (CD). Finally, entity resolution is performed by applying a support vector machine (SVM). Compared to existing approaches, the proposed method is capable of discovering latent information from the context of entity names, rather than using some statistical information such as the number of occurrences of words. Based on the results from comprehensive experiments over two benchmark datasets, we conclude that our proposed method, compared to several existing solutions, for resolving ambiguity entities is promising.
AB - Entity resolution refers to disambiguating and resolving entities in structured and unstructured data. Developments of effective resolution algorithms are significant for processing scientific documents, particularly for biomedical literature. Specifically, name ambiguity among biomedical entities is a primary task that needs to be solved in the knowledge extraction process. In this paper, we present a novel approach to disambiguating gene/protein names by using context graphs. A set of abstracts of documents is used to build the context graphs through disclosing the indirect co-occurrence relationships among words. Feature vectors of the graphs can be constructed according to information gain (IG) on the word set. To evaluate the IG values, we propose a new metrics that integrates the word frequency (WF), dispersion degree (DD) and concentration degree (CD). Finally, entity resolution is performed by applying a support vector machine (SVM). Compared to existing approaches, the proposed method is capable of discovering latent information from the context of entity names, rather than using some statistical information such as the number of occurrences of words. Based on the results from comprehensive experiments over two benchmark datasets, we conclude that our proposed method, compared to several existing solutions, for resolving ambiguity entities is promising.
KW - Context-based graphs
KW - Entity resolution
KW - Feature selection
KW - Support vector machines
UR - http://www.scopus.com/inward/record.url?scp=85039172688&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85039172688&partnerID=8YFLogxK
U2 - 10.1016/j.ins.2017.12.024
DO - 10.1016/j.ins.2017.12.024
M3 - Article
AN - SCOPUS:85039172688
SN - 0020-0255
VL - 432
SP - 431
EP - 441
JO - Information Sciences
JF - Information Sciences
ER -