HDSKG: Harvesting domain specific knowledge graph from content of webpages

Xuejiao Zhao, Zhenchang Xing, Muhammad Ashad Kabir, Naoya Sawada, Jing Li, Shang Wei Lin

Research output: Book chapter/Published conference paperConference paperpeer-review

61 Citations (Scopus)

Abstract

Knowledge graph is useful for many different domains like search result ranking, recommendation, exploratory search, etc. It integrates structural information of concepts across multiple information sources, and links these concepts together. The extraction of domain specific relation triples (subject, verb phrase, object) is one of the important techniques for domain specific knowledge graph construction. In this research, an automatic method named HDSKG is proposed to discover domain specific concepts and their relation triples from the content of webpages. We incorporate the dependency parser with rule-based method to chunk the relations triple candidates, then we extract advanced features of these candidate relation triples to estimate the domain relevance by a machine learning algorithm. For the evaluation of our method, we apply HDSKG to Stack Overflow (a Q&A website about computer programming). As a result, we construct a knowledge graph of software engineering domain with 35279 relation triples, 44800 concepts, and 9660 unique verb phrases. The experimental results show that both the precision and recall of HDSKG (0.78 and 0.7 respectively) is much higher than the openIE (0.11 and 0.6 respectively). The performance is particularly efficient in the case of complex sentences. Further more, with the self-training technique we used in the classifier, HDSKG can be applied to other domain easily with less training data.
Original languageEnglish
Title of host publicationProceedings of SANER 2017
Subtitle of host publication24th IEEE international conference on software analysis, evolution, and reengineering
EditorsMartin Pinzger, Gabriele Bavota, Andrian Marcus
Place of PublicationUnited States
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages56-67
Number of pages12
ISBN (Electronic)9781509055012
ISBN (Print)9781509055029 (Print on demand)
DOIs
Publication statusPublished - 23 Mar 2017
Event24th IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2017 - Alpen-Adria University of Klagenfurt, Klagenfurt, Austria
Duration: 21 Feb 201724 Feb 2017
https://saner.aau.at/ (Conference website)

Conference

Conference24th IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2017
Country/TerritoryAustria
CityKlagenfurt
Period21/02/1724/02/17
OtherSANER is the premier research conference on the theory and practice of recovering information from existing software and systems. It explores innovative methods of extracting the many kinds of information that can be recovered from software, software engineering documents, and systems artifacts, and examines innovative ways of using this information in system renovation and program understanding.
SANER promotes discussion and interaction among researchers and practitioners about the development of maintainable systems, and the improvement, evolution, migration, and reengineering of existing systems. It also explores innovative methods of extracting the many kinds of information of interest to software developers and examines innovative ways of using this information in system renovation and program understanding.
Internet address

Fingerprint

Dive into the research topics of 'HDSKG: Harvesting domain specific knowledge graph from content of webpages'. Together they form a unique fingerprint.

Cite this