Rule based approach to extract metadata from scientific PDF documents

Ahmer Maqsood Hashmi, Muhammad Tannvir Afzal, Sabih Rehman

Research output: Book chapter/Published conference paperConference paperpeer-review

Abstract

The number of scientific PDF documents is increasing at a very rapid pace. The searching for these documents is becoming a time consuming task, due to the large number of PDF documents. To make the search and storage more efficient, we need a mechanism to extract metadata from these documents and store this metadata according to their semantics. Extracting information from metadata and storing that information is very time consuming task and requires lots of human effort if performed manually due to large numbers of documents and their varying formats. In this paper, we present a rule-based approach to extract metadata information from the research articles. This approach was developed and evaluated on a diverse data-set provided by ESWC (2016) having a number of different formats and features. Evaluation results show that our proposed approach performs 22% better than CERMINE and 9% better than GROBID.
Original languageEnglish
Title of host publication2020 5th International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA)
Place of PublicationUnited States
PublisherIEEE
Pages1-4
Number of pages4
ISBN (Electronic)9781728194363
ISBN (Print)9781728194370
DOIs
Publication statusPublished - 25 Nov 2020
Event2020 IEEE Conference on Innovative Technologies in Intelligent Systems & Industrial Applications: CITISIA 2020 - Charles Sturt University Sydney campus, Sydney, Australia
Duration: 25 Nov 202027 Nov 2020
https://web.archive.org/web/20201022124657/http://www.ieee-citisia.org/ (conference website)
file:///D:/Users/bmt175/Downloads/Proceeding.pdf (proceedings)

Conference

Conference2020 IEEE Conference on Innovative Technologies in Intelligent Systems & Industrial Applications
Country/TerritoryAustralia
CitySydney
Period25/11/2027/11/20
OtherThe “Conference on Innovative Technologies in Intelligent Systems & Industrial Applications” (CITISIA) is a student conference that aims to provide students of higher learning institutions with a platform for presenting their own projects. It is also a measure of recognition of students’ professional and technical achievements – by industries and international organizations such as IEEE. This conference is designed to facilitate exchanges of ideas through communication, networking and learning from others, for students and IEEE Chapters in terms of greater collaboration.
Internet address

Fingerprint

Dive into the research topics of 'Rule based approach to extract metadata from scientific PDF documents'. Together they form a unique fingerprint.

Cite this