Rule based approach to extract metadata from scientific PDF documents

Ahmer Maqsood Hashmi, Muhammad Tannvir Afzal, Sabih Rehman

Research output: Book chapter/Published conference paperConference paperpeer-review

1 Citation (Scopus)

Abstract

The number of scientific PDF documents is increasing at a very rapid pace. The searching for these documents is becoming a time consuming task, due to the large number of PDF documents. To make the search and storage more efficient, we need a mechanism to extract metadata from these documents and store this metadata according to their semantics. Extracting information from metadata and storing that information is very time consuming task and requires lots of human effort if performed manually due to large numbers of documents and their varying formats. In this paper, we present a rule-based approach to extract metadata information from the research articles. This approach was developed and evaluated on a diverse data-set provided by ESWC (2016) having a number of different formats and features. Evaluation results show that our proposed approach performs 22% better than CERMINE and 9% better than GROBID.
Original languageEnglish
Title of host publicationCITISIA 2020 IEEE Conference on Innovative Technologies in Intelligent System and Industrial Application
Subtitle of host publicationConference Proceedings 25-27 November 2020, Sydney, Australia
Place of PublicationUnited States
PublisherIEEE
Pages1-4
Number of pages4
ISBN (Electronic)9781728194363
ISBN (Print)9781728194370
DOIs
Publication statusPublished - 25 Nov 2020
Event5th IEEE International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications, CITISIA 2020: CITISIA 2020 - Charles Sturt University Sydney campus, Sydney, Australia
Duration: 25 Nov 202027 Nov 2020
https://web.archive.org/web/20201128085551/https://ieee-citisia.org/ (Conference website)
https://web.archive.org/web/20210124015105/https://ieee-citisia.org/wp-content/uploads/2020/11/Conference-Program-new1.pdf (Conference program)
https://ieeexplore.ieee.org/xpl/conhome/9371766/proceeding?pageNumber=4 (Full paper proceedings)

Publication series

NameCITISIA 2020 - IEEE Conference on Innovative Technologies in Intelligent Systems and Industrial Applications, Proceedings

Conference

Conference5th IEEE International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications, CITISIA 2020
Country/TerritoryAustralia
CitySydney
Period25/11/2027/11/20
OtherThe “Conference on Innovative Technologies in Intelligent Systems & Industrial Applications” (CITISIA) is a student conference that aims to provide students of higher learning institutions with a platform for presenting their own projects. It is also a measure of recognition of students’ professional and technical achievements – by industries and international organizations such as IEEE. This conference is designed to facilitate exchanges of ideas through communication, networking and learning from others, for students and IEEE Chapters in terms of greater collaboration.
Internet address

Fingerprint

Dive into the research topics of 'Rule based approach to extract metadata from scientific PDF documents'. Together they form a unique fingerprint.

Cite this