TY - JOUR
T1 - GrantExtractor
T2 - Accurate grant support information extraction from biomedical fulltext based on Bi-LSTM-CRF
AU - Dai, Suyang
AU - Ding, Yuxia
AU - Zhang, Zihan
AU - Zuo, Wenxuan
AU - Huang, Xiaodi
AU - Zhu, Shanfeng
N1 - Publisher Copyright:
© 2004-2012 IEEE.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2021
Y1 - 2021
N2 - Grant support (GS) in the MEDLINE database refers to funding agencies and contract numbers. It is important for funding organizations to track their funding outcomes from the GS information. As such, how to accurately and automatically extract funding information from biomedical literature is challenging. In this paper, we present a pipeline system called GrantExtractor that isable to accurately extract GS information from fulltext biomedical literature. GrantExtractor effectively integrates several advanced machine learning techniques. In particular, we use a sentence classifier to identify funding sentences from articles first. A bi-directional LSTM and the CRF layer (BiLSTM-CRF), and pattern matching are then used to extract entities of grant numbers and agencies from these identified funding sentences. After removing noisy numbers by a multi-class model, we finally match each grant number with its corresponding agency. Experimental results on benchmark datasets have demonstrated that GrantExtractor clearly outperforms all baseline methods. It is further evident that GrantExtractor won the first place in Task 5C of 2017 BioASQ challenge, with achieving the Micro-recall of 0.9526 for 22,610 articles. Moreover, GrantExtractor has achieved the Micro F-measure score as high as 0.90 in extracting grant pairs.
AB - Grant support (GS) in the MEDLINE database refers to funding agencies and contract numbers. It is important for funding organizations to track their funding outcomes from the GS information. As such, how to accurately and automatically extract funding information from biomedical literature is challenging. In this paper, we present a pipeline system called GrantExtractor that isable to accurately extract GS information from fulltext biomedical literature. GrantExtractor effectively integrates several advanced machine learning techniques. In particular, we use a sentence classifier to identify funding sentences from articles first. A bi-directional LSTM and the CRF layer (BiLSTM-CRF), and pattern matching are then used to extract entities of grant numbers and agencies from these identified funding sentences. After removing noisy numbers by a multi-class model, we finally match each grant number with its corresponding agency. Experimental results on benchmark datasets have demonstrated that GrantExtractor clearly outperforms all baseline methods. It is further evident that GrantExtractor won the first place in Task 5C of 2017 BioASQ challenge, with achieving the Micro-recall of 0.9526 for 22,610 articles. Moreover, GrantExtractor has achieved the Micro F-measure score as high as 0.90 in extracting grant pairs.
KW - BiLSTM-CRF
KW - biomedical fulltext
KW - biomedical text mining
KW - Grant support
KW - information extraction
UR - http://www.scopus.com/inward/record.url?scp=85100680683&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85100680683&partnerID=8YFLogxK
U2 - 10.1109/TCBB.2019.2939128
DO - 10.1109/TCBB.2019.2939128
M3 - Article
C2 - 31494556
AN - SCOPUS:85100680683
SN - 1545-5963
VL - 18
SP - 205
EP - 215
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
IS - 1
ER -