TY - JOUR
T1 - GrantExtractor
T2 - Accurate grant support information extraction from biomedical fulltext based on Bi-LSTM-CRF
AU - Dai, Suyang
AU - Ding, Yuxia
AU - Zhang, Zihan
AU - Zuo, Wenxuan
AU - Huang, Xiaodi
AU - Zhu, Shanfeng
N1 - Funding Information:
Abstract—Grant support (GS) in the MEDLINE database refers to funding agencies and contract numbers. It is important for funding organizations to track their funding outcomes from the GS information. As such, how to accurately and automatically extract funding information from biomedical literature is challenging. In this paper, we present a pipeline system called GrantExtractor that is able to accurately extract GS information from fulltext biomedical literature. GrantExtractor effectively integrates several advanced machine learning techniques. In particular, we use a sentence classifier to identify funding sentences from articles first. A bi-directional LSTM and the CRF layer (BiLSTM-CRF), and pattern matching are then used to extract entities of grant numbers and agencies from these identified funding sentences. After removing noisy numbers by a multi-class model, we finally match each grant number with its corresponding agency. Experimental results on benchmark datasets have demonstrated that GrantExtractor clearly outperforms all baseline methods. It is further evident that GrantExtractor won the first place in Task 5C of 2017 BioASQ challenge, with achieving the Micro-recall of 0.9526 for 22,610 articles. Moreover, GrantExtractor has achieved the Micro F-measure score as high as 0.90 in extracting grant pairs.
Funding Information:
(Acknowledgements)...The WHI program is funded by ... through contracts N01WH22110, 24152, 32100-2, 32105-6, 32108-9 ...
Funding Information:
(Acknowledgments)...This work was supported by British Heart Foundation Grants PG/08/113/25515, FS/05/047/18875, and PG/08/056/25325. (Footnotes)...This study was funded by NIH grants GM24663 to HRH; DK068429, GM082971 to AJMW; and GM076378 to KAH.
Funding Information:
• Many grant support information in MEDLINE are redundant. Fig. 4 show the grant information of cita-tion with PMID 21589900 (PMCID 3093365). We find that the answer is redundant. For example, R01 CA077955 and CA077955 refer to the same grant. • Some grant information recorded in MEDLINE are not from the full text. For example, Fig. 5 shows a part of full text with pmid 21589900 and pmcid 3093365. The extracted grant IDs by GrantExtractor for this article are “AI038459”, “CA116837”, “CA77955”, “CA92229” and “DK58587”. Obviously, the grant number in Fig. 5 are all extracted successfully by GrantExtractor. However, compared to annotated grant information in “GrantList”, GrantExtractor can not extract the activity codes “R01” and “R29” from the full text. These activity codes must be generated from other sources. • There are some indexing errors. For example, in the full text of PMC article with PMCID 3011582, the authors state that “This work was supported in part by the NIGMS (NIH NIGMS 1R15GM085936)”. But in the “GrantList” section of corresponding MEDLINE citation (PMID 21589609), the grant number is recorded as “R15GM085936” which missing the appli-cation type “1”.
Funding Information:
For the BioASQ challenge, the BioASQ organizers provide a baseline result from NLM in order to compare the performance. To the best of our knowledge, there is no published literature on the tools currently used by NLM. In spite of this, several studies have addressed the related problem of grant extraction by using different methods. Based on a combination of a Naive Bayes classifier and heuristic rules, researchers in NLM [7] proposed a system that automatically infers the Grant Support types from article text. Grant Support type [8] refers to the funding source of a grant such as US Government grants, Non US Government grants, and Welcome Trust grants. Their purpose is different from the goal of our task. Furthermore, they defined many rules for classifying GS types. It is hard to implement and fails in dealing with the irregular funding information. Another study of NLM [9] utilized a semi-supervised learning method to classify grant support zones. But this method cannot extract the specific grant information from biomedical articles. Gross et al. [10] proposed a rule-based method with the natural language process (NLP) algorithm for extracting grant meta data from articles. The premise of this method is that the zone of a grant statement should be given in advance. Additionally, the extraction performance depends on the regular structure of funding sentences. It is hard to adjust the rules for a better generalization. In summary, all these previous studies cannot address the real problems of GS information extraction.
Funding Information:
(Footnotes)...This work was supported by NIH grant #AI-P01-059576(KVH), NIH grant
Publisher Copyright:
© 2004-2012 IEEE.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2021
Y1 - 2021
N2 - Grant support (GS) in the MEDLINE database refers to funding agencies and contract numbers. It is important for funding organizations to track their funding outcomes from the GS information. As such, how to accurately and automatically extract funding information from biomedical literature is challenging. In this paper, we present a pipeline system called GrantExtractor that isable to accurately extract GS information from fulltext biomedical literature. GrantExtractor effectively integrates several advanced machine learning techniques. In particular, we use a sentence classifier to identify funding sentences from articles first. A bi-directional LSTM and the CRF layer (BiLSTM-CRF), and pattern matching are then used to extract entities of grant numbers and agencies from these identified funding sentences. After removing noisy numbers by a multi-class model, we finally match each grant number with its corresponding agency. Experimental results on benchmark datasets have demonstrated that GrantExtractor clearly outperforms all baseline methods. It is further evident that GrantExtractor won the first place in Task 5C of 2017 BioASQ challenge, with achieving the Micro-recall of 0.9526 for 22,610 articles. Moreover, GrantExtractor has achieved the Micro F-measure score as high as 0.90 in extracting grant pairs.
AB - Grant support (GS) in the MEDLINE database refers to funding agencies and contract numbers. It is important for funding organizations to track their funding outcomes from the GS information. As such, how to accurately and automatically extract funding information from biomedical literature is challenging. In this paper, we present a pipeline system called GrantExtractor that isable to accurately extract GS information from fulltext biomedical literature. GrantExtractor effectively integrates several advanced machine learning techniques. In particular, we use a sentence classifier to identify funding sentences from articles first. A bi-directional LSTM and the CRF layer (BiLSTM-CRF), and pattern matching are then used to extract entities of grant numbers and agencies from these identified funding sentences. After removing noisy numbers by a multi-class model, we finally match each grant number with its corresponding agency. Experimental results on benchmark datasets have demonstrated that GrantExtractor clearly outperforms all baseline methods. It is further evident that GrantExtractor won the first place in Task 5C of 2017 BioASQ challenge, with achieving the Micro-recall of 0.9526 for 22,610 articles. Moreover, GrantExtractor has achieved the Micro F-measure score as high as 0.90 in extracting grant pairs.
KW - BiLSTM-CRF
KW - biomedical fulltext
KW - biomedical text mining
KW - Grant support
KW - information extraction
UR - http://www.scopus.com/inward/record.url?scp=85100680683&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85100680683&partnerID=8YFLogxK
U2 - 10.1109/TCBB.2019.2939128
DO - 10.1109/TCBB.2019.2939128
M3 - Article
C2 - 31494556
AN - SCOPUS:85100680683
SN - 1545-5963
VL - 18
SP - 205
EP - 215
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
IS - 1
ER -