TY - JOUR
T1 - PDF text classification to leverage information extraction from publication reports
AU - Bui, Duy Duc An
AU - Del Fiol, Guilherme
AU - Jonnalagadda, Siddhartha
N1 - Funding Information:
This work was made possible by funding from National Library of Medicine Grants ( R00LM011389 and R01LM011416-01 ). We are also thankful to the guidance provided by the US Satellite of the Cochrane Heart Group led by Dr. Mark Huffman.
PY - 2016/6/1
Y1 - 2016/6/1
N2 - Objectives: Data extraction from original study reports is a time-consuming, error-prone process in systematic review development. Information extraction (IE) systems have the potential to assist humans in the extraction task, however majority of IE systems were not designed to work on Portable Document Format (PDF) document, an important and common extraction source for systematic review. In a PDF document, narrative content is often mixed with publication metadata or semi-structured text, which add challenges to the underlining natural language processing algorithm. Our goal is to categorize PDF texts for strategic use by IE systems. Methods: We used an open-source tool to extract raw texts from a PDF document and developed a text classification algorithm that follows a multi-pass sieve framework to automatically classify PDF text snippets (for brevity, texts) into TITLE, ABSTRACT, BODYTEXT, SEMISTRUCTURE, and METADATA categories. To validate the algorithm, we developed a gold standard of PDF reports that were included in the development of previous systematic reviews by the Cochrane Collaboration. In a two-step procedure, we evaluated (1) classification performance, and compared it with machine learning classifier, and (2) the effects of the algorithm on an IE system that extracts clinical outcome mentions. Results: The multi-pass sieve algorithm achieved an accuracy of 92.6%, which was 9.7% (p < 0.001) higher than the best performing machine learning classifier that used a logistic regression algorithm. F-measure improvements were observed in the classification of TITLE (+15.6%), ABSTRACT (+54.2%), BODYTEXT (+3.7%), SEMISTRUCTURE (+34%), and MEDADATA (+14.2%). In addition, use of the algorithm to filter semi-structured texts and publication metadata improved performance of the outcome extraction system (F-measure +4.1%, p = 0.002). It also reduced of number of sentences to be processed by 44.9% (p < 0.001), which corresponds to a processing time reduction of 50% (p = 0.005). Conclusions: The rule-based multi-pass sieve framework can be used effectively in categorizing texts extracted from PDF documents. Text classification is an important prerequisite step to leverage information extraction from PDF documents.
AB - Objectives: Data extraction from original study reports is a time-consuming, error-prone process in systematic review development. Information extraction (IE) systems have the potential to assist humans in the extraction task, however majority of IE systems were not designed to work on Portable Document Format (PDF) document, an important and common extraction source for systematic review. In a PDF document, narrative content is often mixed with publication metadata or semi-structured text, which add challenges to the underlining natural language processing algorithm. Our goal is to categorize PDF texts for strategic use by IE systems. Methods: We used an open-source tool to extract raw texts from a PDF document and developed a text classification algorithm that follows a multi-pass sieve framework to automatically classify PDF text snippets (for brevity, texts) into TITLE, ABSTRACT, BODYTEXT, SEMISTRUCTURE, and METADATA categories. To validate the algorithm, we developed a gold standard of PDF reports that were included in the development of previous systematic reviews by the Cochrane Collaboration. In a two-step procedure, we evaluated (1) classification performance, and compared it with machine learning classifier, and (2) the effects of the algorithm on an IE system that extracts clinical outcome mentions. Results: The multi-pass sieve algorithm achieved an accuracy of 92.6%, which was 9.7% (p < 0.001) higher than the best performing machine learning classifier that used a logistic regression algorithm. F-measure improvements were observed in the classification of TITLE (+15.6%), ABSTRACT (+54.2%), BODYTEXT (+3.7%), SEMISTRUCTURE (+34%), and MEDADATA (+14.2%). In addition, use of the algorithm to filter semi-structured texts and publication metadata improved performance of the outcome extraction system (F-measure +4.1%, p = 0.002). It also reduced of number of sentences to be processed by 44.9% (p < 0.001), which corresponds to a processing time reduction of 50% (p = 0.005). Conclusions: The rule-based multi-pass sieve framework can be used effectively in categorizing texts extracted from PDF documents. Text classification is an important prerequisite step to leverage information extraction from PDF documents.
KW - Document analysis
KW - Machine learning
KW - Natural language processing
KW - Text classification
UR - http://www.scopus.com/inward/record.url?scp=84962759821&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84962759821&partnerID=8YFLogxK
U2 - 10.1016/j.jbi.2016.03.026
DO - 10.1016/j.jbi.2016.03.026
M3 - Article
C2 - 27044929
AN - SCOPUS:84962759821
VL - 61
SP - 141
EP - 148
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
SN - 1532-0464
ER -