Software that facilitates accurate extraction of text from PDF files of research articles for use in text mining applications.This open source system extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles. The current version of LA-PDFText is a baseline system that extracts text using a three-stage process:* identification of blocks of contiguous text* classification of these blocks into rhetorical categories* extraction of the text from blocks grouped section-wise. It is intended for use both scientists and NLP engineers interested in getting access to text within specific sections of research articles.

Resource Type: 
Parent organization: 
University of Southern California; California; USA
Supporting agency: 
NIGMS (BioScholar project) NIH (NeuArt project) Biomedical Informatics Research Network NSF (SciKnowMine project)