Project Details
Description
Allen Institute for AI Project Proposal
Extracting metadata from scientific documents involves identifying, for a given corpus of scientific publications, information about each venue (i.e. the title, author, venue, and other information for each paper) and the citation graph (i.e. which papers cite which other papers). Extracting this metadata is a critical and challenging subtask for providing search to scholarly papers. Scientific search engines like the Allen Institute’s semanticscholar.org need to automatically extract metadata in order to provide accurate title and author information for their search results, to optimize the relevance of search results, and to compute bibliometrics that identify the most salient and significant publications. But, extracting metadata is challenging, and accuracy rates of 0.6 to 0.8 (in F1 measure) are not uncommon.
This proposal aims to provide improved methods for metadata extraction. The PI collaborated with the Allen Institute for AI (AI2) on their current version of metadata extraction; this project aims to substantially extend that work to use state-of-the-art methods (neural network language models). The initial codebase for the work has been developed within AI2, and this proposal will extend that code base.
We plan to answer several research questions, from among the following:
• How can large-scale neural networks be used to perform metadata extraction? In particular, iterating many times over the full corpus of documents is intractable. Methods that filter or hone in on key data in advance will be required, and entail new research.
• How can we use external data sources like the Open Access Citations to bootstrap large-scale training data required by the neural methods?
Our experiments will evaluate against existing baselines, including AI2’s existing metadata extractor and the Grobid open-source baseline extractor. Subtasks will include title and author extraction, and/or reference extraction. Success metrics will focus on F1 against gold (labeled) test sets. We will work closely with AI2 personnel to ensure that our solutions can transfer to the organization.
One PhD student for six months is requested for this work.
Status | Finished |
---|---|
Effective start/end date | 9/1/17 → 3/31/18 |
Funding
- Allen Institute for Artificial Intelligence (Agmt 10/13/2017)
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.