Facilitating information extraction without annotated data using unsupervised and positive-unlabeled learning

Zfania Tom Korach; Sharmitha Yerneni; Jonathan Einbinder; Carl Kallenberg; Li Zhou

Facilitating information extraction without annotated data using unsupervised and positive-unlabeled learning

AMIA Annu Symp Proc. 2021 Jan 25:2020:658-667. eCollection 2020.

Authors

Zfania Tom Korach^{1

2}, Sharmitha Yerneni¹, Jonathan Einbinder^{2

3}, Carl Kallenberg³, Li Zhou^{1

2}

Affiliations

¹ Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA.
² Harvard Medical School, Boston, MA.
³ CRICO Risk Management Foundation, Boston, MA.

PMID: 33936440
PMCID: PMC8075513

Abstract

Information extraction (IE), the distillation of specific information from unstructured data, is a core task in natural language processing. For rare entities (<1% prevalence), collection of positive examples required to train a model may require an infeasibly large sample of mostly negative ones. We combined unsupervised- with biased positive-unlabeled (PU) learning methods to: 1) facilitate positive example collection while maintaining the assumptions needed to 2) learn a binary classifier from the biased positive-unlabeled data alone. We tested the methods on a real-life use case of rare (<0.42%) entity extraction from medical malpractice documents. When tested on a manually reviewed random sample of documents, the PU model achieved an area under the precision-recall curve of0.283 and Fj of 0.410, outperforming fully supervised learning (0.022 and 0.096, respectively). The results demonstrate our method's potential to reduce the manual effort required for extracting rare entities from narrative texts.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Data Curation
Data Mining / methods*
Humans
Natural Language Processing*