Facilitating information extraction without annotated data using unsupervised and positive-unlabeled learning

AMIA Annu Symp Proc. 2021 Jan 25:2020:658-667. eCollection 2020.

Abstract

Information extraction (IE), the distillation of specific information from unstructured data, is a core task in natural language processing. For rare entities (<1% prevalence), collection of positive examples required to train a model may require an infeasibly large sample of mostly negative ones. We combined unsupervised- with biased positive-unlabeled (PU) learning methods to: 1) facilitate positive example collection while maintaining the assumptions needed to 2) learn a binary classifier from the biased positive-unlabeled data alone. We tested the methods on a real-life use case of rare (<0.42%) entity extraction from medical malpractice documents. When tested on a manually reviewed random sample of documents, the PU model achieved an area under the precision-recall curve of0.283 and Fj of 0.410, outperforming fully supervised learning (0.022 and 0.096, respectively). The results demonstrate our method's potential to reduce the manual effort required for extracting rare entities from narrative texts.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Data Curation
  • Data Mining / methods*
  • Humans
  • Natural Language Processing*