Gain more with less: Extracting information from business documents with small data

MT Nguyen, NH Son - Expert Systems with Applications, 2023 - Elsevier
MT Nguyen, NH Son
Expert Systems with Applications, 2023Elsevier
Abstract Information extraction (IE) is a vital step of digitization that reduces paperwork in
offices. However, the adaptation of common IE systems to actual business cases faces two
issues. First, the number of training samples is small (ie 100–200 examples). Second, span
extraction models based on question answering formulation require a long time for training
and inference. To overcome these issues, we introduce a new query-based model for the
extraction of information from business documents. For data limitation, the model employs …
Abstract
Information extraction (IE) is a vital step of digitization that reduces paperwork in offices. However, the adaptation of common IE systems to actual business cases faces two issues. First, the number of training samples is small (i.e. 100–200 examples). Second, span extraction models based on question answering formulation require a long time for training and inference. To overcome these issues, we introduce a new query-based model for the extraction of information from business documents. For data limitation, the model employs transfer learning which adapts the knowledge of pre-trained language models (i.e. BERT) to specific domains. To do that, we design a new CNN layer for the adaptation of the model to specific domains. For the speed, different from the encoding of normal span extraction methods (BERT-QA), the proposed model encodes short tags and context documents in two channels in parallel, which speeds up training and inference time. Information from short tags is fused with context documents learned from CNN by using attention to predict start and end positions of extracted spans. Promising results on five domain-specific datasets in English and Japanese indicate that the proposed model produces high-quality outputs and can be applied for business scenarios.
Elsevier
以上显示的是最相近的搜索结果。 查看全部搜索结果