作者
Eli Cortez, Altigran S da Silva, Marcos André Gonçalves, Edleno S de Moura
发表日期
2010/6/6
图书
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
页码范围
807-818
简介
Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized in implicit semi-structured records available in textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed in the recent literature. In this paper we introduce ONDUX (On Demand Unsupervised Information Extraction), a new unsupervised probabilistic approach for IETS. As other unsupervised IETS approaches, ONDUX relies on information available on pre-existing data to associate segments in the input string with attributes of a given domain. Unlike other approaches, we rely on very effective matching strategies instead of explicit learning strategies. The effectiveness of this matching strategy is also exploited to disambiguate the extraction of certain attributes through a reinforcement step that explores sequencing and …
引用总数
2009201020112012201320142015201620172018201920202021148510365112
学术搜索中的文章
E Cortez, AS da Silva, MA Gonçalves, ES de Moura - Proceedings of the 2010 ACM SIGMOD International …, 2010