查看文章

jst.go.jp 中的 [PDF]

事例に基づく HTML 文書から XML 文書への半自動変換—シリーズ型 HTML 文書における類似性の利用—

作者

梅原雅之，岩沼宏治，永井宏和

发表日期

2001

期刊

人工知能学会論文誌

卷号

期号

页码范围

408-416

出版商

一般社団法人人工知能学会

简介

In order to utilize a large quantity of information in Internet, machine processing of HTML documents has been becoming tremendously important. HTML, however, is designed mainly for reading with browsers, thus not suitable for machine processing. XML was proposed as a solution for this problem. Unfortunately, full automatic transformation from HTML to XML is extremely difficult, because it absolutely demands to understand the meaning of HTML documents. On the other hand, there are many series of HTML pages in actual Web sites. Each page of a series usually has a quite similar structure with each other. Therefore a case-based transformation must be a promising method in practice. In this paper, we give a case-based transformation method from HTML documents to XML ones. Given a series of HTML documents and a sample transformation from a selected HTML document into XML one, we first analyze both of the semantic and syntactic information appearing in the sample pair. Next the remaining HTML pages of the series are automatically transformed into XML documents by using the information previously extracted from the sample. We adopt a vector model of term weighted frequency for approximating the meaning of HTML documents, and also use both headlines and a parse tree as syntactical information. Throughout experimental evaluation, we show this case-based method achieved a highly accurate transformation, ie, 80% of actual 80 pages can be transformed in a correct way.

引用总数

被引用次数：27

200220032004200520062007200820099 6 6 2 3 1

学术搜索中的文章

事例に基づく HTML 文書から XML 文書への半自動変換—シリーズ型 HTML 文書における類似性の利用—

梅原雅之，岩沼宏治，永井宏和 - 人工知能学会論文誌, 2001

被引用次数：27 相关文章所有 7 个版本