作者
梅原雅之, 岩沼宏治, 永井宏和
发表日期
2001
期刊
人工知能学会論文誌
卷号
16
期号
5
页码范围
408-416
出版商
一般社団法人 人工知能学会
简介
In order to utilize a large quantity of information in Internet, machine processing of HTML documents has been becoming tremendously important. HTML, however, is designed mainly for reading with browsers, thus not suitable for machine processing. XML was proposed as a solution for this problem. Unfortunately, full automatic transformation from HTML to XML is extremely difficult, because it absolutely demands to understand the meaning of HTML documents. On the other hand, there are many series of HTML pages in actual Web sites. Each page of a series usually has a quite similar structure with each other. Therefore a case-based transformation must be a promising method in practice. In this paper, we give a case-based transformation method from HTML documents to XML ones. Given a series of HTML documents and a sample transformation from a selected HTML document into XML one, we first analyze both of the semantic and syntactic information appearing in the sample pair. Next the remaining HTML pages of the series are automatically transformed into XML documents by using the information previously extracted from the sample. We adopt a vector model of term weighted frequency for approximating the meaning of HTML documents, and also use both headlines and a parse tree as syntactical information. Throughout experimental evaluation, we show this case-based method achieved a highly accurate transformation, ie, 80% of actual 80 pages can be transformed in a correct way.
引用总数
20022003200420052006200720082009966231