查看文章

psu.edu 中的 [PDF]

An improvised algorithm for relevant content extraction from web pages

作者

Aanshi Bhardwaj, Veenu Mangat

发表日期

2014/5/1

期刊

Journal of Emerging Technologies in Web Intelligence

卷号

期号

页码范围

226-230

出版商

Academy Publisher

简介

World Wide Web (WWW) is now a famous medium by which people all around the world can spread and gather information of all kind. However, there is large amount of irrelevant redundant and information on web pages also. Such information makes various web mining tasks web page crawling, web page classification, link based ranking and topic distillation complex. Previously, the relevant content was extracted only from textual part of web pages. But now-a-days the content on web page is not only in the text form but also as an image, video or audio. This paper proposes an improved algorithm for extracting informative content from web pages ie it extracts the relevant content not only as text but also as images, videos, audios, adobe flash files and online games. Experiments were conducted on different real websites show that precision and recall values of our approach is superior to the previous Word to Leaf Ratio approach.

引用总数

被引用次数：10

201720182019202020212022202320242 1 3 1 1 1 1

学术搜索中的文章

An improvised algorithm for relevant content extraction from web pages

A Bhardwaj, V Mangat - Journal of Emerging Technologies in Web Intelligence, 2014

被引用次数：10 相关文章所有 3 个版本