A novel approach for content extraction from web pages

A Bhardwaj, V Mangat - 2014 Recent Advances in Engineering …, 2014 - ieeexplore.ieee.org
2014 Recent Advances in Engineering and Computational Sciences (RAECS), 2014ieeexplore.ieee.org
The rapid development of the internet and web publishing techniques create numerous
information sources published as HTML pages on World Wide Web. However, there is lot of
redundant and irrelevant information also on web pages. Navigation panels, Table of
content (TOC), advertisements, copyright statements, service catalogs, privacy policies etc.
on web pages are considered as relevant and irrelevant content. Such information makes
various web mining tasks such as web page crawling, web page classification, link based …
The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web. However, there is lot of redundant and irrelevant information also on web pages. Navigation panels, Table of content (TOC), advertisements, copyright statements, service catalogs, privacy policies etc. on web pages are considered as relevant and irrelevant content. Such information makes various web mining tasks such as web page crawling, web page classification, link based ranking, topic distillation complex. This paper discusses various approaches for extracting informative content from web pages and a new approach for content extraction from web pages using word to leaf ratio and density of links.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果