Trafilatura: A web scraping library and command-line tool for text discovery and extraction

A Barbaresi - Proceedings of the 59th Annual Meeting of the …, 2021 - aclanthology.org
An essential operation in web corpus construction consists in retaining the desired content
while discarding the rest. Another challenge finding one's way through websites. This article …

Cyber-attack features for detecting cyber threat incidents from online news

MS Abdullah, A Zainal, MA Maarof… - 2018 cyber resilience …, 2018 - ieeexplore.ieee.org
There are large volume of data from the online news sources that are freely available which
might contain valuable information. Data such as cyber-attacks news keep growing bigger …

An efficient regular expression inference approach for relevant image extraction

HV Agun, E Uzun - Applied Soft Computing, 2023 - Elsevier
Traditional approaches for extracting relevant images automatically from web pages are
error-prone and time-consuming. To improve this task, operations such as preparing a larger …

Out-of-the-box and into the ditch? multilingual evaluation of generic text extraction tools

A Barbaresi, G Lejeune - Language Resources and Evaluation …, 2020 - hal.science
This article examines extraction methods designed to retain the main text content of web
pages and discusses how the extraction could be oriented and evaluated: can and should it …

Web content information extraction based on DOM tree and statistical information

X Yu, Z Jin - 2017 IEEE 17th International Conference on …, 2017 - ieeexplore.ieee.org
Booming web pages contain a lot of information, while they contain little content and much
unrelated noise information, such as script code, links, advertising and so on. These …

Survey Paper on Web Content Extraction & Classification

D Shete, S Bojewar, A Sanghvi - 2021 6th International …, 2021 - ieeexplore.ieee.org
Over the last few years, web data extraction has gained popularity. Product information on
the Ecommerce website floods the internet with big data. Web-based business sites these …

Social media and web sensing on interior and urban design

EA Stathopoulos, A Shvets, R Carlini… - … IEEE Symposium on …, 2022 - ieeexplore.ieee.org
Social media and web sites provide an access to pub-lic opinions on certain aspects and
therefore play an important role in getting insights on targeted audiences. Designers have …

Automatic news-roundup generation using clustering, extraction, and presentation

V Utomo, JS Leu - Multimedia Systems, 2020 - Springer
Along with the growth of the internet, the number of information published increased
exponentially. This huge flow of information causes a problem called “information overload” …

An optimal data entry method, using web scraping and text recognition

N Roopesh, MS Akarsh… - … Conference on Information …, 2021 - ieeexplore.ieee.org
Data entry is one of the most tedious jobs which consumes huge manpower in creating
structured data from the given inputs. A large amount of data entered in the system can be …

Web Page Content Extraction Based on Multi-feature Fusion

B Yu, J Du, Y Shao - arXiv preprint arXiv:2203.12591, 2022 - arxiv.org
With the rapid development of Internet technology, people have more and more access to a
variety of web page resources. At the same time, the current rapid development of deep …