The influence of preprocessing on text classification using a bag-of-words representation

Y HaCohen-Kerner, D Miller, Y Yigal - PloS one, 2020 - journals.plos.org
Text classification (TC) is the task of automatically assigning documents to a fixed number of
categories. TC is an important component in many text applications. Many of these …

[PDF][PDF] Text classification by augmenting the bag-of-words representation with redundancy-compensated bigrams

C Boulis, M Ostendorf - Proc. of the International Workshop in …, 2005 - researchgate.net
The most prevalent representation for text classification is the bag-of-words vector. A number
of approaches have sought to replace or augment the bag-of-words representation with …

Word co-occurrence features for text classification

F Figueiredo, L Rocha, T Couto, T Salles… - Information Systems, 2011 - Elsevier
In this article we propose a data treatment strategy to generate new discriminative features,
called compound-features (or c-features), for the sake of text classification. These c-features …

The impact of preprocessing on text classification

AK Uysal, S Gunal - Information processing & management, 2014 - Elsevier
Preprocessing is one of the key components in a typical text classification framework. This
paper aims to extensively examine the impact of preprocessing on text classification in terms …

[HTML][HTML] Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers

M Siino, I Tinnirello, M La Cascia - Information Systems, 2024 - Elsevier
With the advent of the modern pre-trained Transformers, the text preprocessing has started
to be neglected and not specifically addressed in recent NLP literature. However, both from …

Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling

W Cunha, S Canuto, F Viegas, T Salles… - Information Processing …, 2020 - Elsevier
Text Classification pipelines are a sequence of tasks needed to be performed to classify
documents into a set of predefined categories. The pre-processing phase (before training) of …

[PDF][PDF] A loss function analysis for classification methods in text categorization

F Li, Y Yang - Proceedings of the 20th international conference on …, 2003 - cdn.aaai.org
This paper presents a formal analysis of popular text classification methods, focusing on
their loss functions whose minimization is essential to the optimization of those methods …

[PDF][PDF] A case study in using linguistic phrases for text categorization on the WWW

J Furnkranz, T Mitchell, E Riloff - Working Notes of the AAAI/ICML …, 1998 - cdn.aaai.org
Most learning algorithms that are applied to text categorization problems rely on a bag-of-
words document representation, ie, each word occurring in the document is considered as a …

[PDF][PDF] Text classification by labeling words

B Liu, X Li, WS Lee, PS Yu - Aaai, 2004 - cdn.aaai.org
Traditionally, text classifiers are built from labeled training examples. Labeling is usually
done manually by human experts (or the users), which is a labor intensive and time …

Network-based bag-of-words model for text classification

D Yan, K Li, S Gu, L Yang - IEEE Access, 2020 - ieeexplore.ieee.org
The rapidly developing internet and other media have produced a tremendous amount of
text data, making it a challenging and valuable task to find a more effective way to analyze …