Automatic genre identification: a survey

T Kuzman, N Ljubešić - Language Resources and Evaluation, 2023 - Springer
Automatic genre identification (AGI) is a text classification task focused on genres, ie, text
categories defined by the author's purpose, common function of the text, and the text's …

[图书][B] Register variation online

D Biber, J Egbert - 2018 - books.google.com
While other books focus on special internet registers, like tweets or texting, no previous
study describes the full range of everyday registers found on the searchable web. These are …

[PDF][PDF] The paisa'corpus of italian web texts

V Lyding, E Stemle, C Borghetti, M Brunello… - Proceedings of the 9th …, 2014 - arpi.unipi.it
The PAISA Corpus of Italian Web Texts Page 1 Felix Bildhauer & Roland Schäfer (eds.),
Proceedings of the 9th Web as Corpus Workshop (WaC-9) @ EACL 2014, pages 36–43 …

Register variation on the searchable web: A multi-dimensional analysis

D Biber, J Egbert - Journal of English Linguistics, 2016 - journals.sagepub.com
Most previous linguistic investigations of the web have focused on special linguistic features
associated with Internet language (eg, the use of emoticons, abbreviations, contractions …

Exploring the composition of the searchable web: A corpus-based taxonomy of web registers

D Biber, J Egbert, M Davies - Corpora, 2015 - euppublishing.com
One major challenge for Web-As-Corpus research is that a typical Web search provides little
information about the register of the documents that are searched. Previous research has …

Automatic genre identification for robust enrichment of massive text collections: Investigation of classification methods in the era of large language models

T Kuzman, I Mozetič, N Ljubešić - Machine Learning and Knowledge …, 2023 - mdpi.com
Massive text collections are the backbone of large language models, the main ingredient of
the current significant progress in artificial intelligence. However, as these collections are …

Developing a bottom‐up, user‐based method of web register classification

J Egbert, D Biber, M Davies - Journal of the Association for …, 2015 - Wiley Online Library
This paper introduces a project to develop a reliable, cost‐effective method for classifying
Internet texts into register categories, and apply that approach to the analysis of a large …

It's how you say it: Identifying appropriate register for chatbot language design

AP Chaves, E Doerry, J Egbert, M Gerosa - Proceedings of the 7th …, 2019 - dl.acm.org
Designing chatbots that produce language that is natural and appropriate to a given context
is critical in satisfying user expectations. Currently, little is known about how a chatbot's …

The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres

T Chanier, C Poudat, B Sagot, G Antoniadis… - Journal for language …, 2014 - shs.hal.science
The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-
munication (CMC) genres with interactions in French as the main language, by assembling …

The GINCO training dataset for web genre identification of documents out in the wild

T Kuzman, P Rupnik, N Ljubešić - arXiv preprint arXiv:2201.03857, 2022 - arxiv.org
This paper presents a new training dataset for automatic genre identification GINCO, which
is based on 1,125 crawled Slovenian web documents that consist of 650 thousand words …