Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

M Sanguinetti, C Bosco, L Cassidy, Ö Çetinoğlu… - Language Resources …, 2023 - Springer
This article presents a discussion on the main linguistic phenomena which cause difficulties
in the analysis of user-generated texts found on the web and in social media, and proposes …

[PDF][PDF] Evaluating Off-the-Shelf NLP Tools for German.

K Ortmann, A Roussel, S Dipper - KONVENS, 2019 - sfb1102.uni-saarland.de
It is not always easy to keep track of what tools are currently available for a particular
annotation task, nor is it obvious how the provided models will perform on a given data set …

Iterative named entity recognition with conditional random fields

A Alves-Pinto, C Demus, M Spranger, D Labudde… - Applied Sciences, 2021 - mdpi.com
Named entity recognition (NER) constitutes an important step in the processing of
unstructured text content for the extraction of information as well as for the computer …

Treebanking user-generated content: A proposal for a unified representation in Universal Dependencies

M Sanguinetti, B Cristina, C Lauren, C Ozlem… - Proceedings of the 12th …, 2020 - iris.unica.it
The paper presents a discussion on the main linguistic phenomena of user-generated texts
found in web and social media, and proposes a set of annotation guidelines for their …

A corpus of German political speeches from the 21st century

A Barbaresi - 11th Language Resources and Evaluation Conference …, 2018 - hal.science
The present German political speeches corpus follows from a initial release which has been
used in various research contexts. This article documents an updated and extended version …

Assessing emoji use in modern text processing tools

AAM Shoeb, G De Melo - arXiv preprint arXiv:2101.00430, 2021 - arxiv.org
Emojis have become ubiquitous in digital communication, due to their visual appeal as well
as their ability to vividly convey human emotion, among other factors. The growing …

An annotated social media corpus for German

E Bick - 12th Language Resources and Evaluation …, 2020 - portal.findresearcher.sdu.dk
This paper presents the German Twitter section of a large (2 billion word) bilingual Social
Media corpus for Hate Speech research, discussing the compilation, pseudonymization and …

[PDF][PDF] Etiquetagem morfossintática multigênero para o português do Brasil segundo o modelo" Universal Dependencies"

EH Silva, TAS Pardo, NT Roman - Anais, 2023 - repositorio.usp.br
Part of speech tagging is a process that seeks to identify the grammatical classes of words
and symbols (tokens) in a sentence. For Brazilian Portuguese, there is a variety of …

Building type classification from social media texts via geo-spatial textmining

M Häberle, M Werner, XX Zhu - IGARSS 2019-2019 IEEE …, 2019 - ieeexplore.ieee.org
In this work, we present a model for building type classification from Twitter text messages
(tweets) by employing geo-spatial textmining methods. First, we apply standard text pre …

A corpus of German Reddit exchanges (GeRedE)

A Blombach, N Dykes, P Heinrich… - Proceedings of the …, 2020 - aclanthology.org
GeRedE is a 270 million token German CMC corpus containing approximately 380,000
submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Reddit is …