Methods for quantifying dataset similarity: a review, taxonomy and comparison

M Stolte, F Kappenberg, J Rahnenführer… - Statistic …, 2024 - projecteuclid.org
Quantifying the similarity between datasets has widespread applications in statistics and
machine learning. The performance of a predictive model on novel datasets, referred to as …

Data augmentation is a hyperparameter: Cherry-picked self-supervision for unsupervised anomaly detection is creating the illusion of success

J Yoo, T Zhao, L Akoglu - arXiv preprint arXiv:2208.07734, 2022 - arxiv.org
Self-supervised learning (SSL) has emerged as a promising alternative to create
supervisory signals to real-world problems, avoiding the extensive cost of manual labeling …

A Review and Taxonomy of Methods for Quantifying Dataset Similarity

M Stolte, A Bommert, J Rahnenführer - arXiv preprint arXiv:2312.04078, 2023 - arxiv.org
In statistics and machine learning, measuring the similarity between two or more datasets is
important for several purposes. The performance of a predictive model on novel datasets …

Using GPT-3 as a Text Data Augmentator for a Complex Text Detector

M Romero-Sandoval… - 2023 IEEE 5th …, 2023 - ieeexplore.ieee.org
In this work, we explore the problem of complex text detection. This problem is a frequent
challenge when implementing text simplification pipelines. Identifying complex text …

Active Labeling Aided Semi-Supervised Safety Assessment With Task-Related Unknown Scenarios

C Liu, X He, M Li, Y Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The open environment presents a challenging issue for the online safety assessment of
dynamic systems, which means that unknown scenarios may arise unexpectedly. These …

A Novel Dataset for Financial Education Text Simplification in Spanish

N Perez-Rojas, S Calderon-Ramirez… - arXiv preprint arXiv …, 2023 - arxiv.org
Text simplification, crucial in natural language processing, aims to make texts more
comprehensible, particularly for specific groups like visually impaired Spanish speakers, a …

DDD: Discriminative Difficulty Distance for plant disease diagnosis

Y Arima, S Kagiwada, H Iyatomi - arXiv preprint arXiv:2501.00734, 2025 - arxiv.org
Recent studies on plant disease diagnosis using machine learning (ML) have highlighted
concerns about the overestimated diagnostic performance due to inappropriate data …

Uncertainty Estimation for Complex Text Detection in Spanish

M Abreu-Cardenas… - 2023 IEEE 5th …, 2023 - ieeexplore.ieee.org
Text simplifcation refers to the transformation of a source text aiming to increase its readiblity
and understandability for a specific target population. This task is an important step towards …

[PDF][PDF] FG-AI4H DEL5. 4 Training and test data specification

S Sector - itu.int
Summary ITU-T FG-AI4H Deliverable DEL5. 4 provides guidelines on the systematic way of
preparing technical requirements specifications for datasets used in the training and testing …