A survey of data partitioning and sampling methods to support big data analysis

MS Mahmud, JZ Huang, S Salloum… - Big Data Mining and …, 2020 - ieeexplore.ieee.org
Computer clusters with the shared-nothing architecture are the major computing platforms
for big data processing and analysis. In cluster computing, data partitioning and sampling …

An evidential analytics for buried information in big data samples: Case study of semiconductor manufacturing

YC Ko, H Fujita - Information Sciences, 2019 - Elsevier
The big data samples are important source for analytics. However, its relevant/irrelevant
information, unspecified variables/scales, noise/null, and so forth impose huge challenges …

Evaluation of sampling methods for scatterplots

J Yuan, S Xiang, J Xia, L Yu… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Given a scatterplot with tens of thousands of points or even more, a natural question is which
sampling method should be used to create a small but “good” scatterplot for a better …

Approximate clustering ensemble method for big data

MS Mahmud, JZ Huang, R Ruby… - … Transactions on Big …, 2023 - ieeexplore.ieee.org
Clustering a big distributed dataset of hundred gigabytes or more is a challenging task in
distributed computing. A popular method to tackle this problem is to use a random sample of …

Automatic scatterplot design optimization for clustering identification

GJ Quadri, JA Nieves, BM Wiernik… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Scatterplots are among the most widely used visualization techniques. Compelling
scatterplot visualizations improve understanding of data by leveraging visual perception to …

Clustering approximation via a fusion of multiple random samples

MS Mahmud, JZ Huang, S García - Information Fusion, 2024 - Elsevier
In big data clustering exploration, the situation is paradoxical because there is no prior or
insufficient domain knowledge. Moreover, clustering a big dataset is a challenging task in …

Sampling for big data profiling: A survey

Z Liu, A Zhang - IEEE Access, 2020 - ieeexplore.ieee.org
Due to the development of internet technology and computer science, data is exploding at
an exponential rate. Big data brings us new opportunities and challenges. On the one hand …

An ensemble method for estimating the number of clusters in a big data set using multiple random samples

MS Mahmud, JZ Huang, R Ruby, K Wu - Journal of Big Data, 2023 - Springer
Clustering a big dataset without knowing the number of clusters presents a big challenge to
many existing clustering algorithms. In this paper, we propose a Random Sample Partition …

[PDF][PDF] Automatic generation of comparison notebooks for interactive data exploration.

A Chanson, N Labroche, P Marcel, S Rizzi, V t'Kindt - EDBT, 2022 - openproceedings.org
We consider the problem of generating SQL notebooks of comparison queries for
Exploratory Data Analysis (EDA). A comparison query allows to find insights in a dataset by …

Exploring and cleaning big data with random sample data blocks

S Salloum, JZ Huang, Y He - Journal of Big Data, 2019 - Springer
Data scientists need scalable methods to explore and clean big data before applying
advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore …