The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large

S Biswas, M Wardat, H Rajan - … of the 44th International Conference on …, 2022 - dl.acm.org
Increasingly larger number of software systems today are including data science
components for descriptive, predictive, and prescriptive analytics. The collection of data …

Biasfinder: Metamorphic test generation to uncover bias for sentiment analysis systems

MH Asyrofi, Z Yang, INB Yusuf, HJ Kang… - IEEE Transactions …, 2021 - ieeexplore.ieee.org
Artificial intelligence systems, such as Sentiment Analysis (SA) systems, typically learn from
large amounts of data that may reflect human bias. Consequently, such systems may exhibit …

KGTorrent: A dataset of python jupyter notebooks from kaggle

L Quaranta, F Calefato… - 2021 IEEE/ACM 18th …, 2021 - ieeexplore.ieee.org
Computational notebooks have become the tool of choice for many data scientists and
practitioners for performing analyses and disseminating results. Despite their increasing …

Computational reproducibility of Jupyter notebooks from biomedical publications

S Samuel, D Mietchen - GigaScience, 2024 - academic.oup.com
Background Jupyter notebooks facilitate the bundling of executable code with its
documentation and output in one interactive environment, and they represent a popular …

A large-scale comparison of Python code in Jupyter notebooks and scripts

K Grotov, S Titov, V Sotnikov, Y Golubev… - Proceedings of the 19th …, 2022 - dl.acm.org
In recent years, Jupyter notebooks have grown in popularity in several domains of software
engineering, such as data science, machine learning, and computer science education …

Towards better dependency management: A first look at dependency smells in python projects

Y Cao, L Chen, W Ma, Y Li, Y Zhou… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Managing cross-project dependencies is tricky in modern software development. A primary
way to manage dependencies is using dependency configuration files, which brings …

A dataset and analysis of open-source machine learning products

N Nahar, H Zhang, G Lewis, S Zhou… - arXiv preprint arXiv …, 2023 - arxiv.org
Machine learning (ML) components are increasingly incorporated into software products, yet
developers face challenges in transitioning from ML prototypes to products. Academic …

smartpip: A smart approach to resolving python dependency conflict issues

C Wang, R Wu, H Song, J Shu, G Li - Proceedings of the 37th IEEE/ACM …, 2022 - dl.acm.org
As one of the representative software ecosystems, PyPI, together with the Python package
management tool pip, greatly facilitates Python developers to automatically manage the …

Lexecutor: Learning-guided execution

B Souza, M Pradel - Proceedings of the 31st ACM Joint European …, 2023 - dl.acm.org
Executing code is essential for various program analysis tasks, eg, to detect bugs that
manifest through exceptions or to obtain execution traces for further dynamic analysis. How …

Workflow analysis of data science code in public GitHub repositories

D Ramasamy, C Sarasua, A Bacchelli… - Empirical Software …, 2023 - Springer
Despite the ubiquity of data science, we are far from rigorously understanding how coding in
data science is performed. Even though the scientific literature has hinted at the iterative and …