[图书][B] The data matching process

P Christen, P Christen - 2012 - Springer
This chapter provides an overview of the data matching process, and describes the five
major steps involved in this process: data pre-processing (cleaning and standardisation) …

[图书][B] An introduction to Kolmogorov complexity and its applications

M Li, P Vitányi - 2008 - Springer
Ming Li Paul Vitányi Fourth Edition Page 1 An Introduction to Kolmogorov Complexity and Its
Applications Ming Li Paul Vitányi Fourth Edition Texts in Computer Science Page 2 Texts in …

Data-Centric Systems and Applications

MJ Carey, S Ceri, P Bernstein, U Dayal, C Faloutsos… - Italy: Springer, 2006 - Springer
The rapid growth of the Web in the past two decades has made it the largest publicly
accessible data source in the world. Web mining aims to discover useful information or …

An alternative to NCD for large sequences, Lempel-Ziv Jaccard distance

E Raff, C Nicholas - Proceedings of the 23rd ACM SIGKDD international …, 2017 - dl.acm.org
The Normalized Compression Distance (NCD) has been used in a number of domains to
compare objects with varying feature types. This flexibility comes from the use of general …

A survey of machine learning methods and challenges for windows malware classification

E Raff, C Nicholas - arXiv preprint arXiv:2006.09271, 2020 - arxiv.org
Malware classification is a difficult problem, to which machine learning methods have been
applied for decades. Yet progress has often been slow, in part due to a number of unique …

Detecting visually similar web pages: Application to phishing detection

TC Chen, S Dick, J Miller - ACM Transactions on Internet Technology …, 2010 - dl.acm.org
We propose a novel approach for detecting visual similarity between two Web pages. The
proposed approach applies Gestalt theory and considers a Web page as a single indivisible …

Normalized information distance

PMB Vitányi, FJ Balbach, RL Cilibrasi, M Li - Information theory and …, 2009 - Springer
The normalized information distance is a universal distance measure for objects of all kinds.
It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it …

A novel hyperparameter-free approach to decision tree construction that avoids overfitting by design

RG Leiva, AF Anta, V Mancuso, P Casari - Ieee Access, 2019 - ieeexplore.ieee.org
Decision trees are an extremely popular machine learning technique. Unfortunately,
overfitting in decision trees still remains an open issue that sometimes prevents achieving …

[HTML][HTML] Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study

JA Lees, M Kendall, J Parkhill, C Colijn… - Wellcome open …, 2018 - ncbi.nlm.nih.gov
Background: Phylogenetic reconstruction is a necessary first step in many analyses which
use whole genome sequence data from bacterial populations. There are many available …

Lempel-Ziv Jaccard Distance, an effective alternative to ssdeep and sdhash

E Raff, C Nicholas - Digital Investigation, 2018 - Elsevier
Recent work has proposed the Lempel-Ziv Jaccard Distance (LZJD) as a method to
measure the similarity between binary byte sequences for malware classification. We …