Distributed data management using MapReduce

F Li, BC Ooi, MT Özsu, S Wu - ACM Computing Surveys (CSUR), 2014 - dl.acm.org
MapReduce is a framework for processing and managing large-scale datasets in a
distributed cluster, which has been used for applications such as generating search indexes …

Decoding billions of integers per second through vectorization

D Lemire, L Boytsov - Software: Practice and Experience, 2015 - Wiley Online Library
In many important applications—such as search engines and relational database systems—
data are stored in the form of arrays of integers. Encoding and, most importantly, decoding of …

RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

Y He, R Lee, Y Huai, Z Shao, N Jain… - 2011 IEEE 27th …, 2011 - ieeexplore.ieee.org
MapReduce-based data warehouse systems are playing important roles of supporting big
data analytics to understand quickly the dynamics of user behavior trends and their needs in …

Modern B-tree techniques

G Graefe - Foundations and Trends® in Databases, 2011 - nowpublishers.com
Invented about 40 years ago and called ubiquitous less than 10 years later, B-tree indexes
have been used in a wide variety of computing systems from handheld devices to …

Architecture of a database system

JM Hellerstein, M Stonebraker… - … and Trends® in …, 2007 - nowpublishers.com
Abstract Database Management Systems (DBMSs) are a ubiquitous and critical component
of modern computing, and the result of decades of research and development in both …

Cheetah: a high performance, custom data warehouse on top of MapReduce

S Chen - Proceedings of the VLDB Endowment, 2010 - dl.acm.org
Large-scale data analysis has become increasingly important for many enterprises.
Recently, a new distributed computing paradigm, called MapReduce, and its open source …

{SEAL}: Storage-efficient causality analysis on enterprise logs with query-friendly compression

P Fei, Z Li, Z Wang, X Yu, D Li, K Jee - 30th USENIX Security …, 2021 - usenix.org
Causality analysis automates attack forensic and facilitates behavioral detection by
associating causally related but temporally distant system events. Despite its proven …

The fastlanes compression layout: Decoding> 100 billion integers per second with scalar code

A Afroozeh, P Boncz - Proceedings of the VLDB Endowment, 2023 - dl.acm.org
The open-source FastLanes project aims to improve big data formats, such as Parquet, ORC
and columnar database formats, in multiple ways. In this paper, we significantly accelerate …

Compressed linear algebra for large-scale machine learning

A Elgohary, M Boehm, PJ Haas, FR Reiss… - Proceedings of the …, 2016 - dl.acm.org
Large-scale machine learning (ML) algorithms are often iterative, using repeated read-only
data access and I/O-bound matrix-vector multiplications to converge to an optimal model. It …

Constant-time query processing

V Raman, G Swart, L Qiao, F Reiss… - 2008 IEEE 24th …, 2008 - ieeexplore.ieee.org
Query performance in current systems depends significantly on tuning: how well the query
matches the available indexes, materialized views etc. Even in a well tuned system, there …