CAREER: Efficient Query Processing for Private Data Federations

Project: Research project

Description

Summary: Data sharing is crucial for enabling users to get the most out of their information.
Scientists, business professionals, and public policy makers are starting to use data sharing platforms
to pool their information for querying for more comprehensive insights. Despite the clear
benefits of this opportunity, many data owners are hesitant to release their data to others owing
to privacy concerns and regulatory requirements. Many such opportunities pertain to sensitive
information held by multiple parties who might be willing to pool their data with that of others
for analysis, but are unable to disclose their individual records owing to regulatory requirements
or privacy concerns. We call a data sharing platform that queries the union of sensitive data from
multiple sources a private data federation.
We propose VaultDB, a private data federation that queries the sensitive records of two or
more private data providers using secure multi-party computation. In prior work we demonstrated
how to seamlessly translate database queries from SQL into secure computation and
executed them over a database federation. The queries ran obliviously, hence their instruction
traces, I/Os, and network traffic did not reveal information about their inputs. Here, only the
federation coordinator and client who submitted the query may view its output. This capability
comes at an astoundingly high performance cost, with query runtimes often multiple orders
of magnitude slower than their plaintext equivalents. To address this challenge, we propose a
N-pronged approach:
• Optimization for Private Data Federation Queries: We will generalize the principles of
query optimization in relational databases to private data federations that support oblivious
query processing. This setting presents novel research challenges because we must design
our query plans to avoid secure computation. We will also design and implement novel cost
models to capture the performance impact of secure computation and guide the optimizer
to automatically select low-cost protocols. We will achieve high-performance queries over
secure computation by building analytical cost models to identify efficient query execution
plans. These plans will determine how to order the database operators in a query tree and
how to partition their input data. We will also propose and evaluate efficient semi-oblivious
algorithms for joins and aggregation and add the selection thereof to our query optimizer.
• Bounded Intermediate Result Cardinalities: The main property of relational queries that causes
their runtimes to explode when run obliviously is the cumulative effect of padding intermediate
query result to not reveal information about their input tuples. For example, a join
of two relations of length n must unconditionally produce n2 output tuples. Naturally, as
we scale to 3, 4, or more joins, our operator output cardinalities jump to n3, n4 and so on.
This is not scalable for large datasets. We will devise an algebra to take advantage of properties
of the relational model and user-provided hints to bound the growth of intermediate
result sizes during query execution. In addition, we will propose novel database operator
algorithms–e.g., for joins and aggregates–that exploit these cardinality bounds to further
improve our query runtimes.
To be clear, we will not investigate novel cryptographic primitives or protocols in the scope
of this proposed work. Rather, we will focus on finding efficient execution plans for private
data federation queries. Second, we
StatusActive
Effective start/end date3/1/192/29/24

Funding

  • National Science Foundation (CNS-1846447)

Fingerprint

Query processing
Costs
Algebra
Agglomeration
Processing
Industry