Summary: Data sharing is crucial for enabling users to get the most out of their information. Scientists, business professionals, and public policy makers are starting to use data sharing platforms to pool their information for querying for more comprehensive insights. Despite the clear benefits of this opportunity, many data owners are hesitant to release their data to others owing to privacy concerns and regulatory requirements. Many such opportunities pertain to sensitive information held by multiple parties who might be willing to pool their data with that of others for analysis, but are unable to disclose their individual records owing to regulatory requirements or privacy concerns. We call a data sharing platform that queries the union of sensitive data from multiple sources a private data federation. We propose VaultDB, a private data federation that queries the sensitive records of two or more private data providers using secure multi-party computation. In prior work we demonstrated how to seamlessly translate database queries from SQL into secure computation and executed them over a database federation. The queries ran obliviously, hence their instruction traces, I/Os, and network traffic did not reveal information about their inputs. Here, only the federation coordinator and client who submitted the query may view its output. This capability comes at an astoundingly high performance cost, with query runtimes often multiple orders of magnitude slower than their plaintext equivalents. To address this challenge, we propose a N-pronged approach: • Optimization for Private Data Federation Queries: We will generalize the principles of query optimization in relational databases to private data federations that support oblivious query processing. This setting presents novel research challenges because we must design our query plans to avoid secure computation. We will also design and implement novel cost models to capture the performance impact of secure computation and guide the optimizer to automatically select low-cost protocols. We will achieve high-performance queries over secure computation by building analytical cost models to identify efficient query execution plans. These plans will determine how to order the database operators in a query tree and how to partition their input data. We will also propose and evaluate efficient semi-oblivious algorithms for joins and aggregation and add the selection thereof to our query optimizer. • Bounded Intermediate Result Cardinalities: The main property of relational queries that causes their runtimes to explode when run obliviously is the cumulative effect of padding intermediate query result to not reveal information about their input tuples. For example, a join of two relations of length n must unconditionally produce n2 output tuples. Naturally, as we scale to 3, 4, or more joins, our operator output cardinalities jump to n3, n4 and so on. This is not scalable for large datasets. We will devise an algebra to take advantage of properties of the relational model and user-provided hints to bound the growth of intermediate result sizes during query execution. In addition, we will propose novel database operator algorithms–e.g., for joins and aggregates–that exploit these cardinality bounds to further improve our query runtimes. To be clear, we will not investigate novel cryptographic primitives or protocols in the scope of this proposed work. Rather, we will focus on finding efficient execution plans for private data federation queries. Second, we
|Effective start/end date||3/1/19 → 2/29/24|
- National Science Foundation (CNS-1846447)
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.