Skew-aware join optimization for array databases

Jennie M Rogers, Olga Papaemmanouil, Leilani Battle, Michael Stonebraker

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

Science applications are accumulating an ever-increasing amount of multidimensional data. Although some of it can be processed in a relational database, much of it is better suited to array-based engines. As such, it is important to optimize the query processing of these systems. This paper focuses on efficient query processing of join operations within an array database. These engines invariably "chunk" their data into multidimensional tiles that they use to efficiently process spatial queries. As such, traditional relational algorithms need to be substantially modified to take advantage of array tiles. Moreover, most n-dimensional science data is unevenly distributed in array space because its underlying observations rarely follow a uniform pattern. It is crucial that the optimization of array joins be skew-aware. In addition, owing to the scale of science applications, their query processing usually spans multiple nodes. This further complicates the planning of array joins. In this paper, we introduce a join optimization framework that is skew-aware for distributed joins. This optimization consists of two phases. In the first, a logical planner selects the query's algorithm (e.g., merge join), the granularity of the its tiles, and the reorganization operations needed to align the data. The second phase implements this logical plan by assigning tiles to cluster nodes using an analytical cost model. Our experimental results, on both synthetic and real-world data, demonstrate that this optimization framework speeds up array joins by up to 2.5X in comparison to the baseline.

Original languageEnglish (US)
Title of host publicationSIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages123-135
Number of pages13
Volume2015-May
ISBN (Electronic)9781450327589
DOIs
StatePublished - May 27 2015
EventACM SIGMOD International Conference on Management of Data, SIGMOD 2015 - Melbourne, Australia
Duration: May 31 2015Jun 4 2015

Other

OtherACM SIGMOD International Conference on Management of Data, SIGMOD 2015
CountryAustralia
CityMelbourne
Period5/31/156/4/15

Fingerprint

Tile
Query processing
Engines
Planning
Costs

ASJC Scopus subject areas

  • Software
  • Information Systems

Cite this

Rogers, J. M., Papaemmanouil, O., Battle, L., & Stonebraker, M. (2015). Skew-aware join optimization for array databases. In SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Vol. 2015-May, pp. 123-135). Association for Computing Machinery. https://doi.org/10.1145/2723372.2723709
Rogers, Jennie M ; Papaemmanouil, Olga ; Battle, Leilani ; Stonebraker, Michael. / Skew-aware join optimization for array databases. SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. Vol. 2015-May Association for Computing Machinery, 2015. pp. 123-135
@inproceedings{b74e89cb287f4cc6be5f9930ed928106,
title = "Skew-aware join optimization for array databases",
abstract = "Science applications are accumulating an ever-increasing amount of multidimensional data. Although some of it can be processed in a relational database, much of it is better suited to array-based engines. As such, it is important to optimize the query processing of these systems. This paper focuses on efficient query processing of join operations within an array database. These engines invariably {"}chunk{"} their data into multidimensional tiles that they use to efficiently process spatial queries. As such, traditional relational algorithms need to be substantially modified to take advantage of array tiles. Moreover, most n-dimensional science data is unevenly distributed in array space because its underlying observations rarely follow a uniform pattern. It is crucial that the optimization of array joins be skew-aware. In addition, owing to the scale of science applications, their query processing usually spans multiple nodes. This further complicates the planning of array joins. In this paper, we introduce a join optimization framework that is skew-aware for distributed joins. This optimization consists of two phases. In the first, a logical planner selects the query's algorithm (e.g., merge join), the granularity of the its tiles, and the reorganization operations needed to align the data. The second phase implements this logical plan by assigning tiles to cluster nodes using an analytical cost model. Our experimental results, on both synthetic and real-world data, demonstrate that this optimization framework speeds up array joins by up to 2.5X in comparison to the baseline.",
author = "Rogers, {Jennie M} and Olga Papaemmanouil and Leilani Battle and Michael Stonebraker",
year = "2015",
month = "5",
day = "27",
doi = "10.1145/2723372.2723709",
language = "English (US)",
volume = "2015-May",
pages = "123--135",
booktitle = "SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data",
publisher = "Association for Computing Machinery",

}

Rogers, JM, Papaemmanouil, O, Battle, L & Stonebraker, M 2015, Skew-aware join optimization for array databases. in SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. vol. 2015-May, Association for Computing Machinery, pp. 123-135, ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, Melbourne, Australia, 5/31/15. https://doi.org/10.1145/2723372.2723709

Skew-aware join optimization for array databases. / Rogers, Jennie M; Papaemmanouil, Olga; Battle, Leilani; Stonebraker, Michael.

SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. Vol. 2015-May Association for Computing Machinery, 2015. p. 123-135.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Skew-aware join optimization for array databases

AU - Rogers, Jennie M

AU - Papaemmanouil, Olga

AU - Battle, Leilani

AU - Stonebraker, Michael

PY - 2015/5/27

Y1 - 2015/5/27

N2 - Science applications are accumulating an ever-increasing amount of multidimensional data. Although some of it can be processed in a relational database, much of it is better suited to array-based engines. As such, it is important to optimize the query processing of these systems. This paper focuses on efficient query processing of join operations within an array database. These engines invariably "chunk" their data into multidimensional tiles that they use to efficiently process spatial queries. As such, traditional relational algorithms need to be substantially modified to take advantage of array tiles. Moreover, most n-dimensional science data is unevenly distributed in array space because its underlying observations rarely follow a uniform pattern. It is crucial that the optimization of array joins be skew-aware. In addition, owing to the scale of science applications, their query processing usually spans multiple nodes. This further complicates the planning of array joins. In this paper, we introduce a join optimization framework that is skew-aware for distributed joins. This optimization consists of two phases. In the first, a logical planner selects the query's algorithm (e.g., merge join), the granularity of the its tiles, and the reorganization operations needed to align the data. The second phase implements this logical plan by assigning tiles to cluster nodes using an analytical cost model. Our experimental results, on both synthetic and real-world data, demonstrate that this optimization framework speeds up array joins by up to 2.5X in comparison to the baseline.

AB - Science applications are accumulating an ever-increasing amount of multidimensional data. Although some of it can be processed in a relational database, much of it is better suited to array-based engines. As such, it is important to optimize the query processing of these systems. This paper focuses on efficient query processing of join operations within an array database. These engines invariably "chunk" their data into multidimensional tiles that they use to efficiently process spatial queries. As such, traditional relational algorithms need to be substantially modified to take advantage of array tiles. Moreover, most n-dimensional science data is unevenly distributed in array space because its underlying observations rarely follow a uniform pattern. It is crucial that the optimization of array joins be skew-aware. In addition, owing to the scale of science applications, their query processing usually spans multiple nodes. This further complicates the planning of array joins. In this paper, we introduce a join optimization framework that is skew-aware for distributed joins. This optimization consists of two phases. In the first, a logical planner selects the query's algorithm (e.g., merge join), the granularity of the its tiles, and the reorganization operations needed to align the data. The second phase implements this logical plan by assigning tiles to cluster nodes using an analytical cost model. Our experimental results, on both synthetic and real-world data, demonstrate that this optimization framework speeds up array joins by up to 2.5X in comparison to the baseline.

UR - http://www.scopus.com/inward/record.url?scp=84957604356&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84957604356&partnerID=8YFLogxK

U2 - 10.1145/2723372.2723709

DO - 10.1145/2723372.2723709

M3 - Conference contribution

VL - 2015-May

SP - 123

EP - 135

BT - SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

PB - Association for Computing Machinery

ER -

Rogers JM, Papaemmanouil O, Battle L, Stonebraker M. Skew-aware join optimization for array databases. In SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. Vol. 2015-May. Association for Computing Machinery. 2015. p. 123-135 https://doi.org/10.1145/2723372.2723709