Scalable, In-situ Data Clustering Data Analysis for Extreme Scale Scientific Computing

Project: Research project

Description

Abstract
The challenges of extreme scale computing systems exist across multiple dimensions including
architecture, energy constraints, memory scaling, limited I/O, scalability of software and applications.
These constraints and the need for faster scientific discovery have identified the demand for scalable and in-situ analysis. It is clear that larger the simulations using extreme-scale systems, greater the need for effective data analysis and derivation of insights, at a faster pace, and within the constraints of limited storage space, deeper and complex memory hierarchies, minimization of data movement due to energy and I/O constraints. The traditional model of store raw and/or derived data and analyze later will become cost prohibitive in the exascale computing realm. Furthermore, continuously involving human in the loop for analyzing data will become less effective due to the sheer size and complexity of data. For in-situ analysis, the design of existing analytics algorithms and software by simply extending the assumptions made based on the off-line model may not work, and therefore, rethinking and redesign of analysis algorithms, runtime and software is needed. In order to keep pace with the ever-increasing computational parallelism demands by large-scale simulations, the analysis algorithms must be customizable to the needs of simulation and data it produces for deriving insights.

The objective of this proposal is to address challenges in the design and development of scalable insitu analytics algorithms and software based on “Scalable Thinking”. The proposed research and
development includes scalable algorithms and software for spatio-temporal data clustering, anomaly
detection, learning data distributions, for in-situ implementation and execution. All of these are very
important for large-scale analysis and have wide applicability. Our design approach is driven by
rethinking and reformulation within the constraints posed by in-situ analysis requirements at multiple
levels. This is particularly important for data analytics in the extreme scale computing environment
because most existing techniques developed, validated and optimized on small data sets may not be
scalable nor may they be suitable for in-situ analytics on the emerging extreme scale computing systems.

Another key component of our approach will be to incorporate a co-design approach to development of scalable algorithms and software by taking full advantage of new architecture rather than simply
considering and scaling existing techniques.

Methodology – Our design principles for developing in-situ data analysis software are to: (1) identify
parts of computation that can be done close to the data, while it is still in memory; (2) extract analysis components that best perform in-situ or post-hoc; (3) determine the type of derived distributions and statistics from a given spatio-temporal data that can be kept locally in order to both accelerate computation and meet energy constraints in subsequent iterations and phases; (4) make use of selfdescribing formats so that data can be consistent and understood at staging and analysis nodes, thereby providing portability and flexibility. Our algorithms and software will be scalable, reusable, extensible, and generic for applications in different disciplines. The algorithms and software will be developed to be able to run in-situ with the simulations, as well as during post-hoc analysis.

Potential Impact – We believe that this approach will satisfy many synergistic requirements for data
intensive appl
StatusActive
Effective start/end date8/1/157/31/19

Funding

  • Department of Energy (DE-SC0014330)

Fingerprint

Natural sciences computing
Data storage equipment
Scalability
Statistics