A Science Platform for Accessing and Analyzing Large Datasets

About

The next decade of astronomy will be marked by automated sky-surveys delivering large (multi-TB to PB-sized), rich, spatio-temporal data sets. Research activities with such data volumes present a challenge: downloads and local (especially single-computer) analyses are increasingly impractical (slow), and setting up (remote, distributed) data analytics tools requires technical expertise not readily available to typical astronomy groups. A possible solution is to offer remote, next-to-the-data, analysis; however, such services — science platforms — still face challenges of offering sufficient scalability, and ensuring the service is easy to use.

We build and demonstrate a solution built based on (adapted) industry standard tools and made accessible through web gateways. Our motivation is to enable the analysis of data from the Zwicky Transient Facility — a precursor to LSST — within the ZTF Partnership. This platform is built on Amazon Web Services (with Kubernetes and S3 as the orchestration and storage layers, respectively), utilizes Apache Spark (with the Astronomy eXtensions for Spark — AXS — framework) for parallel data analytics, and JupyterHub as the web-accesible front-end.

With this setup, we show how it is possible to transparently scale processing with no end-user interaction. We outline the architecture of the analysis platform, provide implementation details, rationale for (and against) technology choices, verify scalability through strong and weak scaling tests, and demonstrate usability through an example science analysis of data from the ZTF. The code is available at https://github.com/astronomy-commons/science-platform. To our knowledge, this is a first application of cloud-based scalable analytics to astronomical datasets approaching LSST-scale.

Talks and Publications

ApJ

This work has been published in the Astrophysical Journal (ApJ). This publication provides more detail than the Gateways paper.

Read

Citation: Stetzler. S. et al. (in prep.)


Gateways 2020

We presented this work virtually at the Gateways 2020 Conference, producing both a talk and a paper for the conference proceedings.

Watch Read

Citation: Stetzler. S. et al. 2020. ”A Scalable Cloud-Based Analysis Platform for Survey Astronomy.” Paper presented at Gateways 2020, Online, USA, October 12-23, 2020. https://osf.io/e2zwf/.

Cost Calculator

How much would it cost you to deploy this? Here we provide a cost calculator that reproduces the variable costs in Table 2 of the main paper. You can play around with the inputs to the calculator to estimate costs for setting up an identical system for your own purposes.

Parameters

System
Interactive Usage
Queries

Variable Cost

Total:

Contact

You can reach out to the authors via email: