2nd Workshop on Reproducible Workflows, Data Management, and Security

Conference website https://sites.google.com/view/rewords22/home

Submission link https://easychair.org/conferences/?conf=rewords22

Submission deadline July 8, 2022

Topics: reproducibility for big data/ai workflows data management tools and techniques provenance management techniques and tools reproducibility efforts and techniques

This workshop explores innovations and experiences around developing portable, general, reproducible workflows while paying attention to providing open data with verifiable authenticity while protecting privacy, where needed. We are looking for community discussion and participation on the above topics plus the following. First, component packaging via containers and virtual machines, automation scripting, deployment, portability builds, and system support for these and other relevant activities. Second, provenance collection, exploration, and tracking are key for a well-documented scientific output. Third, issues with managing large data sets and workflow intermediate data, particularly those intended to manage publicly accessed data for use and reuse are encouraged. Finally, new techniques and technologies that address portability and reproducibility requirements, such as those required for peer reviewed publication, are also requested.

Submission Guidelines

All papers must be original and not simultaneously submitted to another journal or conference. The following paper categories are welcome:

Full papers describing emerging and future computational workloads are combining traditional HPC applications with tools and techniques from the scale out data analytics and machine learning community. Getting these technologies to co-exist and interoperate to advance scientific discovery is a daunting task with few known good solutions. In general, constructing these workflows has the potential to create pitfalls and incompatibilities that limit adoption.

Formalizing the steps necessary for an application or data processing pipeline is increasingly popular and necessary. Requirements for reproducibility artifacts for publishing venues are also driving this formalization. The processes and infrastructure to accomplish these requirements are frequently bespoke or custom for a particular research area. All of these formalization activities can be described as workflow systems. Existing off-the-shelf tools address a distributed environment fairly well, but are not complete solutions and do not address the scale up community much, if at all.

Complicating managing workflows are the tasks of managing data both during workflow execution and then afterwards as well as offering authentication and data security for shared data sets. With some data, such as climate simulation output, being subject to intense scrutiny, it becomes crucial to offer open data that can be verified as authentic by means of encrypted creator identities, and accessible only to people with a need to know. Sharing and analyzing the data knowing it is authentic while protecting the privacy of the creators is essential for reliable open science, while protecting the identity of the scientists performing the work.

This workshop seeks to explore ideas and experiences on what kinds of infrastructure developments can improve upon the state of the art. Explorations of component packaging via containers and virtual machines, automation scripting, deployment, portability builds, and system support for these and other relevant activities are key infrastructure. Provenance collection, exploration, and tracking are key for a well documented scientific output. Using existing systems to achieve these goals via experiences are important for developing best practices that span application domains. Data privacy techniques such as multi-party encryption and differential privacy are important as well. Issues with managing large data sets and workflow intermediate data, particularly those intended to manage publicly accessed data for use and reuse are encouraged. New techniques and technologies that address reproducibility requirements are also requested. We seek work on all of these, and related, topics as well as position and experience papers looking to drive conversation for practitioners and researchers in these spaces.

This workshop contributes by sharing experiences and exploring the various technological infrastructure needs to support effective, convenient workflow systems and application composition structures and approaches across a broad spectrum of HPC environments from clusters to supercomputers to cloud systems.

Paper submission guidelines:

Papers should be formatted in IEEE format following eScience formatting rules and can be 5 pages not including references. The format of the paper should be of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines. Templates are available from this link.

List of Topics

Position and Experience papers related to scientific applications and platforms on related topics (particularly the topics listed below)

Big Data or AI workflow systems like Spark, Hadoop, and Tensorflow in conjunction with data management and reproducibility efforts and techniques

Front end systems for configuring or controlling workflows

Data management tools and techniques

Privacy preserving methods to enable data sharing

Workflow engines designed to simplify workflow construction for end users

Provenance management and collection techniques and tools

Application-specific workflow implementations

Mechanisms to support combining multiple application components into a composite application or workflow

Software engineering tools and techniques to support workflow creation, execution, and use

Reproducibility supporting infrastructure

In situ analytics or visualization support for workflows

System software/OS features to enable workflow tools and application composition

Programming support for assembling workflows or connecting application components

Reusable components intended as either ”glue” between workflow components or for analysis or other processing

Storage (both disk and in compute area) support for buffering between components

Programming support for addressing data format/contents mismatch

Programming support for resource management

Program Committee

Claire Bowen (Urban Institute)

Juliana Freire (Reprozip)

Loïc Pottier: (ISI)

Rafael Ferreira Da Silva (ISI)

Jakob Lüttgau: (UTK)

Tom Peterka (ANL)

Jay Jay Billings (Amazon)

Margo Seltzer (Harvard)

Hariharan Devarajan (LLNL)

Dmitry Duplyakin (NREL)

Organizing committee

Jay Lofstead (Sandia)

Jai Dayal (Intel)

Anthony Kougkas (IIT)

Paula Olaya (UTK)

Contact

All questions about submissions should be emailed to gflofst@sandia.gov