LLNL ECP Software Technology

The Department of Energy’s Exascale Computing Project (ECP) Software Technology focus area is tasked with enhancing the software stack on which DOE applications rely in order to meet the needs of exascale applications. Specifically, these projects conduct research and development on tools and methods that will allow the efficient utilization of exascale systems, enhance productivity, and facilitate portability. LLNL researchers contribute to a broad range of ECP Software Technology projects:

AAPS: Advanced Architecture and Portability Specialists

Principal Investigator: David Richards, LLNL

The Advanced Architecture and Portability Specialists (AAPS) team seeks to identify best practices for exascale code development through flexible programming abstractions, proxy apps, and hands-on engagements with code teams and vendors. The AAPS team gathers and shares best practices from a range of applications and proxy apps, accelerating progress across projects and reducing duplication of effort. The team engages vendors and the broader community to improve programming models and compilers and to help identify and overcome potential obstacles for specific applications.

Enabling Time Integrators for Exascale Through SUNDIALS

Principal Investigator: Carol Woodward, LLNL

SUNDIALS provides adaptive time integrators (with sensitivity capabilities) for ODEs and DAEs along with an efficient nonlinear solver package. In collaboration with Southern Methodist University, this project will add a multivector capability to better support multiphysics simulations and heterogeneous architectures, encapsulate nonlinear solvers under the time integrators, redesign the linear solver interfaces, improve the vector API, and enhance our test suite while deploying a continuous integration system. These enhancements will allow SUNDIALS to better leverage solvers from other packages, support multirate simulations, and better make use of on-node data structure optimizations.

Exascale Code Generation Toolkit

Principal Investigator: Dan Quinlan, LLNL

The goal of the Exascale Code Generation Toolkit is to support the automated generation of code for current and future compute architectures. The toolkit will enable performance portability across a wide range of vendor architectures and compilers. The project will deliver source code generation tools required to automate the targeting of algorithm implementations to a wide range of architectures and vendor compilers (and tools). A collaboration with PNNL, Ohio State and Colorado State Universities, this project will assist ECP applications to achieve performance portability across multiple platforms with various hardware and software supports.

LLNL ATDM Development Tools

Principal Investigator: Becky Springmeyer, LLNL

This project centers on creating production-quality open source tools and libraries for users of next-generation exascale systems. This includes a project for hardening existing tools (ProTools) and a project for developing new tools for debugging at scale (AID). HPC tool researchers have created a wealth of ideas and prototypes, but they do not always have a path for turning those into hardened production-quality tools and maintaining them. ProTools works with both researchers and application teams to identify and harden the tool software needed for exascale. The team provides software engineering (such as porting, testing, and feature development), develops standardized tool interfaces, and assists application teams with tool integration. The AID team develops new tools and expertise for debugging and code correctness on exascale, with a focus on race condition detection and reproducibility. The prioritization of these software efforts relies on input from exascale applications teams in the ASC program. Major efforts include:

• Caliper—Development and productization of performance analysis infrastructure.
• SCR—Supporting researchers and code teams adopting the Scalable Checkpoint/Restart tool.
• SPOT—Supporting application teams with performance history tracking.
• OMPD—Standardization of a debugging tools interface for OpenMP.
• AID—Development of a next-generation debugging and correctness tool suite.

LLNL ATDM Mathematical Libraries

Principal Investigator: Becky Springmeyer, LLNL

This project focuses on the MFEM (Modular Finite Element Methods) library that provides high-performance finite element (FE) discretizations to high-order ATDM applications in the ECP. A main component of this effort is the development of ATDM-specific physics enhancements in MFEM finite element algorithms, and the MFEM-based BLAST Arbitrary Lagrangian-Eulerian (ALE) radiation-hydrodynamics code, to provide efficient discretization components for Lawrence Livermore National Laboratory’s ATDM efforts, including the MARBL application (ECP’s LLNLApp).

A second main task in the project is the development of unique unstructured adaptive mesh refinement (AMR) algorithms in MFEM, that focus on generality, parallel scalability, and ease of integration in unstructured mesh applications. The new AMR capabilities can benefit a variety of ECP apps that use unstructured meshes (complementing the work in AMRex), as well as many other applications in industry and the SciDAC program.

Another aspect of the work is the preparation of the MFEM finite element library and related codes for exascale platforms by using mathematical algorithms and software implementations that exploit increasing on-node concurrency targeting multiple complex architectures (e.g. GPUs). This part of the project is synergistic with and leverages efforts from the CEED co-design center.

LLNL ATDM Programming Models and Runtimes

Principal Investigator: Kathryn Mohror, LLNL

This project covers two main thrusts in programming models standards and runtimes for exascale supercomputing systems. The first thrust is programming models standards work in MPI and OpenMP. For MPI, the focus is on interfaces for supporting tools, including MPI_T and MPIR, and for fault tolerance, including Reinit. In OpenMP, the focus is on the tools interfaces OMPT and OMPD. However, we will participate and monitor developments in all parts of both standards in addition to the focus areas. Additionally, we will participate in community programming model efforts to ensure that they support ASC needs.

The other main thrust area for Lawrence Livermore National Laboratory (LLNL) is the ROSE project’s work in support of ATDM exascale application efforts. We will develop advanced program transformation and analysis to improve correctness and performance of RAJA codes. The research results will be communicated to RAJA maintainers to improve the RAJA portable programming layer. Both of these LLNL efforts in this area are crucial for ECP and ASC applications to achieve portable performance on upcoming exascale systems.

LLNL ATDM Software Ecosystem and Delivery

Principal Investigator: Becky Springmeyer, LLNL

Providing a full-featured, integrated, and maintainable exascale software stack is essential to support the ASC Program’s computational mission. Lawrence Livermore National Laboratory (LLNL) is achieving this goal through a broad range of integrated efforts that span the software ecosystem, including internal developments, vendor support, collaborations with other laboratories through ECP, and targeted external contracts. Integrating these efforts provides both coordination, as well as a clear path from R&D to delivery and deployment. Next Generation Computing Enablement (NGCE) addresses needs of HPC developers on LLNL’s current and future computing systems on the path to exascale. NGCE project areas include system-level software, power-aware computing, and resource management.

MFEM: Scalable Finite Element Discretization Library

Principal Investigator: Tzanio Kolev

Livermore's open-source MFEM library enables application scientists to quickly prototype parallel physics application codes based on partial differential equations (PDEs) discretized with high-order finite elements. The finite element method is a powerful discretization technique that can utilize general unstructured grids to approximate the solutions of many PDEs. High-order finite elements in particular are ideally suited to take advantage of the changing computational landscape, because their order can be used to adjust the algorithm for different hardware or to fine-tune the performance by increasing the flops/bytes ratio.

For more information, see the MFEM website on GitHub: mfem.org.

Power Steering

Principal Investigator: Tapasya Patki, LLNL

The Power Steering project seeks to provide a job-level power management system that can optimize performance under power and/or energy constraints. The management system will be configurable and cross-platform and will be mostly transparent to applications. The project will integrate existing research efforts, including Conductor and Adagio, into the GEO runtime system project, which is an open source effort by Intel. This integration will enhance the GEO efforts and expand its capabilities while providing a stable, industry-supported open source solution. The project will further extend GEO with new platform plugins, making its capabilities available to upcoming target platforms for ECP beyond Intel systems.

RAJA: On-node Performance Portability, and Managing Computation and Memory Interplay at Exascale

Principal Investigator: Richard Hornung, LLNL

HPC systems along the path to exascale require applications to expose fine-grained, on-node parallelism in various forms and manage data for different memory spaces in complex heterogeneous memory systems. This presents daunting challenges especially for large, multi-physics applications that couple multiple diverse physics packages employing different mesh structures and many libraries (solvers, equation of state, material models, I/O, etc.). To achieve high performance and make effective use of platforms, application developers need robust, productivity-enhancing software infrastructure that helps them map data and parallelism to hardware resources and share data between packages and other tools.

The focus of this project involves three software components under development in the ATDM and ASC programs at LLNL: RAJA, CHAI, and Sidre.

RAJA is an execution abstraction layer that encapsulates tasks, parallelism, and data placement for loop-based algorithms in C++ applications.
CHAI is a runtime layer that transparently moves data between memory spaces as needed for RAJA execution contexts.
Sidre is a in-memory data repository that supports mesh-aware, application-level semantics to describe, allocate, store, and access data within and across physics packages in an integrated application code.

Each of these tools is currently being used in production ASC and ATDM applications at LLNL. The ECP-funded portion of the project will be directed at making these products more robust, expand their applicability and integration potential, and make them widely available to engage ECP software and application efforts.

Software Ecosystem and Delivery Software Development Toolkit

Principal Investigator: Rob Neely

A Software Development Kit (SDK) in the Exascale Computing Project Software Technology (ST) focus area is a community effort to:

Define complementarity and interoperability across a collection of software capabilities in a given functionality domain.
Create a set of community policies that govern community behavior and quality expectations.
Grow awareness of best practices among SDK community members.
Increase common look-and-feel across independently developed capabilities.
Provide an intermediate build-integration-testing target to reduce the complexity of managing the ECP software stack.

The Extreme-scale Scientific Software Development Kit (xSDK) is an existing project within ECP ST, bringing together the math libraries (hypre, PETSc, SuperLU, Trilinos and more) as an SDK. The xSDK provides an tangible case study for how an SDK can be established.

The Ecosystem and Delivery SDK project will identify and establish one or more SDKs within the Ecosystem and Delivery communities, with a goal of satisfying the above objectives.

UNIFYCR: A Checkpoint/Restart File System for Distributed Burst Buffers

Principal Investigator: Kathryn Mohror, LLNL

UNIFYCR will be a user-level file system, highly-specialized for shared file access on HPC systems with distributed burst buffers that will be integrated with resource managers to transparently intercept I/O calls and integrate cleanly with other software, including I/O and checkpoint/restart libraries. A collaboration between LLNL and ORNL, UNIFYCR project will enable transparent use of the distributed burst buffers of current and future HPC systems for the checkpoint/restart needs of applications that use shared files.

ZFP: Compressed Floating-Point Arrays

Principal Investigator: Peter Lindstrom, LLNL

This project will deliver high-speed compressed floating-point arrays that reduce data movement and storage in exascale simulations by up to 100 times while providing read and write random access, user-specified storage size, and bounded error. ECP activities include software hardening, addition of essential functionality, performance optimization, and application and tools integration.

Argo: Operating System and Resource Management for Exascale

Principal Investigator: Pete Beckman, ANL; Maya Gokhale, LLNL Lead

The goal of Argo is to improve or augment existing OS/R components for use in production HPC systems, providing portable, open source software that improves the performance and scalability and that provides increased functionality to exascale applications. In this project, we will develop capabilities that address enclaves – application-facing, recursive, dynamic collections of nodes; power – global, enclave, and node level tracking and management;

containers – performance isolation by partitioning the node resources; and hierarchical memory – DRAM cache for nonvolatile storage (DI-MMAP) and software-managed scratchpad for DRAM hierarchy (DeepRAM).

ECP ALPINE: Algorithms and Infrastructure for In Situ Visualization and Analysis

Principal Investigator: James Ahrens, LANL; Eric Brugger, LLNL Lead

The ECP ALPINE project, a collaboration between Los Alamos, Lawrence Livermore, and Lawrence Berkeley Laboratories, the University of Oregon, and Kitware, will develop algorithms and infrastructure for in situ visualization and analysis. We will leverage the ParaView and VisIt projects to deliver visualization and analysis algorithms suitable for exascale and in situ processing and to deliver in situ infrastructure that is exascale-capable and can be used for deployment of existing applications, libraries, and tools. We will engage with ECP Applications to integrate our algorithms and infrastructure into their software and engage with ECP Software Technologies to integrate their exascale software into our infrastructure.

IDEAS2

Principal Investigator: Mike Heroux, SNL; Todd Gamblin, LLNL

The IDEAS-ECP project aims to improve developer productivity (product quality, development time, and staffing resources) and software sustainability (reducing the cost of maintaining, sustaining, and evolving software capabilities in the future) for ECP application code teams.

To accomplish these goals, the project focuses on 1) engaging ECP application teams to understand their needs and to help them achieve better productivity; 2) engaging with the broader software community to adapt their approaches to productivity and sustainability for the Computational Science and Engineering (CSE) community; 3) developing and disseminating better scientific software practices through web and training outreach; 4) identifying and promoting leaders in the community who can promote productivity and sustainability through an IDEAS-ECP fellowship or similar program; and 5) engaging with leadership computing facilities on training, delivery, and software development challenges.

OMPI-X: Open MPI for Exascale

Principal Investigator: David Bernholdt, ORNL; Martin Schulz, LLNL Lead

A collaboration between Oak Ridge, Los Alamos, Lawrence Livermore, and Sandia National Laboratories and the University of Tennessee, this project focuses on preparing the MPI standard and its implementation in Open MPI for exascale through improvements in scalability, capability, and resilience. Our work will address a broad spectrum of issues in both the standard and the implementation: (1) runtime interoperability for MPI+X and beyond, (2) extending the MPI standard to better support coming exascale architectures, (3) improvements to Open MPI scalability and performance, (4) support for more dynamic execution environments, (5) resilience in MPI and Open MPI, (6) MPI tools interfaces, and (7) quality assurance.

SICM: A Simplified Complex Memory API and Operating System/Runtime Interface for ECP

Principal Investigator: Mike Lang, LANL; Maya Gokhale, LLNL Lead

The goal of SICM, a collaboration between Los Alamos, Lawrence Livermore, Oak Ridge, and Sandia National Laboratories and Georgia Tech University, is to deliver a unified simple interface to the emerging complex memory hierarchies on exascale nodes to reduce the non-portable code required to leverage emerging memory technologies in both runtimes and applications. This capability will be implemented as a two-level interface: initially a low-level interface focused on supporting runtimes or advanced users and then a high-level application focused interface. The low-level interface will be quickly deployed and developed. The high level interface is more complex and will require a longer requirements gathering and development cycle.

SOLLVE: Scaling OpenMP with LLVM for Exascale Performance and Portability

Principal Investigator: Barbara Chapman, BNL; Bronis de Supinski, LLNL Lead

The SOLLVE project will develop enhancements to OpenMP that meet critical needs of ECP applications. A high-quality implementation of the enhancements will be created using LLVM and light-weight threading. Other implementations will be encouraged. SOLLVE will enable applications to use directives to create productive, performant, and performance-portable intra-node code. It will ensure that OpenMP evolves to support the needs of exascale applications and will facilitate OpenMP-MPI interoperability. Further, SOLLVE will produce an enhanced LLVM compiler and will contribute to an ECP solution for lightweight threading. This project is a collaboration between Brookhaven, Lawrence Livermore, Argonne, and Oak Ridge National Laboratories, Rice University, and the University of Illinois Urbana-Champaign.

VeloC: Very Low Overhead Transparent Multilevel Checkpoint/restart

Principal Investigator: Franck Cappello, ANL; Kathryn Mohror, LLNL Lead

The VeloC framework will provide transparent benefit of optimized checkpoint/restart performance for applications and workflows through a simple interface. It will increase programmer productivity by dramatically reducing the difficulty of handling varied and complex storage architectures. VeloC will cover a large spectrum of DOE applications using specific code or I/O libraries for checkpoint/restart by refactoring the FTI and SCR multilevel checkpoint libraries to provide asynchronous nonblocking checkpoint movements in the storage hierarchy; combined data structure and file interfaces; support of applications using I/O libraries (HDF5, PnetCDF, ADIOS); support of vendor services performing file movements between levels of the storage hierarchy; and optimized support of the DOE CORAL systems and the future exascale systems.

xSDK4ECP: Extreme-scale Scientific Software Development Kit for the Exascale Computing Project

Principal Investigator: Lois Curfman McInnes, ANL; Ulrike Yang, LLNL Lead

The Extreme-scale Scientific Software Development Kit for the ECP is a collaboration between Argonne, Lawrence Berkeley, Lawrence Livermore, and Sandia National Laboratories, the University of Tennessee, and the University of California Berkeley organized to develop compliance standards and interoperability layers among numerical libraries hypre, PETSc, SUNDIALS, SuperLU, Trilinos, and dense linear algebra packages MAGMA, PLASMA, DPLASMA, ScaLAPACK, LAPACK; to promote sustainability strategy development; and to ensure ECP user engagement. Within the ECP, xSDK4ECP will enable the seamless combined use of diverse, independently developed software packages as needed by ECP applications, including coordinated use of on-node resources; integrated execution (control inversion and adaptive execution strategies); and coordinated and sustainable documentation, testing, packaging, and deployment.

LLNL ECP Software Technology

LLNL Led

AAPS: Advanced Architecture and Portability Specialists

Enabling Time Integrators for Exascale Through SUNDIALS

Exascale Code Generation Toolkit

LLNL ATDM Development Tools

LLNL ATDM Mathematical Libraries

LLNL ATDM Programming Models and Runtimes

LLNL ATDM Software Ecosystem and Delivery

MFEM: Scalable Finite Element Discretization Library

Power Steering

RAJA: On-node Performance Portability, and Managing Computation and Memory Interplay at Exascale

Software Ecosystem and Delivery Software Development Toolkit

UNIFYCR: A Checkpoint/Restart File System for Distributed Burst Buffers

ZFP: Compressed Floating-Point Arrays

Partner Led

Argo: Operating System and Resource Management for Exascale

ECP ALPINE: Algorithms and Infrastructure for In Situ Visualization and Analysis

IDEAS2

OMPI-X: Open MPI for Exascale

SICM: A Simplified Complex Memory API and Operating System/Runtime Interface for ECP

SOLLVE: Scaling OpenMP with LLVM for Exascale Performance and Portability

VeloC: Very Low Overhead Transparent Multilevel Checkpoint/restart

xSDK4ECP: Extreme-scale Scientific Software Development Kit for the Exascale Computing Project