M. Berzins. Symplectic Time Integration Methods for the Material Point Method, Experiments, Analysis and Order Reduction, In WCCM-ECCOMAS2020 virtual Conference, January, 2021.
The provision of appropriate time integration methods for the Material Point Method (MPM) involves considering stability, accuracy and energy conservation. A class of methods that addresses many of these issues are the widely-used symplectic time integration methods. Such methods have good conservation properties and have the potential to achieve high accuracy. In this work we build on the work in  and consider high order methods for the time integration of the Material Point Method. The results of practical experiments show that while high order methods in both space and time have good accuracy initially, unless the problem has relatively little particle movement then the accuracy of the methods for later time is closer to that of low order methods. A theoretical analysis explains these results as being similar to the stage error found in Runge Kutta methods, though in this case the stage error arises from the MPM differentiations and interpolations from particles to grid and back again, particularly in cases in which there are many grid crossings.
A. Dubey, M. Berzins, C. Burstedde, M.l L. Norman, D. Unat, M. Wahib.
Structured Adaptive Mesh Refinement Adaptations to Retain Performance Portability With Increasing Heterogeneity, In Computing in Science & Engineering, Vol. 23, No. 5, pp. 62-66. 2021.
Adaptive mesh refinement (AMR) is an important method that enables many mesh-based applications to run at effectively higher resolution within limited computing resources by allowing high resolution only where really needed. This advantage comes at a cost, however: greater complexity in the mesh management machinery and challenges with load distribution. With the current trend of increasing heterogeneity in hardware architecture, AMR presents an orthogonal axis of complexity. The usual techniques, such as asynchronous communication and hierarchy management for parallelism and memory that are necessary to obtain reasonable performance are very challenging to reason about with AMR. Different groups working with AMR are bringing different approaches to this challenge. Here, we examine the design choices of several AMR codes and also the degree to which demands placed on them by their users influence these choices.
J. K. Holmen, D. Sahasrabudhe, M. Berzins. A Heterogeneous MPI+PPL Task Scheduling Approach for Asynchronous Many-Task Runtime Systems, In Proceedings of the Practice and Experience in Advanced Research Computing 2021 on Sustainability, Success and Impact (PEARC21), ACM, 2021.
Asynchronous many-task runtime systems and MPI+X hybrid parallelism approaches have shown promise for helping manage theincreasing complexity of nodes in current and emerging high performance computing (HPC) systems, including those for exascale. Theincreasing architectural diversity, however, poses challenges for large legacy runtime systems emphasizing broad support for majorHPC systems. Performance portability layers (PPL) have shown promise for helping manage this diversity. This paper describes aheterogeneous MPI+PPL task scheduling approach for combining these promising solutions with additional consideration for parallelthird party libraries facing similar challenges to help prepare such a runtime for the diverse heterogeneous systems accompanyingexascale computing. This approach is demonstrated using a heterogeneous MPI+Kokkos task scheduler and the accompanyingportable abstractions  implemented in the Uintah Computational Framework, an asynchronous many-task runtime system, withadditional consideration for hypre, a parallel third party library. Results are shown for two challenging problems executing workloadsrepresentative of typical Uintah applications. These results show performance improvements up to 4.4x when using this schedulerand the accompanying portable abstractions  to port a previously MPI-Only problem to Kokkos::OpenMP and Kokkos::CUDA toimprove multi-socket, multi-device node use. Good strong-scaling to 1,024 NVIDIA V100 GPUs and 512 IBM POWER9 processor arealso shown using MPI+Kokkos::OpenMP+Kokkos::CUDA at scale
R. Zambre, D. Sahasrabudhe, H. Zhou, M. Berzins, A. Chandramowlishwaran, P. Balaji. Logically Parallel Communication for Fast MPI+Threads Communication, In Proceedings of the Transactions on Parallel and Distributed Computing, IEEE, April, 2021.
Supercomputing applications are increasingly adopting the MPI+threads programming model over the traditional “MPI everywhere” approach to better handle the disproportionate increase in the number of cores compared with other on-node resources. In practice, however, most applications observe a slower performance with MPI+threads primarily because of poor communication performance. Recent research efforts on MPI libraries address this bottleneck by mapping logically parallel communication, that is, operations that are not subject to MPI’s ordering constraints to the underlying network parallelism. Domain scientists, however, typically do not expose such communication independence information because the existing MPI-3.1 standard’s semantics can be limiting. Researchers had initially proposed user-visible endpoints to combat this issue, but such a solution requires intrusive changes to the standard (new APIs). The upcoming MPI-4.0 standard, on the other hand, allows applications to relax unneeded semantics and provides them with many opportunities to express logical communication parallelism. In this paper, we show how MPI+threads applications can achieve high performance with logically parallel communication. Through application case studies, we compare the capabilities of the new MPI-4.0 standard with those of the existing one and user-visible endpoints (upper bound). Logical communication parallelism can boost the overall performance of an application by over 2x.
T. A. J. Ouermi, R. M. Kirby, M. Berzins. Numerical Testing of a New Positivity-Preserving Interpolation Algorithm, Subtitled arXiv, 2020.
An important component of a number of computational modeling algorithms is an interpolation method that preserves the positivity of the function being interpolated. This report describes the numerical testing of a new positivity-preserving algorithm that is designed to be used when interpolating from a solution defined on one grid to different spatial grid. The motivating application is a numerical weather prediction (NWP) code that uses spectral elements as the discretization choice for its dynamics core and Cartesian product meshes for the evaluation of its physics routines. This combination of spectral elements, which use nonuniformly spaced quadrature/collocation points, and uniformly-spaced Cartesian meshes combined with the desire to maintain positivity when moving between these necessitates our work. This new approach is evaluated against several typical algorithms in use on a range of test problems in one or more space dimensions. The results obtained show that the new method is competitive in terms of observed accuracy while at the same time preserving the underlying positivity of the functions being interpolated.
D. Sahasrabudhe, M. Berzins.
Improving Performance of the Hypre Iterative Solver for Uintah Combustion Codes on Manycore Architectures Using MPI Endpoints and Kernel Consolidation, In Computational Science -- ICCS 2020, 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part I, Springer International Publishing, pp. 175--190. 2020.
The solution of large-scale combustion problems with codes such as the Arches component of Uintah on next generation computer architectures requires the use of a many and multi-core threaded approach and/or GPUs to achieve performance. Such codes often use a low-Mach number approximation, that require the iterative solution of a large system of linear equations at every time step. While the discretization routines in such a code can be improved by the use of, say, OpenMP or Cuda Approaches, it is important that the linear solver be able to perform well too. For Uintah the Hypre iterative solver has proved to solve such systems in a scalable way. The use of Hypre with OpenMP leads to at least 2x slowdowns due to OpenMP overheads, however. This behavior is analyzed and a solution proposed by using the MPI Endpoints approach is implemented within Hypre, where each team of threads acts as a different MPI rank. This approach minimized OpenMP synchronization overhead, avoided slowdowns, performed as fast or (up to 1.5x) faster than Hypre’s MPI only version, and allowed the rest of Uintah to be optimized using OpenMP. Profiling of the GPU version of Hypre showed the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro kernels and was further optimized by using Cuda-aware MPI. The overall speedup of 1.26x to 1.44x was observed compared to the baseline GPU implementation.
The solution of large-scale combustion problems with codes such as Uintah on modern computer architectures requires the use of multithreading and GPUs to achieve performance. Uintah uses a low-Mach number approximation that requires iteratively solving a large system of linear equations. The Hypre iterative solver has solved such systems in a scalable way for Uintah, but the use of OpenMP with Hypre leads to at least 2x slowdown due to OpenMP overheads. The proposed solution uses the MPI Endpoints within Hypre, where each team of threads acts as a different MPI rank. This approach minimizes OpenMP synchronization overhead and performs as fast or (up to 1.44x) faster than Hypre’s MPI-only version, and allows the rest of Uintah to be optimized using OpenMP. The profiling of the GPU version of Hypre shows the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro-kernels and was further optimized by using Cuda-aware MPI, resulting in an overall speedup of 1.16–1.44x compared to the baseline GPU implementation.
The above optimization strategies were published in the International Conference on Computational Science 2020. This work extends the previously published research by carrying out the second phase of communication-centered optimizations in Hypre to improve its scalability on large-scale supercomputers. This includes an efficient non-blocking inter-thread communication scheme, communication-reducing patch assignment, and expression of logical communication parallelism to a new version of the MPICH library that utilizes the underlying network parallelism. The above optimizations avoid communication bottlenecks previously observed during strong scaling and improve performance by up to 2x on 256 nodes of Intel Knight’s Landing processor.
Proceedings of VI International Conference on Particle-based Methods – Fundamentals and Applications, Barcelona, Edited by E. O ̃ nate, M. Bischoff, D.R.J. Owen, P. Wriggers & T. Zohdi, PARTICLES 2019, pp. 555-566. October, 2019.
The success of the Material Point Method (MPM) in solving many challenging problems nevertheless raises some open questions regarding the fundamental properties of the method such as the energy conservation since being addressed by Bardenhagen and by Love and Sulsky. Similarly while low order symplectic time integration techniques are used with MPM, higher order methods have not been used. For this reason the Stormer Verlet method, a popular and widely-used symplectic method is applied to MPM. Both the time integration error and the energy conservation properties of this method applied to MPM are considered. The method is shown to have locally third order accuracy of energy conservation in time. This is in contrast to the locally second order accuracy in energy conservation of the methods that are used in many MPM calculations. This third accuracy accuracy is demonstrated both locally and globally on a standard MPM test example.
J. K. Holmen, B. Peterson, A. Humphrey, D. Sunderland, O. H. Diaz-Ibarra, J. N. Thornock, M. Berzins. Portably Improving Uintah's Readiness for Exascale Systems Through the Use of Kokkos, SCI Institute, 2019.
Uncertainty and diversity in future HPC systems, including those for exascale, makes portable codebases desirable. To ease future ports, the Uintah Computational Framework has adopted the Kokkos C++ Performance Portability Library. This paper describes infrastructure advancements and performance improvements using partitioning functionality recently added to Kokkos within Uintah's MPI+Kokkos hybrid parallelism approach. Results are presented for two challenging calculations that have been refactored to support Kokkos::OpenMP and Kokkos::Cuda back-ends. These results demonstrate performance improvements up to (i) 2.66x when refactoring for portability, (ii) 81.59x when adding loop-level parallelism via Kokkos back-ends, and (iii) 2.63x when more eciently using a node. Good strong-scaling characteristics to 442,368 threads across 1728 Knights Landing processors are also shown. These improvements have been achieved with little added overhead (sub-millisecond, consuming up to 0.18% of per-timestep time). Kokkos adoption and refactoring lessons are also discussed.
J. K. Holmen, B. Peterson, M. Berzins. An Approach for Indirectly Adopting a Performance Portability Layer in Large Legacy Codes, In 2nd International Workshop on Performance, Portability, and Productivity in HPC (P3HPC), In conjunction with SC19, 2019.
Diversity among supported architectures in current and emerging high performance computing systems, including those for exascale, makes portable codebases desirable. Portability of a codebase can be improved using a performance portability layer to provide access to multiple underlying programming models through a single interface. Direct adoption of a performance portability layer, however, poses challenges for large pre-existing software frameworks that may need to preserve legacy code and/or adopt other programming models in the future. This paper describes an approach for indirect adoption that introduces a framework-specific portability layer between the application developer and the adopted performance portability layer to help improve legacy code support and long-term portability for future architectures and programming models. This intermediate layer uses loop-level, application-level, and build-level components to ease adoption of a performance portability layer in large legacy codebases. Results are shown for two challenging case studies using this approach to make portable use of OpenMP and CUDA via Kokkos in an asynchronous many-task runtime system, Uintah. These results show performance improvements up to 2.7x when refactoring for portability and 2.6x when more efficiently using a node. Good strong-scaling to 442,368 threads across 1,728 Knights Landing processors are also shown using MPI+Kokkos at scale.
A. Humphrey, M. Berzins.
An Evaluation of An Asynchronous Task Based Dataflow Approach For Uintah, In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Vol. 2, pp. 652-657. July, 2019.
The challenge of running complex physics code on the largest computers available has led to dataflow paradigms being explored. While such approaches are often applied at smaller scales, the challenge of extreme-scale data flow computing remains. The Uintah dataflow framework has consistently used dataflow computing at the largest scales on complex physics applications. At present Uintah contains two main dataflow models. Both are based upon asynchronous communication. One uses a static graph-based approach with asynchronous communication and the other uses a more dynamic approach that was introduced almost a decade ago. Subsequent changes within the Uintah runtime system combined with many more large scale experiments, has necessitated a reevaluation of these two approaches, comparing them in the context of large scale problems. While the static approach has worked well for some large-scale simulations, the dynamic approach is seen to offer performance improvements over the static case for a challenging fluid-structure interaction problem at large scale that involves fluid flow and a moving solid represented using particle method on an adaptive mesh.
D. Sahasrabudhe, M. Berzins, J. Schmidt.
Node failure resiliency for Uintah without checkpointing, In Concurrency and Computation: Practice and Experience, pp. e5340. 2019.
The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many-core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm-Based Fault Tolerance (ABFT) using Adaptive Mesh Refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solution may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables such as positivity or boundedness may be violated during interpolation. These challenges can be addressed by the combination of two techniques: 1. a fault-tolerant MPI implementation to recover from runtime node failures, and 2. high-order interpolation schemes to preserve the physical solution and reconstruct lost data. The approach considered here uses a "Limited Essentially Non-Oscillatory" (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault-tolerant MPI - ULFM to recover from runtime failure, and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10x faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.
D. Sahasrabudhe, E. T. Phipps, S. Rajamanickam, M. Berzins. A Portable SIMD Primitive using Kokkos for Heterogeneous Architectures, In Sixth Workshop on Accelerator Programming Using Directives (WACCPD), 2019.
As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (simd) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (knl), and also facilitates Single Instruction Multiple Threads (simt) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new simd primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the \logical vector length" (lvl). The simd primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.
Q. Tran, M. Berzins, W. Solowski. An improved moving least squares method for the Material Point Method, In Proceedings of the 2nd International Conference on the Material Point Method for Modelling Soil-Water-Structure Interaction (MPM 2019), 2019.
The paper presents an improved moving least squares reconstruction technique for the Material Point Method. The moving least squares reconstruction(MLS)can improve spatial accuracy in simulations involving large deformations. However, the MLS algorithm relies on computing the inverse of the moment matrix.This is both expensive and potentially unstable when there are not enough material points to reconstruct the high-order least squares function, which leads to a singular or an ill-conditioned matrix. The shown formulation can overcome this limitation while retain the same order of accuracy compared with the conventional moving least squares reconstruction.Numerical experiments demonstrate the improvements in the accuracy and comparison with the original Material Point Method and the Convected Particles Domain Interpolation method.
Q. A. Tran, W. Sołowski, M. Berzins, J. Guilkey. A convected particle least square interpolation material point method, In International Journal for Numerical Methods in Engineering, Wiley, October, 2019.
Applying the convected particle domain interpolation (CPDI) to the material point method has many advantages over the original material point method, including significantly improved accuracy. However, in the large deformation regime, the CPDI still may not retain the expected convergence rate. The paper proposes an enhanced CPDI formulation based on least square reconstruction technique. The convected particle least square interpolation (CPLS) material point method assumes the velocity field inside the material point domain as nonconstant. This velocity field in the material point domain is mapped to the background grid nodes with a moving least squares reconstruction. In this paper, we apply the improved moving least squares method to avoid the instability of the conventional moving least squares method due to a singular matrix. The proposed algorithm can improve convergence rate, as illustrated by numerical examples using the method of manufactured solutions.
Nonlinear stability and time step selection for the MPM method, In Computational Particle Mechanics, Jan, 2018.
The Material Point Method (MPM) has been developed from the Particle in Cell (PIC) method over the last 25 years and has proved its worth in solving many challenging problems involving large deformations. Nevertheless there are many open questions regarding the theoretical properties of MPM. For example in while Fourier methods, as applied to PIC may provide useful insight, the non-linear nature of MPM makes it necessary to use a full non-linear stability analysis to determine a stable time step for MPM. In order to begin to address this the stability analysis of Spigler and Vianello is adapted to MPM and used to derive a stable time step bound for a model problem. This bound is contrasted against traditional Speed of sound and CFL bounds and shown to be a realistic stability bound for a model problem.
The need to scale next-generation industrial engineering problems to the largest computational platforms presents unique challenges. This paper focuses on data management related problems faced by the Uintah simulation framework at a production scale of 260K processes. Uintah provides a highly scalable asynchronous many-task runtime system, which in this work is used for the modeling of a 1000 megawatt electric (MWe) ultra-supercritical (USC) coal boiler. At 260K processes, we faced both parallel I/O and visualization related challenges, e.g., the default file-per-process I/O approach of Uintah did not scale on Mira. In this paper we present a simple to implement, restructuring based parallel I/O technique. We impose a restructuring step that alters the distribution of data among processes. The goal is to distribute the dataset such that each process holds a larger chunk of data, which is then written to a file independently. This approach finds a middle ground between two of the most common parallel I/O schemes--file per process I/O and shared file I/O--in terms of both the total number of generated files, and the extent of communication involved during the data aggregation phase. To address scalability issues when visualizing the simulation data, we developed a lightweight renderer using OSPRay, which allows scientists to visualize the data interactively at high quality and make production movies. Finally, this work presents a highly efficient and scalable radiation model based on the sweeping method, which significantly outperforms previous approaches in Uintah, like discrete ordinates. The integrated approach allowed the USC boiler problem to run on 260K CPU cores on Mira.
Performance optimization in the petascale era and beyond in the exascale era has and will require modifications of legacy codes to take advantage of new architectures with large core counts and SIMD units. The Numerical Weather Prediction (NWP) physics codes considered here are optimized using thread-local structures of arrays (SOA). High-level and low-level optimization strategies are applied to the WRF Single-Moment 6-Class Microphysics Scheme (WSM6) and Global Forecast System (GFS) physics codes used in the NEPTUNE forecast code. By building on previous work optimizing WSM6 on the Intel Knights Landing (KNL), it is shown how to further optimize WMS6 and GFS physics, and GFS radiation on Intel KNL, Haswell, and potentially on future micro-architectures with many cores and SIMD vector units. The optimization techniques used herein employ thread-local structures of arrays (SOA), an OpenMP directive, OMP SIMD, and minor code transformations to enable better utilization of SIMD units, increase parallelism, improve locality, and reduce memory traffic. The optimized versions of WSM6, GFS physics, GFS radiation run 70, 27, and 23 faster (respectively) on KNL and 26, 18 and 30 faster (respectively) on Haswell than their respective original serial versions. Although this work targets WRF physics schemes, the findings are transferable to other performance optimization contexts and provide insight into the optimization of codes with complex physical models for present and near-future architectures with many core and vector units.
B. Peterson, A. Humphrey, J. Holmen T. Harman, M. Berzins, D. Sunderland, H.C. Edwards.
Demonstrating GPU Code Portability and Scalability for Radiative Heat Transfer Computations, In Journal of Computational Science, Elsevier BV, June, 2018.
High performance computing frameworks utilizing CPUs, Nvidia GPUs, and/or Intel Xeon Phis necessitate portable and scalable solutions for application developers. Nvidia GPUs in particular present numerous portability challenges with a different programming model, additional memory hierarchies, and partitioned execution units among streaming multiprocessors. This work presents modifications to the Uintah asynchronous many-task runtime and the Kokkos portability library to enable one single codebase for complex multiphysics applications to run across different architectures. Scalability and performance results are shown on multiple architectures for a globally coupled radiation heat transfer simulation, ranging from a single node to 16384 Titan compute nodes.
B. Peterson, A. Humphrey, D. Sunderland, J. Sutherland, T. Saad, H. Dasari, M. Berzins.
Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime, In International Journal of Parallel Programming, Dec, 2018.
The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime system. Uintah is based on a distributed directed acyclic graph (DAG) of computational tasks, with a task scheduler that efficiently schedules and executes these tasks on both CPU cores and on-node accelerators. The runtime system identifies task dependencies, creates a task graph prior to the execution of these tasks, automatically generates MPI message tags, and automatically performs halo transfers for simulation variables. Automating halo transfers in a heterogeneous environment poses significant challenges when tasks compute within a few milliseconds, as runtime overhead affects wall time execution, or when simulation variables require large halos spanning most or all of the computational domain, as task dependencies become expensive to process. These challenges are magnified at production scale when application developers require each compute node perform thousands of different halo transfers among thousands simulation variables. The principal contribution of this work is to (1) identify and address inefficiencies that arise when mapping tasks onto the GPU in the presence of automated halo transfers, (2) implement new schemes to reduce runtime system overhead, (3) minimize application developer involvement with the runtime, and (4) show overhead reduction results from these improvements.