Parallel Breadth First Search on GPU Clusters|
SCI Technical Report, Z. Fu, H.K. Dasari, M. Berzins, B. Thompson. No. UUSCI-2014-002, SCI Institute, University of Utah, 2014.
Fast, scalable, low-cost, and low-power execution of parallel graph algorithms is important for a wide variety of commercial and public sector applications. Breadth First Search (BFS) imposes an extreme burden on memory bandwidth and network communications and has been proposed as a benchmark that may be used to evaluate current and future parallel computers. Hardware trends and manufacturing limits strongly imply that many core devices, such as NVIDIA® GPUs and the Intel® Xeon Phi®, will become central components of such future systems. GPUs are well known to deliver the highest FLOPS/watt and enjoy a very significant memory bandwidth advantage over CPU architectures. Recent work has demonstrated that GPUs can deliver high performance for parallel graph algorithms and, further, that it is possible to encapsulate that capability in a manner that hides the low level details of the GPU architecture and the CUDA language but preserves the high throughput of the GPU. We extend previous research on GPUs and on scalable graph processing on super-computers and demonstrate that a high-performance parallel graph machine can be created using commodity GPUs and networking hardware.
Keywords: GPU cluster, MPI, BFS, graph, parallel graph algorithm
Fast Multi-Resolution Reads of Massive Simulation Datasets|
S. Kumar, C. Christensen, P.-T. Bremer, E. Brugger, V. Pascucci, J. Schmidt, M. Berzins, H. Kolla, J. Chen, V. Vishwanath, P. Carns, R. Grout. In Proceedings of the International Supercomputing Conference ISC'14, Leipzig, Germany, June, 2014.
Today's massively parallel simulation code can produce output ranging up to many terabytes of data. Utilizing this data to support scientific inquiry requires analysis and visualization, yet the sheer size of the data makes it cumbersome or impossible to read without computational resources similar to the original simulation. We identify two broad classes of problems for reading data and present effective solutions for both. The first class of data reads depends on user requirements and available resources. Tasks such as visualization and user-guided analysis may be accomplished using only a subset of variables with restricted spatial extents at a reduced resolution. The other class of reads require full resolution multi-variate data to be loaded, for example to restart a simulation. We show that utilizing the hierarchical multi-resolution IDX data format enables scalable and efficient serial and parallel read access on a variety of hardware from supercomputers down to portable devices. We demonstrate interactive view-dependent visualization and analysis of massive scientific datasets using low-power commodity hardware, and we compare read performance with other parallel file formats for both full and partial resolution data.
Scalable large-scale fluid-structure interaction solvers in the Uintah framework via hybrid task-based parallelism algorithms|
Q. Meng, M. Berzins. In Concurrency and Computation: Practice and Experience, Vol. 26, No. 7, pp. 1388--1407. May, 2014.
Uintah is a software framework that provides an environment for solving fluid–structure interaction problems on structured adaptive grids for large-scale science and engineering problems involving the solution of partial differential equations. Uintah uses a combination of fluid flow solvers and particle-based methods for solids, together with adaptive meshing and a novel asynchronous task-based approach with fully automated load balancing. When applying Uintah to fluid–structure interaction problems, the combination of adaptive mesh- ing and the movement of structures through space present a formidable challenge in terms of achieving scalability on large-scale parallel computers. The Uintah approach to the growth of the number of core counts per socket together with the prospect of less memory per core is to adopt a model that uses MPI to communicate between nodes and a shared memory model on-node so as to achieve scalability on large-scale systems. For this approach to be successful, it is necessary to design data structures that large numbers of cores can simultaneously access without contention. This scalability challenge is addressed here for Uintah, by the development of new hybrid runtime and scheduling algorithms combined with novel lock-free data structures, making it possible for Uintah to achieve excellent scalability for a challenging fluid–structure problem with mesh refinement on as many as 260K cores.
Keywords: MPI, threads, Uintah, many core, lock free, fluid-structure interaction, c-safe
An Alternative Formulation of Lyapunov Exponents for Computing Lagrangian Coherent Structures|
A.R. Sanderson. In Proceedings of the 2014 IEEE Pacific Visualization Symposium (PacificVis), Yokahama Japan, 2014.
Lagrangian coherent structures are time-evolving surfaces that highlight areas in flow fields where neighboring advected particles diverge or converge. The detection and understanding of such structures is an important part of many applications such as in oceanography where there is a need to predict the dispersion of oil and other materials in the ocean. One of the most widely used tools for revealing Lagrangian coherent structures has been to calculate the finite-time Lyapunov exponents, whose maximal values appear as ridgelines to reveal Lagrangian coherent structures. In this paper we explore an alternative formulation of Lyapunov exponents for computing Lagrangian coherent structures.
Geometric constraints on quadratic Bézier curves using minimal length and energy|
Y. Joon Ahn, C. Hoffmann, P. Rosen. In Journal of Computational and Applied Mathematics, Vol. 255, pp. 887--897. 2014.
This paper derives expressions for the arc length and the bending energy of quadratic Bézier curves. The formulas are in terms of the control point coordinates. For fixed start and end points of the Bézier curve, the locus of the middle control point is analyzed for curves of fixed arc length or bending energy. In the case of arc length this locus is convex. For bending energy it is not. Given a line or a circle and fixed end points, the locus of the middle control point is determined for those curves that are tangent to a given line or circle. For line tangency, this locus is a parallel line. In the case of the circle, the locus can be classified into one of six major types. In some of these cases, the locus contains circular arcs. These results are then used to implement fast algorithms that construct quadratic Bézier curves tangent to a given line or circle, with given end points, that minimize bending energy or arc length.
2D Vector Field Simplification Based on Robustness|
P. Skraba, Bei Wang, G. Chen, P. Rosen. In Proceedings of the 2014 IEEE Pacific Visualization Symposium, PacificVis, Note: Awarded Best Paper!, 2014.
Vector field simplification aims to reduce the complexity of the flow by removing features in order of their relevance and importance, to reveal prominent behavior and obtain a compact representation for interpretation. Most existing simplification techniques based on the topological skeleton successively remove pairs of critical points connected by separatrices, using distance or area-based relevance measures. These methods rely on the stable extraction of the topological skeleton, which can be difficult due to instability in numerical integration, especially when processing highly rotational flows. These geometric metrics do not consider the flow magnitude, an important physical property of the flow. In this paper, we propose a novel simplification scheme derived from the recently introduced topological notion of robustness, which provides a complementary view on flow structure compared to the traditional topological-skeleton-based approaches. Robustness enables the pruning of sets of critical points according to a quantitative measure of their stability, that is, the minimum amount of vector field perturbation required to remove them. This leads to a hierarchical simplification scheme that encodes flow magnitude in its perturbation metric. Our novel simplification algorithm is based on degree theory, has fewer boundary restrictions, and so can handle more general cases. Finally, we provide an implementation under the piecewise-linear setting and apply it to both synthetic and real-world datasets.
Keywords: vector field, topology-based techniques, flow visualization
|International Journal for Uncertainty Quantification,
Subtitled Special Issue on Working with Uncertainty: Representation, Quantification, Propagation, Visualization, and Communication of Uncertainty, C.R. Johnson, A. Pang (Eds.). In Int. J. Uncertainty Quantification, Vol. 3, No. 3, Begell House, Inc., 2013.
|International Journal for Uncertainty Quantification,
Subtitled Special Issue on Working with Uncertainty: Representation, Quantification, Propagation, Visualization, and Communication of Uncertainty, C.R. Johnson, A. Pang (Eds.). In Int. J. Uncertainty Quantification, Vol. 3, No. 2, Begell House, Inc., pp. vii--viii. 2013.
A Fast Iterative Method for Solving the Eikonal Equation on Tetrahedral Domains|
Z. Fu, R.M. Kirby, R.T. Whitaker. In SIAM Journal on Scientific Computing, Vol. 35, No. 5, pp. C473--C494. 2013.
Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous 2D strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers.
Scalable Visualization and Interactive Analysis Using Massive Data Streams|
V. Pascucci, P.-T. Bremer, A. Gyulassy, G. Scorzelli, C. Christensen, B. Summa, S. Kumar. In Cloud Computing and Big Data, Advances in Parallel Computing, Vol. 23, IOS Press, pp. 212--230. 2013.
Historically, data creation and storage has always outpaced the infrastructure for its movement and utilization. This trend is increasing now more than ever, with the ever growing size of scientific simulations, increased resolution of sensors, and large mosaic images. Effective exploration of massive scientific models demands the combination of data management, analysis, and visualization techniques, working together in an interactive setting. The ViSUS application framework has been designed as an environment that allows the interactive exploration and analysis of massive scientific models in a cache-oblivious, hardware-agnostic manner, enabling processing and visualization of possibly geographically distributed data using many kinds of devices and platforms.
For general purpose feature segmentation and exploration we discuss a new paradigm based on topological analysis. This approach enables the extraction of summaries of features present in the data through abstract models that are orders of magnitude smaller than the raw data, providing enough information to support general queries and perform a wide range of analyses without access to the original data.
Keywords: Visualization, data analysis, topological data analysis, Parallel I/O
The influence of an applied heat flux on the violence of reaction of an explosive device|
M. Hall, J.C. Beckvermit, C.A. Wight, T. Harman, M. Berzins. In Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery, San Diego, California, XSEDE '13, pp. 11:1--11:8. 2013.
It is well known that the violence of slow cook-off explosions can greatly exceed the comparatively mild case burst events typically observed for rapid heating. However, there have been few studies that examine the reaction violence as a function of applied heat flux that explore the dependence on heating geometry and device size. Here we report progress on a study using the Uintah Computation Framework, a high-performance computer model capable of modeling deflagration, material damage, deflagration to detonation transition and detonation for PBX9501 and similar explosives. Our results suggests the existence of a sharp threshold for increased reaction violence with decreasing heat flux. The critical heat flux was seen to increase with increasing device size and decrease with the heating of multiple surfaces, suggesting that the temperature gradient in the heated energetic material plays an important role the violence of reactions.
Keywords: DDT, cook-off, deflagration, detonation, violence of reaction, c-safe
Characterization and modeling of PIDX parallel I/O for performance optimization|
S. Kumar, A. Saha, V. Vishwanath, P. Carns, J.A. Schmidt, G. Scorzelli, H. Kolla, R. Grout, R. Latham, R. Ross, M.E. Papka, J. Chen, V. Pascucci. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 67. 2013.
Parallel I/O library performance can vary greatly in response to user-tunable parameter values such as aggregator count, file count, and aggregation strategy. Unfortunately, manual selection of these values is time consuming and dependent on characteristics of the target machine, the underlying file system, and the dataset itself. Some characteristics, such as the amount of memory per core, can also impose hard constraints on the range of viable parameter values. In this work we address these problems by using machine learning techniques to model the performance of the PIDX parallel I/O library and select appropriate tunable parameter values. We characterize both the network and I/O phases of PIDX on a Cray XE6 as well as an IBM Blue Gene/P system. We use the results of this study to develop a machine learning model for parameter space exploration and performance prediction.
Keywords: I/O, Network Characterization, Performance Modeling
Large Scale Parallel Solution of Incompressible Flow Problems using Uintah and hypre|
J. Schmidt, M. Berzins, J. Thornock, T. Saad, J. Sutherland. In 2013 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 458--465. 2013.
The Uintah Software framework was developed to provide an environment for solving fluid-structure interaction problems on structured adaptive grids on large-scale, longrunning, data-intensive problems. Uintah uses a combination of fluid-flow solvers and particle-based methods for solids together with a novel asynchronous task-based approach with fully automated load balancing. As Uintah is often used to solve incompressible flow problems in combustion applications it is important to have a scalable linear solver. While there are many such solvers available, the scalability of those codes varies greatly. The hypre software offers a range of solvers and preconditioners for different types of grids. The weak scalability of Uintah and hypre is addressed for particular examples of both packages when applied to a number of incompressible flow problems. After careful software engineering to reduce startup costs, much better than expected weak scalability is seen for up to 100K cores on NSFs Kraken architecture and up to 260K cpu cores, on DOEs new Titan machine. The scalability is found to depend in a crtitical way on the choice of algorithm used by hypre for a realistic application problem.
Keywords: Uintah, hypre, parallelism, scalability, linear equations
Uncertainty Visualization in Forward and Inverse Cardiac Models|
B. Burton, B. Erem, K. Potter, P. Rosen, C.R. Johnson, D. Brooks, R.S. Macleod. In Computing in Cardiology CinC, pp. 57--60. 2013.
Quantification and visualization of uncertainty in cardiac forward and inverse problems with complex geometries is subject to various challenges. Specific to visualization is the observation that occlusion and clutter obscure important regions of interest, making visual assessment difficult. In order to overcome these limitations in uncertainty visualization, we have developed and implemented a collection of novel approaches. To highlight the utility of these techniques, we evaluated the uncertainty associated with two examples of modeling myocardial activity. In one case we studied cardiac potentials during the repolarization phase as a function of variability in tissue conductivities of the ischemic heart (forward case). In a second case, we evaluated uncertainty in reconstructed activation times on the epicardium resulting from variation in the control parameter of Tikhonov regularization (inverse case). To overcome difficulties associated with uncertainty visualization, we implemented linked-view windows and interactive animation to the two respective cases. Through dimensionality reduction and superimposed mean and standard deviation measures over time, we were able to display key features in large ensembles of data and highlight regions of interest where larger uncertainties exist.
Specimen-specific predictions of contact stress under physiological loading in the human hip: validation and sensitivity studies|
C.R. Henak, A.K. Kapron, B.J. Ellis, S.A. Maas, A.E. Anderson, J.A. Weiss. In Biomechanics and Modeling in Mechanobiology, pp. 1-14. 2013.
Hip osteoarthritis may be initiated and advanced by abnormal cartilage contact mechanics, and finite element (FE) modeling provides an approach with the potential to allow the study of this process. Previous FE models of the human hip have been limited by single specimen validation and the use of quasi-linear or linear elastic constitutive models of articular cartilage. The effects of the latter assumptions on model predictions are unknown, partially because data for the instantaneous behavior of healthy human hip cartilage are unavailable. The aims of this study were to develop and validate a series of specimen-specific FE models, to characterize the regional instantaneous response of healthy human hip cartilage in compression, and to assess the effects of material nonlinearity, inhomogeneity and specimen-specific material coefficients on FE predictions of cartilage contact stress and contact area. Five cadaveric specimens underwent experimental loading, cartilage material characterization and specimen-specific FE modeling. Cartilage in the FE models was represented by average neo-Hookean, average Veronda Westmann and specimen- and region-specific Veronda Westmann hyperelastic constitutive models. Experimental measurements and FE predictions compared well for all three cartilage representations, which was reflected in average RMS errors in contact stress of less than 25 %. The instantaneous material behavior of healthy human hip cartilage varied spatially, with stiffer acetabular cartilage than femoral cartilage and stiffer cartilage in lateral regions than in medial regions. The Veronda Westmann constitutive model with average material coefficients accurately predicted peak contact stress, average contact stress, contact area and contact patterns. The use of subject- and region-specific material coefficients did not increase the accuracy of FE model predictions. The neo-Hookean constitutive model underpredicted peak contact stress in areas of high stress. The results of this study support the use of average cartilage material coefficients in predictions of cartilage contact stress and contact area in the normal hip. The regional characterization of cartilage material behavior provides the necessary inputs for future computational studies, to investigate other mechanical parameters that may be correlated with OA and cartilage damage in the human hip. In the future, the results of this study can be applied to subject-specific models to better understand how abnormal hip contact stress and contact area contribute to OA.
Relationship of the intercondylar roof and the tibial footprint of the ACL: implications for ACL reconstruction|
P.T. Scheffel, H.B. Henninger, R.T. Burks. In American Journal of Sports Medicine, Vol. 41, No. 2, pp. 396--401. 2013.
Background: Debate exists on the proper relation of the anterior cruciate ligament (ACL) footprint with the intercondylar notch in anatomic ACL reconstructions. Patient-specific graft placement based on the inclination of the intercondylar roof has been proposed. The relationship between the intercondylar roof and native ACL footprint on the tibia has not previously been quantified.
Hypothesis: No statistical relationship exists between the intercondylar roof angle and the location of the native footprint of the ACL on the tibia.
Study Design: Case series; Level of evidence, 4.
Methods: Knees from 138 patients with both lateral radiographs and MRI, without a history of ligamentous injury or fracture, were reviewed to measure the intercondylar roof angle of the femur. Roof angles were measured on lateral radiographs. The MRI data of the same knees were analyzed to measure the position of the central tibial footprint of the ACL (cACL). The roof angle and tibial footprint were evaluated to determine if statistical relationships existed.
Results: Patients had a mean ± SD age of 40 ± 16 years. Average roof angle was 34.7° ± 5.2° (range, 23°-48°; 95% CI, 33.9°-35.5°), and it differed by sex but not by side (right/left). The cACL was 44.1% ± 3.4% (range, 36.1%-51.9%; 95% CI, 43.2%-45.0%) of the anteroposterior length of the tibia. There was only a weak correlation between the intercondylar roof angle and the cACL (R = 0.106). No significant differences arose between subpopulations of sex or side.
Conclusion: The tibial footprint of the ACL is located in a position on the tibia that is consistent and does not vary according to intercondylar roof angle. The cACL is consistently located between 43.2% and 45.0% of the anteroposterior length of the tibia. Intercondylar roof–based guidance may not predictably place a tibial tunnel in the native ACL footprint. Use of a generic ACL footprint to place a tibial tunnel during ACL reconstruction may be reliable in up to 95% of patients.
Evaluation of a post-processing approach for multiscale analysis of biphasic mechanics of chondrocytes|
S.C. Sibole, S.A. Maas, J.P. Halloran, J.A. Weiss, A. Erdemir. In Computer Methods in Biomechanical and Biomedical Engineering, Vol. 16, No. 10, pp. 1112--1126. 2013.
PubMed ID: 23809004
Understanding the mechanical behaviour of chondrocytes as a result of cartilage tissue mechanics has significant implications for both evaluation of mechanobiological function and to elaborate on damage mechanisms. A common procedure for prediction of chondrocyte mechanics (and of cell mechanics in general) relies on a computational post-processing approach where tissue-level deformations drive cell-level models. Potential loss of information in this numerical coupling approach may cause erroneous cellular-scale results, particularly during multiphysics analysis of cartilage. The goal of this study was to evaluate the capacity of first- and second-order data passing to predict chondrocyte mechanics by analysing cartilage deformations obtained for varying complexity of loading scenarios. A tissue-scale model with a sub-region incorporating representation of chondron size and distribution served as control. The post-processing approach first required solution of a homogeneous tissue-level model, results of which were used to drive a separate cell-level model (same characteristics as the sub-region of control model). The first-order data passing appeared to be adequate for simplified loading of the cartilage and for a subset of cell deformation metrics, for example, change in aspect ratio. The second-order data passing scheme was more accurate, particularly when asymmetric permeability of the tissue boundaries was considered. Yet, the method exhibited limitations for predictions of instantaneous metrics related to the fluid phase, for example, mass exchange rate. Nonetheless, employing higher order data exchange schemes may be necessary to understand the biphasic mechanics of cells under lifelike tissue loading states for the whole time history of the simulation.
Evaluation of Current Algorithms for Segmentation of Scar Tissue from Late Gadolinium Enhancement Cardiovascular Magnetic Resonance of the Left Atrium: An Open-Access Grand Challenge|
R. Karim, R.J. Housden, M. Balasubramaniam, Z. Chen, D. Perry, A. Uddin, Y. Al-Beyatti, E. Palkhi, P. Acheampong, S. Obom, A. Hennemuth, Y. Lu, W. Bai, W. Shi, Y. Gao, H.-O. Peitgen, P. Radau, R. Razavi, A. Tannenbaum, D. Rueckert, J. Cates, T. Schaeffter, D. Peters, R.S. MacLeod, K. Rhode. In Journal of Cardiovascular Magnetic Resonance, Vol. 15, No. 105, 2013.
Background: Late Gadolinium enhancement (LGE) cardiovascular magnetic resonance (CMR) imaging can be used to visualise regions of fibrosis and scarring in the left atrium (LA) myocardium. This can be important for treatment stratification of patients with atrial fibrillation (AF) and for assessment of treatment after radio frequency catheter ablation (RFCA). In this paper we present a standardised evaluation benchmarking framework for algorithms segmenting fibrosis and scar from LGE CMR images. The algorithms reported are the response to an open challenge that was put to the medical imaging community through an ISBI (IEEE International Symposium on Biomedical Imaging) workshop.
Methods: The image database consisted of 60 multicenter, multivendor LGE CMR image datasets from patients with AF, with 30 images taken before and 30 after RFCA for the treatment of AF. A reference standard for scar and fibrosis was established by merging manual segmentations from three observers. Furthermore, scar was also quantified using 2, 3 and 4 standard deviations (SD) and full-width-at-half-maximum (FWHM) methods. Seven institutions responded to the challenge: Imperial College (IC), Mevis Fraunhofer (MV), Sunnybrook Health Sciences (SY), Harvard/Boston University (HB), Yale School of Medicine (YL), King’s College London (KCL) and Utah CARMA (UTA, UTB). There were 8 different algorithms evaluated in this study.
Results: Some algorithms were able to perform significantly better than SD and FWHM methods in both pre- and post-ablation imaging. Segmentation in pre-ablation images was challenging and good correlation with the reference standard was found in post-ablation images. Overlap scores (out of 100) with the reference standard were as follows: Pre: IC = 37, MV = 22, SY = 17, YL = 48, KCL = 30, UTA = 42, UTB = 45; Post: IC = 76, MV = 85, SY = 73, HB = 76, YL = 84, KCL = 78, UTA = 78, UTB = 72.
Conclusions: The study concludes that currently no algorithm is deemed clearly better than others. There is scope for further algorithmic developments in LA fibrosis and scar quantification from LGE CMR images. Benchmarking of future scar segmentation algorithms is thus important. The proposed benchmarking framework is made available as open-source and new participants can evaluate their algorithms via a web-based interface.
Keywords: Late gadolinium enhancement, Cardiovascular magnetic resonance, Atrial fibrillation, Segmentation, Algorithm benchmarking
Investigating Applications Portability with the Uintah DAG-based Runtime System on PetaScale Supercomputers|
Q. Meng, A. Humphrey, J. Schmidt, M. Berzins. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 96:1--96:12. 2013.
Present trends in high performance computing present formidable challenges for applications code using multicore nodes possibly with accelerators and/or co-processors and reduced memory while still attaining scalability. Software frameworks that execute machine-independent applications code using a runtime system that shields users from architectural complexities offer a possible solution. The Uintah framework for example, solves a broad class of large-scale problems on structured adaptive grids using fluid-flow solvers coupled with particle-based solids methods. Uintah executes directed acyclic graphs of computational tasks with a scalable asynchronous and dynamic runtime system for CPU cores and/or accelerators/co-processors on a node. Uintah's clear separation between application and runtime code has led to scalability increases of 1000x without significant changes to application code. This methodology is tested on three leading Top500 machines; OLCF Titan, TACC Stampede and ALCF Mira using three diverse and challenging applications problems. This investigation of scalability with regard to the different processors and communications performance leads to the overall conclusion that the adaptive DAG-based approach provides a very powerful abstraction for solving challenging multi-scale multi-physics engineering problems on some of the largest and most powerful computers available today.
Keywords: Blue Gene/Q, GPU, Xeon Phi, adaptive, application, co-processor, heterogeneous systems, hybrid parallelism, parallel, scalability, software, uintah, NETL
Preliminary Experiences with the Uintah Framework on Intel Xeon Phi and Stampede|
Q. Meng, A. Humphrey, J. Schmidt, M. Berzins. In Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery (XSEDE 2013), San Diego, California, pp. 48:1--48:8. 2013.
In this work, we describe our preliminary experiences on the Stampede system in the context of the Uintah Computational Framework. Uintah was developed to provide an environment for solving a broad class of fluid-structure interaction problems on structured adaptive grids. Uintah uses a combination of fluid-flow solvers and particle-based methods, together with a novel asynchronous task-based approach and fully automated load balancing. While we have designed scalable Uintah runtime systems for large CPU core counts, the emergence of heterogeneous systems presents considerable challenges in terms of effectively utilizing additional on-node accelerators and co-processors, deep memory hierarchies, as well as managing multiple levels of parallelism. Our recent work has addressed the emergence of heterogeneous CPU/GPU systems with the design of a Unified heterogeneous runtime system, enabling Uintah to fully exploit these architectures with support for asynchronous, out-of-order scheduling of both CPU and GPU computational tasks. Using this design, Uintah has run at full scale on the Keeneland System and TitanDev. With the release of the Intel Xeon Phi co-processor and the recent availability of the Stampede system, we show that Uintah may be modified to utilize such a co-processor based system. We also explore the different usage models provided by the Xeon Phi with the aim of understanding portability of a general purpose framework like Uintah to this architecture. These usage models range from the pragma based offload model to the more complex symmetric model, utilizing all co-processor and host CPU cores simultaneously. We provide preliminary results of the various usage models for a challenging adaptive mesh refinement problem, as well as a detailed account of our experience adapting Uintah to run on the Stampede system. Our conclusion is that while the Stampede system is easy to use, obtaining high performance from the Xeon Phi co-processors requires a substantial but different investment to that needed for GPU-based systems.
Keywords: MIC, Xeon Phi, adaptive, co-processor, heterogeneous systems, hybrid parallelism, parallel, scalability, stampede, uintah, c-safe