Performance optimizations for scalable implicit RANS calculations with SU2
•Presents the performance optimization of an unstructured, implicit CFD application.•Single-node optimizations show a 2:6X speedup over a single rank per core baseline.•Strong scaling limit improves by 3:8X with a hybrid OpenMP+MPI model.•Linear multigrid solver increases scaling efficiency by 2X ov...
Gespeichert in:
Veröffentlicht in: | Computers & fluids 2016-04, Vol.129, p.146-158 |
---|---|
Hauptverfasser: | , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •Presents the performance optimization of an unstructured, implicit CFD application.•Single-node optimizations show a 2:6X speedup over a single rank per core baseline.•Strong scaling limit improves by 3:8X with a hybrid OpenMP+MPI model.•Linear multigrid solver increases scaling efficiency by 2X over Krylov methods.•Treatment of open-source code makes this work extensible to the larger CFD community.
In this paper, we present single- and multi-node optimizations of SU2, a widely-used, open-source Computational Fluid Dynamics application, aimed at improving performance and scalability for implicit Reynolds-averaged Navier–Stokes calculations on unstructured grids. Typical industry-standard implementations are currently limited by unstructured accesses, variable degrees of parallelism, as well as the global synchronizations inherent in traditionally used Krylov linear solvers. Therefore, we rely on aggressive single-node optimizations, such as hierarchical parallelism, dynamic threading, compacted memory layout, and vectorization, along with a communication-friendly agglomeration (geometric) linear multigrid solver. Based on results with the well-known ONERA M6 geometry, our single core and shared memory optimizations result in a speedup of 2.6X on the latest 14-core Intel® Xeon™ 11Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance E5-2697v3 processor when compared to the baseline SU2 implementation with 14 MPI ranks. In multi-node settings, the hybrid OpenMP+MPI multigrid implementation achieves 2X higher parallel efficiency on 256 nodes over conventional Krylov-based (GMRES) methods. |
---|---|
ISSN: | 0045-7930 1879-0747 |
DOI: | 10.1016/j.compfluid.2016.02.003 |