Effects of Easy Hybrid Parallelization with CUDA for OpenMX

Jae-Hyeon Parq; Erik Sevre; Sang-Mook Lee

Research Article

Effects of Easy Hybrid Parallelization with CUDA for OpenMX

by Jae-Hyeon Parq, Erik Sevre, Sang-Mook Lee

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 98 - Issue 13

Published: July 2014

Authors: Jae-Hyeon Parq, Erik Sevre, Sang-Mook Lee

10.5120/17244-7580

PDF

Jae-Hyeon Parq, Erik Sevre, Sang-Mook Lee . Effects of Easy Hybrid Parallelization with CUDA for OpenMX. International Journal of Computer Applications. 98, 13 (July 2014), 20-27. DOI=10.5120/17244-7580

                        @article{ 10.5120/17244-7580,
                        author  = { Jae-Hyeon Parq,Erik Sevre,Sang-Mook Lee },
                        title   = { Effects of Easy Hybrid Parallelization with CUDA for OpenMX },
                        journal = { International Journal of Computer Applications },
                        year    = { 2014 },
                        volume  = { 98 },
                        number  = { 13 },
                        pages   = { 20-27 },
                        doi     = { 10.5120/17244-7580 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2014
                        %A Jae-Hyeon Parq
                        %A Erik Sevre
                        %A Sang-Mook Lee
                        %T Effects of Easy Hybrid Parallelization with CUDA for OpenMX%T 
                        %J International Journal of Computer Applications
                        %V 98
                        %N 13
                        %P 20-27
                        %R 10.5120/17244-7580
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

A MPI-friendly density functional theory (DFT) source code was modified within hybrid parallelization including CUDA. The objective is to find out how simple conversions within the hybrid parallelization with mid-range GPUs affect DFT code not originally suitable to CUDA. Several rules of hybrid parallelization for numerical-atomic-orbital (NAO) DFT codes were settled. The test was performed on a magnetite material system with OpenMX code by utilizing a hardware system containing 2 Xeon E5606 CPUs and 2 Quadro 4000 GPUs. 3-way hybrid routines obtained a speedup of 7. 55 while 2-way hybrid speedup by 10. 94. GPUs with CUDA complement the efficiency of OpenMP and compensate CPUs' excessive competition within MPI.

References

A. Ghosh, P. R. Taylor, "High-level ab initio calculations on the energetic of low-lying spin states of biologically relevant transition metal complexes: a first progress report", Curr. Opin. Chem. Biol. 7:113?124 2003.
OpenMX webpage, http://www. openmx-square. org/
SIESTA webpage, http://www. icmab. es/siesta/
Message Passing Interface Forum, http://www. mpi-forum. org/
OpenMP, http://openmp. org/wp/
H. Jin, D. Jespersen, P. Mehrotra, R. Biswas, L. Huang, B. Chapman, "High performance computing using MPI and OpenMP on multi-core parallel systems", Parallel Comput. 37:562?575, 2011.
J. E. Stone, D. J. Hardy, I. S. Ufimtsev, K. Schulten, "GPU-accelerated molecular modeling coming of age", J. Mol. Graph. 29(2):116?125, 2010.
S. Maintz, B. Eck, R. Dronskowski, "Speeding up plane-wave electronic-structure calculations using graphics-process units", Comput. Phys. Commun. 182:1421?1427, 2011.
K. A. Wilkson, P. Sherwood, M. F. Guest, K. J. Naidoo, "Acceleration of the GAMESS-UK Electronic Structure Package on Graphical Processing Units", J. Comput. Chem. 32(10):2313?2318, 2011.
L. Genovese, M. Ospici, T. Deutsch, J. -F. Méhaut, A. Neelov, S. Goedecker, "Density functional theory calculation on many-cores hybrid central processing unit-graphic processing unit architectures", J. Chem. Phys. 131:034103, 2009.
C. T. Yang, C. L. Huang, C. F. Lin, "Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters", Comput. Phys. Commun. 182:266?269, 2011.
F. Wang, C. -Q. Yang, Y. -F. Du, J. Chen, H. -Z. Yi, W. -X. Xu, "Optimizing linpack benchmark on GPU-accelerated petascale supercomputer", J. Comput. Sci. & Technol. 26(5): 854?865, 2011.
H. -Y. Schive, U. -H. Zhang, T. Chiueh, "Directionally unsplit hydrodynamic schemes with hybrid MPI/OpenMP/GPU parallelization in AMR", Int. J. High Perform. C. 26(4):367?377, 2011.
F. Lu, J. Song, F. Yin, X. Zhu, "Performance evaluation of hybrid programming patterns for large CPU/GPU heterogeneous clusters", Comput. Phys. Commun. 183:1172?1181, 2012.
F. Lu, J. Song, X. Cao, X. Zhu, "CPU/GPU computing for long-wave radiation physics on large GPU clusters", Comput. Geosci. 41:47?55, 2012.
C. T. Hsu, K. F. Sin, S. W. Chiang, "Parallel computation for Boltzmann equation simulation with Dynamic Discrete Ordinate Method (DDOM)", Comput. Fluids 54:39?44, 2012.
M. H. Fadhil, M. I. Younis, "Parallelizing RSA Algorithm on Multicore CPU and GPU", Int. J. Comput. Appl. 87(6):15?22, 2014.
You can see our patch for OpenMX3. 6 at http://www. eriksevre. com/projects/openmxcuda/
Martin R. M. 2004 Electronic Structure: basic theory and practical methods. Cambridge University press.
Kohanoff J. 2006 Electronic Structure Calculations for Solids and Molecules. Cambridge University press.
U. von Barth, L. Hedin, "A local exchange-correlation potential for the spin polarized case: I", J. Phys. C: Solid State Phys. 5:1629?1642, 1972.
CUDA C programming Guide, http://docs. nvidia. com/cuda/cuda-c-programming-guide/
CUDA C Best Practices Guide, http://docs. nvidia. com/cuda/cuda-c-best-practices-guide/
D. Hisley, G. Agrawal, P. Satya-narayana, L. Pollock, "Porting and performance evaluation of irregular codes using OpenMP", Concurrency: Pract. Exper. 12:1241–1259, 2000.
B. Chapman, F. Bregier, A. Patil, A. Prabhakar, "Achieving performance under OpenMP on ccNUMA and software distributed shared memory systems", Concurrency Computat. : Pract. Exper. 14:713–739, 2002.
A. Marowka, Z. Liu, B. Chapman, "OpenMP-oriented applications for distributed shared memory architectures", Concurrency Computat. : Pract. Exper. 16:371–384, 2004.
M. J. Berger, M. J. Aftosmis, D. D. Marshall, S. M. Murman, "Performance of a newCFD flowsolver using a hybrid programming paradigm", J. Parallel Distr. Com. 65:414423, 2005.
F. Broquedis, N. Furmento, B. Goglin, P. -A. Wacrenier, R. Namyst, "ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures", Int. J. Parallel Prog. 38:418–439, 2010.
A. Marongiu, L. Benini, "An OpenMP Compiler for Efficient Use of Distributed Scratchpad Memory in MPSoCs", IEEE T. Compu. 61(2):222–236, 2012.
W. M. Brown, P. Wang, S. J. Plimpton, A. M. Tharrington, "Implementing molecular dynamics on hybrid high performance computers – short range forces", Comput. Phys. Commun. 182:898–911, 2011.
K. Momma, F. Izumi, "VESTA: a three-dimensional visualization system for electronic and structural analysis", J. Appl. Crystallogr. , 41:653–658, 2008.
T. Ozaki, H. Kino, "Numerical atomic basis orbitals from H to Kr", Phys. Rev. B 69:195113, 2004.
T. Ozaki, H. Kino, "Efficient projector expansion for the ab initio LCAO method", Phys. Rev. B 72:045121, 2005.
H. C. Hamilton, "Neutron Diffraction Investigation of the 119°K Transition in Magnetite", Phys. Rev. 110(5):1050–1057, 1958.
A. S. Householder, "Unitary Triangularization of a Nonsymmetric Matrix", J. ACM 5(4):339–342, 1958.
Netlib website, http://www. netlib. org/
CUBLAS website, http://developer. nvidia. com/cublas
OpenMX technical note: Householder Method for Tridiagonalization, http://www. openmx-square. org/tech_notes/tech10-1_0. pdf
Intel Xeon 5600 series information website, http://download. intel. com/support/processors/xeon/sb/xeon 5600. pdf
NVIDIA QUADRO 4000 website, http://www. nvidia. com/object/product-quadro-4000-us. html

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

MPI CUDA OpenMP electronic structure graphical processing unit pseudo-atomic-orbital