Hybrid-parallel algorithms for 2D Green`s functions

Anuncio
Hybrid-parallel algorithms for 2D Green’s functions
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando D.
Quesada and Tomás Ramírez
[email protected];
[email protected]
Universidad Politécnica de Cartagena/ Universidad de Murcia
ETSI. Telecomunicación/ Facultad de Informática
Dpto. Tecnologías de la Información y las Comunicaciones/ Dpto. de Informática y Sistemas
International Conference on Computational Science
June 5-7, 2013
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
1 / 19
Outline
1
Introduction and motivation
2
Computation of Green’s functions on hybrid systems
3
Experimental results
4
Autotuning
5
Conclusions and perspectives
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
2 / 19
Outline
1
Introduction and motivation
2
Computation of Green’s functions on hybrid systems
3
Experimental results
4
Autotuning
5
Conclusions and perspectives
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
2 / 19
Outline
1
Introduction and motivation
2
Computation of Green’s functions on hybrid systems
3
Experimental results
4
Autotuning
5
Conclusions and perspectives
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
2 / 19
Outline
1
Introduction and motivation
2
Computation of Green’s functions on hybrid systems
3
Experimental results
4
Autotuning
5
Conclusions and perspectives
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
2 / 19
Outline
1
Introduction and motivation
2
Computation of Green’s functions on hybrid systems
3
Experimental results
4
Autotuning
5
Conclusions and perspectives
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
2 / 19
Introduction and motivation
Motivation of the work
1
High interest in the development of full-wave techniques for the
analysis of microwave components and antennas.
2
Need of efficient software tools that allow optimization of complex
devices in real time.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
3 / 19
Introduction and motivation
Motivation of the work
1
High interest in the development of full-wave techniques for the
analysis of microwave components and antennas.
2
Need of efficient software tools that allow optimization of complex
devices in real time.
Calculation of Green’s functions inside waveguides
Increment of the execution time due to:
1
Low convergence rate of series (images, modes).
2
Large number of pairs of points.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
3 / 19
Introduction and motivation
Objectives of the work
1
Increase efficiency using parallel computing.
2
Application of several hybrid-heterogeneous parallelism strategies
is proposed in this context.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
4 / 19
Introduction and motivation
Objectives of the work
1
Increase efficiency using parallel computing.
2
Application of several hybrid-heterogeneous parallelism strategies
is proposed in this context.
Strategies explored
1
Parameterized hybrid parallelism (MPI+OpenMP+CUDA) for the
computation of Green’s functions in rectangular waveguides.
2
Autotuning strategies based in the parameterized code.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
4 / 19
Computation of Green’s functions on hybrid systems
Hybrid parallelism
1
MPI+OpenMP, OpenMP+CUDA and MPI+OpenMP+CUDA
routines are developed to accelerate the calculation of 2D
waveguide Green’s functions.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
5 / 19
Computation of Green’s functions on hybrid systems
Hybrid parallelism
1
MPI+OpenMP, OpenMP+CUDA and MPI+OpenMP+CUDA
routines are developed to accelerate the calculation of 2D
waveguide Green’s functions.
For each MPI process Pk , 0 ≤ k < p:
omp_set_num_threads(h + g)
for i = k mp to (k + 1) mp − 1 do
node=omp_get_thread_num()
if node < h then
Compute with OpenMP thread
else
Call to CUDA kernel
end if
end for
As seen, (p) MPI processes
are started.
In addition, (h + g) threads
run inside each process.
Threads (0) to (h − 1) works
on the CPU (OpenMP, OMP).
Remaining threads from (h)
to (h + g − 1) works in GPU
calling CUDA kernels.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
5 / 19
Computation of Green’s functions on hybrid systems
Routines developed
p \h+g
1
p
1+0
SEQ
MPI
h+0
OMP
MPI+OMP
0+g
CUDA
MPI+CUDA
h+g
OMP+CUDA
MPI+OMP+CUDA
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
6 / 19
Experimental results
Computational systems
Saturno is a NUMA system with 24 cores, Intel Xeon, 1.87 GHz, 32
GB of shared-memory. Plus NVIDIA Tesla C2050, CUDA with total of
448 CUDA cores, 2.8 Gb and 1.15 GHz.
Marte and Mercurio are AMD Phenom II X6 1075T (hexa-core), 3
GHz, 15 GB (Marte) and 8 GB (Mercurio). Plus NVIDIA GeForce GTX
590 with two devices, with 512 CUDA cores each. Are connected in a
homogeneous cluster.
Luna is an Intel Core 2 Quad Q6600, 2.4 GHz, 4 GB. With NVIDIA
GeForce 9800 GT, CUDA with a total of 112 CUDA cores.
All them connected in a heterogeneous cluster.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
7 / 19
Experimental results
Use of GPU
Comparison between use of one kernel versus several kernels:
Plot is presented as a function of the problem size (#images, #points).
S=T(#kernels=1)/ T(#kernels=X).
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
8 / 19
Experimental results
Use of GPU
Comparison between use of one kernel versus several kernels:
Plot is presented as a function of the problem size (#images, #points).
S=T(#kernels=1)/ T(#kernels=X).
Three kernels give satisfactory results.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
8 / 19
Experimental results
Comparison between use of CPU versus use of GPU
Test on computational speed, when CPUs or GPUs are used.
CPU version uses number of threads equal to number of cores.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
9 / 19
Experimental results
Comparison between use of CPU versus use of GPU
Test on computational speed, when CPUs or GPUs are used.
CPU version uses number of threads equal to number of cores.
S=T(#threads=#cores)/
T(#kernels=3).
S > 1 means GPU is
preferred over CPU.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected];
domingo
9 / 19
Experimental results
Improvement with MPI+GPU
Test on computational speed, several kernels in a process versus
several processes one kernel each.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 10
domingo
/ 19
Experimental results
Improvement with MPI+GPU
Test on computational speed, several kernels in a process versus
several processes one kernel each.
S=T(#proc=2;#kernels=X)/
T(#proc=2*X;#kernels=1).
S > 1 means it is
preferable to start the
kernels inside MPI
processes.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 10
domingo
/ 19
Experimental results
Comparison between GPU and optimum parameters
The selection of the optimum values for p, h and g produces lower
execution times that blind GPU use.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 11
domingo
/ 19
Experimental results
Comparison between GPU and optimum parameters
The selection of the optimum values for p, h and g produces lower
execution times that blind GPU use.
S=T(#kernels=3)/
T(lowest).
S > 1 means GPU is
worse than lowest.
Speed-up of two is
obtained for large problems using optimum.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 11
domingo
/ 19
Experimental results
Comparison homogeneous - heterogeneous cluster
Combination of nodes at different computational speed, different
number of cores and GPU produces additional reduction of the
execution time. Different values of p, h and g for different nodes.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 12
domingo
/ 19
Experimental results
Comparison homogeneous - heterogeneous cluster
Combination of nodes at different computational speed, different
number of cores and GPU produces additional reduction of the
execution time. Different values of p, h and g for different nodes.
S=T(#kernels=3*#nodes)/
T(lowest).
Important reduction of the
execution time with the
hetereogeneous cluster.
Execution time closer to
the lowest experimental.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 12
domingo
/ 19
Experimental results
Experiments allow satisfactory results with some heuristic
Three CUDA kernels per GPU.
Kernel calls inside MPI processes.
Not to include Luna in the heterogeneous cluster.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 13
domingo
/ 19
Experimental results
Experiments allow satisfactory results with some heuristic
Three CUDA kernels per GPU.
Kernel calls inside MPI processes.
Not to include Luna in the heterogeneous cluster.
Work to be done
Further improvement.
What in a different computational system?.
What for a user non expert in parallelism?.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 13
domingo
/ 19
Autotuning parallel codes
Autotuning strategies
High complexity of today’s hybrid, heterogeneous and hierarchical
parallel systems; difficult to estimate optimum parameters leading
to lowest execution times.
Solution is to develop codes with autotuning engines.
Tries to ensure execution times close to optimum, independently
of the particular problem and of the characteristics of the
computing systems.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 14
domingo
/ 19
Autotuning parallel codes
Autotuning strategies
High complexity of today’s hybrid, heterogeneous and hierarchical
parallel systems; difficult to estimate optimum parameters leading
to lowest execution times.
Solution is to develop codes with autotuning engines.
Tries to ensure execution times close to optimum, independently
of the particular problem and of the characteristics of the
computing systems.
Types of Autotuning techniques
Empirical autotuning.
Modeling of the execution time.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 14
domingo
/ 19
Autotuning parallel codes
Model based autotuning
Execution time when computation is distributed between OpenMP
threads or MPI processes:
Fine grained:
2mimag + 1
nmod
S1 +
(2nimag + 1)S2 + R(c) + M(c)
mn
c
c
Coarse grained:
lmm
n (nmod S1 + (2mimag + 1) (2nimag + 1) S2 + R(c) + M(c))
c
R(c) cost of reduction, and M(c) management cost; depend of the
number of threads or processes
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 15
domingo
/ 19
Autotuning parallel codes
Model based autotuning
Satisfactory predictions
In Marte
In Marte+Mercurio
from which satisfactory selection can be taken,
but how to model for hybrid systems and GPU?
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 16
domingo
/ 19
Autotuning parallel codes
Empirical autotuning
Run some test executions during the initial installation phase of
the routine (installation set; keep installation set small).
This information is used at running time when a particular problem
is being solved (validation set).
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 17
domingo
/ 19
Autotuning parallel codes
Empirical autotuning
Run some test executions during the initial installation phase of
the routine (installation set; keep installation set small).
This information is used at running time when a particular problem
is being solved (validation set).
images-points
AUTO-TUNING
LOWEST
DEVIATION
1000-25
0.155
0.114
35.96%
100000-25
5.012
5.012
0%
1000-100
1.706
1.656
3.02%
100000-100
87.814
79.453
10.52%
Waveguide GF: different problem sizes (images, number of points).
Execution times with the autotuning technique and with the optimum
parameters (lowest).
Autotuning routine performs well for the problem sizes investigated.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 17
domingo
/ 19
Conclusions
Combination of several parallelism paradigms allows the efficient
solution of electromagnetic problems in today’s computational
systems, which are hybrid, heterogeneous and hierarchical.
Calculation of Green’s functions inside waveguides has been
adapted for heterogeneous clusters with CPUs and GPUs with
different speeds.
Parameterized algorithms facilitate to adapt the code to the
characteristics of the computational system.
Autotuning techniques can be incorporated so that non
parallelism-experts can use routines efficiently in complex
computational systems.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 18
domingo
/ 19
Perspectives
More optimized versions of the codes can be developed, specially
for GPU.
Empirical autotuning techniques for large heterogeneous systems
must be more in depth studied.
Model of the execution time of the hybrid routines need to be
developed.
Inclusion of the routines in higher-level electromagnetism codes,
as for example analysis of finite microstrip structures using the
Volume Integral Equation solved by the Method of Moments.
Alejandro Álvarez-Melcón, Domingo Giménez, Fernando
ICCS
D. Quesada
2013 / June
and5-7,
Tomás
2013
Ramírez [email protected]; 19
domingo
/ 19
Descargar