Thrust y paralelismo dinámico

Anuncio
Más Thrust 1.8,
intro a “dynamic parallelism”
Clase 12, 18/05/2015
http://fisica.cab.cnea.gov.ar/gpgpu/index.php/en/icnpg/clases
Códigos para la clase
Códigos: async_thrust.cu, async_reduce_thrust.cu
●
cp -r /share/apps/codigos/alumnos_icnpg2015/thrust18 . ; cd thrust18
Thrust 1.8:
streams, new execution policies, dynamic paralellism, etc.
Thrust
https://github.com/thrust/thrust/wiki/Device-Backends
Here we demonstrate how to use Thrust's "backend systems" which control how Thrust algorithms get mapped to and executed
on the parallel processors available to the application. There are two basic ways to access Thrust's systems: by specifying the
global "device" system associated with types like thrust::device_vector, or by selecting a specific container associated with a
particular system, such as thrust::cuda::vector. These two approaches are complementary and may be used together within the
same program.
Selecting a global device system
●
●
wget http://thrust.googlecode.com/hg/examples/monte_carlo.cu
Mismo fuente, dos formas de compilarlo → dos ejecutables. El
“device” se interpreta de dos formas diferentes, el “host” igual
(aunque tambien se podria interpretarlo diferentemente):
●
●
CUDA
●
nvcc -O2 -o monte_carlo monte_carlo.cu
OPENMP
●
nvcc -O2 -o monte_carlo monte_carlo.cu -Xcompiler -fopenmp \
-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -lgomp
Direct System Access:
Using a system-specific vector
#include <thrust/system/omp/vector.h>
#include <thrust/sort.h>
#include <cstdlib>
#include <algorithm>
#include <iostream>
int main(void)
{
// serially generate 1M random numbers
thrust::omp::vector<int> vec(1 << 20);
std::generate(vec.begin(), vec.end(), rand);
// sort data in parallel with OpenMP
thrust::sort(vec.begin(), vec.end());
// no need to transfer data back to host
// report the largest number
std::cout << "Largest number is " << vec.back() << std::endl;
return 0;
}
El compilador se entera que el algoritmo debe llamar a una implementacion paralela en
OPENMP, porque asi se lo informan los iteradores en su argumento... Si hubieramos usado un
thrust::cuda::vector<int> o thrust::device_vector<int>, el sort se implementaria en CUDA.
No podemos darle a sort() punteros crudos a menos que agreguemos policies...
Direct System Access:
Using a system-specific vector
Direct System Access:
Execution Policies
Direct System Access:
Execution Policies
#include <thrust/system/omp/execution_policy.h>
#include <thrust/sort.h>
#include <cstdlib>
#include <algorithm>
#include <iostream>
#include <vector>
int main(void)
{
// serially generate 1M random numbers
std::vector<int> vec(1 << 20);
std::generate(vec.begin(), vec.end(), rand);
// sort data in parallel with OpenMP by specifying its execution policy
thrust::sort(thrust::omp::par, vec.begin(), vec.end());
// report the largest number
std::cout << "Largest number is " << vec.back() << std::endl;
}
Tambien se podria haber usado un puntero crudo de C, alocado con malloc().
En ese caso, vec.begin() → vec, vec.end() → vec+N
Direct System Access:
Execution Policies
MAL
Ojo!, tambien es
un error darle un
puntero crudo de
C, apuntando a
memoria de host
#include <vector>
#include <thrust/sort.h>
#include <thrust/system/cuda/execution_policy.h>
int main()
{
std::vector<int> vec = ...
// error -- CUDA kernels can't access std::vector!
thrust::sort(thrust::cuda::par, vec.begin(), vec.end());
}
BIEN
#include <thrust/tabulate.h>
#include <thrust/sort.h>
#include <iostream>
int main()
{
int n = 13;
int *raw_ptr = 0;
cudaMalloc(&raw_ptr, n * sizeof(int));
// it's OK to pass raw pointers allocated by cudaMalloc to an algorithm invoked with cuda::par
thrust::tabulate(thrust::cuda::par, raw_ptr, raw_ptr + n, thrust::identity<int>());
std::cout << "data is sorted: " << thrust::is_sorted(thrust::cuda::par, raw_ptr, raw_ptr + n);
cudaFree(raw_ptr);
}
Streams with Thrust
Streams
Ver async.cu, clase streams...
Streams con Thrust
thrust::cuda::par is the parallel
execution policy associated with
Thrust's CUDA backend system.
Ahora puede tomar como argumento
el stream!
Notación:
thrust::cuda::par.on(stream[i]),
On the other hand, it's safe to use thrust::cuda::par with raw pointers allocated
bycudaMalloc, even when the pointer isn't wrapped by thrust::device_ptr:
Streams con Thrust
Ejemplo de clase de streams reciclado, mas nuevo material de
demo ...
●
cp -r /share/apps/codigos/alumnos_icnpg2015/thrust18 . ; cd thrust18
Incluimos thrust 1.8 headers (no necesario con cuda 7 toolkit), y
el macro THRUST18 (omitir para cambiar a la version async.cu).
●
nvcc async_thrust.cu -I /share/apps/codigos/thrust-master -DTHRUST18
Ejecutamos nvprof -o a.prof ./a.out
●
qsub submit_gpu.sh
Verificar concurrencia de streams (overlaps copy-kernel, kernelkernel...)
●
nvpp → import a.prof
Streams con Thrust
// baseline case - sequential transfer and execute
// asynchronous version 1: loop over {copy, kernel, copy}
Streams con Thrust
A optimizar! ...
Algorithm invocation in CUDA __device__ code
Algorithm invocation in CUDA __device__ code
template<typename Iterator, typename T, typename BinaryOperation, typename Pointer>
__global__ void reduce_kernel(Iterator first, Iterator last, T init, BinaryOperation binary_op,
Pointer result)
{
*result = thrust::reduce(thrust::cuda::par, first, last, init, binary_op);
}
int main()
{
size_t n = 1 << 20;
thrust::device_vector<unsigned int> data(n, 1);
thrust::device_vector<unsigned int> result(1, 0);
// method 1: call thrust::reduce from an asynchronous CUDA kernel launch
// create a CUDA stream
cudaStream_t s;
cudaStreamCreate(&s);
// launch a CUDA kernel with only 1 thread on our stream
reduce_kernel<<<1,1,0,s>>>(data.begin(), data.end(), 0, thrust::plus<int>(), result.data());
// wait for the stream to finish
cudaStreamSynchronize(s);
// our result should be ready
assert(result[0] == n);
cudaStreamDestroy(s);
return 0;
}
Ejemplo de la librería:
/share/apps/codigos/thrust-master/examples/cuda/async_reduce.cu
template<typename Iterator, typename T, typename BinaryOperation, typename Pointer>
__global__ void reduce_kernel(Iterator first, Iterator last, T init, BinaryOperation binary_op, Pointer result)
{
*result = thrust::reduce(thrust::cuda::par, first, last, init, binary_op);
}
int main()
{
size_t n = 1 << 18;
thrust::device_vector<unsigned int> data(n, 1);
thrust::device_vector<unsigned int> result(1, 0);
// method 1: call thrust::reduce from an asynchronous CUDA kernel launch
// create a CUDA stream
cudaStream_t s;
cudaStreamCreate(&s);
// launch a CUDA kernel with only 1 thread on our stream
reduce_kernel<<<1,1,0,s>>>(data.begin(), data.end(), 0, thrust::plus<int>(), result.data());
nvtxRangePushA("la CPU hace su propia reduccion... quien gana?");
unsigned res=0;
unsigned inc=1;
for(int i=0;i<n;i++) res+=inc;
nvtxRangePop();
// wait for the stream to finish
cudaStreamSynchronize(s);
Ejemplo de la librería modificado para la clase:
/share/apps/codigos/alumnos_icnpg2015/thrust18/async_reduce_thrust.cu
Algorithm invocation in CUDA __device__ code
Si todavia no lo hicieron...
●
cp -r /share/apps/codigos/alumnos_icnpg2015/thrust18 . ; cd thrust18
Incluimos thrust 1.8 headers, NVTX, y flags para paralelismo
dinamico...
●
nvcc -arch=sm_35 -rdc=true async_reduce_thrust.cu -lcudadevrt -lnvToolsExt
-I /share/apps/codigos/thrust-master -o a.out
Ejecutamos nvprof -o a.prof ./a.out
●
qsub submit_gpu.sh
Algorithm invocation in CUDA __device__ code
Ejemplo de reduccion asincrónica usando GPU
size_t n = 1 << 18;
Prueben aumentarlo...
Paralelismo dinámico
Dynamic Parallelism in CUDA is supported via an extension to the CUDA programming model that enables a
CUDA kernel to create and synchronize new nested work. Basically, a child CUDA Kernel can be called from within
a parent CUDA kernel and then optionally synchronize on the completion of that child CUDA Kernel. The parent
CUDA kernel can consume the output produced from the child CUDA Kernel, all without CPU involvement.
Dynamic Parallelism
http://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/
Early CUDA programs had to conform to a flat, bulk parallel programming model. Programs had to
perform a sequence of kernel launches, and for best performance each kernel had to expose enough
parallelism to efficiently use the GPU. For applications consisting of “parallel for” loops the bulk
parallel model is not too limiting, but some parallel patterns—such as nested parallelism—cannot be
expressed so easily. Nested parallelism arises naturally in many applications, such as those using
adaptive grids, which are often used in real-world applications to reduce computational complexity
while capturing the relevant level of detail. Flat, bulk parallel applications have to use either a fine
grid, and do unwanted computations, or use a coarse grid and lose finer details.
Dynamic Parallelism
http://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/
Dynamic parallelism is generally useful for problems where nested parallelism cannot be avoided. This
includes, but is not limited to, the following classes of algorithms:
●
algorithms using hierarchical data structures, such as adaptive grids;
●
algorithms using recursion, where each level of recursion has parallelism, such as quicksort;
●
algorithms where work is naturally split into independent batches, where each batch involves complex
parallel processing but cannot fully use a single GPU.
Mandelbrot Set
En negro
NVIDIA_CUDA-6.5_Samples/2_Graphics/Mandelbrot/
Mandelbrot Set
The Escape Time Algorithm
Mandelbrot Set
The Escape Time Algorithm
Mandelbrot Set
Recursion:
Subdivision de cuadrados
Grilla Madre
Grilla Hija
Grilla Nieta
Mariani-Silver Algorithm
Dynamic Parallelism
http://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/
De un solo paso seria asi:
nvcc -arch=sm_35 -rdc=true myprog.cu -lcudadevrt -o myprog.o
Dynamic Parallelism
http://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/
Dynamic Parallelism
http://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/
Grid Nesting and Synchronization
Memory Consistency
child grids always complete before
the parent grids that launch them,
even if there is no explicit
synchronization,
Passing Pointers to Child Grids
●
●
●
Device Streams and Events
Recursion Depth and Device
Limits
Etc...
Dynamic Parallelism
Cool Thing You Could Do with Dynamic Parallelism - Intro to Parallel
Programming
https://www.youtube.com/watch?v=QVvHbsMIQzY
Problems Suitable for Dynamic Parallelism - Intro to Parallel
Programming
https://www.youtube.com/watch?v=8towMTm82DM
http://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/
http://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/
Intro to Parallel Programming UDACITY, Lesson 7.2
https://www.udacity.com/course/cs344.
Dinamica Molecular con interacciones de largo
alcance, por ejemplo, gravitatorias...
Esquema PP: O(N*N)
● Esquema PM: O(M*log M)
● Esquema PPPM=P3M
● Tree codes:
●
Algorithm invocation in CUDA __device__ code
farber1-11_sync.cu
CPU ociosa
farber1-11_async.cu
1 thread de la
CPU ocioso
Piense como mejorarlo....
ahora que sabe reducción asincrónica
Descargar