Streams - Ciemat

Anuncio
Table of Contents
Streams
Learning CUDA to Solve Scientific Problems.
1
Objectives
2
Pinned-Memory
3
Streams
Miguel Cárdenas Montes
Centro de Investigaciones Energéticas Medioambientales y Tecnológicas,
Madrid, Spain
[email protected]
2010
M. Cárdenas (CIEMAT)
T5. Streams.
2010
1 / 26
M. Cárdenas (CIEMAT)
T5. Streams.
2010
2 / 26
2010
4 / 26
Objectives
To use streams in order to improve the performance.
Pinned-Memory
Technical Issues
Stream.
Pinned-memory.
M. Cárdenas (CIEMAT)
T5. Streams.
2010
3 / 26
M. Cárdenas (CIEMAT)
T5. Streams.
Pinned-Memory II
Pinned-Memory I
Until now, the instruction malloc() has been used to allocate on the
host memory.
However, the CUDA runtime offers its own mechanism for allocating
host memory: cudaHostAlloc().
There is a significant difference between the memory that malloc()
allocates and the memory that cudaHostAlloc() allocates.
M. Cárdenas (CIEMAT)
T5. Streams.
2010
The instruction malloc() allocates standard and pageable host
memory.
The instruction cudaHostAlloc() allocates a buffer of page-locked
host memory.
This can be called pinned-memory.
Page-locked memory guarantees that the operating system will never
page this memory out to disk, which ensures its residency in physical
memory.
5 / 26
M. Cárdenas (CIEMAT)
T5. Streams.
2010
6 / 26
Pinned-Memory III
Knowing the physical address of a buffer, the GPU can then use
direct memory access (DMA) to copy data to or from the host.
Since DMA copies proceed without intervention from the CPU, it also
means that the CPU could be simultaneously paging these buffers out
to the disk or relocating their physical addresses by updating the
operating system’s pagetables.
The possibility of the CPU moving pageable data means that using
memory for a DMA copy is essential.
Pinned-Memory III
On the warning side, the computer running the application needs to
have available physical memory for every page-locked buffer, since
these buffers can never be swapped out to disk.
The use of pinned memory should be restricted to memory that will
be used as a source or destination in call to cudaMemcpy() and
freeing them when they are no longer needed.
In fact, even when you attempt to perform a memory copy with
pageable memory, the CUDA driver still uses DMA to transfer the
buffer to the GPU.
Therefore, the copy happens twice, first from a pageable system
buffer to a page-locked ”staging” buffer and then from the
page-locked system buffer to the GPU.
M. Cárdenas (CIEMAT)
T5. Streams.
2010
7 / 26
M. Cárdenas (CIEMAT)
T5. Streams.
2010
8 / 26
Pinned-Memory IV
To allocate on host memory as pinned memory the instruction
cudaHostAlloc() has to be used.
Streams
For freeing the memory allocated the instruction should be
cudaFreeHost().
cudaHostAlloc( (void**)&a, size * sizeof( *a ), cudaHostAllocDefault ) ;
cudaFreeHost ( a );
M. Cárdenas (CIEMAT)
T5. Streams.
2010
9 / 26
M. Cárdenas (CIEMAT)
T5. Streams.
2010
10 / 26
Single Stream II
First at all, the device chosen must support the capacity termed
device overlap.
Single Stream I
Streams can play an important role in accelerating applications.
A CUDA stream represents a queue of GPU operations that get
executed in a specific order.
Operations such as: kernel launches, memory copies, and event start
and stop can be placed and ordered into a stream.
The order in which operations are added to the stream specifies the
order in which they will be executed.
M. Cárdenas (CIEMAT)
T5. Streams.
2010
11 / 26
A GPU supporting this feature possesses the capacity to
simultaneously execute a CUDA kernel while performing a copy
between device and host memory.
int main( void ) {
cudaDeviceProp prop;
int whichDevice;
HANDLE_ERROR( cudaGetDevice( &whichDevice ) );
HANDLE_ERROR( cudaGetDeviceProperties( &prop, whichDevice ) );
if (!prop.deviceOverlap) {
printf( "Device will not handle overlaps");
return 0;
}
M. Cárdenas (CIEMAT)
T5. Streams.
2010
12 / 26
Single Stream IV
Single Stream III
Then the memory is allocated and the array fulfilled with random
integers.
If the device supports overlapping, then ...
The stream should be created with the instruction
cudaStreamCreate().
int *host_a, *host_b, *host_c;
int *dev_a, *dev_b, *dev_c;
// allocate the memory on
HANDLE_ERROR( cudaMalloc(
HANDLE_ERROR( cudaMalloc(
HANDLE_ERROR( cudaMalloc(
cudaEvent_t start, stop;
float elapsedTime;
// start the timers
HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );
// allocate page-locked memory
HANDLE_ERROR( cudaHostAlloc( (void**)&host_a, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) );
HANDLE_ERROR( cudaHostAlloc( (void**)&host_b, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) );
HANDLE_ERROR( cudaHostAlloc( (void**)&host_c, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) );
// initialize the stream
cudaStream_t stream;
HANDLE_ERROR( cudaStreamCreate( &stream ) );
M. Cárdenas (CIEMAT)
the GPU
(void**)&dev_a, N*sizeof(int) ) );
(void**)&dev_b, N*sizeof(int) ) );
(void**)&dev_c, N*sizeof(int) ) );
for (int i=0; i < FULL_DATA_SIZE; i++) {
host_a[i] = rand();
host_b[i] = rand();
}
T5. Streams.
2010
13 / 26
M. Cárdenas (CIEMAT)
T5. Streams.
2010
14 / 26
Single Stream V
The call cudaMemcpyAsync( ) places a request to perform a memory
copy into the stream specified by the argument stream.
Single Stream VI
When the call returns, there is no guarantee that the copy be
definitively performed before the next operation into the same stream.
When the loop has terminated, there could still be a bit of work
queued up for the GPU to finish.
The use of cudaMemcpyAsunc() requires the use of cudaHostAlloc().
It is needed to synchronize with the host, in order to guarantee that
the tasks have been done.
Also, the kernel invocation used the argument stream.
After the synchronization the timer can be stopped.
// now loop over full data
for (int i=0; i<FULL_DATA_SIZE; i+=N) {
// copy the locked memory to the device, asynchronously
HANDLE_ERROR( cudaMemcpyAsync( dev_a, host_a+i, N*sizeof(int), cudaMemcpyHostToDevice, stream ) );
HANDLE_ERROR( cudaMemcpyAsync( dev_b, host_b+i, N*sizeof(int), cudaMemcpyHostToDevice, stream ) );
// copy result chunk from locked to full buffer
HANDLE_ERROR( cudaStreamSynchronize( stream ) );
kernel <<< N/256, 256, 0, stream >>> (dev_a, dev_b, dev_c) ;
// copy back data from device to locked memory
HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost, stream ) );
}
M. Cárdenas (CIEMAT)
T5. Streams.
2010
15 / 26
M. Cárdenas (CIEMAT)
T5. Streams.
2010
16 / 26
Single Stream VII
After the synchronization the timer can be stopped.
Single Stream VIII
The memory can be cleaned.
Finally a dummy kernel is used and some mandatory data.
Before exiting the application, the stream has to be destroyed.
# define N (1024 * 1024)
# define FULL_DATA_SIZE (N*20)
HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
__global__ void kernel( int *a, int *b, int *c) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
int idx1 = (idx + 1 ) % 256;
int idx2 = (idx + 2 ) % 256;
float as = (a[idx] + a[idx1] + a[idx2]) /3.0f;
float bs = (b[idx] + b[idx1] + b[idx2]) /3.0f;
c[idx] = (as + bs)/2;
}
}
HANDLE_ERROR( cudaEventSynchronize( stop ) );
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime, start, stop ) );
printf( "Time taken: %3.1f ms \n", elapsedTime );
// cleanup the streams and memory
HANDLE_ERROR( cudaFreeHost ( host_a ) );
HANDLE_ERROR( cudaFreeHost ( host_b ) );
HANDLE_ERROR( cudaFreeHost ( host_c ) );
HANDLE_ERROR( cudaFree ( dev_a ) );
HANDLE_ERROR( cudaFree ( dev_b ) );
HANDLE_ERROR( cudaFree ( dev_c ) );
HANDLE_ERROR( cudaStreamDestroy( stream ) );
return 0;
}
M. Cárdenas (CIEMAT)
T5. Streams.
2010
17 / 26
M. Cárdenas (CIEMAT)
T5. Streams.
2010
18 / 26
Kernel invocation, parameters
Dg is of type dim3 and specifies the dimension and size of the grid,
such that Dg.x * Dg.y equals the number of blocks being launched;
Dg.z must be equal to 1;
Kernel invocation, parameters
Any call to a global function must specify the execution
configuration for that call.
The execution configuration defines the dimension of the grid and
blocks that will be used to execute the function on the device, as well
as the associated stream.
When using the runtime API (Section 3.2), the execution
configuration is specified by inserting an expression of the form
<<<Dg, Db, Ns, S>>> between the function name and the
parenthesized argument list.
Db is of type dim3 and specifies the dimension and size of each block,
such that Db.x * Db.y * Db.z equals the number of threads per block;
Ns is of type size t and specifies the number of bytes in shared
memory that is dynamically allocated per block for this call in
addition to the statically allocated memory; this dynamically allocated
memory is used by any of the variables declared as an external array;
Ns is an optional argument which defaults to 0;
S is of type cudaStream t and specifies the associated stream; S is an
optional argument which defaults to 0.
M. Cárdenas (CIEMAT)
T5. Streams.
2010
19 / 26
M. Cárdenas (CIEMAT)
T5. Streams.
2010
20 / 26
Multiple Streams I
At the beginning of the previous example, the feature of supporting
overlap was checked; and the computation was broken into chunks.
The underlying idea to improve the performance is to divide the
computational tasks and overlapping them.
The newer NVIDIA GPUs support simultaneously: kernel execution,
and two memory copies (one to the device and one from the device).
M. Cárdenas (CIEMAT)
T5. Streams.
2010
Multiple Streams II
For multiple streams, each of them must be created.
// initialize the streams
cudaStream_t stream0, stream1;
HANDLE_ERROR( cudaStreamCreate( &stream0 ) );
HANDLE_ERROR( cudaStreamCreate( &stream1 ) );
21 / 26
M. Cárdenas (CIEMAT)
T5. Streams.
2010
22 / 26
Multiple Streams IV
Multiple Streams III
All actions must be duplicated: buffers allocations, kernel invocations,
synchronization, clean-up.
All actions must be duplicated: buffers allocations, kernel invocations,
synchronization, clean-up.
// now loop over full data
for (int i=0; i<FULL_DATA_SIZE; i+=N) {
// copy the locked memory to the device, asynchronously
HANDLE_ERROR( cudaMemcpyAsync( dev_a0, host_a+i, N*sizeof(int), cudaMemcpyHostToDevice, stream0 ) );
HANDLE_ERROR( cudaMemcpyAsync( dev_b0, host_b+i, N*sizeof(int), cudaMemcpyHostToDevice, stream0 ) );
int *host_a, *host_b, *host_c;
int *dev_a0, *dev_b0, *dev_c0; // for stream 0
int *dev_a1, *dev_b1, *dev_c1; // for stream 1
// allocate the memory on
HANDLE_ERROR( cudaMalloc(
HANDLE_ERROR( cudaMalloc(
HANDLE_ERROR( cudaMalloc(
HANDLE_ERROR( cudaMalloc(
HANDLE_ERROR( cudaMalloc(
HANDLE_ERROR( cudaMalloc(
the GPU
(void**)&dev_a0,
(void**)&dev_b0,
(void**)&dev_c0,
(void**)&dev_a1,
(void**)&dev_b1,
(void**)&dev_c1,
N*sizeof(int)
N*sizeof(int)
N*sizeof(int)
N*sizeof(int)
N*sizeof(int)
N*sizeof(int)
)
)
)
)
)
)
kernel <<< N/256, 256, 0, stream0 >>> (dev_a0, dev_b0, dev_c0) ;
// copy back data from device to locked memory
HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c0, N*sizeof(int), cudaMemcpyDeviceToHost, stream0 ) );
);
);
);
);
);
);
// copy the locked memory to the device, asynchronously
HANDLE_ERROR( cudaMemcpyAsync( dev_a1, host_a+i, N*sizeof(int), cudaMemcpyHostToDevice, stream1 ) );
HANDLE_ERROR( cudaMemcpyAsync( dev_b1, host_b+i, N*sizeof(int), cudaMemcpyHostToDevice, stream1 ) );
kernel <<< N/256, 256, 0, stream1 >>> (dev_a1, dev_b1, dev_c1) ;
// copy back data from device to locked memory
HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c1, N*sizeof(int), cudaMemcpyDeviceToHost, stream1 ) );
// allocate page-locked memory
HANDLE_ERROR( cudaHostAlloc( (void**)&host_a, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) );
HANDLE_ERROR( cudaHostAlloc( (void**)&host_b, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) );
HANDLE_ERROR( cudaHostAlloc( (void**)&host_c, FULL_DATA_SIZE*sizeof(int), cudaHostAllocDefault ) );
}
for (int i=0; i < FULL_DATA_SIZE; i++) { host_a[i] = rand(); host_b[i] = rand(); }
// synchronization
// cleanup
// streams destruction
M. Cárdenas (CIEMAT)
T5. Streams.
2010
23 / 26
M. Cárdenas (CIEMAT)
T5. Streams.
2010
24 / 26
Performance Considerations I
Thanks
Users should take care about the sequence of actions queued in the
streams. It is very easy to inadvertently block the copies or kernel
executions of another stream.
To alleviate this problem, it suffices to enqueue our operations
breadth-first across streams rather than depth-first.
// now loop over full data
for (int i=0; i<FULL_DATA_SIZE; i+=N) {
// enqueue copies of a in stream0 and stream1
HANDLE_ERROR( cudaMemcpyAsync( dev_a0, host_a+i, N*sizeof(int), cudaMemcpyHostToDevice, stream0 ) );
HANDLE_ERROR( cudaMemcpyAsync( dev_a1, host_a+i, N*sizeof(int), cudaMemcpyHostToDevice, stream1 ) );
// enqueue copies of b in stream0 and stream1
HANDLE_ERROR( cudaMemcpyAsync( dev_b0, host_b+i, N*sizeof(int), cudaMemcpyHostToDevice, stream0 ) );
HANDLE_ERROR( cudaMemcpyAsync( dev_b1, host_b+i, N*sizeof(int), cudaMemcpyHostToDevice, stream1 ) );
Thanks
Questions?
More questions?
// enqueue kernel in stream0 and stream1
kernel <<< N/256, 256, 0, stream0 >>> (dev_a0, dev_b0, dev_c0) ;
kernel <<< N/256, 256, 0, stream1 >>> (dev_a1, dev_b1, dev_c1) ;
// copy back data
HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c0, N*sizeof(int), cudaMemcpyDeviceToHost, stream0 ) );
HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c1, N*sizeof(int), cudaMemcpyDeviceToHost, stream1 ) );
}
// synchronization
// cleanup
// streams destruction
M. Cárdenas (CIEMAT)
T5. Streams.
2010
25 / 26
M. Cárdenas (CIEMAT)
T5. Streams.
2010
26 / 26
Descargar