This shows you the differences between two versions of the page.
asc:laboratoare:06 [2024/04/08 13:50] emil.slusanschi [Resurse] |
asc:laboratoare:06 [2024/04/10 12:02] (current) emil.slusanschi [Executie asincrona Host si Device] |
||
---|---|---|---|
Line 158: | Line 158: | ||
</code> | </code> | ||
- | Corect ar fi folosirea functie atomicAdd pentru a serializa accesul. | + | Corect ar fi folosirea functiei atomicAdd pentru a serializa accesul. |
<code C> | <code C> | ||
Line 218: | Line 218: | ||
- Copiere memorie host -> device, avand < 64 KB | - Copiere memorie host -> device, avand < 64 KB | ||
- Copiere memorie host -> device, avand functii cu sufix Async | - Copiere memorie host -> device, avand functii cu sufix Async | ||
- | - Functii memorie set | + | - Functii memorie set (setare / initializare de memorie la o valoare) |
Pentru a face debug unor scenarii de executie asincrona se poate dezactiva complet executia asincrona setand variabila de mediu CUDA_LAUNCH_BLOCKING la 1. Executia de kernels este sincrona cand se ruleaza cu un profiler (Nsight, Visual Profiler). | Pentru a face debug unor scenarii de executie asincrona se poate dezactiva complet executia asincrona setand variabila de mediu CUDA_LAUNCH_BLOCKING la 1. Executia de kernels este sincrona cand se ruleaza cu un profiler (Nsight, Visual Profiler). | ||
+ | |||
+ | ==== Fluxuri nonimplicite ==== | ||
+ | |||
+ | Pentru a folosi cudaMemcpyAsync, este necesar lucrul cu fluxuri nonimplictie (non-default streams), care, in C/C++ pot fi declarate, create si distruse in partea de cod de pe host (CPU) in urmatorul fel: | ||
+ | |||
+ | <code C> | ||
+ | cudaStream_t stream1; | ||
+ | cudaError_t result; | ||
+ | result = cudaStreamCreate(&stream1) | ||
+ | result = cudaStreamDestroy(stream1) | ||
+ | </code> | ||
+ | |||
+ | Odata creat un astfel de flux, el poate fi utilizat in procesul de copiere a memoriei host -> device astfel: | ||
<code C> | <code C> | ||
result = cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1) | result = cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1) | ||
+ | </code> | ||
+ | |||
+ | Pentru a emite un kernel către un flux nonimplicit, specificăm identificatorul fluxului ca al patrulea parametru de configurare a execuției. Se observă și un al treilea parametru de configurare a execuției, care este folosit pentru a aloca memorie partajată device-ului (GPU-ului), utilizându-se 0 dacă nu se dorește acest aspect. | ||
+ | |||
+ | <code C> | ||
+ | increment<<<1,N,0,stream1>>>(d_a) | ||
</code> | </code> | ||
Line 232: | Line 251: | ||
==== Executie si transfer date asincron ==== | ==== Executie si transfer date asincron ==== | ||
- | Anumite device-uri pot executa un transfer asincron memorie alaturi de o executie de kernel. Acest lucru este dependent de compute capability si se poate verifica in device property asyncEngineCount. De asemenea, se pot face transferuri de memorie intra-device simultan cu executia de kernel cand atat device property concurrentKernels si asyncEngineCount sunt 1. | + | Anumite device-uri pot executa un transfer asincron memorie alaturi de o executie de kernel. Acest lucru este dependent de compute capability si se poate verifica in device property asyncEngineCount. |
+ | |||
+ | {{:asc:lab9:cuda_async.png?900|}} | ||
+ | |||
+ | De asemenea, se pot face transferuri de memorie intra-device simultan cu executia de kernel cand atat device property concurrentKernels, cat si asyncEngineCount sunt 1. | ||
+ | |||
+ | {{:asc:lab9:cuda_async_2.png?900|}} | ||
===== Dynamic Paralellism ===== | ===== Dynamic Paralellism ===== | ||
Line 259: | Line 284: | ||
<hidden> | <hidden> | ||
- | {{:asc:lab9:sol:lab9_sol.tar.gz|Soluție Laborator 9}} | + | {{:asc:lab9:sol:lab9_sol.tar.gz|Soluție Laborator 6}} |
</hidden> | </hidden> | ||
- | /* {{:asc:lab9:sol:lab9_sol.zip|Solutie Laborator 9}} */ | + | /* {{:asc:lab9:sol:lab9_sol.zip|Solutie Laborator 6}} */ |
- | {{:asc:lab6:asc_lab9.pdf|Enunt Laborator 6}} | + | /* {{:asc:lab6:asc_lab9.pdf|Enunt Laborator 6}} */ |
* Responsabili laborator: Matei Barbu, Alexandru Bala | * Responsabili laborator: Matei Barbu, Alexandru Bala | ||
Line 280: | Line 305: | ||
* [[https://www.nvidia.com/en-us/data-center/tesla-p100/|NVIDIA Pascal P100]] | * [[https://www.nvidia.com/en-us/data-center/tesla-p100/|NVIDIA Pascal P100]] | ||
* Advanced CUDA | * Advanced CUDA | ||
+ | * [[http://www-personal.umich.edu/~smeyer/cuda/grid.pdf | CUDA Thread Basics]] | ||
+ | * [[https://devblogs.nvidia.com/even-easier-introduction-cuda/ | An Even Easier Introduction to CUDA]] | ||
* [[https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf|CUDA Streams 1]] | * [[https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf|CUDA Streams 1]] | ||
* [[https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/|CUDA Streams 2]] | * [[https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/|CUDA Streams 2]] | ||
* [[https://devblogs.nvidia.com/introduction-cuda-dynamic-parallelism/|CUDA Dynamic Parallelism]] | * [[https://devblogs.nvidia.com/introduction-cuda-dynamic-parallelism/|CUDA Dynamic Parallelism]] | ||
- | * [[http://www-personal.umich.edu/~smeyer/cuda/grid.pdf | CUDA Thread Basics]] | + | * [[https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/|How to Overlap Data Transfers in CUDA C/C++]] |
- | * [[https://devblogs.nvidia.com/even-easier-introduction-cuda/ | An Even Easier Introduction to CUDA]] | + | |