Differences

This shows you the differences between two versions of the page.

Link to this comparison view

asc:laboratoare:06 [2022/04/04 05:31]
stefan_dan.ciocirlan [3. Analysis of the Tachyon raytracing engine]
asc:laboratoare:06 [2024/04/10 12:02] (current)
emil.slusanschi [Executie asincrona Host si Device]
Line 1: Line 1:
-====== Laboratorul 06 - Analiza Performantei Programelor ​======+====== Laboratorul 06 - Advanced CUDA ======
  
-The purpose of this lab is the familiarization with the field of application profiling & debugging, through the means of dedicated tools for spotting performance bottlenecks.+===== Spatiu unificat memorie =====
  
-We will focus on several open source and commercial tools, as follows: +De la [[http://developer.download.nvidia.com/​compute/​cuda/​6_0/​rel/​docs/​CUDA_Toolkit_Release_Notes.pdf| CUDA 6.0]], NVIDIA a schimbat semnificativ modelul de programare prin facilitarea comunicarii unitatii CPU (hostcu unitatea GPU (device)in mod transparent prin acelasi set de adrese de memorie virtuale. Astfel exista posibilitatea ca prin acelasi pointer de memorie sa se scrie date atat de catre CPU cat si de catre GPU. Evident transferurile de memorie au loc intre spatii diferite de adresare (ex RAM vs VRAM), dar acest lucru se intampla transparent la nivel de aplicatie CUDA / pentru programator.
-  * valgrind ​kcachegrind (on own systems & hp-sl.q cluster queue) +
-  * Solaris Studio (on hp-sl.q cluster queue) +
-  * perf (on your own systems) +
-  * **You will record all your findings ​in a single text document and upload it on the Moodle.**+
  
-===== 1Valgrind / KCachegrind=====+{{:​asc:​lab9:​nv-unified.png?​640|NVIDIA Unified Memory}}
  
-Valgrind is tool used for memory debugging, memory leak detection and profilingIt is also a generic framework for creating dynamic analysis tools, such as memory checkers [1].+Mai jos avem un exemplu de folosire ​memoriei unificateSingura diferenta fata de alocarea pe CPU/HOST este ca memoria trebuie alocata cu cudaMallocManaged si dealocata cu cudaFree.
  
-Valgrind is in essence a virtual machine using just-in-time compilation techniques, including dynamic recompilationIt is important ​to keep in mind that nothing ​from the original program ever gets run directly ​on the host processorInstead, it will translate ​the input program into a simpler form called Intermediate Representation ​(IR), which is processor neutral. After this transformationa tool [2] is called ​to do whatever transformation of the IR it needs and the resulting IR is then translated back into machine code and ran on the host processor.+<code C> 
 +#include <​iostream>​ 
 +#include <math.h> 
 +  
 +// CUDA kernel ​to add elements of two arrays 
 +__global__ 
 +void add(int n, float *x, float *y) 
 +
 +  int index = blockIdx.x * blockDim.x + threadIdx.x;​ 
 +  int stride = blockDim.x * gridDim.x;​ 
 +  for (int i = index; i < n; i += stride) 
 +    y[i] = x[i] + y[i]; 
 +
 +  
 +int main(void) 
 +
 +  int N = 1<<​20;​ 
 +  float *x, *y; 
 +  
 +  // Allocate Unified Memory -- accessible ​from CPU or GPU 
 +  cudaMallocManaged(&​x,​ N*sizeof(float));​ 
 +  cudaMallocManaged(&​y,​ N*sizeof(float));​ 
 +  
 +  // initialize x and y arrays ​on the host 
 +  for (int i = 0; i < N; i++) { 
 +    x[i] = 1.0f; 
 +    y[i] = 2.0f; 
 +  } 
 +  
 +  // Launch kernel on 1M elements on the GPU 
 +  int blockSize = 256; 
 +  int numBlocks = (N + blockSize - 1/ blockSize;​ 
 +  add<<<​numBlocksblockSize>>>​(Nx, y); 
 +  
 +  // Wait for GPU to finish before accessing ​on host 
 +  cudaDeviceSynchronize();​ 
 +  
 +  // Check for errors (all values should be 3.0f) 
 +  float maxError = 0.0f; 
 +  for (int i = 0; i < N; i++) 
 +    maxError = fmax(maxError,​ fabs(y[i]-3.0f));​ 
 +  std::cout << "Max error: " << maxError << std::​endl;​ 
 +  
 +  // Free memory 
 +  cudaFree(x);​ 
 +  cudaFree(y);​ 
 +  
 +  return 0; 
 +
 +</​code>​
  
-The tools available in Valgrind are: +===== Operatii atomice CUDA =====
-  * **memcheck**. Used to detect memory-management problems and it is aimed at C and C++ programs. All memory reads and writes are checked, and all calls to malloc/​new/​free/​delete are intercepted. Therefore it can detect memory leaks, access to invalid memory, weird initialization values, overflows, etc. Memcheck runs programs about 10-30x slower than normal; +
-  * **cachegrind**. Used to profile CPU cache. It performs detailed simulation of the I1, D1 and L2 caches in order to pinpoint the sources of cache misses. It identifies the number of cache misses, memory references and instructions executed for each line of source code. Cache grind runs programs about 20-100x slower than normal; +
-  * **callgrind**. It is an extension to cachegrind and provides all the information that the latter offers, plus extra information regarding call graphs. In order to view the results, a visualization tool called KCachegrind [3] can be used; +
-  * **massif**. It is a heap profiler and it performs detailed heap profiling by taking regular snapshots of a program'​s heap and produces a graph showing heap usage over time, including information about which parts of the program are responsible for the most memory allocations. Massif runs programs about 20x slower than normal; +
-  * **hellgrind** and drd. These tools are thread debuggers which find data races in multithreaded programs. They look for memory locations which are accessed by more than one (POSIX) pthread, but for which no consistently used (pthread_mutex_) lock can be found; +
-  * other 3rd party tools can be found here [4].+
  
-<​note>​ +CUDA ofera acces la multiple operatii atomice tip citire-modificare-scriere. Acestea presupun serializarea accesului in contextul mai multor thread-uri. Functiile sunt limitate la anumite tipuri de date: 
-**TASK 1Install valgrind on your computers (or run vallgrind/kcachegrind on the hp-sl.q queue) and run the callgrind tool, for the {{asc:lab6:primes.c.zip|primes.c}} app with tests specified in the skeletonRecord your comments ​timings and observations ​in the text document of lab 6**+  - int 
 +  - unsigned int 
 +  - unsigned long long int 
 +  - float 
 +  - double 
 + 
 +Exemple de functii atomice: 
 +  - [[https://​docs.nvidia.com/​cuda/​cuda-c-programming-guide/​index.html#​atomicadd|atomicAdd]] 
 +  - [[https://​docs.nvidia.com/​cuda/​cuda-c-programming-guide/​index.html#​atomicsub|atomicSub]] 
 +  - [[https://docs.nvidia.com/​cuda/​cuda-c-programming-guide/​index.html#​atomicexch|atomicExch]] 
 +  - [[https://​docs.nvidia.com/​cuda/​cuda-c-programming-guide/​index.html#​atomicmin|atomicMin]] 
 +  - [[https://​docs.nvidia.com/​cuda/​cuda-c-programming-guide/​index.html#​atomicmax|atomicMax]] 
 +  - [[https://​docs.nvidia.com/​cuda/​cuda-c-programming-guide/​index.html#​atomicinc|atomicInc]] 
 +  - [[https://​docs.nvidia.com/​cuda/​cuda-c-programming-guide/​index.html#​atomicdec|atomicDec]] 
 +  - [[https://​docs.nvidia.com/​cuda/​cuda-c-programming-guide/​index.html#​atomicand|atomicAnd]] 
 +  - [[https://​docs.nvidia.com/​cuda/​cuda-c-programming-guide/​index.html#​atomicor|atomicOr]] 
 + 
 +<note important>​ 
 +A se consulta cu atentie documentatia CUDA inainte de folosirea unei operatii atomice (legat de contextul ​in care se aplica, cum opereaza, limitari etc).
 </​note>​ </​note>​
  
-Intrati pe frontend-ul ''​fep.grid.pub.ro''​ folosind contul de pe cs.curs.pub.ro,​ utilizand comanda ''​ssh -X username@fep.grid.pub.ro''​. Executati comanda ''​qlogin -q hp-sl.q''​ pentru a accesa una din serverele ​de pe coada hp-sl.q.+In codul de mai jos se lanseaza un kernel concurrentRW in configuratie numBlocks=8 fiecare cu cate 10 thread-uri.
  
-Pentru a rula si observa functionalitatile valgrind/​kcachegrind urmariti urmatoarea secventa si urmati apoi indicatiile din codul primes.c. +<​code ​C
-<​code ​sh+#​include ​<iostream>
-[@fep7-1 ]$ qlogin -q hp-sl.q +
-[@hpsl-wn01 ~]$ wget -O primes.c.zip http://​ocw.cs.pub.ro/​courses/​_media/​asc/​lab6/​primes.c.zip +
-[@hpsl-wn01 ~]$ module load compilers/​gnu-6.2.0 +
-[@hpsl-wn01 ~]$ gcc -o prime-ex primes.c +
-[@hpsl-wn01 ~]$ valgrind --tool=callgrind -v --dump-every-bb=10000000 ./prime-ex  +
-[@hpsl-wn01 ~]$ kcachegrind  +
-</code>+
  
-Pentru rularea cu openmp. +#define NUM_ELEM ​       8 
-<code sh> +#define NUM_THREADS ​    10
-[@hpsl-wn01 ~]$ gcc -g -fopenmp -o prime-omp primes.c +
-[@hpsl-wn01 ~]$ valgrind --tool=callgrind -v --dump-every-bb=10000000 ./prime-e +
-[@hpsl-wn01 ~]$ kcachegrind  +
-</​code>​+
  
 +using namespace std;
  
-<note important>​ +__global__ void concurrentRW(int *data) { 
-Recomandăm sa va delogati mereu de pe serverele din cluster dupa terminarea sesiunii, utilizand comanda ''​logout''​+... 
 +}
  
-Alternativdaca ati uitat sesiuni deschise, puteti verifica acest lucru de pe fep.grid.pub.ro,​ utilizand comanda ''​qstat''​. In cazul in care identificati astfel de sesiuni "​agatate",​ le puteti sterge (si va rugam sa faceti asta), utilizand comanda ''​qdel -f ID''​ unde ID-ul il identificati din comanda anterioara ''​qstat''​.+int main(int argcchar *argv[]
 +    int* data = NULL; 
 +    bool errorsDetected = false;
  
-Daca nu veti face aceasta delogareveti putea ajunge in situatia in care sa nu va mai puteti loga pe nodurile din cluster. +    cudaMallocManaged(&​dataNUM_ELEM * sizeof(unsigned long long int)); 
-</​note> ​+    if (data == 0) { 
 +        cout << "​[HOST] Couldn'​t allocate memory\n";​ 
 +        return 1; 
 +    }
  
 +    // init all elements to 0
 +    cudaMemset(data,​ 0, NUM_ELEM);
  
-===== 2. Solaris Studio =====+    // launch kernel writes 
 +    concurrentRW<<<​NUM_ELEM,​ NUM_THREADS>>>​(data);​ 
 +    cudaDeviceSynchronize();​ 
 +    if (cudaSuccess !cudaGetLastError()) { 
 +        return 1; 
 +    }
  
-Pentru a utiliza utilitarul pentru profiling Oracle Solaris Studio, trebuie sa efectuati urmatorii pasi:+    for(int i = 0; i < NUM_ELEM; i++) { 
 +        cout << i << ". " << data[i] << endl; 
 +        if(data[i] != (NUM_THREADS * (NUM_THREADS - 1) / 2)) { 
 +            errorsDetected = true; 
 +        } 
 +    }
  
-<code sh> +    if(errorsDetected) { 
-[@fep7-1 ]$ qlogin -q hp-sl.q +        cout << "​Errors detected"​ << endl; 
-[@hpsl-wn01 ~]$ module avail+    } else { 
 +        cout << "​OK"​ << endl; 
 +    }
  
------------------------------------------------------------ ​/usr/​share/​Modules/​modulefiles ------------------------------------------------------------ +    return 0; 
-dot         ​module-git ​ module-info modules ​    ​null ​       use.own+
 +</code>
  
------------------------------------------------------------------- /​etc/​modulefiles ------------------------------------------------------------------- +Functie concurrentRW citeste valoarea de la adresa data[blockIdx.x], o incrementeaza cu threadId si apoi o scrie. 
-compilers/​gnu-4.9.4                     ​libraries/​cuda-10.2                     ​libraries/​opencv-3.1.0-gcc-4.9.4 +In acest caz avem 10 thread-uri care fac operatii citire/scriere la aceeasi adresa, deci un comportament nedefinit
-compilers/gnu-5.4.0                     ​libraries/​cuda-7.5 ​                     libraries/​openmpi-2.0.1-gcc-4.9.4 + 
-compilers/​gnu-6.2.0 ​                    ​libraries/​cuda-8.0 ​                     libraries/​openmpi-2.0.1-gcc-5.4.0 +<code C> 
-compilers/​solarisstudio-12.5 ​           libraries/​cuda-9.0 ​                     utilities/​intel_parallel_studio_xe_2016 +__global__ void concurrentRW(int *data) { 
-libraries/cuda                          libraries/cuda-9.1 ​                     utilities/​opencl +    ​// NUM_THREADS try to read and write at same location 
-[@hpsl-wn01 ~]$ module load compilers/​solarisstudio-12.5 compilers/​gnu-6.2.0 +    data[blockIdx.x= data[blockIdx.x+ threadIdx.x; 
-[@hpsl-wn01 ~]$ module list +}
-Currently Loaded Modulefiles:​ +
-  1) compilers/​gnu-6.2.0 ​           2) compilers/​solarisstudio-12.5 +
-[@hpsl-wn01 ~]$ gcc -g -o prime-ex primes.+
-[@hpsl-wn01 ~]$ collect ​./​prime-ex ​ +
-[@hpsl-wn01 ~]$ analyzer test.1.er ​+
 </​code>​ </​code>​
  
-Atentie, in rulari succesive, va trebui sa incarcati arhive cu numere mai mari X "test.X.er".+Exemplu rezultat: 
 +<code sh> 
 +0
 +1
 +2
 +3. 9 
 +4. 9 
 +5. 9 
 +6. 9 
 +7. 9 
 +Errors detected 
 +</​code>​
  
-Un scurt tutorial este disponibil aici (aveti grija sa incarcati in prealabil modulul Solaris Studio): http://​cluster.grid.pub.ro/​index.php/​cluster-howto/​30-profiling/​76-profiling-with-sun-studio-analyzer+Corect ar fi folosirea functiei atomicAdd pentru a serializa accesul.
  
-O descriere si resurse suplimentare puteti obtine de asemenea pe site-ul oficial: https://docs.oracle.com/cd/​E37069_01/​html/​E37073/​gkodh.html+<code C> 
 +__global__ void concurrentRW(int *data) { 
 +    ​// NUM_THREADS try to read and write at same location 
 +    atomicAdd(&​data[blockIdx.x], threadIdx.x); 
 +
 +</code>
  
-Daca doriti sa va instalati Oracle Solaris/​Developer Studio, o puteti face de aici: +Rezultatul rularii este:
-https://​www.oracle.com/​tools/​developerstudio/​downloads/​developer-studio-jsp.html +
- +
-Pentru rularea paralela trebuie urmarite urmatoarele comenzi:+
 <code sh> <code sh>
-[@hpsl-wn01 ~]$ gcc -g -fopenmp -o prime-omp primes.c +045 
-[@hpsl-wn01 ~]$ collect ​./prime-omp +145 
-[@hpsl-wn01 ~]$ analyzer test.1.er +245 
 +345 
 +4. 45 
 +5. 45 
 +6. 45 
 +7. 45 
 +OK
 </​code>​ </​code>​
  
-Aceeasi observatie ca mai sus, probabil ca datele experimentale sunt intr-o alta arhiva cu numar > 1, in formatul test.X.er, cu X > 1.+==== Operatii atomice system wide ====
  
-<​note>​ +Unitatile GPU ce au Compute capability 6.x permit largirea scopului operatiilor atomiceDe exemplu atomicAdd_system garanteaza ca operatia este atomica cand atat thread-urile de pe unitatea GPU cat si cele de pe unitatea CPU incearca sa acceseze dateleMai jos avem un exemplu de folosire al functiei atomicAdd_system.
-**TASK 2: Run Solaris Studio Profiler on the hp-sl.q queue servers from our cluster by following the steps above on the  {{asc:​lab6:​primes.c.zip|primes.c}} app with tests specified in the skeleton. Explore the various screens and reports of the tool, with the "​Functions,​ Timeline, Call Tree, Source, Callers-Callees"​ tabsRecord your comments and observations in the text document of lab 6** +
-</​note>​+
  
-<note important+<code C
-Recomandăm sa va delogati mereu de pe serverele din cluster dupa terminarea sesiuniiutilizand comanda ''​logout''​+__global__ void mykernel(int *addr) { 
 +  atomicAdd_system(addr10);       // only available on devices with compute capability 6.x 
 +}
  
-Alternativ, daca ati uitat sesiuni deschise, puteti verifica acest lucru de pe fep.grid.pub.ro,​ utilizand comanda ''​qstat''​. In cazul in care identificati astfel de sesiuni "​agatate",​ le puteti sterge ​(si va rugam sa faceti asta), utilizand comanda ''​qdel -f ID''​ unde ID-ul il identificati din comanda anterioara ''​qstat''​.+void foo() 
 +  int *addr; 
 +  cudaMallocManaged(&​addr4); 
 +  *addr = 0;
  
-Daca nu veti face aceasta delogare, veti putea ajunge in situatia in care sa nu va mai puteti loga pe nodurile din cluster+   ​mykernel<<<​...>>>​(addr);​ 
-</note+   ​__sync_fetch_and_add(addr,​ 10);  // CPU atomic operation 
 +} 
 +</code>
  
 +===== Operatii asincrone CUDA =====
  
-===== 3Analysis of the Tachyon raytracing engine =====+In CUDA, urmatoarele operatii sunt definite ca fiind independente si pot fi executate concurent:​ 
 +  - Calcule pe unitatea host 
 +  - Calcule pe unitatea device 
 +  - Transfer memorie host -> device 
 +  - Transfer memorie device -> host 
 +  - Transfer memorie device -> device 
 +Nivelul de concurenta o sa depinda si de capabilitatea unitatilor GPU (compute capability). 
 +In continuare vom explora mai multe scenarii de executie concurenta a operatiilor descrise.
  
-In this section, we will focus on analyzing a software application. We will analyze both a serial and a parallel implementation. The application is called “tachyon” and you can find the source code attached to this lab.+==== Executie asincrona Host si Device ====
  
-On your own systembefore compilation,​ you must install the X11 dev tools and create a set of symlinksFor Ubuntu 64 bit, we must do the following+Folosind apeluri asincroneoperatiile de executie catre device sunt puse in coada avand controlul intors catre host instantAstfel unitatea host poate continua executia fara sa fie blocata in asteptarea executiei.  
-  ​* install dependencies <code bash> sudo apt-get install libx11-dev </​code>​ +Urmatoarele operatii sunt asincrone relativ la host
-  ​* create the symlinks: +  - Lansari de kernel 
-    * <code bashsudo mkdir /​usr/​lib64 ​</code> +  ​Copieri in cadrul spatiului de memorie a unui device 
-    * <code bash> sudo ln -s /​usr/​lib/​x86_64-linux-gnu/​libX11.so /​usr/​lib64/​libX11.so </code+  ​- Copiere memorie host -device, avand 64 KB 
-    * <code bash> sudo ln -/usr/​lib/​x86_64-linux-gnu/​libXext.so /​usr/​lib64/​libXext.so </​code>​+  Copiere memorie host -> device, avand functii cu sufix Async 
 +  Functii memorie set (setare ​initializare de memorie la o valoare)
  
-To compile ityou must extract the archive (asc:​lab6:​tachyon_vtune_amp_xe.tgzto local disk and run make. You can test the compilation by running in the same directory: <code bash>​./​tachyon_find_hotspots dat/balls.dat </​code>​+Pentru a face debug unor scenarii de executie asincrona se poate dezactiva complet executia asincrona setand variabila de mediu CUDA_LAUNCH_BLOCKING la 1. Executia de kernels este sincrona cand se ruleaza cu un profiler (NsightVisual Profiler).
  
-You should see a window like the one below:+==== Fluxuri nonimplicite ====
  
-{{ :asc:​lab6:​image09.png?​300 |}}+Pentru a folosi cudaMemcpyAsync,​ este necesar lucrul cu fluxuri nonimplictie (non-default streams), care, in C/C++ pot fi declarate, create si distruse in partea de cod de pe host (CPU) in urmatorul fel
  
-Pentru a rula si observa functionalitatile acestei unelte urmariti urmatoarea secventa de pasi si indicatiile:​ +<​code ​C
-<​code ​sh+cudaStream_t stream1; 
-wget -O tachyon_vtune_amp_xe.tgz http://​ocw.cs.pub.ro/​courses/​_media/​asc/​lab6/​tachyon_vtune_amp_xe.tgz  +cudaError_t result; 
-gunzip tachyon_vtune_amp_xe.tgz +result = cudaStreamCreate(&​stream1) 
-tar -xvf tachyon_vtune_amp_xe.tar +result = cudaStreamDestroy(stream1)
-cd tachyon +
-make+
 </​code>​ </​code>​
  
-=== Using Valgrind on the Tachyon Code ===+Odata creat un astfel de flux, el poate fi utilizat in procesul de copiere a memoriei host -> device astfel:
  
-1. Make sure you have Valgrind and KCachegrind installed on the system (or login on the hp-sl.q queue) and the application in the initial state, without any modifications on your system +<​code ​C
-<​code ​bash+result = cudaMemcpyAsync(d_a,​ a, N, cudaMemcpyHostToDevice,​ stream1)
-sudo apt-get update +
-sudo apt-get install valgrind kcachegrind+
 </​code>​ </​code>​
-2We will use the tool //callgrind //to get information from the running applicationRun the following command line: + 
-<​code>​ +Pentru a emite un kernel către un flux nonimplicit,​ specificăm identificatorul fluxului ca al patrulea parametru de configurare a execuțieiSe observă și un al treilea parametru de configurare a execuției, care este folosit pentru a aloca memorie partajată device-ului (GPU-ului), utilizându-se 0 dacă nu se dorește acest aspect. 
-valgrind --tool=callgrind --collect-jumps=yes --dump-instr=yes --collect-systime=yes -- ./​tachyon_find_hotspots dat/​balls.dat+ 
 +<​code ​C
 +increment<<<​1,​N,​0,​stream1>>>​(d_a)
 </​code>​ </​code>​
-3. Open the profile in KCachegrind and click on the Calee Map tab. Also, make sure that the buttons //% Relative//, //Cycle detection// and //Relative to parent// are selected. You should see something like this: 
-{{ :​asc:​lab6:​image20.png?​800p |}} 
-From this image, we can see that valgrind measured that about 78% of the total time was spent in the initialize_2D_buffer function. Double click the square containing the function name, then select the “Source code” tab and you will see the problematic code. 
-{{ :​asc:​lab6:​image05.png |}} 
  
 +==== Executie asincrona programe kernel ====
  
-===== 5. (BonusPerf =====+Arhitecturile cu compute capability 2.x sau mai nou, pot executa in paralel instante de kernel diferite. Aceste unitati de executie o sa aibe proprietate concurrentKernels setata la 1 (se face query la device properties inainte). Numarul maxim de lansari asincrone de kernele diferite este dependent de arhitectura (se verifica in functie de compute capability). Singura restrictie este ca programele kernel sa fie in acelasi context.
  
-Perf is a performance analysis tool, available in the Linux kernel since version 2.6.31 [5]. The userspace control application is accessed from the command line and provides a number of subcommands. Unlike Valgrind, perf is capable of statistical profiling of both the entire system (kernel and userspace) and per process PID basis. It supports hardware performance counters, tracepoints,​ software performance counters (e.g. hrtimer), and dynamic probes (for example, kprobes or uprobes).+==== Executie si transfer date asincron ​ ====
  
-Perf is used with several subcommands:​ +Anumite device-uri pot executa un transfer asincron memorie alaturi de o executie de kernel. Acest lucru este dependent de compute capability si se poate verifica in device property asyncEngineCount.
-  * **stat**: measure total event count for a single program or for the whole system for a specified time period; +
-  * **top**: top-like dynamic view of hottest functions;​ +
-  * **record**: measure and save sampling data for single program; +
-  * **report**: analyze file generated by perf record; +
-  * **annotate**:​ annotate sources or assembly; +
-  * **sched**: tracing/​measuring of scheduler actions and latencies;​ +
-  * **list**: list available events.+
  
-1. Make sure you have perf installed on the system and the application in the initial state, without any modifications. **You can only run perf as root. You can only do this on your system.** +{{:asc:lab9:cuda_async.png?900|}}
-2. Run the following command line: +
-<code bash> +
-perf record -a -g -- ./​tachyon_find_hotspots +
-</​code>​ +
-For other perf parameters, you can read this [[http://​www.brendangregg.com/​perf.html|link]] +
-3. Run the following command line to view the collected results: +
-<​code>​ +
-perf report +
-</​code>​ +
-You should see a screen like the following:​ +
-{{ :asc:lab6:image19.png |}} +
-From this image you can see that perf will display the symbol for the function that takes the most amount of CPU time in red. In our case it’s the //​_Z20initialize_2D_bufferPjS_//,​ which translates in the C source code into the same function as with VTune and Valgrind. +
-<note tip>​Hint:​ To find out the demangled name, use the c++filt command: <​code>​ c++filt _Z20initialize_2D_bufferPjS_</​code></​note>​+
  
 +De asemenea, se pot face transferuri de memorie intra-device simultan cu executia de kernel cand atat device property concurrentKernels,​ cat si asyncEngineCount sunt 1.
  
-===== Exercitii =====+{{:​asc:​lab9:​cuda_async_2.png?​900|}}
  
-  - logati-va pe ''​fep.grid.pub.ro''​ folosind contul de pe ''​curs.upb.ro''​ +===== Dynamic Paralellism ​ =====
-  - executati comanda ''​wget https://​ocw.cs.pub.ro/​courses/​_media/​asc/​lab6/​lab6_skl.tar.gz -O lab6_skl.tar.gz''​ +
-  - dezarhivati folosind comanda ''​tar -xzvf lab6_skl.tar.gz''​+
  
-**Task 0**  ​Folosit valgrind pentru ''​task0.c''​ urmarind TODO-uri pentru teste. +Paralelismul dinamic consta in posibilitatea de a lansa programe kernel din thread-urile ce ruleaza pe device/GPUIn alte cuvinte, unitatea GPU poate sa isi  atribuie noi task-uri/thread-uri fara interventia unitatii host/CPUAceasta manifestare este utila in problemele unde maparea threaduri<​->date nu este simpla/trivialaDe exemplu, in situatia unde unele thread-uri ar avea prea putin de lucru, iar altele prea mult (imaginea de mai jos, simulare fluide) ​o situatia debalansata computational.
-    * ''​make task0''​ pentru versiunea seriala +
-    * ''​make openmp_task0''​ pentru versiunea paralelizata +
-    * <code sh> +
-[@fep7-1 ]$ srun --x11 -p hpsl--pty /bin/bash +
-[@hpsl-wn01 ~]$ singularity run docker://gitlab.cs.pub.ro:​5050/​asc/​asc-public/c-labs:1.3.1 /bin/bash +
-Singularity>​ make task0 +
-Singularity>​ valgrind ​--tool=callgrind -v --dump-every-bb=10000000 ​./task0 +
-Singularity>​ kcachegrind  +
-Singularity>​ make clean +
-Singularity>​ make openmp_task0 +
-Singularity>​ valgrind --tool=callgrind -v --dump-every-bb=10000000 ./task0 +
-Singularity>​ kcachegrind  +
-</​code>​+
  
-**Task 1**  - Analizati aplicatia Tachyon. +{{:asc:lab9:​dynamic-paralellism.png?​560|Fluid simulation}}
-    * Rulati scriptul ''​task1.sh''​ pentru a descarca si compila Tachyon. +
-    * Varianta seriala ''​tachyon_find_hotspots''​ +
-    * Varianta paralelizata ''​tachyon_analyze_locks''​ +
-    * <code sh> +
-[@fep7-1 ]$ srun --x11 -p hpsl--pty /bin/bash +
-[@hpsl-wn01 ~]$ singularity run docker://​gitlab.cs.pub.ro:​5050/​asc/​asc-public/​c-labs:1.3.1 /bin/bash +
-Singularity>​ ./​task1.sh +
-Singularity>​ cd tachyon +
-Singularity>​ valgrind ​--tool=callgrind --collect-jumps=yes --dump-instr=yes --collect-systime=yes -- ./​tachyon_find_hotspots dat/​balls.dat +
-Singularity>​ valgrind --tool=callgrind --collect-jumps=yes --dump-instr=yes --collect-systime=yes -- ./​tachyon_analyze_locks dat/balls.dat +
-</​code>​ +
-    * (BONUS) Folositi perf+
  
 +Cerintele pentru paralelism dinamic sunt CUDA 5.0 ca Toolkit si respectiv Compute Capability 3.5. O lista cu GPU-uri NVIDIA si Compute Capability se regaseste [[https://​developer.nvidia.com/​cuda-gpus|aici]].
 +
 +===== Exercitii =====
  
-**Task 2**  - Folositi tool-ul cachegrind din valgrind pentru a analiza codul care realizeaza inmultirea de matrice folosind diferite reordonari ale buclelor. 
-    * Compilati si rulati ''​task2.c''​ 
-    * Notati observatiile voastre legate de numarul de I refs, D refs, D1 misses, branches si mispredicts. 
-    * <code sh> 
-[@fep7-1 ]$ srun --x11 -p hpsl--pty /bin/bash 
-[@hpsl-wn01 ~]$ singularity run docker://​gitlab.cs.pub.ro:​5050/​asc/​asc-public/​c-labs:​1.3.1 /bin/bash 
-Singularity>​ make task2 
-Singularity>​ valgrind --tool=cachegrind --branch-sim=yes ./​mult_reorder 1 
-Singularity>​ valgrind --tool=cachegrind --branch-sim=yes ./​mult_reorder 2 
-Singularity>​ valgrind --tool=cachegrind --branch-sim=yes ./​mult_reorder 3 
-</​code>​ 
  
 +Urmăriți instrucțiunile de pe [[https://​gitlab.cs.pub.ro/​asc/​asc-public/​-/​tree/​master/​labs/​cuda/​advanced|GitLab]].
  
 <note important>​ <note important>​
 Recomandăm sa va delogati mereu de pe serverele din cluster dupa terminarea sesiunii, utilizand comanda ''​exit''​ Recomandăm sa va delogati mereu de pe serverele din cluster dupa terminarea sesiunii, utilizand comanda ''​exit''​
  
-Alternativ, daca ati uitat sesiuni deschise, puteti verifica acest lucru de pe fep.grid.pub.ro,​ utilizand comanda ''​squeue''​. In cazul in care identificati astfel de sesiuni "​agatate",​ le puteti sterge (si va rugam sa faceti asta), utilizand comanda ''​scancel ​ ID''​ unde ID-ul il identificati din comanda anterioara ''​squeue''​. Puteți folosi mai precis ''​squeue -u username''​ (username de pe fep.grid.pub.ro) pentru a vedea doar sesiunile care vă interesează.+Alternativ, daca ati uitat sesiuni deschise, puteti verifica acest lucru de pe fep8.grid.pub.ro,​ utilizand comanda ''​squeue''​. In cazul in care identificati astfel de sesiuni "​agatate",​ le puteti sterge (si va rugam sa faceti asta), utilizand comanda ''​scancel ​ ID''​ unde ID-ul il identificati din comanda anterioara ''​squeue''​. Puteți folosi mai precis ''​squeue -u username''​ (username de pe fep8.grid.pub.ro) pentru a vedea doar sesiunile care vă interesează. (Sau ''​squeue --me''​).
  
 Daca nu veti face aceasta delogare, veti putea ajunge in situatia in care sa nu va mai puteti loga pe nodurile din cluster. Daca nu veti face aceasta delogare, veti putea ajunge in situatia in care sa nu va mai puteti loga pe nodurile din cluster.
Line 243: Line 281:
  
  
 +===== Resurse =====
 +
 +<​hidden>​
 +{{:​asc:​lab9:​sol:​lab9_sol.tar.gz|Soluție Laborator 6}}
 +</​hidden>​
 +
 +/* {{:​asc:​lab9:​sol:​lab9_sol.zip|Solutie Laborator 6}} */
 +
 +/* {{:​asc:​lab6:​asc_lab9.pdf|Enunt Laborator 6}} */
 +
 +  * Responsabili laborator: Matei Barbu, Alexandru Bala
  
-===== Reference =====  +==== Referinte ​==== 
-  * http://valgrind.org+  * Bibliografie 
-  * http://valgrind.org/info/tools.html +    * [[https://booksite.elsevier.com/9780124077263/​downloads/​advance_contents_and_appendices/​appendix_C.pdf|Graphics and Computing GPUs]] 
-  http://kcachegrind.sourceforge.net/html/Usage.html +  * Documentatie CUDA: 
-  http://valgrind.org/downloads/variants.html +    * [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/​index.html|CUDA C Programming]] 
-  * https://perf.wiki.kernel.org/index.php/Main_Page +    [[https://docs.nvidia.com/cuda/pdf/​CUDA_Compiler_Driver_NVCC.pdf| CUDA NVCC compiler]] 
-  * https://software.intel.com/en-us/intel-parallel-studio-xe +    [[https://docs.nvidia.com/cuda/profiler-users-guide/​index.html| CUDA Visual Profiler]] 
-  * http://www.brendangregg.com/perf.html  +    [[https://docs.nvidia.com/​cuda/​cuda-toolkit-release-notes/index.html|CUDA Dev Toolkit]] 
-  * https://www.oracle.com/tools/developerstudio/downloads/developer-studio-jsp.html+    * [[https://​developer.nvidia.com/​cuda-gpus|CUDA GPUs]] 
 +  * Acceleratoare xl (NVidia P100) 
 +    * [[https://www.nvidia.com/en-us/data-center/​tesla-p100/​|NVIDIA Pascal P100]] 
 +  * Advanced CUDA 
 +    * [[http://www-personal.umich.edu/~smeyer/​cuda/​grid.pdf | CUDA Thread Basics]] 
 +    [[https://devblogs.nvidia.com/even-easier-introduction-cuda| An Even Easier Introduction to CUDA]] 
 +    * [[https://developer.download.nvidia.com/​CUDA/​training/​StreamsAndConcurrencyWebinar.pdf|CUDA Streams 1]] 
 +    * [[https://​devblogs.nvidia.com/​gpu-pro-tip-cuda-7-streams-simplify-concurrency/​|CUDA Streams 2]] 
 +    * [[https://​devblogs.nvidia.com/​introduction-cuda-dynamic-parallelism/​|CUDA Dynamic Parallelism]] 
 +    * [[https://​developer.nvidia.com/​blog/​how-overlap-data-transfers-cuda-cc/​|How to Overlap Data Transfers in CUDA C/C++]]
  
-==== Resources ==== 
-  * Responsabilii acestui laborator: [[emil.slusanschi@cs.pub.ro|Emil Slusanschi]],​ Alexandru Patrascu si Octavian Moraru. 
-  * <​html><​a class="​media mediafile mf_pdf"​ href=":​asc:​lab6:​index?​do=export_pdf">​PDF laborator</​a></​html>​ 
-  * {{asc:​lab6:​tachyon_vtune_amp_xe.tgz|Aplicatie laborator (tachyon)}} 
-  * {{asc:​lab6:​primes.c.zip|Aplicatie de test laborator}} 
  
asc/laboratoare/06.1649039495.txt.gz · Last modified: 2022/04/04 05:31 by stefan_dan.ciocirlan
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0