Differences

This shows you the differences between two versions of the page.

Link to this comparison view

asc:laboratoare:06 [2022/04/04 05:33]
stefan_dan.ciocirlan [3. Analysis of the Tachyon raytracing engine]
asc:laboratoare:06 [2026/02/23 18:47] (current)
giorgiana.vlasceanu
Line 1: Line 1:
-====== Laboratorul 06 - Analiza Performantei Programelor ​======+====== Laboratorul 06 - Arhitectura GPU NVIDIA CUDA ======
  
-The purpose of this lab is the familiarization with the field of application profiling & debugging, through the means of dedicated tools for spotting performance bottlenecks.+Pentru o înțelegere profundă a arhitecturii CUDA vă recomandăm să citiți din resursele bilbiografie "​Graphics and Computing GPUs". În continuare o să discutăm mai aplicat despre implementări ale acestei arhitecturi.
  
-We will focus on several open source and commercial tools, as follows: +Arhitectura NVIDIA FERMI [[https://www.nvidia.com/​content/​PDF/​fermi_white_papers/​NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf|aici]], Tesla 2070, coada executie fep8.grid.pub.ro -> dp
-  * valgrind ​kcachegrind (on own systems & hp-sl.q cluster queue) +
-  * Solaris Studio (on hp-sl.q cluster queue) +
-  * perf (on your own systems) +
-  * **You will record all your findings in a single text document and upload it on the Moodle.**+
  
-===== 1Valgrind ​KCachegrind=====+Arhitectura NVIDIA KEPLER [[https://​www.nvidia.com/content/​PDF/​kepler/​NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf|aici]],​ Tesla K40M, coada executie fep8.grid.pub.ro -> hpsl
  
-Valgrind is tool used for memory debuggingmemory leak detection and profilingIt is also generic framework for creating dynamic analysis tools, such as memory checkers ​[1].+Prima arhitectura NVIDIA complet programabila ​fost G80 (ex. [[http://​www.nvidia.com/​page/​8800_tech_briefs.html|Geforce 8800]]lansat in anul 2006)Cu aceasta arhitectura s-trecut de la unitati hardware fixe vertex/​pixel la cele de unified shader care puteau procesa atat vertex/​pixel cat si geometry. Evolutia arhitecturilor GPU de la NVIDIA este detaliata ​[[http://​s08.idav.ucdavis.edu/​luebke-nvidia-gpu-architecture.pdf|aici]].
  
-Valgrind is in essence a virtual machine using just-in-time compilation techniques, including dynamic recompilationIt is important to keep in mind that nothing from the original program ever gets run directly on the host processorInsteadit will translate the input program into a simpler form called Intermediate Representation ​(IR), which is processor neutralAfter this transformationa tool [2] is called to do whatever transformation of the IR it needs and the resulting IR is then translated back into machine code and ran on the host processor.+Implementarea NVIDIA pentru GPGPU se numeste CUDA (Compute Unified Device Architecture) si permite utilizarea limbajului C pentru programarea pe GPU-urile proprii. Lista de GPU-uri ce suporta API-ul CUDA sau OpenCL se regaseste pe site-ul oficial [[https://​www.geforce.com/​hardware/​technology/​cuda/​supported-gpus|aici]] sau pe wiki [[https://​en.wikipedia.org/​wiki/​List_of_Nvidia_graphics_processing_units|aici]]. Fiecare noua arhitectura are un codename (ex FermiPascal) si este reprezentata de un "​compute capability" ​(list [[https://​developer.nvidia.com/​cuda-gpus|aici]]). Cu cat arhitectura este mai nouacu atat sunt suportate mai multe facilitati din API-urile CUDA si OpenCL.
  
-The tools available ​in Valgrind are: +Unitatea GPU este potrivita pentru paralelismul de date SIMD (Single Instruction Multiple Data), astfel aceleasi instructiuni sunt executate ​in paralel pe mai multe unitati de procesareDatorita faptului ca acelasi program este executat pentru fiecare element de datesunt necesare mai putine elemente pentru controlul fluxuluiSi deoarece calculele sunt intensive computationallatenta accesului la memorie poate fi ascunsa prin calcule ​in locul unor cache-uri mari pentru date.
-  * **memcheck**Used to detect memory-management problems and it is aimed at C and C++ programs. All memory reads and writes are checkedand all calls to malloc/​new/​free/​delete are interceptedTherefore it can detect memory leaksaccess to invalid memory, weird initialization values, overflows, etc. Memcheck runs programs about 10-30x slower than normal; +
-  * **cachegrind**. Used to profile CPU cache. It performs detailed simulation of the I1, D1 and L2 caches ​in order to pinpoint the sources of cache misses. It identifies the number of cache misses, memory references and instructions executed for each line of source code. Cache grind runs programs about 20-100x slower than normal; +
-  * **callgrind**. It is an extension to cachegrind and provides all the information that the latter offers, plus extra information regarding call graphs. In order to view the results, a visualization tool called KCachegrind [3] can be used; +
-  * **massif**. It is a heap profiler and it performs detailed heap profiling by taking regular snapshots of a program'​s heap and produces a graph showing heap usage over time, including information about which parts of the program are responsible for the most memory allocations. Massif runs programs about 20x slower than normal; +
-  * **hellgrind** and drd. These tools are thread debuggers which find data races in multithreaded programs. They look for memory locations which are accessed by more than one (POSIX) pthread, but for which no consistently used (pthread_mutex_) lock can be found; +
-  * other 3rd party tools can be found here [4].+
  
-<​note>​ +Motivul discrepantei intre performanta paralela dintre CPU si GPU este faptul ca GPU sunt specializate pentru procesare masiv paralela si intensiva computational ​(descrierea perfecta a taskurilor de randare graficasi construite ​in asa fel incat majoritatea tranzistorilor de pe chip se ocupa de procesarea datelor ​in loc de cachingul datelor si controlul fluxului executiei
-**TASK 1: Install valgrind on your computers ​(or run vallgrind/​kcachegrind on the hp-sl.q queueand run the callgrind tool, for the {{asc:​lab6:​primes.c.zip|primes.c}} app with tests specified ​in the skeleton. Record your comments / timings and observations ​in the text document of lab 6** +
-</​note>​+
  
-Intrati pe frontend-ul ''​fep.grid.pub.ro''​ folosind contul ​de pe cs.curs.pub.routilizand comanda ''​ssh -X username@fep.grid.pub.ro''​. Executati comanda ''​qlogin -q hp-sl.q''​ pentru a accesa una din serverele ​de pe coada hp-sl.q.+La GPU-urile NVIDIA, un Streaming Processor (SP) este un microprocesor cu executie secventiala,​ ce contine un pipeline, unitati aritmetico-logice (ALU) si de calcul in virgula mobila (FPU)Nu are un cache, fiind bun doar la executia multor operatii matematiceUn singur SP nu are performante remarcabileinsa prin cresterea numarului ​de unitati, se pot rula algoritmi ce se preteaza paralelizarii masive.
  
-Pentru a rula si observa functionalitatile valgrind/​kcachegrind urmariti urmatoarea secventa si urmati apoi indicatiile din codul primes.c. +SP impreuna cu Special Function Units (SFU) sunt incapsulate intr-un Streaming Multiprocessor (SM/SMX)Fiecare SFU contine unitati pentru inmultire in virgula mobila, utilizate pentru operatii transcendente (sin, cos) si interpolareMT se ocupa cu trimiterea instructiunilor pentru executie la SP si SFU.
-<code sh> +
-[@fep7-1 ]$ qlogin -q hp-sl.q +
-[@hpsl-wn01 ~]$ wget -O primes.c.zip http://ocw.cs.pub.ro/​courses/​_media/​asc/​lab6/​primes.c.zip +
-[@hpsl-wn01 ~]$ module load compilers/​gnu-6.2.0 +
-[@hpsl-wn01 ~]$ gcc -o prime-ex primes.c +
-[@hpsl-wn01 ~]$ valgrind --tool=callgrind -v --dump-every-bb=10000000 ./prime-ex  +
-[@hpsl-wn01 ~]$ kcachegrind  +
-</​code>​+
  
-Pentru rularea cu openmp. +Pe langa acestea, exista si un cache (de dimensiuni reduse) pentru instructiuni,​ unul pentru date precum si memorie shared, partajata de SP-uriUrmatorul nivel de incapsulare este Texture / Processor Cluster (TPC). Acesta contine SM-uri, logica de control si un bloc de handling pentru texturi. Acest bloc se ocupa de modul de adresare al texturilor, logica de filtrare a acestora precum si un cache pentru texturi.
-<code sh> +
-[@hpsl-wn01 ~]$ gcc -g -fopenmp -o prime-omp primes.+
-[@hpsl-wn01 ~]$ valgrind --tool=callgrind -v --dump-every-bb=10000000 ​./prime-e +
-[@hpsl-wn01 ~]$ kcachegrind  +
-</​code>​+
  
 +{{:​asc:​lab11:​cuda-arch.png?​direct&​720|}}
  
-<note important>​ +Filosofia ​din spatele arhitecturii este permiterea rularii unui numar foarte mare de threaduri. Acest lucru este facut posibil prin paralelismul existent la nivel hardware.
-Recomandăm sa va delogati mereu de pe serverele ​din cluster dupa terminarea sesiunii, utilizand comanda ''​logout''​+
  
-Alternativdaca ati uitat sesiuni deschise, puteti verifica acest lucru de pe fep.grid.pub.ro,​ utilizand comanda ''​qstat''​. In cazul in care identificati astfel ​de sesiuni "​agatate",​ le puteti sterge (si va rugam sa faceti asta), utilizand comanda ''​qdel -f ID''​ unde ID-ul il identificati din comanda anterioara ''​qstat''​.+Documentatia NVIDIA recomanda rularea unui numar cat mai mare threaduri pentru a executa un task. Arhitectura CUDA de exemplu suporta zeci de mii de threadurinumarul acestora fiind mult mai mare decat unitatile fizice existente ​pe chipAcest lucru se datoreaza faptului ca un numar mare de threaduri poate masca latenta accesului la memorie.
  
-Daca nu veti face aceasta delogareveti putea ajunge ​in situatia ​in care sa nu va mai puteti loga pe nodurile din cluster. +Urmarind acelasi model modular din punct de vedere al arhitecturiithreadurile sunt incapsulate ​in blocuri (thread blocks / warps), iar blocurile ​in grile (thread grid). Fiecare thread este identificat prin indexul threadului in bloc, indexul blocului in grila si indexul grilei. Indexurile threadurilor si ale blocurilor pot fi uni/​bi/​tri-dimensionale,​ iar  indexul grilei poate fi uni sau bi-dimensional. Acest tip de impartire are rolul de a usura programarea pentru probleme ce utilizeaza structuri de date cu mai multe dimensiuni. Se poate observa ca thread-urile dintr-un thread block trebuie ​sa execute cat mai multe instructiuni identice spre a nu irosi resurse.
-</​note> ​+
  
 +{{:​asc:​lab11:​thread.blocks.jpg?​direct&​360|{{thread.blocks.jpg|''​Structura threadurilor in blocuri''​}}
  
-===== 2Solaris Studio =====+Threadurile dintr-un bloc pot coopera prin partajarea de date prin intermediul memoriei shared si prin sincronizarea executiei. Functia de bariera functioneaza doar pentru threadurile dintr-un bloc. Sincronizarea nu este posibila la alt nivel (intre blocuri/​grila etc.). Mai multe explicatii se regasesc in [[https://​ocw.cs.pub.ro/​courses/​asc/​laboratoare/​06|Laboratorul 6]].
  
-Pentru a utiliza utilitarul pentru profiling Oracle Solaris Studio, trebuie sa efectuati urmatorii pasi:+===== Ierarhia de memorie =====
  
-<code sh> +Intelegerea ierharhiei de memorie este esentiala in programarea eficienta a unitatii GPU. Capacitatea mare de executie in paralel a unui GPU necesita ascunderea latentei de acces catre memoria principala (fie VRAM pentru dGPU sau RAM pentru iGPU).
-[@fep7-1 ]$ qlogin -q hp-sl.+
-[@hpsl-wn01 ~]$ module avail+
  
------------------------------------------------------------ /​usr/​share/​Modules/​modulefiles ------------------------------------------------------------ +{{:​asc:​lab11:​mem.hierarchy.png?​direct|Ierarhia memoriei in CUDA}}
-dot         ​module-git ​ module-info modules ​    ​null ​       use.own+
  
------------------------------------------------------------------- /​etc/​modulefiles ------------------------------------------------------------------- +**Register File** 
-compilers/​gnu-4.9.4 ​                    ​libraries/​cuda-10.2 ​                    ​libraries/​opencv-3.1.0-gcc-4.9.4 +<code sh> 
-compilers/gnu-5.4.0 ​                    ​libraries/cuda-7.5 ​                     libraries/​openmpi-2.0.1-gcc-4.9.4 +/* marcam pentru compilator regValPi in register file *
-compilers/​gnu-6.2.0                     ​libraries/​cuda-8.0 ​                     libraries/​openmpi-2.0.1-gcc-5.4.0 +__private__ float regValPi = 3.14f; 
-compilers/solarisstudio-12.5 ​           libraries/cuda-9.0 ​                     utilities/​intel_parallel_studio_xe_2016 +/* compilatorul cel mai probabil oricum incadreaza regVal2Pi ca registru *
-libraries/​cuda ​                         libraries/​cuda-9.1 ​                     utilities/​opencl +float regVal2Pi = * 3.14f;
-[@hpsl-wn01 ~]$ module load compilers/​solarisstudio-12.5 compilers/​gnu-6.2.+
-[@hpsl-wn01 ~]$ module list +
-Currently Loaded Modulefiles:​ +
-  1) compilers/​gnu-6.2.0 ​           2) compilers/​solarisstudio-12.5 +
-[@hpsl-wn01 ~]$ gcc -g -o prime-ex primes.c +
-[@hpsl-wn01 ~]$ collect ./prime-ex  +
-[@hpsl-wn01 ~]$ analyzer test.1.er ​+
 </​code>​ </​code>​
 +  *Cea mai rapida forma de memorie de pe GPU
 +  *Accesibila doar de catre thread, iar durata de viata este aceeasi ca si a threadului
 +  *Un kernel complex poate determina folosirea unui numar mare de registrii si astfel:
 +    * limitarea executiei multor thread-uri simultan
 +    * register spill, atunci cand valorile registrilor sunt salvate in memoria globala
  
-Atentie, in rulari succesiveva trebui sa incarcati arhive cu numere mai mari X "​test.X.er"​.+**Local Memory** 
 +<code sh> 
 +/* fiecare work item salveaza un element */ 
 +__local__ float lArray[lid] = data[gid];​ 
 +</​code>​ 
 +  *In functie de implementarea hardware100GB/sec -> 2TB/sec 
 +  *Pentru GPU o memorie rapida, actioneaza ca un cache L1/alt register file, la CPU de regula este doar o portiune din RAM 
 +  *Ca si in cazul registrilor,​ este accesibila doar de catre threadiar durata de viata este aceeasi ca si a threadului
  
-Un scurt tutorial este disponibil aici (aveti grija sa incarcati in prealabil modulul Solaris Studio): http://cluster.grid.pub.ro/index.php/​cluster-howto/30-profiling/​76-profiling-with-sun-studio-analyzer+**Shared Memory** 
 +<code sh> 
 +/* elementele sunt salvate la nivel de bloc */ 
 +__shared__ int s[64]; 
 +</code> 
 +  *Accesibila tuturor threadurilor dintr-un bloc (warp/wavefront), iar durata de viata este aceeasi ca si a blocului 
 +  *Trebuie evitate conflictele de access (bank conflicts)
  
-O descriere si resurse suplimentare puteti obtine de asemenea pe site-ul oficial: https://​docs.oracle.com/cd/E37069_01/html/E37073/​gkodh.html+**Constant Memory** 
 +<code sh> 
 +__const__ float pi = 3.14f 
 +</code> 
 +  * In functie de implementarea hardware, 100GB/sec -> 1TB/sec 
 +  * In general performanta foarte buna, (cache L1/L2, zona dedicata),​ 
 +  * Are durata de viata a aplicatiei kernel
  
-Daca doriti sa va instalati Oracle Solaris/​Developer Studio, o puteti face de aici: +**Global Memory**
-https://​www.oracle.com/​tools/​developerstudio/​downloads/​developer-studio-jsp.html +
- +
-Pentru rularea paralela trebuie urmarite urmatoarele comenzi:+
 <code sh> <code sh>
-[@hpsl-wn01 ~]$ gcc -g -fopenmp -o prime-omp primes.+__kernel__ void process(__global__ float* data){ ​... }
-[@hpsl-wn01 ~]$ collect ​./​prime-omp +
-[@hpsl-wn01 ~]$ analyzer test.1.er +
 </​code>​ </​code>​
 +  * In functie de implementarea hardware, 30GB/sec -> 500GB/sec
 +  * Video RAM (VRAM), de regula cu o capacitate intre 1GB si 12GB in functie de placa video
 +  * Memorie dedicata specializata doar pentru placile grafice discrete (GPU-urile integrate in CPU folosesc RAM)
 +  * In general latime mare de banda (256-512 biti) si chipuri de memorii de mare viteza (GDDR7)
  
-Aceeasi observatie ca mai susprobabil ca datele experimentale sunt intr-o alta arhiva cu numar 1, in formatul test.X.er, cu X 1.+**Host Memory (RAM)** 
 +  * In general4GB/​sec ​-> 30GB/sec 
 +  * Pentru acces din kernel trebuie transfer/​mapare explicita RAM->VRAM pe partea de host/CPU 
 +  * Memoria RAM accesibila direct de CPU si indirect de GPU via DMA si magistrala PCIe 
 +  * Viteza de transfer (throughput/​latenta) este limitata de magistrala PCIe cat si de memoria RAM
  
-<​note>​ +Caracteristici GPU K40m (coada hpsl)via query device properties CUDA
-**TASK 2: Run Solaris Studio Profiler on the hp-sl.q queue servers from our cluster by following the steps above on the  {{asc:​lab6:​primes.c.zip|primes.c}} app with tests specified in the skeleton. Explore the various screens and reports of the toolwith the "​Functions,​ Timeline, Call Tree, Source, Callers-Callees"​ tabs. Record your comments and observations in the text document of lab 6. ** +
-</​note>​+
  
-<note important+<code sh
-Recomandăm sa va delogati mereu de pe serverele din cluster dupa terminarea sesiuniiutilizand comanda ''​logout''​+Device 0: "Tesla K40m"​ 
 +  CUDA Driver Version / Runtime Version ​         9.1 / 9.1 
 +  CUDA Capability Major/Minor version number: ​   3.5 
 +  Total amount of global memory: ​                11441 MBytes (11996954624 bytes) 
 +  (15) Multiprocessors(192) CUDA Cores/​MP: ​    2880 CUDA Cores 
 +  GPU Max Clock rate:                            745 MHz (0.75 GHz) 
 +  Memory Clock rate:                             3004 Mhz 
 +  Memory Bus Width: ​                             384-bit 
 +  L2 Cache Size:                                 ​1572864 bytes 
 +  Maximum Texture Dimension Size (x,​y,​z) ​        ​1D=(65536),​ 2D=(65536, 65536), 3D=(4096, 4096, 4096) 
 +  Maximum Layered 1D Texture Size, (num) layers ​ 1D=(16384), 2048 layers 
 +  Maximum Layered 2D Texture Size, (num) layers ​ 2D=(16384, 16384), 2048 layers 
 +  Total amount of constant memory: ​              65536 bytes 
 +  Total amount of shared memory per block: ​      49152 bytes 
 +  Total number of registers available per block: 65536 
 +  Warp size:                                     32 
 +  Maximum number of threads per multiprocessor: ​ 2048 
 +  Maximum number of threads per block: ​          ​1024 
 +  Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
 +  Max dimension size of a grid size    (x,y,z): (2147483647,​ 65535, 65535) 
 +  Maximum memory pitch: ​                         2147483647 bytes 
 +  Texture alignment: ​                            512 bytes 
 +  Concurrent copy and kernel execution: ​         Yes with 2 copy engine(s) 
 +  Run time limit on kernels: ​                    No 
 +  Integrated GPU sharing Host Memory: ​           No 
 +  Support host page-locked memory mapping: ​      Yes 
 +  Alignment requirement for Surfaces: ​           Yes 
 +  Device has ECC support: ​                       Enabled 
 +  Device supports Unified Addressing (UVA): ​     Yes 
 +  Device PCI Domain ID / Bus ID / location ID:   0 / 8 / 0 
 +</​code>​
  
-Alternativ, daca ati uitat sesiuni deschise, puteti verifica acest lucru de pe fep.grid.pub.ro,​ utilizand comanda ''​qstat''​. In cazul in care identificati astfel de sesiuni "​agatate",​ le puteti sterge ​(si va rugam sa faceti asta), utilizand comanda ''​qdel -f ID''​ unde ID-ul il identificati din comanda anterioara ''​qstat''​.+Caracteristici GPU M2070 (coada dp), via query device properties CUDA
  
-Daca nu veti face aceasta delogare, veti putea ajunge in situatia in care sa nu va mai puteti loga pe nodurile din cluster. +<code sh
-</note>  +Device 0: "Tesla M2070" 
- +  CUDA Driver Version / Runtime Version ​         9.1 / 9.1 
- +  CUDA Capability Major/Minor version number: ​   2.
-===== 3Analysis ​of the Tachyon raytracing engine ​===== +  Total amount ​of global memory: ​                5302 MBytes (5559156736 bytes) 
- +  (14) Multiprocessors,​ ( 32) CUDA Cores/​MP: ​    448 CUDA Cores 
-In this section, we will focus on analyzing a software application. We will analyze both a serial and a parallel implementation. The application is called “tachyon” and you can find the source code attached to this lab. +  GPU Max Clock rate:                            1147 MHz (1.15 GHz) 
- +  Memory Clock rate:                             1566 Mhz 
-On your own systembefore compilationyou must install the X11 dev tools and create a set of symlinks. For Ubuntu ​64 bitwe must do the following+  Memory Bus Width: ​                             384-bit 
-  ​* install dependencies <code bash> sudo apt-get install libx11-dev </​code>​ +  L2 Cache Size:                                 ​786432 bytes 
-  ​* create the symlinks+  Maximum Texture Dimension Size (x,​y,​z) ​        1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048) 
-    * <code bash> sudo mkdir /usr/lib64 </​code>​ +  Maximum Layered 1D Texture Size, (num) layers ​ 1D=(16384), 2048 layers 
-    * <code bash> sudo ln -/​usr/​lib/​x86_64-linux-gnu/​libX11.so /​usr/​lib64/​libX11.so </​code>​ +  Maximum Layered 2D Texture Size, (num) layers ​ 2D=(16384, 16384), 2048 layers 
-    * <code bash> sudo ln -s /​usr/​lib/​x86_64-linux-gnu/​libXext.so /​usr/​lib64/​libXext.so </​code>​ +  Total amount of constant memory: ​              65536 bytes 
- +  Total amount of shared memory per block: ​      49152 bytes 
-To compile it, you must extract the archive {{asc:lab6:tachyon_vtune_amp_xe.tgz| Tachyon}} to local disk and run make. You can test the compilation by running in the same directory<code bash>./tachyon_find_hotspots dat/balls.dat ​</​code>​ +  Total number of registers available per block: 32768 
- +  Warp size:                                     32 
-You should see a window like the one below:+  Maximum number of threads per multiprocessor: ​ 1536 
 +  Maximum number of threads per block: ​          ​1024 
 +  Max dimension size of a thread block (x,y,z): (1024, 1024, 64
 +  Max dimension size of a grid size    (x,y,z)(65535, 65535, 65535) 
 +  ​Maximum memory pitch: ​                         2147483647 bytes 
 +  ​Texture alignment                            512 bytes 
 +  ​Concurrent copy and kernel execution: ​         Yes with 2 copy engine(s) 
 +  Run time limit on kernels: ​                    No 
 +  Integrated GPU sharing Host Memory: ​           No 
 +  Support host page-locked memory mapping: ​      Yes 
 +  ​Alignment requirement for Surfaces: ​           Yes 
 +  ​Device has ECC support                       ​Enabled 
 +  Device supports Unified Addressing (UVA)     Yes 
 +  Device PCI Domain ID / Bus ID / location ID  0 20 
 +</​code>​
  
-{{ :​asc:​lab6:​image09.png?​300 |}}+Caracteristici GPU P100 (coada xl), via query device properties CUDA
  
-Pentru a rula si observa functionalitatile acestei unelte urmariti urmatoarea secventa de pasi si indicatiile:​ 
 <code sh> <code sh>
-wget -O tachyon_vtune_amp_xe.tgz http://ocw.cs.pub.ro/​courses/​_media/asc/​lab6/​tachyon_vtune_amp_xe.tgz  +Device 1: "Tesla P100-PCIE-16GB"​ 
-gunzip tachyon_vtune_amp_xe.tgz +  CUDA Driver Version / Runtime Version ​         12.11.
-tar -xvf tachyon_vtune_amp_xe.tar +  CUDA Capability Major/Minor version number: ​   6.0 
-cd tachyon +  Total amount of global memory: ​                16276 MBytes (17066885120 bytes) 
-make+  (056) Multiprocessors,​ (064) CUDA Cores/​MP: ​   3584 CUDA Cores 
 +  GPU Max Clock rate:                            1329 MHz (1.33 GHz) 
 +  ​Memory Clock rate:                             715 Mhz 
 +  Memory Bus Width: ​                             4096-bit 
 +  L2 Cache Size:                                 ​4194304 bytes 
 +  ​Maximum Texture Dimension Size (x,​y,​z) ​        ​1D=(131072),​ 2D=(131072, 65536), 3D=(16384, 16384, 16384) 
 +  Maximum Layered 1D Texture Size, (num) layers ​ 1D=(32768), 2048 layers 
 +  Maximum Layered 2D Texture Size, (num) layers ​ 2D=(32768, 32768), 2048 layers 
 +  Total amount of constant memory: ​              65536 bytes 
 +  Total amount of shared memory per block: ​      49152 bytes 
 +  Total shared memory per multiprocessor: ​       65536 bytes 
 +  Total number of registers available per block: 65536 
 +  Warp size:                                     32 
 +  Maximum number of threads per multiprocessor: ​ 2048 
 +  Maximum number of threads per block: ​          ​1024 
 +  Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
 +  Max dimension size of a grid size    (x,y,z): (2147483647,​ 65535, 65535) 
 +  Maximum memory pitch: ​                         2147483647 bytes 
 +  Texture alignment: ​                            512 bytes 
 +  Concurrent copy and kernel execution: ​         Yes with 2 copy engine(s) 
 +  Run time limit on kernels: ​                    No 
 +  Integrated GPU sharing Host Memory: ​           No 
 +  Support host page-locked memory mapping: ​      Yes 
 +  Alignment requirement for Surfaces: ​           Yes 
 +  Device has ECC support: ​                       Enabled 
 +  Device supports Unified Addressing (UVA): ​     Yes 
 +  Device supports Managed Memory: ​               Yes 
 +  Device supports Compute Preemption: ​           Yes 
 +  Supports Cooperative Kernel Launch: ​           Yes 
 +  Supports MultiDevice Co-op Kernel Launch: ​     Yes 
 +  Device PCI Domain ID / Bus ID / location ID:   0 / 142 / 0
 </​code>​ </​code>​
  
-=== Using Valgrind on the Tachyon Code ===+Caracteristici GPU A100 (coada ucsx), via query device properties CUDA
  
-1. Make sure you have Valgrind and KCachegrind installed on the system (or login on the hp-sl.q queue) and the application in the initial state, without any modifications on your system +<​code ​sh
-<​code ​bash+Device 0: "​NVIDIA A100-PCIE-40GB" 
-sudo apt-get update +  CUDA Driver Version ​Runtime Version ​         12.11.4 
-sudo apt-get install valgrind kcachegrind +  CUDA Capability Major/Minor version number: ​   8.
-</code> +  Total amount of global memory                40326 MBytes (42285268992 bytes) 
-2We will use the tool //callgrind //to get information from the running applicationRun the following command line+  (108) Multiprocessors,​ (064) CUDA Cores/​MP: ​   6912 CUDA Cores 
-<​code>​ +  GPU Max Clock rate:                            1410 MHz (1.41 GHz) 
-valgrind ​--tool=callgrind --collect-jumps=yes --dump-instr=yes --collect-systime=yes -- ./tachyon_find_hotspots dat/balls.dat+  Memory Clock rate:                             1215 Mhz 
 +  Memory Bus Width: ​                             5120-bit 
 +  L2 Cache Size:                                 ​41943040 bytes 
 +  Maximum Texture Dimension Size (x,​y,​z) ​        1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) 
 +  Maximum Layered 1D Texture Size, (num) layers ​ 1D=(32768), 2048 layers 
 +  Maximum Layered 2D Texture Size, (num) layers ​ 2D=(32768, 32768), 2048 layers 
 +  Total amount of constant memory: ​              65536 bytes 
 +  Total amount of shared memory per block: ​      49152 bytes 
 +  Total shared memory per multiprocessor: ​       167936 bytes 
 +  Total number of registers available per block: 65536 
 +  Warp size:                                     32 
 +  Maximum number of threads per multiprocessor: ​ 2048 
 +  Maximum number of threads per block: ​          ​1024 
 +  Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
 +  Max dimension size of a grid size    (x,y,z): (2147483647,​ 65535, 65535) 
 +  Maximum memory pitch: ​                         2147483647 bytes 
 +  Texture alignment: ​                            512 bytes 
 +  Concurrent copy and kernel execution: ​         Yes with 3 copy engine(s) 
 +  Run time limit on kernels: ​                    No 
 +  Integrated GPU sharing Host Memory: ​           No 
 +  Support host page-locked memory mapping: ​      Yes 
 +  Alignment requirement for Surfaces: ​           Yes 
 +  Device has ECC support: ​                       Enabled 
 +  Device supports Unified Addressing (UVA): ​     Yes 
 +  Device supports Managed Memory: ​               Yes 
 +  Device supports Compute Preemption: ​           Yes 
 +  Supports Cooperative Kernel Launch: ​           Yes 
 +  Supports MultiDevice Co-op Kernel Launch: ​     Yes 
 +  Device PCI Domain ID Bus ID location ID:   0 / 49 / 0
 </​code>​ </​code>​
-3. Open the profile in KCachegrind and click on the Calee Map tab. Also, make sure that the buttons //% Relative//, //Cycle detection// and //Relative to parent// are selected. You should see something like this: 
-{{ :​asc:​lab6:​image20.png?​800p |}} 
-From this image, we can see that valgrind measured that about 78% of the total time was spent in the initialize_2D_buffer function. Double click the square containing the function name, then select the “Source code” tab and you will see the problematic code. 
-{{ :​asc:​lab6:​image05.png |}} 
  
 +===== Optimizarea accesului la memorie =====
  
-===== 5(Bonus) Perf =====+Modul cum accesam memoria influenteaza foarte mult performanta sistemului. Cum putem avea arhitecturi foarte diferite din punctul de vedere al ierarhiei de memorie este important de inteles ca nu putem dezvolta un program care sa ruleze optim in toate cazurile. Un program CUDA este portabil caci poate fi usor rulat pe diferite arhitecturi NVIDIA CUDA, insa de cele mai multe ori trebuie ajustat in functie de arhitectura pentru o performanta optima.
  
-Perf is a performance analysis toolavailable ​in the Linux kernel since version 2.6.31 [5]. The userspace control application is accessed from the command line and provides a number of subcommands. Unlike Valgrindperf is capable of statistical profiling of both the entire system ​(kernel and userspaceand per process PID basisIt supports hardware performance counterstracepointssoftware performance counters (e.g. hrtimer), and dynamic probes (for example, kprobes or uprobes).+In general pentru arhitecturile de tip GPUmemoria shared este impartita ​in module de SRAM identicedenumite bancuri de memorie ​(memory banks). Fiecare banc contine o valoare succesiva de 32 biti (de exempluun int sau un float)astfel incat accesele consecutive intr-un array provenite de la threaduri consecutive sa fie foarte rapidBank conflicts au loc atunci cand se fac cereri multiple asupra datelor aflate in acelasi banc de memorie.
  
-Perf is used with several subcommands:​ +<note important>​ 
-  * **stat**: measure total event count for a single program or for the whole system for a specified time period; +Conflictele de access la bancuri de memorie (cache) pot reduce semnificativ performanta. 
-  * **top**: top-like dynamic view of hottest functions;​ +</note>
-  * **record**: measure and save sampling data for single program; +
-  * **report**: analyze file generated by perf record; +
-  * **annotate**:​ annotate sources or assembly; +
-  * **sched**: tracing/measuring of scheduler actions and latencies;​ +
-  * **list**: list available events.+
  
-1. Make sure you have perf installed on the system and the application in the initial statewithout any modifications. **You can only run perf as root. You can only do this on your system.** +Cand are loc un bank conflicthardware-ul serializeaza operatiile cu memoria (warp/wavefront serialization)si face astfel toate threadurile sa astepte pana cand operatiile de memorie sunt efectuateIn unele cazuri, daca toate threadurile citesc aceeasi adresa de memorie shared, este invocat automat un mecanism de broadcast iar serializarea este evitataMecanismul de broadcast este foarte eficient si se recomanda folosirea sa de oricate ori este posibil.  
-2. Run the following command line: +Spre exemplu daca linia de cache este alcatuita din 16 bancuri de memorie, avem urmatoarele situatii care impacteaza performanta accesului la cache (in comentarii apare echivalentul OpenCLintrucat memoria shared din CUDA are ca echivalent memoria locala ​in OpenCL):
-<code bash> +
-perf record ​-a -g -- ./tachyon_find_hotspots +
-</​code>​ +
-For other perf parametersyou can read this [[http://​www.brendangregg.com/perf.html|link]] +
-3. Run the following command line to view the collected results: +
-<​code>​ +
-perf report +
-</​code>​ +
-You should see a screen like the following:​ +
-{{ :​asc:​lab6:​image19.png |}} +
-From this image you can see that perf will display the symbol for the function that takes the most amount of CPU time in red. In our case it’s the //​_Z20initialize_2D_bufferPjS_//​which translates ​in the C source code into the same function as with VTune and Valgrind. +
-<note tip>HintTo find out the demangled name, use the c++filt command: <​code>​ c++filt _Z20initialize_2D_bufferPjS_</​code></​note>​+
  
- +<code sh> 
-===== Exercitii ===== +__global__ void func(...) {  // ​__kernel void func(...) 
- +...  
-  - logati-va pe ''​fep.grid.pub.ro''​ folosind contul de pe ''​curs.upb.ro''​ +   __shared__ int *array; // __local int *array; 
-  - executati comanda ''​wget https://ocw.cs.pub.ro/​courses/​_media/​asc/​lab6/​lab6_skl.tar.gz -O lab6_skl.tar.gz''​ +   x = array[threadIdx.x]// x = array[get_local_id(0)];​ => performanta 100%, 0 bank conflicts 
-  - dezarhivati folosind comanda ''​tar -xzvf lab6_skl.tar.gz''​ +   x = array[threadIdx.x + 1]// x = array[get_local_id(0) + 1]; => performanta 100%, 0 bank conflicts 
- +   x = array[threadIdx.x * 4]; // x = array[get_local_id(0) * 4]; =performanta 25%, 4 bank conflicts 
-**Task 0**  - Folosit valgrind pentru ''​task0.c''​ urmarind TODO-uri pentru teste+   x array[threadIdx.x * 16]; // x = array[get_local_id(0) * 16]; =performanta 6%,  16 bank conflicts 
-    ''​make task0''​ pentru versiunea seriala +... 
-    ​''​make openmp_task0''​ pentru versiunea paralelizata +}
-    * <code sh> +
-[@fep7-1 ​]$ srun --x11 -p hpsl--pty ​/bin/bash +
-[@hpsl-wn01 ~]$ singularity run docker://gitlab.cs.pub.ro:​5050/​asc/​asc-public/​c-labs:​1.3.1 /bin/bash +
-Singularitymake task0 +
-Singularity>​ valgrind --tool=callgrind -v --dump-every-bb=10000000 ​./task0 +
-Singularitykcachegrind ​ +
-Singularity>​ make clean +
-Singularity>​ make openmp_task0 +
-Singularity>​ valgrind --tool=callgrind -v --dump-every-bb=10000000 ​./task0 +
-Singularity>​ kcachegrind ​+
 </​code>​ </​code>​
  
-**Task 1**  - Analizati aplicatia Tachyon. +In cazul arhitecturilor de tip CPU, memoria shared este doar o regiune din RAMOptimizarile ​pentru a tine datele critice in memoria shared pentru GPU nu ar prezenta deci aceleasi imbunatatiri de performanta.
-    * Rulati scriptul ''​task1.sh'' ​pentru a descarca si compila Tachyon. +
-    * Varianta seriala ''​tachyon_find_hotspots''​ +
-    * Varianta paralelizata ''​tachyon_analyze_locks''​ +
-    * <code sh> +
-[@fep7-1 ]$ srun --x11 -p hpsl--pty /bin/bash +
-[@hpsl-wn01 ~]$ singularity run docker://​gitlab.cs.pub.ro:​5050/​asc/​asc-public/​c-labs:​1.3.1 /bin/bash +
-Singularity>​ ./​task1.sh +
-Singularity>​ cd tachyon +
-Singularity>​ valgrind --tool=callgrind --collect-jumps=yes --dump-instr=yes --collect-systime=yes -- ./​tachyon_find_hotspots dat/​balls.dat +
-Singularity>​ valgrind --tool=callgrind --collect-jumps=yes --dump-instr=yes --collect-systime=yes -- ./​tachyon_analyze_locks dat/balls.dat +
-</​code>​ +
-    * (BONUS) Folositi perf+
  
 +====== Aplicații ======
  
-**Task 2**  - Folositi tool-ul cachegrind din valgrind pentru a analiza codul care realizeaza inmultirea ​de matrice folosind diferite reordonari ale buclelor. +Urmăriți instrucțiunile ​de pe [[https://​gitlab.cs.pub.ro/​asc/​asc-public/​-/​tree/master/labs/cuda/arch|GitLab]]
-    * Compilati si rulati ''​task2.c''​ +
-    * Notati observatiile voastre legate de numarul de I refs, D refs, D1 misses, branches si mispredicts. +
-    * <code sh> +
-[@fep7-1 ]$ srun --x11 -p hpsl--pty /bin/bash +
-[@hpsl-wn01 ~]$ singularity run docker://​gitlab.cs.pub.ro:5050/​asc/​asc-public/​c-labs:​1.3.1 ​/bin/bash +
-Singularity>​ make task2 +
-Singularity>​ valgrind --tool=cachegrind --branch-sim=yes ./mult_reorder 1 +
-Singularity>​ valgrind --tool=cachegrind --branch-sim=yes ./mult_reorder 2 +
-Singularity>​ valgrind --tool=cachegrind --branch-sim=yes ./mult_reorder 3 +
-</​code>​ +
  
 <note important>​ <note important>​
 Recomandăm sa va delogati mereu de pe serverele din cluster dupa terminarea sesiunii, utilizand comanda ''​exit''​ Recomandăm sa va delogati mereu de pe serverele din cluster dupa terminarea sesiunii, utilizand comanda ''​exit''​
  
-Alternativ, daca ati uitat sesiuni deschise, puteti verifica acest lucru de pe fep.grid.pub.ro,​ utilizand comanda ''​squeue''​. In cazul in care identificati astfel de sesiuni "​agatate",​ le puteti sterge (si va rugam sa faceti asta), utilizand comanda ''​scancel ​ ID''​ unde ID-ul il identificati din comanda anterioara ''​squeue''​. Puteți folosi mai precis ''​squeue -u username''​ (username de pe fep.grid.pub.ro) pentru a vedea doar sesiunile care vă interesează.+Alternativ, daca ati uitat sesiuni deschise, puteti verifica acest lucru de pe fep8.grid.pub.ro,​ utilizand comanda ''​squeue''​. In cazul in care identificati astfel de sesiuni "​agatate",​ le puteti sterge (si va rugam sa faceti asta), utilizand comanda ''​scancel ​ ID''​ unde ID-ul il identificati din comanda anterioara ''​squeue''​. Puteți folosi mai precis ''​squeue -u username''​ (username de pe fep8.grid.pub.ro) pentru a vedea doar sesiunile care vă interesează.
  
 Daca nu veti face aceasta delogare, veti putea ajunge in situatia in care sa nu va mai puteti loga pe nodurile din cluster. Daca nu veti face aceasta delogare, veti putea ajunge in situatia in care sa nu va mai puteti loga pe nodurile din cluster.
-</​note> ​+</​note>​
  
 +===== Resurse =====
  
 +<​hidden>​
 +{{:​asc:​lab8:​sol:​lab8_sol.tar.gz|Soluție Laborator 8}}
 +</​hidden>​
  
-===== Reference =====  +  ​Responsabili laboratorMatei Barbu, Alexandru Bala
-  ​http://​valgrind.org/​ +
-  * http://​valgrind.org/​info/​tools.html +
-  * http://​kcachegrind.sourceforge.net/​html/​Usage.html +
-  * http://​valgrind.org/​downloads/​variants.html +
-  * https://​perf.wiki.kernel.org/​index.php/​Main_Page +
-  * https://​software.intel.com/​en-us/​intel-parallel-studio-xe +
-  * http://​www.brendangregg.com/​perf.html  +
-  * https://​www.oracle.com/​tools/​developerstudio/​downloads/​developer-studio-jsp.html+
  
-==== Resources ​==== +==== Referinte ​==== 
-  * Responsabilii acestui laborator: ​[[emil.slusanschi@cs.pub.ro|Emil Slusanschi]], Alexandru Patrascu si Octavian Moraru. +  * Bibliografie 
-  * <​html><​a class="​media mediafile mf_pdf"​ href=":asc:lab6:index?​do=export_pdf">​PDF laborator<​/a></html> +    * [[https://​booksite.elsevier.com/​9780124077263/​downloads/​advance_contents_and_appendices/​appendix_C.pdf|Graphics and Computing GPUs]] 
-  {{asc:lab6:​tachyon_vtune_amp_xe.tgz|Aplicatie laborator ​(tachyon)}} +  * Documentatie CUDA: 
-  * {{asc:lab6:primes.c.zip|Aplicatie de test laborator}}+    * [[https://​docs.nvidia.com/​cuda/​cuda-c-programming-guide/​index.html|CUDA C Programming]] 
 +    * [[https://​docs.nvidia.com/​cuda/​pdf/​CUDA_Compiler_Driver_NVCC.pdf| CUDA NVCC compiler]] 
 +    * [[https://​docs.nvidia.com/​cuda/​profiler-users-guide/​index.html| CUDA Visual Profiler]] 
 +    * [[https://docs.nvidia.com/​cuda/​cuda-toolkit-release-notes/​index.html|CUDA Dev Toolkit]] 
 +    [[https://developer.nvidia.com/​cuda-gpus|CUDA GPUs]] 
 +  * Acceleratoare xl (NVidia P100) 
 +    * [[https://​www.nvidia.com/​en-us/​data-center/​tesla-p100/​|NVIDIA Pascal P100]] 
 +  * Advanced CUDA 
 +    * [[https://​developer.download.nvidia.com/​CUDA/​training/​StreamsAndConcurrencyWebinar.pdf|CUDA Streams 1]] 
 +    * [[https://devblogs.nvidia.com/​gpu-pro-tip-cuda-7-streams-simplify-concurrency/​|CUDA Streams 2]] 
 +    * [[https://​devblogs.nvidia.com/​introduction-cuda-dynamic-parallelism/​|CUDA Dynamic Parallelism]] 
 +    * [[http://​www-personal.umich.edu/​~smeyer/​cuda/​grid.pdf | CUDA Thread Basics]] 
 +    * [[https://​devblogs.nvidia.com/​even-easier-introduction-cuda/​ | An Even Easier Introduction to CUDA]]
  
asc/laboratoare/06.1649039600.txt.gz · Last modified: 2022/04/04 05:33 by stefan_dan.ciocirlan
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0