The purpose of this lab is the familiarization with the field of application profiling & debugging, through the means of dedicated tools for spotting performance bottlenecks.
We will focus on the following open source tools:
valgrind (callgrind, cachegrind) with callgrind_annotate and cg_annotate for results visualization (on haswell partition)perf (on your own systems only, as hardware counters are not available inside Apptainer containers)Valgrind is a tool used for memory debugging, memory leak detection and profiling. It is also a generic framework for creating dynamic analysis tools, such as memory checkers [1].
Valgrind is in essence a virtual machine using just-in-time compilation techniques, including dynamic recompilation. It is important to keep in mind that nothing from the original program ever gets run directly on the host processor. Instead, it will translate the input program into a simpler form called Intermediate Representation (IR), which is processor neutral. After this transformation, a tool [2] is called to do whatever transformation of the IR it needs and the resulting IR is then translated back into machine code and ran on the host processor.
The tools available in Valgrind are:
In this section, we will focus on analyzing a software application. We will analyze both a serial and a parallel implementation. The application is called “tachyon” and you can find the source code attached to this lab.
On your own system, before compilation, you must install the X11 dev tools and create a set of symlinks. For Ubuntu 64 bit, we must do the following:
sudo apt-get install libx11-dev
sudo mkdir /usr/lib64
sudo ln -s /usr/lib/x86_64-linux-gnu/libX11.so /usr/lib64/libX11.so
sudo ln -s /usr/lib/x86_64-linux-gnu/libXext.so /usr/lib64/libXext.so
To compile it, you must extract the archive Tachyon to local disk and run make. You can test the compilation by running in the same directory:
./tachyon_find_hotspots dat/balls.dat
You should see a window like the one below:
To run and observe the functionalities of this tool, follow the following sequence of steps and instructions:
wget -O tachyon_vtune_amp_xe.tgz http://ocw.cs.pub.ro/courses/_media/asc/lab6/tachyon_vtune_amp_xe.tgz gunzip tachyon_vtune_amp_xe.tgz tar -xvf tachyon_vtune_amp_xe.tar cd tachyon make
1. Make sure you have Valgrind and KCachegrind installed on the system (or login on the hp-sl.q queue) and the application in the initial state, without any modifications on your system
sudo apt-get update sudo apt-get install valgrind kcachegrind
2. We will use the tool callgrind to get information from the running application. Run the following command line:
valgrind --tool=callgrind --collect-jumps=yes --dump-instr=yes --collect-systime=yes -- ./tachyon_find_hotspots dat/balls.dat
3. Open the profile in KCachegrind and click on the Calee Map tab. Also, make sure that the buttons % Relative, Cycle detection and Relative to parent are selected. You should see something like this:
From this image, we can see that valgrind measured that about 78% of the total time was spent in the initialize_2D_buffer function. Double click the square containing the function name, then select the “Source code” tab and you will see the problematic code.
Perf is a performance analysis tool, available in the Linux kernel since version 2.6.31 [5]. The userspace control application is accessed from the command line and provides a number of subcommands. Unlike Valgrind, perf is capable of statistical profiling of both the entire system (kernel and userspace) and per process PID basis. It supports hardware performance counters, tracepoints, software performance counters (e.g. hrtimer), and dynamic probes (for example, kprobes or uprobes).
Perf is used with several subcommands:
1. Make sure you have perf installed on the system and the application in the initial state, without any modifications. You can only run perf as root. You can only do this on your system. 2. Run the following command line:
perf record -a -g -- ./tachyon_find_hotspots
For other perf parameters, you can read this link 3. Run the following command line to view the collected results:
perf report
You should see a screen like the following:
From this image you can see that perf will display the symbol for the function that takes the most amount of CPU time in red. In our case it’s the _Z20initialize_2D_bufferPjS_, which translates in the C source code into the same function as with VTune and Valgrind.
c++filt _Z20initialize_2D_bufferPjS_
Pentru acest laborator se pot utiliza sistemele din cluster prin intermediul fep.grid.pub.ro:
wget https://ocw.cs.pub.ro/courses/_media/asc/laboratoare/lab4_skl.tar.gz -O lab4_skl.tar.gz.tar -xzvf lab4_skl.tar.gz.srun --pty bash.apptainer run docker://gitlab.cs.pub.ro:5050/asc/asc-public/c-labs:1.3.1 /bin/bash
Task 0 - Folositi Callgrind pentru task0.c, urmărind TODO-urile din cod.
# Versiunea serială Apptainer> make task0 Apptainer> valgrind --tool=callgrind -v --dump-every-bb=10000000 ./task0 Apptainer> callgrind_annotate callgrind.out.<pid> # Versiunea paralelizată Apptainer> make clean Apptainer> make openmp_task0 Apptainer> valgrind --tool=callgrind -v --dump-every-bb=10000000 ./task0 Apptainer> callgrind_annotate callgrind.out.<pid>
Notați și explicați următoarele observații:
Ir)?Task 1 - Analizati aplicatia Tachyon (ray tracer).
Apptainer> ./task1.sh # Versiunea serială Apptainer> valgrind --tool=callgrind --collect-jumps=yes --dump-instr=yes --collect-systime=yes -- ./tachyon_find_hotspots dat/balls.dat Apptainer> callgrind_annotate callgrind.out.<pid> # Versiunea paralelizată Apptainer> valgrind --tool=callgrind --collect-jumps=yes --dump-instr=yes --collect-systime=yes -- ./tachyon_analyze_locks dat/balls.dat Apptainer> callgrind_annotate callgrind.out.<pid> # Perf (doar software counters în container) Apptainer> perf stat ./tachyon_find_hotspots dat/balls.dat Apptainer> perf stat ./tachyon_analyze_locks dat/balls.dat
Notați și explicați următoarele observații:
initialize_2D_buffer)analyze_locks? Este necesar acolo?Task 2 - Analizați înmulțirea de matrice cu diferite ordonări ale buclelor folosind Cachegrind.
task2.cApptainer> make task2 Apptainer> valgrind --tool=cachegrind --cache-sim=yes ./task2 1 Apptainer> valgrind --tool=cachegrind --cache-sim=yes ./task2 2 Apptainer> valgrind --tool=cachegrind --cache-sim=yes ./task2 3
Completați tabelul cu rezultatele obținute:
| Metric | Mode 1 (ijk) | Mode 2 (ikj) | Mode 3 (jki) |
|---|---|---|---|
| I refs | |||
| D refs | |||
| D1 miss rate | |||
| LLd miss rate |
Notați și explicați următoarele observații:
I refs și D refs între cele 3 ordonări. Ce observați?D1 miss rate între cele 3 ordonări. Care mod este mai eficient din punct de vedere al cache-ului și de ce?LLd miss rate între cele 3 ordonări. Ce observați?