This shows you the differences between two versions of the page.
|
ep:labs:03:contents:tasks:ex4 [2025/02/11 23:28] cezar.craciunoiu created |
ep:labs:03:contents:tasks:ex4 [2026/03/16 19:48] (current) radu.mantu |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ==== 04. [25p] llvm-mca ==== | + | ==== 04. [20p] Hardware Counters ==== |
| - | **llvm-mca** is a machine code analyzer that simulates the execution of a sequence of instructions. By leveraging high-level knowledge of the micro-architectural implementation of the CPU, as well as its execution pipeline, this tool is able to determine the execution speed of said instructions in terms of clock cycles. More importantly though, it can highlight possible contentions of two or more instructions over CPU resources or rather, its [[https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Scheduler_Ports_.26_Execution_Units|ports]]. | + | A significant portion of the system statistics that can be generated involve hardware counters. As the name implies, these are special registers that count the number of occurrences of specific events in the CPU. These counters are implemented through **Model Specific Registers** (MSR), control registers used by developers for debugging, tracing, monitoring, etc. Since these registers may be subject to changes from one iteration of a microarchitecture to the next, we will need to consult chapters 18 and 19 from [[https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html|Intel 64 and IA-32 Architectures Developer's Manual: Vol. 3B]]. |
| - | Note that **llvm-mca** is not the most reliable tool when predicting the precise runtime of an instruction block (see [[https://dspace.mit.edu/bitstream/handle/1721.1/128755/ithemal-measurement.pdf?sequence=2&isAllowed=y|this paper]] for details). After all, CPUs are not as simple as the good old AVR microcontrollers. While calculating the execution time of an AVR linear program (i.e.: no conditional loops) is as simple as adding up the clock cycles associated to each instruction (from the reference manual), things are never that clear-cut when it comes to CPUs. CPU manufacturers such as Intel often times implement hardware optimizations that are not documented or even publicized. For example, we know that the CPU caches instructions in case a loop is detected. If this is the case, then the instructions are dispatched once again form the buffer, thus avoiding extra instruction fetches. What happens though, if the size of the loop's contents exceeds this buffer size? Obviously, without knowing certain aspects such as this buffer size, not to mention anything about microcode or unknown hardware optimizations, it is impossible to give accurate estimates. | + | The instructions that are used to interact with these counters are [[https://www.felixcloutier.com/x86/rdmsr|RDMSR]], [[https://www.felixcloutier.com/x86/wrmsr|WRMSR]] and [[https://www.felixcloutier.com/x86/rdpmc|RDPMC]]. Normally, these are considered privileged instructions (that can be executed only in ring0, aka. kernel space). As a result, acquiring this information from ring3 (user space) requires a context switch into ring0, which we all know to be a costly operation. The objective of this exercise is to prove that this is not necessarily the case and that it is possible to configure and examine these counters from ring3 in as few as a couple of clock cycles. |
| - | {{ :ep:labs:01:contents:tasks:cpu_exec_unit.png?800 |}} | + | Before getting started, one thing to note is that there are two types of performance counters: |
| - | <html> | + | - Fixed Function Counters |
| - | <center> | + | * each can monitor a single, distinct and predetermined event (burned in hardware) |
| - | <b>Figure 2:</b> Simplified view of a single Intel Skylake CPU core. Instructions are decoded into μOps and scheduled out-of-order onto the Execution Units. Your CPUs most likely have (many) more EUs. | + | * are configured a bit differently than the other type |
| - | </center> | + | * are not of interest to us in this laboratory |
| - | </html> | + | - General Purpose Counters |
| + | * can be configured to monitor a specific event from a list of over 200 (see chapters 19.1 and 19.2) | ||
| - | === [5p] Task A - Preparing the input === | + | Here is an overview of the following five tasks: |
| + | * **Task A**: check the version ID of your CPU to determine what it's capable of monitoring. | ||
| + | * **Task B**: set a certain bit in CR4 to enable ring3 usage of the RDPMC instruction. | ||
| + | * **Task C**: use some ring3 tools to enable the hardware counters. | ||
| + | * **Task D**: start counting L2 cache misses. | ||
| + | * **Task E**: use RDPMC to measure the cache misses for a familiar program. | ||
| - | As previosuly mentioned, llvm-mca requires assembly code as input so start by preparing it from the source provided in the archive. | + | === Task A - Hardware info === |
| - | Since **llvm-mca** requires assembly code as input, we first need to translate the C source provided in the archive. Because the assembly parser it utilizes is the same as **clang**'s, use it to compile the C program but stop after the LLVM generation and optmization stages, when the target-specific assembly code is generated. | + | First of all, we need to know what we are working with. Namely, the microarchitecture //version ID// and the //number of counters// per core. To this end, we will use [[https://linux.die.net/man/1/cpuid|cpuid]] (basically a wrapper over the [[https://www.felixcloutier.com/x86/cpuid|CPUID]] instruction.) All the information that we need will be contained in the 0AH leaf (might want to get the raw output of **cpuid**): |
| + | * **CPUID.0AH:EAX[15:8]** : number of general purpose counters | ||
| + | * **CPUID.0AH:EAX[7:0]** : version ID | ||
| + | * **CPUID.0AH:EDX[7:0]** : number of fixed function counters | ||
| - | <note> | + | Note: the first two columns of the output represent the EAX and ECX registers used when calling CPUID. If the most significant bit in EAX is 1 (i.e.: starts with 0x8) the output is for extended options. ECX is a relatively new addition. So when looking for the 0AH leaf, search for a line starting with **0x0000000a**. The register contents following ':' represent the output of the instruction. |
| - | Note how in the [[https://llvm.org/docs/CommandGuide/llvm-mca.html|llvm-mca documentation]] it is stated that the ''LLVM-MCA-BEGIN'' and ''LLVM-MCA-END'' markers can be parsed (as assembly comments) in order to restrict the scope of the analysis. | + | |
| + | Point out to your assistant which is which in the **cpuid** output. | ||
| - | These markers can also be placed in C code (see [[https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html|gcc extended asm]] and [[https://llvm.org/docs/LangRef.html#inline-assembler-expressions|llvm inline asm expressions]]): | + | === Task B - Unlock RDPMC in ring3 === |
| - | <code c> | + | |
| - | asm volatile("# LLVM-MCA-BEGIN" ::: "memory"); | + | |
| - | </code> | + | |
| - | Remember, however, that this approach is not always desirable, for two reasons: | + | Due to security considerations, reading the Performance Monitor Counters from userspace is normally not allowed. This is enforced at a hardware level via the **Performance-Monitor Counter Enable** bit in [[https://en.wikipedia.org/wiki/Control_register#CR4|CR4]]. |
| - | - Even though this is just a comment, the ''volatile'' modifier can pessimize optimization passes. As a result, the generated code may not correspond to what would normally be emitted. | + | |
| - | - Some code structures can not be included in the analysis region. For example, if you want to include the contents of a ''for'' loop, doing so by injecting assembly meta comments in C code will exclude the incrementation and condition check (which are also executed on every iteration). | + | Under normal circumstances, modifying Control Registers from userspace is not possible and you would have to write a kernel module for this. However, the [[https://man.archlinux.org/man/perf_event_open.2#perf_event_related_configuration_files|perf_event_open() man page]] documents a //sysfs// interface (i.e., ''/sys/bus/event_source/devices/cpu/rdpmc'') that does this for us. |
| - | </note> | + | |
| + | Use the //sysfs// interface to revert the **RDPMC** access behavior to the pre-4.0 version. | ||
| <solution -hidden> | <solution -hidden> | ||
| - | clang my_pow.c -masm=intel -S -o my_pow.S | + | <code bash> |
| + | $ echo 2 | sudo tee /sys/bus/event_source/devices/cpu/rdpmc | ||
| + | </code> | ||
| </solution> | </solution> | ||
| - | === [10p] Task B - Analyzing the assembly code === | + | === Task C - Configure IA32_PERF_GLOBAL_CTRL === |
| - | After disassembling the code use **llvm-mca** to inspect its expected throughput and "pressure points" (check out [[https://en.algorithmica.org/hpc/profiling/mca/|this example]]. | + | {{ :ep:labs:01:contents:tasks:ia32_perf_global_ctrl.png?600 }} |
| + | <html> <center> | ||
| + | <b>Figure 2:</b> Control register for the Fixed Function and General Purpose counters. While setting a bit will enable the associated counter, clearing it will disable it. Note that for a counter to be enabled, both this bit and the EN bit in its configuration register must be set. If either is cleared, the counter is disabled. The purpose of this register is to simultaneously change the active state of multiple counters, with a single write instruction. | ||
| + | </center> </html> | ||
| - | One important thing to remember is that **llvm-mca** does not simulate the //behaviour// of each instruction, but only the time required for it to execute. In other words, if you load an immediate value in a register via ''mov rax, 0x1234'', the analyzer will not care //what// the instruction does (or what the value of ''rax'' even is), but how long it takes the CPU to do it. The implication is quite significant: **llvm-mca** is incapable of analyzing complex sequences of code that contain conditional structures, such as ''for'' loops or function calls. Instead, given the sequence of instructions, it will pass through each of them one by one, ignoring their intended effect: conditional jump instructions will fall through, ''call'' instructions will by passed over not even considering the cost of the associated ''ret'', etc. The closest we can come to analyzing a loop is by reducing the analysis scope via the aforementioned ''LLVM-MCA-*'' markers and controlling the number of simulated iterations from the command line. | + | The **IA32_PERF_GLOBAL_CTRL** (0x38f) MSR is an addition from //version 2// that allows enabling / disabling multiple counters with a single WRMSR instruction. What happens, in layman terms, is that the CPU performs an AND between each EANBLE bit in this register and its counterpart in the counter's original configuration register from //version 1// (which we will deal with in the next task.) If the result is 1, the counter begins to register the programmed event every clock cycle. Normally, all these bits should be set by default during the booting process but it never hurts to check. Also, note that this register exists for each logical core. |
| - | To solve this issue, you can set the number of iterations from the command line, so its behaviour can resemble an actual loop. | + | If for CR4 we had to write a kernel module, for MSRs we have user space tools that take care of this for us ([[http://manpages.ubuntu.com/manpages/trusty/man1/rdmsr.1.html|rdmsr]] and [[http://manpages.ubuntu.com/manpages/trusty/man1/wrmsr.1.html|wrmsr]]) by interacting with a driver called **msr** (install **msr-tools** if it's missing from your system.) But first, we must load this driver. |
| - | <note> | + | <code bash> |
| - | Read more on the [[https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Scheduler|Skylake instruction scheduler and ports]]. | + | $ lsmod | grep msr |
| + | $ sudo modprobe msr | ||
| + | $ lsmod | grep msr | ||
| + | msr 16384 0 | ||
| + | </code> | ||
| - | A very short description of each port's main usage: | + | Next, let us read the value in the IA32_PERF_GLOBAL_CTRL register. If the result differs from what you see in the snippet below, overwrite the value (the **-a** flag specifies that we want the command to run on each individual logical core). |
| - | * **Port 0,1:** arithmetic instructions | + | |
| - | * **Port 2,3:** load operations, AGU (address generation unit) | + | |
| - | * **Port 4:** store operations, AGU | + | |
| - | * **Port 5:** vector operations | + | |
| - | * **Port 6:** integer and branch operations | + | |
| - | * **Port 7:** AGU | + | |
| - | The the significance of the SKL ports reported by **llvm-mca** can be found in the [[https://github.com/llvm/llvm-project/blob/d9be232191c1c391a0d665e976808b2a12ea98f1/llvm/lib/Target/X86/X86SchedSkylakeClient.td#L32|Skylake machine model config]]. To find out if your CPU belongs to this category, [[https://github.com/llvm/llvm-project/blob/27c5a9bbb01a464bb85624db2d0808f30de7c996/llvm/lib/TargetParser/Host.cpp#L765|RTFS]] and run an ''inxi -Cx''. | + | <code bash> |
| + | $ sudo rdmsr -a 0x38f | ||
| + | 70000000f | ||
| + | $ sudo wrmsr -a 0x38f 0x70000000f | ||
| - | </note> | + | </code> |
| + | |||
| + | === Task D - Configure IA32_PERFEVENTSELx === | ||
| + | |||
| + | {{ :ep:labs:01:contents:tasks:ia32_perfeventselx.png?600 }} | ||
| + | <html> <center> | ||
| + | <b>Figure 3:</b> Configuration register for individual counters. Of interest to us are the EN bit (mentioned in the previous subsection), | ||
| + | the event selection fields, and the user mode bit. Note how the USR bit can only distinguish between ring 0 and ring 3. While rings 1 and 2 are still present in the CPU's implementation today, no mainstream operating system has used them in over 30 years. The PMC, being a newer addition, acknowledges this reality in trying to simplify the control interface as much as possible. It is not clear if rings 1 and 2 are blind spots for PMCs or if they are covered under ring 0.</b> | ||
| + | </center> </html> | ||
| + | |||
| + | The **IA32_PERFEVENTSELx** are MSRs from //version 1// that are used to configure the monitored event of a certain counter, its enabled state and a few other things. We will not go into detail and instead only mention the fields that interest us right now (you can read about the rest in the Intel manual.) Note that the //x// in the MSR's name stands for the counter number. If we have 4 counters, it takes values in the 0:3 range. The one that we will configure is IA32_PERFEVENTSEL0 (0x186). If you want to configure more than one counter, note that they have consecutive register number (i.e. 0x187, 0x188, etc.). | ||
| + | |||
| + | As for the register flags, those that are not mentioned in the following list should be left cleared: | ||
| + | * **EN** (enable flag) = **1** starts the counter | ||
| + | * **USR** (user mode flag) = **1** monitors only ring3 events | ||
| + | * **UMASK** (unit mask) = **??** depends on the monitored event (see chapter 19.2) | ||
| + | * **EVSEL** (event select) = **??** depends on the monitored event (see chapter 19.2) | ||
| + | |||
| + | Before actually writing in this register, we should verify that no one is currently using it. If this is indeed the case, we might also want to clear **IA32_PMC0** (0xc1). PMC0 is the actual counter that is associated to PERFEVENTSEL0. | ||
| + | |||
| + | <code bash> | ||
| + | $ sudo rdmsr -a 0x186 | ||
| + | 0 | ||
| + | $ sudo wrmsr -a 0xc1 0x00 | ||
| + | $ sudo wrmsr -a 0x186 0x41???? | ||
| + | </code> | ||
| + | |||
| + | For the next (and //final// task) we are going to monitor the number of L2 cache misses. Look for the **L2_RQSTS.MISS** event in table 19-3 or 19-11 (depending on CPU version id) in the Intel manual and set the last two bytes (the unit mask and event select) accordingly. If the operation is successful and the counters have started, you should start seeing non-zero values in the PMC0 register, increasing in subsequent reads. | ||
| <note tip> | <note tip> | ||
| - | In the default view, look at the number of micro-operations (i.e.: ''#uOps'') associated to each instruction. These are the number of primitive operations that each instruction (from the x86 ISA) is broken into. Fun and irrelevant fact: the hardware implementation of certain instructions can be modified via microcode upgrades. | + | An easier alternative to scouring through the Intel manuals would be to use [[https://perfmon-events.intel.com/platforms/tigerlake/core-events/core/|perfmon-events.intel.com]]. Get your CPU //"model name"// from ''/proc/cpuinfo'' and identify your microarchitecture based on the table below. Then search for the desired event in the appropriate section of the site. |
| - | Anyway, keeping in mind this ''#uOps'' value (for each instruction), we'll notice that the sum of all //resource pressures per port// will equal that value. In other words //resource pressure// means the average number of micro-operations that depend on that resource. | + | ^ Generation ^ Microarchitecture (Core Codename) ^ Release Year ^ Typical CPU Numbers ^ |
| + | | 1st | Nehalem / Westmere | 2008–2010 | i3,5,7 3xx–9xx | | ||
| + | | 2nd | Sandy Bridge | 2011 | i3,5,7 2xxx | | ||
| + | | 3rd | Ivy Bridge | 2012 | i3,5,7 3xxx | | ||
| + | | 4th | Haswell | 2013 | i3,5,7 4xxx | | ||
| + | | 5th | Broadwell | 2014–2015 | i3,5,7 5xxx | | ||
| + | | 6th | Skylake | 2015 | i3,5,7 6xxx | | ||
| + | | 7th | Kaby Lake | 2016–2017 | i3,5,7 7xxx | | ||
| + | | 8th | Coffee Lake / Amber Lake / Whiskey Lake | 2017–2018 | i3,5,7 8xxx | | ||
| + | | 9th | Coffee Lake Refresh | 2018–2019 | i3,5,7,9 9xxx | | ||
| + | | 10th | Comet Lake / Ice Lake / Tiger Lake | 2019–2020 | i3,5,7,9 10xxx | | ||
| + | | 11th | Rocket Lake / Tiger Lake | 2021 | i3,5,7,9 11xxx | | ||
| + | | 12th | Alder Lake | 2021–2022 | i3,5,7,9 12xxx | | ||
| + | | 13th | Raptor Lake | 2022–2023 | i3,5,7,9 13xxx | | ||
| + | | 14th | Raptor Lake Refresh | 2023–2024 | i3,5,7,9 14xxx | | ||
| + | | — | Meteor Lake | 2023–2024 | Core Ultra 5,7,9 1xx | | ||
| + | | — | Arrow Lake / Lunar Lake | 2024–2025 | Core Ultra 5,7,9 2xx | | ||
| </note> | </note> | ||
| <solution -hidden> | <solution -hidden> | ||
| - | llvm-mca -march=x86-64 my_pow.S | + | The event they should find in Table 19-3 is called **L2_RQSTS.MISS** (event number: //0x24// | umask: //0x3f//). As such, the value that they need to write in the MSR is //0x413f24//. |
| + | </solution> | ||
| - | Contents of my_pow.S with # LLVM-MCA-BEGIN and END tags: | + | === Task E - Ring3 cache performance evaluation === |
| - | {{:ep:labs:01:contents:tasks:screenshot_from_2023-10-08_22-19-15.png?300|}} | + | As of now, we should be able to modify the **CR4** register with the kernel module, enable all counters in the **IA32_PERF_GLOBAL_CTRL** across all cores and start an **L2 cache miss** counter again, across all cores. What remains is putting everything into practice. |
| - | </solution> | + | |
| - | === [10p] Task C - In-depth examination === | + | Take //mat_mul.c//. This program may be familiar from an ASC laboratory but, in case it isn't, the gist of it is that when using the naive matrix multiplication algorithm (O(n^3)), the frequency with which each iterator varies can wildly affect the performance of the program. The reason behind this is (in)efficient use of the CPU cache. Take a look at the following snippet from the source and keep in mind that each matrix buffer is a continuous area in memory. |
| - | Now that you've got the hang of things, try generating asm code with certain optimization levels (i.e.: ''O1,2,3,s'', etc.) \\ | + | <code C> |
| - | Use the ''-bottleneck-analysis'' flag to identify contentious instruction sequences. Explain the reason to the best of your abilities. | + | for (uint32_t i=0; i<N; ++i) /* line */ |
| + | for (uint32_t j=0; j<N; ++j) /* column */ | ||
| + | for (uint32_t k=0; k<N; ++k) | ||
| + | r[i*N + j] += m1[i*N + k] * m2[k*N + j]; | ||
| + | </code> | ||
| + | |||
| + | What is the problem here? The problem is that i and k are multiplied with a large number N when updating a certain element. Thus, fast variations in these two indices will cause huge strides in accessed memory areas (larger than a cache line) and will cause unnecessary cache misses. So what are the best and worst configurations for the three fors? The best: i, k j. The worst: j, k, i. As we can see, the configurations that we will monitor in //mat_mul.c// do not coincide with the aforementioned two (so... not great, not terrible.) Even so, the difference in execution time and number of cache misses will still be significant. | ||
| + | |||
| + | Which brings us to the task at hand: using the **RDPMC** instruction, calculate the number of L2 cache misses for each of the two multiplications __without performing any context switches__ (hint: look at [[https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html|gcc extended asm]] and the following macro from //mat_mul.c//). | ||
| + | |||
| + | <code C> | ||
| + | #define rdpmc(ecx, eax, edx) \ | ||
| + | asm volatile ( \ | ||
| + | "rdpmc" \ | ||
| + | : "=a"(eax), \ | ||
| + | "=d"(edx) \ | ||
| + | : "c"(ecx)) | ||
| + | </code> | ||
| + | |||
| + | A word of caution: remember that each logical core has its own PMC0 counter, so make sure to use [[https://linux.die.net/man/1/taskset|taskset]] in order to set the CPU affinity of the process. If you don't the process may be passed around different cores and the counter value becomes unreliable. | ||
| + | |||
| + | <code bash> | ||
| + | $ taskset 0x01 ./mat_mul 1024 | ||
| + | </code> | ||
| + | |||
| + | <note important> | ||
| + | Depending on your CPU cache size, a matrix size of 1024 may be insufficient and you won't see any significant difference between the two arrangements. | ||
| + | |||
| + | You can check your cache size by printing the contents of these files: | ||
| + | ''/sys/bus/cpu/devices/cpu0/cache/index*/size''. Indices 0 and 1 typically correspond to the L1 data and L1 instruction caches respectively (you can double check this by reading the ''type'' file instead of ''size''). Indices 2 and 3 correspond to the L2 and L3 caches. | ||
| + | |||
| + | If you run into this problem, either calculate a sufficiently high matrix size value. Or you can just ballpark it. | ||
| + | </note> | ||
| <solution -hidden> | <solution -hidden> | ||
| - | llvm-mca -march=x86-64 -bottleneck-analysis -timeline -iterations=10000 -all-stats my_pow.S | + | <code C> |
| + | /* hardware counter init */ | ||
| + | rdpmc(0, eax, edx); | ||
| + | counter = ((uint64_t)eax) | (((uint64_t)edx) << 32); | ||
| + | |||
| + | /* perform slow multiplication */ | ||
| + | for (uint32_t i=0; i<N; ++i) /* line */ | ||
| + | for (uint32_t j=0; j<N; ++j) /* column */ | ||
| + | for (uint32_t k=0; k<N; ++k) | ||
| + | r[i*N + j] += m1[i*N + k] * m2[k*N + j]; | ||
| + | |||
| + | /* hardware counter delta */ | ||
| + | rdpmc(0, eax, edx); | ||
| + | counter = (((uint64_t)eax) | (((uint64_t)edx) << 32)) - counter; | ||
| + | </code> | ||
| - | Or any other tags from the documentation as long as they explain what they do | ||
| </solution> | </solution> | ||
| + | |||
| + | |||