This shows you the differences between two versions of the page.
ep:labs:03:contents:tasks:ex4 [2025/02/11 23:28] cezar.craciunoiu created |
ep:labs:03:contents:tasks:ex4 [2025/03/18 00:49] (current) radu.mantu |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ==== 04. [25p] llvm-mca ==== | + | ==== 04. [25p] Microcode analysis ==== |
**llvm-mca** is a machine code analyzer that simulates the execution of a sequence of instructions. By leveraging high-level knowledge of the micro-architectural implementation of the CPU, as well as its execution pipeline, this tool is able to determine the execution speed of said instructions in terms of clock cycles. More importantly though, it can highlight possible contentions of two or more instructions over CPU resources or rather, its [[https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Scheduler_Ports_.26_Execution_Units|ports]]. | **llvm-mca** is a machine code analyzer that simulates the execution of a sequence of instructions. By leveraging high-level knowledge of the micro-architectural implementation of the CPU, as well as its execution pipeline, this tool is able to determine the execution speed of said instructions in terms of clock cycles. More importantly though, it can highlight possible contentions of two or more instructions over CPU resources or rather, its [[https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Scheduler_Ports_.26_Execution_Units|ports]]. | ||
Line 14: | Line 14: | ||
=== [5p] Task A - Preparing the input === | === [5p] Task A - Preparing the input === | ||
- | As previosuly mentioned, llvm-mca requires assembly code as input so start by preparing it from the source provided in the archive. | + | As a simple example we will look at ''task_04/csum.c''. This file contains the `csum_16b1c()` function that computes the 16-bit one's complement checksum used in the IP and TCP headers. |
- | Since **llvm-mca** requires assembly code as input, we first need to translate the C source provided in the archive. Because the assembly parser it utilizes is the same as **clang**'s, use it to compile the C program but stop after the LLVM generation and optmization stages, when the target-specific assembly code is generated. | + | Since **llvm-mca** requires assembly code as input, we first need to translate the provided C code. Because the assembly parser it utilizes is the same as **clang**'s, use it to compile the C program but stop after the LLVM generation and optimization stages, when the target-specific assembly code is emitted. |
+ | |||
+ | <code bash> | ||
+ | $ clang -S -masm=intel csum.c # output = csum.s | ||
+ | </code> | ||
<note> | <note> | ||
Line 27: | Line 31: | ||
Remember, however, that this approach is not always desirable, for two reasons: | Remember, however, that this approach is not always desirable, for two reasons: | ||
- | - Even though this is just a comment, the ''volatile'' modifier can pessimize optimization passes. As a result, the generated code may not correspond to what would normally be emitted. | + | - Even though this is just a comment, the ''volatile'' qualifier can pessimize optimization passes. As a result, the generated code may not correspond to what would normally be emitted. |
- | - Some code structures can not be included in the analysis region. For example, if you want to include the contents of a ''for'' loop, doing so by injecting assembly meta comments in C code will exclude the incrementation and condition check (which are also executed on every iteration). | + | - Some code structures can not be included in the analysis region. For example, if you want to include the contents of a ''for'' loop, doing so by injecting assembly meta comments in C code will exclude the iterator increment and condition check (which are also executed on every iteration). |
</note> | </note> | ||
- | |||
- | <solution -hidden> | ||
- | clang my_pow.c -masm=intel -S -o my_pow.S | ||
- | </solution> | ||
=== [10p] Task B - Analyzing the assembly code === | === [10p] Task B - Analyzing the assembly code === | ||
- | After disassembling the code use **llvm-mca** to inspect its expected throughput and "pressure points" (check out [[https://en.algorithmica.org/hpc/profiling/mca/|this example]]. | + | Use **llvm-mca** to inspect its expected throughput and "pressure points" (check out [[https://en.algorithmica.org/hpc/profiling/mca/|this example]]). |
- | One important thing to remember is that **llvm-mca** does not simulate the //behaviour// of each instruction, but only the time required for it to execute. In other words, if you load an immediate value in a register via ''mov rax, 0x1234'', the analyzer will not care //what// the instruction does (or what the value of ''rax'' even is), but how long it takes the CPU to do it. The implication is quite significant: **llvm-mca** is incapable of analyzing complex sequences of code that contain conditional structures, such as ''for'' loops or function calls. Instead, given the sequence of instructions, it will pass through each of them one by one, ignoring their intended effect: conditional jump instructions will fall through, ''call'' instructions will by passed over not even considering the cost of the associated ''ret'', etc. The closest we can come to analyzing a loop is by reducing the analysis scope via the aforementioned ''LLVM-MCA-*'' markers and controlling the number of simulated iterations from the command line. | + | One important thing to remember is that **llvm-mca** does not simulate the //behavior// of each instruction, but only the time required for it to execute. In other words, if you load an immediate value in a register via ''mov rax, 0x1234'', the analyzer will not care //what// the instruction does (or what the value of ''rax'' even is), but how long it takes the CPU to do it. The implication is quite significant: **llvm-mca** is incapable of analyzing complex sequences of code that contain conditional structures, such as ''for'' loops or function calls. Instead, given the sequence of instructions, it will pass through each of them one by one, ignoring their intended effect: conditional jump instructions will fall through, ''call'' instructions will by passed over not even considering the cost of the associated ''ret'', etc. The closest we can come to analyzing a loop is by reducing the analysis scope via the aforementioned ''LLVM-MCA-*'' markers and controlling the number of simulated iterations from the command line. |
- | To solve this issue, you can set the number of iterations from the command line, so its behaviour can resemble an actual loop. | + | To solve this issue, you can set the number of iterations from the command line, so its behavior can resemble an actual loop. |
<note> | <note> | ||
Line 64: | Line 64: | ||
</note> | </note> | ||
- | <solution -hidden> | + | === [10p] Task C - In-depth examination === |
- | llvm-mca -march=x86-64 my_pow.S | + | |
- | Contents of my_pow.S with # LLVM-MCA-BEGIN and END tags: | + | Now that you've got the hang of things, use the ''-bottleneck-analysis'' flag to identify contentious instruction sequences. |
- | {{:ep:labs:01:contents:tasks:screenshot_from_2023-10-08_22-19-15.png?300|}} | + | Explain the reason to the best of your abilities. For example, the following two instructions display a register dependency because the ''mov'' instruction needs to wait for the ''push'' instruction to update the RSP register. |
- | </solution> | + | |
- | === [10p] Task C - In-depth examination === | + | <code> |
+ | 0. push rbp ## REGISTER dependency: rsp | ||
+ | 1. mov rbp, rsp ## REGISTER dependency: rsp | ||
+ | </code> | ||
- | Now that you've got the hang of things, try generating asm code with certain optimization levels (i.e.: ''O1,2,3,s'', etc.) \\ | + | How would you go about further optimizing this code? |
- | Use the ''-bottleneck-analysis'' flag to identify contentious instruction sequences. Explain the reason to the best of your abilities. | + | |
- | <solution -hidden> | + | <note> |
- | llvm-mca -march=x86-64 -bottleneck-analysis -timeline -iterations=10000 -all-stats my_pow.S | + | Also look at the kernel's implementation of a [[https://elixir.bootlin.com/linux/v6.13.7/source/arch/x86/include/asm/checksum_64.h#L45|checksum calculation]] over the variable IP header. |
+ | </note> | ||
- | Or any other tags from the documentation as long as they explain what they do | + | <solution -hidden> |
+ | llvm-mca -bottleneck-analysis -timeline -iterations=10000 -all-stats csum.s | ||
</solution> | </solution> |