Show page

Differences

This shows you the differences between two versions of the page.

--- ep:labs:03:contents:tasks:ex4 [2020/08/03 16:39]
cristian.marin0805
+++ ep:labs:03:contents:tasks:ex4 [2025/03/18 00:49] (current)
radu.mantu
@@ Line 1: / Line 1: @@
-==== 04. [20p] Monitor I/O with vmstat and iostat ====
+==== 04. [25p] Microcode analysis ====
-We said in the beginning that the disk I/O subsystems are the slowest part of any system. This is why the I/O monitoring is so important, maximizing the performance of the slowest part of a system resulting in an improvement of the performance of the entire system.
-=== [10p] Task A - Script ===
+**llvm-mca** is a machine code analyzer that simulates the execution of a sequence of instructions. By leveraging high-level knowledge of the micro-architectural implementation of the CPU, as well as its execution pipeline, this tool is able to determine the execution speed of said instructions in terms of clock cycles. More importantly though, it can highlight possible contentions of two or more instructions over CPU resources or rather, its [[https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Scheduler_Ports_.26_Execution_Units|ports]].
-Write a script that reads the data into memory and generates a text file 500 times larger, by concatenating the contents of the following novel {{:ep:labs:olivertwist.txt|olivertwist.txt}} to itself.
+Note that **llvm-mca** is not the most reliable tool when predicting the precise runtime of an instruction block (see [[https://dspace.mit.edu/bitstream/handle/1721.1/128755/ithemal-measurement.pdf?sequence=2&isAllowed=y|this paper]] for details). After all, CPUs are not as simple as the good old AVR microcontrollers. While calculating the execution time of an AVR linear program (i.e.: no conditional loops) is as simple as adding up the clock cycles associated to each instruction (from the reference manual), things are never that clear-cut when it comes to CPUs. CPU manufacturers such as Intel often times implement hardware optimizations that are not documented or even publicized. For example, we know that the CPU caches instructions in case a loop is detected. If this is the case, then the instructions are dispatched once again form the buffer, thus avoiding extra instruction fetches. What happens though, if the size of the loop's contents exceeds this buffer size? Obviously, without knowing certain aspects such as this buffer size, not to mention anything about microcode or unknown hardware optimizations, it is impossible to give accurate estimates.
-<solution -hidden>
+{{ :ep:labs:01:contents:tasks:cpu_exec_unit.png?800 |}}
-<code>
+<html>
-if __name__ == '__main__':
+<center>
-    text_file1 = open("OliverTwist.txt", "r")
+<b>Figure 2:</b> Simplified view of a single Intel Skylake CPU core. Instructions are decoded into μOps and scheduled out-of-order onto the Execution Units. Your CPUs most likely have (many) more EUs.
-    text_file2 = open("OliverTwistLarge.txt", "w+")
+</center>
-    lines_file1 = text_file1.readlines()
+</html>
-    for x in range(0, 500):
-    	text_file2.writelines(lines_file1)
+=== [5p] Task A - Preparing the input ===
+As a simple example we will look at ''task_04/csum.c''. This file contains the `csum_16b1c()` function that computes the 16-bit one's complement checksum used in the IP and TCP headers.
+Since **llvm-mca** requires assembly code as input, we first need to translate the provided C code. Because the assembly parser it utilizes is the same as **clang**'s, use it to compile the C program but stop after the LLVM generation and optimization stages, when the target-specific assembly code is emitted.
+<code bash>
+$ clang -S -masm=intel csum.c    # output = csum.s
 </code>
-</solution>
-=== [10p] Task B - Monitoring behaviour ===
+<note>
+Note how in the [[https://llvm.org/docs/CommandGuide/llvm-mca.html|llvm-mca documentation]] it is stated that the ''LLVM-MCA-BEGIN'' and ''LLVM-MCA-END'' markers can be parsed (as assembly comments) in order to restrict the scope of the analysis.
-Now we want to analyze what is happening with the I/O subsystem during an expensive operation. Monitor the behavior of the system while running your script using **vmstat** and **iostat**.
+These markers can also be placed in C code (see [[https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html|gcc extended asm]] and [[https://llvm.org/docs/LangRef.html#inline-assembler-expressions|llvm inline asm expressions]]):
+<code c>
+asm volatile("# LLVM-MCA-BEGIN" ::: "memory");
+</code>
+Remember, however, that this approach is not always desirable, for two reasons:
+  - Even though this is just a comment, the ''volatile'' qualifier can pessimize optimization passes. As a result, the generated code may not correspond to what would normally be emitted.
+  - Some code structures can not be included in the analysis region. For example, if you want to include the contents of a ''for'' loop, doing so by injecting assembly meta comments in C code will exclude the iterator increment and condition check (which are also executed on every iteration).
+</note>
+=== [10p] Task B - Analyzing the assembly code ===
+Use **llvm-mca** to inspect its expected throughput and "pressure points" (check out [[https://en.algorithmica.org/hpc/profiling/mca/|this example]]).
+One important thing to remember is that **llvm-mca** does not simulate the //behavior// of each instruction, but only the time required for it to execute. In other words, if you load an immediate value in a register via ''mov rax, 0x1234'', the analyzer will not care //what// the instruction does (or what the value of ''rax'' even is), but how long it takes the CPU to do it. The implication is quite significant: **llvm-mca** is incapable of analyzing complex sequences of code that contain conditional structures, such as ''for'' loops or function calls. Instead, given the sequence of instructions, it will pass through each of them one by one, ignoring their intended effect: conditional jump instructions will fall through, ''call'' instructions will by passed over not even considering the cost of the associated ''ret'', etc. The closest we can come to analyzing a loop is by reducing the analysis scope via the aforementioned ''LLVM-MCA-*'' markers and controlling the number of simulated iterations from the command line.
+To solve this issue, you can set the number of iterations from the command line, so its behavior can resemble an actual loop.
+<note>
+Read more on the [[https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Scheduler|Skylake instruction scheduler and ports]].
+A very short description of each port's main usage:
+  * **Port 0,1:** arithmetic instructions
+  * **Port 2,3:** load operations, AGU (address generation unit)
+  * **Port 4:** store operations, AGU
+  * **Port 5:** vector operations
+  * **Port 6:** integer and branch operations
+  * **Port 7:** AGU
+The the significance of the SKL ports reported by **llvm-mca** can be found in the [[https://github.com/llvm/llvm-project/blob/d9be232191c1c391a0d665e976808b2a12ea98f1/llvm/lib/Target/X86/X86SchedSkylakeClient.td#L32|Skylake machine model config]]. To find out if your CPU belongs to this category, [[https://github.com/llvm/llvm-project/blob/27c5a9bbb01a464bb85624db2d0808f30de7c996/llvm/lib/TargetParser/Host.cpp#L765|RTFS]] and run an ''inxi -Cx''.
+</note>
 <note tip>
-Understanding vmstat IO section:
+In the default view, look at the number of micro-operations (i.e.: ''#uOps'') associated to each instruction. These are the number of primitive operations that each instruction (from the x86 ISA) is broken into. Fun and irrelevant fact: the hardware implementation of certain instructions can be modified via microcode upgrades.
-  * **bi** - column reports the number of blocks received (or “blocks in”) from a disk per second.
-  * **bo** - column reports the number of blocks sent (“blocks out”) to a disk per second.
+Anyway, keeping in mind this ''#uOps'' value (for each instruction), we'll notice that the sum of all //resource pressures per port// will equal that value. In other words //resource pressure// means the average number of micro-operations that depend on that resource.
 </note>
+=== [10p] Task C - In-depth examination ===
+Now that you've got the hang of things, use the ''-bottleneck-analysis'' flag to identify contentious instruction sequences.
+Explain the reason to the best of your abilities. For example, the following two instructions display a register dependency because the ''mov'' instruction needs to wait for the ''push'' instruction to update the RSP register.
+<code>
+.    push      rbp         ## REGISTER dependency:  rsp
+.    mov       rbp, rsp    ## REGISTER dependency:  rsp
+</code>
+How would you go about further optimizing this code?
+<note>
+Also look at the kernel's implementation of a [[https://elixir.bootlin.com/linux/v6.13.7/source/arch/x86/include/asm/checksum_64.h#L45|checksum calculation]] over the variable IP header.
+</note>
+<solution -hidden>
+llvm-mca -bottleneck-analysis -timeline -iterations=10000 -all-stats csum.s
+</solution>

General Information

Lectures

Labs

Assignments

Archived Labs

ep/labs/03/contents/tasks/ex4.1596461969.txt.gz · Last modified: 2020/08/03 16:39 by cristian.marin0805

Show page Old revisions

Media Manager Back to top