Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ep:labs:01:contents:tasks:ex4 [2023/10/08 22:17]
mihai.blacioti [04. [25p] LLVM]
ep:labs:01:contents:tasks:ex4 [2023/11/14 12:51] (current)
radu.mantu
Line 1: Line 1:
-==== 04. [25p] LLVM-MCA ====+==== 04. [25p] llvm-mca ====
  
-LLVM-MCA is a machine code analyzer that simulates the execution of an assembly code snippet on particular microarchitecture by making use of data available to compilers. By doing so, it provides ​the latency and throughput ​of the aforemetioned block as well as various resources within ​the CPU.+**llvm-mca** is a machine code analyzer that simulates the execution of a sequence ​of instructions. By leveraging high-level knowledge of the micro-architectural implementation ​of the CPU, as well as its execution pipeline, this tool is able to determine ​the execution speed of said instructions in terms of clock cycles. More importantly though, it can highlight possible contentions of two or more instructions over CPU resources or rather, its [[https://​en.wikichip.org/​wiki/​intel/​microarchitectures/​skylake_(client)#​Scheduler_Ports_.26_Execution_Units|ports]].
  
-Given an assembly code, llvm-mca ​estimates ​the Instructions Per Cycle (IPC), as well as hardware resource pressure among other thingsThe analysis and reporting style were taken from the IACA tool provided by Intel.+Note that **llvm-mca** is not the most reliable tool when predicting the precise runtime of an instruction block (see [[https://​dspace.mit.edu/​bitstream/​handle/​1721.1/​128755/​ithemal-measurement.pdf?​sequence=2&​isAllowed=y|this paper]] for details). After allCPUs are not as simple ​as the good old AVR microcontrollersWhile calculating the execution time of an AVR linear program (i.e.: no conditional loops) is as simple as adding up the clock cycles associated to each instruction (from the reference manual), things are never that clear-cut when it comes to CPUs. CPU manufacturers such as Intel often times implement hardware optimizations that are not documented or even publicized. For example, we know that the CPU caches instructions in case a loop is detected. If this is the case, then the instructions are dispatched once again form the buffer, thus avoiding extra instruction fetches. What happens though, if the size of the loop's contents exceeds this buffer size? Obviously, without knowing certain aspects such as this buffer size, not to mention anything about microcode or unknown hardware optimizations,​ it is impossible to give accurate estimates.
  
-In other words, a machine code analyzer helps developers understand how an assembly code will run on a specific hardware, which in turn aids with optimizing code+{{ :​ep:​labs:​01:​contents:​tasks:​cpu_exec_unit.png?800 |}} 
- +<​html>​ 
-<note+<center
-[[https://llvm.org/​docs/​CommandGuide/​llvm-mca.html]] +<​b>​Figure 2:</b> Simplified view of a single Intel Skylake CPU coreInstructions are decoded into μOps and scheduled out-of-order onto the Execution Units. Your CPUs most likely have (many) more EUs
- +</center> 
-[[https://​en.algorithmica.org/​hpc/​profiling/​mca/​]] +</html>
-</note>+
  
 === [5p] Task A - Preparing the input === === [5p] Task A - Preparing the input ===
Line 17: Line 16:
 As previosuly mentioned, llvm-mca requires assembly code as input so start by preparing it from the source provided in the archive. As previosuly mentioned, llvm-mca requires assembly code as input so start by preparing it from the source provided in the archive.
  
-<​note ​tip+Since **llvm-mca** requires assembly code as input, we first need to translate the C source provided in the archive. Because the assembly parser it utilizes is the same as **clang**'​s,​ use it to compile the C program but stop after the LLVM generation and optmization stages, when the target-specific assembly code is generated. 
-HINTclang -(https://clang.llvm.org/​docs/​CommandGuide/​clang.html)+ 
 +<​note>​ 
 +Note how in the [[https://​llvm.org/​docs/​CommandGuide/​llvm-mca.html|llvm-mca documentation]] it is stated that the ''​LLVM-MCA-BEGIN''​ and ''​LLVM-MCA-END''​ markers can be parsed (as assembly comments) in order to restrict the scope of the analysis. 
 + 
 +These markers can also be placed in C code (see [[https://gcc.gnu.org/​onlinedocs/​gcc/​Extended-Asm.html|gcc extended asm]] and [[https://llvm.org/​docs/​LangRef.html#​inline-assembler-expressions|llvm inline asm expressions]])
 +<code c> 
 +asm volatile("#​ LLVM-MCA-BEGIN"​ ::: "​memory"​);​ 
 +</​code>​ 
 + 
 +Remember, however, that this approach is not always desirable, for two reasons: 
 +  - Even though this is just a comment, the ''​volatile''​ modifier can pessimize optimization passes. As a result, the generated code may not correspond to what would normally be emitted. 
 +  - Some code structures can not be included in the analysis region. For example, if you want to include the contents of a ''​for''​ loop, doing so by injecting assembly meta comments in C code will exclude the incrementation and condition check (which are also executed on every iteration).
 </​note>​ </​note>​
  
Line 27: Line 37:
 === [10p] Task B - Analyzing the assembly code === === [10p] Task B - Analyzing the assembly code ===
  
-After disassemblying ​the code use the tool to inspect its performance. Note that you can add comments in the assembly code (# LLVM-MCA-BEGIN & # LLVM-MCA-END) to determine a region of interestIn this particular scenario, we are dealing with a simple loop so prioritizing that area of code is adequate.+After disassembling ​the code use **llvm-mca** ​to inspect its expected throughput and "​pressure points" ​(check out [[https://​en.algorithmica.org/​hpc/​profiling/​mca/​|this example]].
  
-Because of the loopjumps in the assembly code are bound to be present ​Llvm-mca goes through all the instructions sequentially whilst simulating ​the duration and the resource access thus it does not care about the effects they produce. To put it differentlywhen it encounters a jump it will resort ​to a fallthrough.+One important thing to remember is that **llvm-mca** does not simulate ​the //​behaviour//​ of each instructionbut only the time required for it to executeIn other words, if you load an immediate value in a register via ''​mov rax, 0x1234'', ​the analyzer will not care //​what// ​the instruction does (or what the value of ''​rax''​ even is), but how long it takes the CPU to do it. The implication is quite significant:​ **llvm-mca** is incapable of analyzing complex sequences of code that contain conditional structures, such as ''​for''​ loops or function calls. Instead, given the sequence of instructions, it will pass through each of them one by one, ignoring their intended effect: conditional ​jump instructions ​will fall through, ''​call''​ instructions will by passed over not even considering the cost of the associated ''​ret'',​ etc. The closest we can come to analyzing ​loop is by reducing the analysis scope via the aforementioned ''​LLVM-MCA-*''​ markers and controlling the number of simulated iterations from the command line.
  
 To solve this issue, you can set the number of iterations from the command line, so its behaviour can resemble an actual loop. To solve this issue, you can set the number of iterations from the command line, so its behaviour can resemble an actual loop.
- 
-It must be acknowledged that in regards to real hardware, llvm-mca has a significant error, but for conducting analysis it is still a useful tool. [[https://​dspace.mit.edu/​bitstream/​handle/​1721.1/​128755/​ithemal-measurement.pdf?​sequence=2&​isAllowed=y]] 
  
 <​note>​ <​note>​
-Note: Besides ​the default Instruction Info, information about the Scheduler Ports [[https://​en.wikichip.org/​wiki/​intel/​microarchitectures/​skylake_(client)]] ​is present. +Read more on the [[https://​en.wikichip.org/​wiki/​intel/​microarchitectures/​skylake_(client)#​Scheduler|Skylake instruction scheduler and ports]].
- +
-[[https://​www.reddit.com/​r/​intel/​comments/​pzix4a/​what_do_the_ports_of_the_intel_processor_refer_to/​]]+
  
 A very short description of each port's main usage: A very short description of each port's main usage:
-  ​- Port0 & Port1 -> arithmetic instructions +  ​* **Port 0,​1:​** ​arithmetic instructions 
-  ​- Port2 & Port3 -> load operations, AGU (address generation unit) +  ​* **Port 2,​3:​** ​load operations, AGU (address generation unit) 
-  ​- Port4 -> store operations, AGU +  ​* **Port 4:** store operations, AGU 
-  ​- Port5 -> vector operations +  ​* **Port 5:** vector operations 
-  ​- Port6 -> integer and branch operations +  ​* **Port 6:** integer and branch operations 
-  ​- Port7 -> AGU+  ​* **Port 7:** AGU
  
-Those are the ports for the Skylake ​microarchitecture+The the significance of the SKL ports reported by **llvm-mca** can be found in the [[https://​github.com/​llvm/​llvm-project/​blob/​d9be232191c1c391a0d665e976808b2a12ea98f1/​llvm/​lib/​Target/​X86/​X86SchedSkylakeClient.td#​L32|Skylake ​machine model config]]. To find out if your CPU belongs to this category, [[https://​github.com/​llvm/​llvm-project/​blob/​27c5a9bbb01a464bb85624db2d0808f30de7c996/​llvm/​lib/​TargetParser/​Host.cpp#​L765|RTFS]] and run an ''​inxi -Cx''​.
  
-[[https://en.wikipedia.org/wiki/Skylake_(microarchitecture)]]+</​note>​ 
 + 
 +<note tip> 
 +In the default view, look at the number of micro-operations (i.e.''#​uOps''​) associated to each instruction. These are the number of primitive operations that each instruction (from the x86 ISA) is broken into. Fun and irrelevant fact: the hardware implementation of certain instructions can be modified via microcode upgrades. 
 + 
 +Anyway, keeping in mind this ''#​uOps''​ value (for each instruction),​ we'll notice that the sum of all //resource pressures per port// will equal that valueIn other words //resource pressure// means the average number of micro-operations that depend on that resource.
 </​note>​ </​note>​
  
Line 58: Line 69:
 Contents of my_pow.S with # LLVM-MCA-BEGIN and END tags: Contents of my_pow.S with # LLVM-MCA-BEGIN and END tags:
  
-POZAPOZAPOZA+{{:​ep:​labs:​01:​contents:​tasks:​screenshot_from_2023-10-08_22-19-15.png?​300|}}
 </​solution>​ </​solution>​
  
 === [10p] Task C - In-depth examination === === [10p] Task C - In-depth examination ===
  
-After getting ​the hang of working ​with llvm-mca try adding command line options such as -bottleneck-analysis ​and changing ​the iterations count for a more thorough investigation. +Now that you've got the hang of things, try generating asm code with certain optimization levels (i.e.: ''​O1,​2,​3,​s'',​ etc.) \\ 
- +Use the ''​-bottleneck-analysis''​ flag to identify contentious instruction sequences. Explain ​the reason to the best of your abilities.
-<​note>​ +
-The bottleneck argument provides information about throughput inefficiencies. +
-</​note>​+
  
 <​solution -hidden> <​solution -hidden>
ep/labs/01/contents/tasks/ex4.1696792655.txt.gz · Last modified: 2023/10/08 22:17 by mihai.blacioti
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0