Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ep:labs:01:contents:tasks:ex4 [2023/10/10 12:59]
radu.mantu
ep:labs:01:contents:tasks:ex4 [2023/11/14 12:51] (current)
radu.mantu
Line 1: Line 1:
-==== 04. [25p] LLVM-MCA ====+==== 04. [25p] llvm-mca ====
  
 **llvm-mca** is a machine code analyzer that simulates the execution of a sequence of instructions. By leveraging high-level knowledge of the micro-architectural implementation of the CPU, as well as its execution pipeline, this tool is able to determine the execution speed of said instructions in terms of clock cycles. More importantly though, it can highlight possible contentions of two or more instructions over CPU resources or rather, its [[https://​en.wikichip.org/​wiki/​intel/​microarchitectures/​skylake_(client)#​Scheduler_Ports_.26_Execution_Units|ports]]. **llvm-mca** is a machine code analyzer that simulates the execution of a sequence of instructions. By leveraging high-level knowledge of the micro-architectural implementation of the CPU, as well as its execution pipeline, this tool is able to determine the execution speed of said instructions in terms of clock cycles. More importantly though, it can highlight possible contentions of two or more instructions over CPU resources or rather, its [[https://​en.wikichip.org/​wiki/​intel/​microarchitectures/​skylake_(client)#​Scheduler_Ports_.26_Execution_Units|ports]].
  
 Note that **llvm-mca** is not the most reliable tool when predicting the precise runtime of an instruction block (see [[https://​dspace.mit.edu/​bitstream/​handle/​1721.1/​128755/​ithemal-measurement.pdf?​sequence=2&​isAllowed=y|this paper]] for details). After all, CPUs are not as simple as the good old AVR microcontrollers. While calculating the execution time of an AVR linear program (i.e.: no conditional loops) is as simple as adding up the clock cycles associated to each instruction (from the reference manual), things are never that clear-cut when it comes to CPUs. CPU manufacturers such as Intel often times implement hardware optimizations that are not documented or even publicized. For example, we know that the CPU caches instructions in case a loop is detected. If this is the case, then the instructions are dispatched once again form the buffer, thus avoiding extra instruction fetches. What happens though, if the size of the loop's contents exceeds this buffer size? Obviously, without knowing certain aspects such as this buffer size, not to mention anything about microcode or unknown hardware optimizations,​ it is impossible to give accurate estimates. Note that **llvm-mca** is not the most reliable tool when predicting the precise runtime of an instruction block (see [[https://​dspace.mit.edu/​bitstream/​handle/​1721.1/​128755/​ithemal-measurement.pdf?​sequence=2&​isAllowed=y|this paper]] for details). After all, CPUs are not as simple as the good old AVR microcontrollers. While calculating the execution time of an AVR linear program (i.e.: no conditional loops) is as simple as adding up the clock cycles associated to each instruction (from the reference manual), things are never that clear-cut when it comes to CPUs. CPU manufacturers such as Intel often times implement hardware optimizations that are not documented or even publicized. For example, we know that the CPU caches instructions in case a loop is detected. If this is the case, then the instructions are dispatched once again form the buffer, thus avoiding extra instruction fetches. What happens though, if the size of the loop's contents exceeds this buffer size? Obviously, without knowing certain aspects such as this buffer size, not to mention anything about microcode or unknown hardware optimizations,​ it is impossible to give accurate estimates.
 +
 +{{ :​ep:​labs:​01:​contents:​tasks:​cpu_exec_unit.png?​800 |}}
 +<​html>​
 +<​center>​
 +<​b>​Figure 2:</​b>​ Simplified view of a single Intel Skylake CPU core. Instructions are decoded into μOps and scheduled out-of-order onto the Execution Units. Your CPUs most likely have (many) more EUs.
 +</​center>​
 +</​html>​
  
 === [5p] Task A - Preparing the input === === [5p] Task A - Preparing the input ===
Line 46: Line 53:
   * **Port 6:** integer and branch operations   * **Port 6:** integer and branch operations
   * **Port 7:** AGU   * **Port 7:** AGU
 +
 +The the significance of the SKL ports reported by **llvm-mca** can be found in the [[https://​github.com/​llvm/​llvm-project/​blob/​d9be232191c1c391a0d665e976808b2a12ea98f1/​llvm/​lib/​Target/​X86/​X86SchedSkylakeClient.td#​L32|Skylake machine model config]]. To find out if your CPU belongs to this category, [[https://​github.com/​llvm/​llvm-project/​blob/​27c5a9bbb01a464bb85624db2d0808f30de7c996/​llvm/​lib/​TargetParser/​Host.cpp#​L765|RTFS]] and run an ''​inxi -Cx''​.
 +
 +</​note>​
 +
 +<note tip>
 +In the default view, look at the number of micro-operations (i.e.: ''#​uOps''​) associated to each instruction. These are the number of primitive operations that each instruction (from the x86 ISA) is broken into. Fun and irrelevant fact: the hardware implementation of certain instructions can be modified via microcode upgrades.
 +
 +Anyway, keeping in mind this ''#​uOps''​ value (for each instruction),​ we'll notice that the sum of all //resource pressures per port// will equal that value. In other words //resource pressure// means the average number of micro-operations that depend on that resource.
 </​note>​ </​note>​
  
ep/labs/01/contents/tasks/ex4.1696931965.txt.gz · Last modified: 2023/10/10 12:59 by radu.mantu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0