This is an old revision of the document!


04. [25p] LLVM-MCA

llvm-mca is a machine code analyzer that simulates the execution of a sequence of instructions. By leveraging high-level knowledge of the micro-architectural implementation of the CPU, as well as its execution pipeline, this tool is able to determine the execution speed of said instructions in terms of clock cycles. More importantly though, it can highlight possible contentions of two or more instructions over CPU resources or rather, its ports.

Note that llvm-mca is not the most reliable tool when predicting the precise runtime of an instruction block (see this paper for details). After all, CPUs are not as simple as the good old AVR microcontrollers. While calculating the execution time of an AVR linear program (i.e.: no conditional loops) is as simple as adding up the clock cycles associated to each instruction (from the reference manual), things are never that clear-cut when it comes to CPUs. CPU manufacturers such as Intel often times implement hardware optimizations that are not documented or even publicized. For example, we know that the CPU caches instructions in case a loop is detected. If this is the case, then the instructions are dispatched once again form the buffer, thus avoiding extra instruction fetches. What happens though, if the size of the loop's contents exceeds this buffer size? Obviously, without knowing certain aspects such as this buffer size, not to mention anything about microcode or unknown hardware optimizations, it is impossible to give accurate estimates.

Given an assembly code, llvm-mca estimates the Instructions Per Cycle (IPC), as well as hardware resource pressure among other things. The analysis and reporting style were taken from the IACA tool provided by Intel.

In other words, a machine code analyzer helps developers understand how an assembly code will run on a specific hardware, which in turn aids with optimizing code.

[5p] Task A - Preparing the input

As previosuly mentioned, llvm-mca requires assembly code as input so start by preparing it from the source provided in the archive.

[10p] Task B - Analyzing the assembly code

After disassemblying the code use the tool to inspect its performance. Note that you can add comments in the assembly code (# LLVM-MCA-BEGIN & # LLVM-MCA-END) to determine a region of interest. In this particular scenario, we are dealing with a simple loop so prioritizing that area of code is adequate.

Because of the loop, jumps in the assembly code are bound to be present. Llvm-mca goes through all the instructions sequentially whilst simulating the duration and the resource access thus it does not care about the effects they produce. To put it differently, when it encounters a jump it will resort to a fallthrough.

To solve this issue, you can set the number of iterations from the command line, so its behaviour can resemble an actual loop.

It must be acknowledged that in regards to real hardware, llvm-mca has a significant error, but for conducting analysis it is still a useful tool. https://dspace.mit.edu/bitstream/handle/1721.1/128755/ithemal-measurement.pdf?sequence=2&isAllowed=y

Note: Besides the default Instruction Info, information about the Scheduler Ports https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client) is present.

https://www.reddit.com/r/intel/comments/pzix4a/what_do_the_ports_of_the_intel_processor_refer_to/

A very short description of each port's main usage:

  1. Port0 & Port1 → arithmetic instructions
  2. Port2 & Port3 → load operations, AGU (address generation unit)
  3. Port4 → store operations, AGU
  4. Port5 → vector operations
  5. Port6 → integer and branch operations
  6. Port7 → AGU

Those are the ports for the Skylake microarchitecture.

https://en.wikipedia.org/wiki/Skylake_(microarchitecture)

[10p] Task C - In-depth examination

After getting the hang of working with llvm-mca try adding command line options such as -bottleneck-analysis and changing the iterations count for a more thorough investigation.

The bottleneck argument provides information about throughput inefficiencies.

ep/labs/01/contents/tasks/ex4.1696929850.txt.gz · Last modified: 2023/10/10 12:24 by radu.mantu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0