Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ep:labs:01:contents:ex4 [2019/09/22 15:43]
radu.mantu removed
— (current)
Line 1: Line 1:
-==== T04. [35p] Hardware counters ==== 
-A significant portion of the system statistics that can be generated involve hardware counters. As the name implies, these are special registers that count the number of occurrences of specific events in the CPU. These counters are implemented through **Model Specific Registers** (MSR), control registers used by developers for debugging, tracing, monitoring, etc. Since these registers may be subject to changes from one iteration of a microarchitecture to the next, we will need to consult chapters 18 and 19 from [[https://​www.intel.com/​content/​www/​us/​en/​architecture-and-technology/​64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html|Intel 64 and IA-32 Architectures Developer'​s Manual: Vol. 3B]]. 
  
-The instructions that are used to interact with these counters are [[https://​www.felixcloutier.com/​x86/​rdmsr|RDMSR]],​ [[https://​www.felixcloutier.com/​x86/​wrmsr|WRMSR]] and [[https://​www.felixcloutier.com/​x86/​rdpmc|RDPMC]]. Normally, these are considered privileged instruction (that can be executed only in ring0, aka. kernel space). As a result, acquiring these information from ring3 (user space) requires a context switch into ring0, which we all know to be a costly operation. The objective of this exercise is to prove that this is not necessarily the case and that it is possible to configure and examine these counters from ring3 in as few as a couple of clock cycles. 
- 
-Before getting started, one thing to note is that there are two types of performance counters: 
-  - Fixed Function Counters 
-      * each can monitor a single, distinct and predetermined event (burned in hardware) 
-      * are configured a bit differently than the other type 
-      * are not of interest to us in this laboratory 
-  - General Purpose Counters 
-      * can be configured to monitor a specific event from a list of over 200 (see chapters 19.1 and 19.2) 
- 
-Download {{:​ep:​labs:​hw_counter.zip|}}. 
- 
-=== [5p] Task A - Hardware Info === 
-First of all, we need to know what we are working with. Namely, the microarchitecture //version ID// and the //number of counters// per core. To this end, we will use [[https://​linux.die.net/​man/​1/​cpuid|cpuid]] (basically a wrapper over the [[https://​www.felixcloutier.com/​x86/​cpuid|CPUID]] instruction). All the information that we need will be contained in the 0AH leaf (might want to get the raw output of **cpuid**): 
-  * **CPUID.0AH:​EAX[15:​8]** : number of general purpose counters 
-  * **CPUID.0AH:​EAX[7:​0]** : version ID 
-  * **CPUID.0AH:​EDX[7:​0]** : number of fixed function counters 
-  
-Point out to your assistant which is which in the **cpuid** output. 
- 
-=== [5p] Task B - Unlock RDPMC in ring3 === 
-This is pretty straightforward. All you need to do is set the **Performance-Monitor Counter Enable** bit in [[https://​en.wikipedia.org/​wiki/​Control_register#​CR4|CR4]]. Naturally, this can't be done from ring3. As such, we provide a kernel module that does it for you (see //​hack_cr4.c//​). When the module is loaded, it will set the aforementioned bit. Similarly, when the module is unloaded, it will revert the change. Try compiling the module, loading and unloading it and finally, check the kernel message log to verify that it works. 
-<code bash> 
-$ make 
-$ sudo insmod hack_cr4 
-$ sudo rmmod hack_cr4 
-$ dmesg 
-</​code>​ 
- 
-=== [5p] Task C - Configure IA32_PERF_GLOBAL_CTRL === 
-{{:​ep:​labs:​ia32_perf_global_ctrl.png?​600|}} 
- 
-The **IA32_PERF_GLOBAL_CTRL** (0x38f) MSR is an addition from //version 2// that allows enabling / disabling multiple counters with a single WRMSR instruction. What happens, in layman terms, is that the CPU performs an AND between each EANBLE bit in this register and its counterpart in the counter'​s original configuration register from //version 1// (which we will deal with in the next task). If the result is 1, the counter begins to register the programmed event every clock cycle. Normally, all these bits should be set by default during the booting process but it never hurts to check. Also, note that this register exists for each logical core. 
- 
-If for CR4 we had to write a kernel module, for MSRs we have user space tools that take care of this for us ([[http://​manpages.ubuntu.com/​manpages/​trusty/​man1/​rdmsr.1.html|rdmsr]] and [[http://​manpages.ubuntu.com/​manpages/​trusty/​man1/​wrmsr.1.html|wrmsr]]) by interacting with a driver called **msr**. But first, we must load this driver. 
- 
-<code bash> 
-$ lsmod | grep msr 
-$ sudo modprobe msr 
-$ lsmod | grep msr 
-    msr                    16384  0 
-</​code>​ 
- 
-Next, let us read the value in the IA32_PERF_GLOBAL_CTRL register. If the result differs from what you see in the snippet below, overwrite the value (the **-a** flag specifies that we want the command to run on each individual logical core). 
- 
-<code bash> 
-$ sudo rdmsr -a 0x38f 
-    70000000f 
-$ sudo wrmsr -a 0x38f 0x70000000f 
- 
-</​code>​ 
- 
-=== [5p] Task D - Configure IA32_PERFEVENTSELx === 
-{{:​ep:​labs:​ia32_perfeventselx.png?​600|}} 
- 
-The **IA32_PERFEVENTSELx** are MSRs from //version 1// that are used to configure the monitored event of a certain counter, its enabled state and a few other things. We will not go into detail and instead only mention the fields that interest us right now (you can read about the rest in the Intel manual). Note that the //x// in the MSR's name stands for the counter number. If we have 4 counters, it takes values in the 0:3 range. The one that we will configure is IA32_PERFEVENTSEL0 (0x186). If you want to configure more than one counter, note that they have consecutive register number (i.e. 0x187, 0x188, etc.). 
- 
-As for the register flags, those that are not mentioned in the following list should be left cleared: 
-  * **EN** (enable flag) = **1** starts the counter 
-  * **USR** (user mode flag) = **1** monitors only ring3 events.2) 
-  * **UMASK** (unit mask) = **??** depends on the monitored event (see chapter 19.2) 
-  * **EVSEL** (event select) = **??** depends on the monitored event (see chapter 19.2) 
- 
-Before actually writing in this register, we should verify that no one is currently using it. If this is indeed the case, we might also want to clear **IA32_PMC0** (0xc1). PMC0 is the actual counter that is associated to PERFEVENTSEL0. 
- 
-<code bash> 
-$ sudo rdmsr -a 0x186 
-    0 
-$ sudo wrmsr -a 0xc1 0x00 
-$ sudo wrmsr -a 0x186 0x41???? 
-</​code>​ 
- 
-For the next (and //final// task) we are going to monitor the number of L2 cache misses. Look for such an event in table 19.3 from chapter 19.2 in the Intel manual and set the last two bytes (the unit mask and event select) accordingly. If the operation is successful and the counters have started, you should start seeing non-zero values in the PMC0 register, increasing in subsequent reads. 
- 
-<​solution -hidden> 
-The event they should find in Table 19-3 is called **L2_RQSTS.MISS** (event number: //0x24// | umask: //0x3f//). As such, the value that they need to write in the MSR is //​0x413f24//​. 
-</​solution>​ 
- 
-=== [15p] Task E - Ring3 cache performance evaluation === 
- 
-As of now, we should be able to modify the **CR4** registers with the kernel module, enable all counters in the **IA32_PERF_GLOBAL_CTRL** across all cores and start an **L2 cache miss** counter again, across all cores. What remains is putting everything into practice. 
- 
-Take //​mat_mul.c//​. This program may be familiar from an ASC laboratory but, in case it isn't, the gist of it is that when using the naive matrix multiplication algorithm (O(n^3)), the frequency with which each iterator varies can wildly affect the performance of the program. The reason behind this is (in)efficient use of the CPU cache. Take a look at the following snippet from the source and keep in mind that each matrix buffer is a continuous area in memory. 
- 
-<code C> 
-for (uint32_t i=0; i<N; ++i)             /* line   */ 
-    for (uint32_t j=0; j<N; ++j)         /* column */ 
-        for (uint32_t k=0; k<N; ++k) 
-            r[i*N + j] += m1[i*N + k] * m2[k*N + j]; 
-</​code>​ 
- 
-What is the problem here? The problem is that i and k are multiplied with a large number N when updating a certain element. Thus, fast variations in these two indices will cause huge strides in accessed memory areas (larger than a cache line) and will cause unnecessary cache misses. So what are the best and worst configurations for the three fors? The best: i, k j. The worst: j, k, i. As we can see, the configurations that we will monitor in //​mat_mul.c//​ do not coincide with the aforementioned two (so... not great, not terrible). Even so, the difference in execution time and number of cache misses will still be significant. 
- 
-Which brings us to the task at hand: using the **RDPMC** instruction,​ calculate the number of L2 cache misses for each of the two multiplications __without performing any context switches__ (hint: look at [[https://​gcc.gnu.org/​onlinedocs/​gcc/​Extended-Asm.html|gcc extended asm]] or //​hack_cr4.c//​). 
- 
-A word of caution: remember that each logical core has its own PMC0 counter, so make sure to use [[https://​linux.die.net/​man/​1/​taskset|taskset]] in order to set the CPU affinity of the process. 
- 
-<code bash> 
-$ taskset 0x01 ./mat_mul 1024 
-</​code>​ 
- 
-Here are some bonus questions: 
-  * What would happen if we did not set the CPU affinity? 
-  * Can you find a disadvantage for using RDPMC in ring3? Consider having more CPU intensive tasks than available cores. 
-  * What about PMC overflows? 
- 
-<​solution -hidden> 
-**Task E:** 
-<code C> 
-#define rdpmc(ecx, eax, edx)    \ 
-    asm volatile (              \ 
-        "​rdpmc" ​                \ 
-        : "​=a"​(eax), ​           \ 
-          "​=d"​(edx) ​            \ 
-        : "​c"​(ecx)) 
-</​code>​ 
- 
-<code C> 
-/* hardware counter init */ 
-rdpmc(ecx, eax, edx); 
-counter = ((uint64_t)eax) | (((uint64_t)edx) << 32); 
- 
-/* perform slow multiplication */ 
-for (uint32_t i=0; i<N; ++i)             /* line   */ 
-    for (uint32_t j=0; j<N; ++j)         /* column */ 
-        for (uint32_t k=0; k<N; ++k) 
-            r[i*N + j] += m1[i*N + k] * m2[k*N + j]; 
- 
-/* hardware counter delta */ 
-rdpmc(ecx, eax, edx); 
-counter = (((uint64_t)eax) | (((uint64_t)edx) << 32)) - counter; 
-</​code>​ 
- 
-**Additional questions:​** 
-  * RDPMC will read a PMC on the local core. If the process is bounced between cores, while the multiplication is executed, you will end up calculating the difference between the L2 cache misses on cpu0 after the multiplication and the L2 cache misses on cpu7 before the multiplication. Makes no sense. 
-  * This is usually done in ring0 because the kernel knows when a process is scheduled and when it's not. If the system has many CPU intensive processes, ours might end up in WAIT state and the PMC will keep being incremented for events caused by another process. However, if the system does not have that many CPU intensive processes (which in our case, doesn'​t),​ then the values that we obtain, while not exact, are good enough. 
-  * The PMC are 40-bit counters and may overflow at some point. Short term we don't need to worry about that. Long term, we can set up the counter to generate an interrupt when an overflow occurs. 
-</​solution>​ 
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0