# Lab 01 - CPU Monitoring (Linux)

## Objectives

• Offer an introduction to Performance Monitoring
• Present the main CPU metrics and how to interpret them
• Get you to use various tools for monitoring the performance of the CPU
• Familiarize you with the x86 Hardware Performance Counters

## Proof of Work

Before you start, create a Google Doc. Here, you will add screenshots / code snippets / comments for each exercise. Whatever you decide to include, it must prove that you managed to solve the given task (so don't show just the output, but how you obtained it and what conclusion can be drawn from it). If you decide to complete the feedback for bonus points, include a screenshot with the form submission confirmation, but not with its contents.

When done, export the document as a pdf and upload in the appropriate assignment on moodle. The deadline is 23:55 on Friday.

## Introduction

Performance Monitoring is the process of checking a set of metrics in order to ascertain the health of the system. Normally, the information gleaned from these metrics is in turn used to fine tune the system in order to maximize its performance. As you may imagine, both acquiring and interpreting this data requires at least some knowledge of the underlying operating system.

In the following four labs, we'll discuss the four main subsystems that are likely to have an impact either on a single process, or on the system as a whole. These are: CPU, memory, disk I/O and networking. Note that these subsystems are not independent of one another. For example, a web application may be dependent on the network stack of the kernel. Its implementation determines the amount of packets processed in a given amount of time. However, protocols that require checksum calculation (e.g.: TCP) will want to use a highly optimized implementation of this function (which is written directly in assembly). If your architecture does not have such an implementation and falls back to using the one written in C, you may prefer changing your choice of protocol.

When dealing strictly with the CPU, these are a few things to look out for:

##### Context Switches

A context switch is a transition from one runtime environment to another. One example would be performing a privileged call to kernel space via a system call, then returning from it. When this happens, a copy of your register state must be stored, for obvious reasons. This operation takes some time

This usually comes in the form of performing a privileged call to kernel space (e.g.: syscall) and returning from it. Whenever this happens, a copy of your register state must be (re)stored, which takes up some time.

Note, however, how each individual process has its own address space, but in every address space, the only constant is the kernel. Why is that? Well, when the time slice of a process runs out and another is scheduled in, the kernel must perform a Translation Lookaside Buffer (TLB) flush. Otherwise, memory accesses in the new process might erroneously end up targeting the memory of the previous process. Yes, some shared objects (libraries) could have been mapped at the same virtual addresses and deleting those entries from the TLB is a shame, but there's no workaround for that. Now, back to our original question: why is the kernel mapped identically in each virtual address space? The reason is that when you perform a context switch into the kernel after calling open() or read(), a TLB flush is not necessary. If you wanted to write your own kernel, you could theoretically isolate the kernel's address space (like any other process), but you would see a huge performance drop.

The takeaway is that some context switches are more expensive than others. Not being able to schedule a process to a single core 100% of the time comes with a huge cost (flushing the TLB). This being said, context switches from user space to kernel space are still expensive operations. As Terry Davis once demonstrated in his Temple OS, running everything at the same privilege level can reduce the cost of context switches by orders of magnitude.

##### CPU Utilization

Each process is given a time slice for it to utilize however it sees fit. The way that time is utilized can prove to be a meaningful metric. There are two ways that we can look at this data: system level or process level.

At system level, the data is offered by the kernel in /proc/stats (details in man 5 proc; look for this file). For each core, we get the amount of time units (USER_HZ configured at compile time in the kernel ~= 10ms) each core has spent on a certain type of task. The more commonly encountered are of course:

• user: Running unprivileged code in ring3.
• system: Running privileged code in ring0.
• idle: Not running anything. In this case, the core voltage & frequency is usually reduced.
• nice: Same as user, but refers to processes with a nice > 0 personality. More details here.

The less reliable / relevant ones are:

• iowait: Time waiting for I/O. Not reliable because this is usually done via Direct Memory Access at kernel level and processes that perform blocking I/O operations (e.g.: read() – with the exception of certain types of files, such as sockets, opened with O_NONBLOCK) automatically yield their remaining time for another CPU bound process to be rescheduled.
• (soft)irq: Time servicing interrupts. This has nothing to do with user space processes. A high number can indicate high peripheral activity.
• steal: If the current system runs under a Hypervisor (i.e.: you are running in a Virtual Machine), know that the HV has every right to steal clock cycles from any VM in order to satisfy its own goals. Just like the kernel can steal clock cycles from a regular process to service an interrupt from, let's say, the Network Interface Controller, so can the HV steal clock cycles from the VM for exactly the same purpose.
• guest: The opposite of steal. If you are running a VM, then the kernel can take the role of a HV in some capacity (see kvm). This is the amount of time the CPU was used to run the guest VM.

At process level, the data can be found in /proc/[pid]/stat (see man 5 proc). Note that in this case, the amount of information the kernel interface provides is much more varied. While we still have utime (user time) and stime (system time), note that we also have statistics for child processes that have not been orphaned: cutime, cstime.

Although you may find many tools that offer similar information, remember that these files are the origin. Another thing to keep in mind is that this data is representative for the entire session, i.e.: from system boot or from process launch. If you want to interpret it in a meaningful manner, you need to get two data points and know the time interval between their acquisition.

##### Scheduling

When a CPU frees up, the kernel must decide which process gets to run next. To this end, it uses the Completely Fair Scheduler (CFS). Normally, we don't question the validity of the scheduler's design. That's a few levels above our paygrade. What we can do, is adjust the value of /proc/sys/kernel/sched_min_granularity_ns. This virtual file contains the minimum amount of nanoseconds that a task is allocated when scheduled on the CPU. A lower value guarantees that each process will be scheduled sooner rather than later, which is a good trait of a real-time system (e.g.: Android – you don't want unresponsive menus). A greater value, however, is better when you are doing batch processing (e.g.: rendering a video). We noted previously that switching active processes on a CPU core is an expensive operation. Thus, allowing each process to run for longer will reduce the CPU dead time in the long run.

Another aspect that's not necessarily as talked about is core scheduling. Given that you have more available cores than active tasks, on what core do you schedule a task? When answering this question, we need to keep in mind a few things: the CPU does not operate at a constant frequency. The voltage of each core, and consequently its frequency, varies based on the amount of active time. That being said, if a core has been idle for quite some time and suddenly a new task is scheduled, it will take some time to get it from its low-power frequency to its maximum. But now consider this: what if the workload is not distributed among all cores and a small subset of cores overheats? The CPU is designed to forcibly reduce the frequency is such cases, and if the overall temperature exceeds a certain point, shut down entirely.

At the moment, CFS likes to spread out the tasks to all cores. Of course, each process has the right to choose the cores it's comfortable to run on (more on this in the exercises section). Another reason why this may be preferable that we haven't mentioned before is not invalidating the CPU cache. L1 and L2 caches are specific to each physical core. L3 is accessible to all cores. However. L1 and L2 have an access time of 1-10ns, while L3 can go as high as 30ns. If you have some time, read a bit about Nest, a newly proposed scheduler that aims to keep scheduled tasks on “warm cores” until it becomes necessary to power up idle cores as well. Can you come up with situations when Nest may be better or worse than CFS?

### 01. [30p] Vmstat

The vmstat utility provides a good low-overhead view of system performance. Since vmstat is such a low-overhead tool, it is practical to have it running even on heavily loaded servers when it is needed to monitor the system’s health.

#### [10p] Task A - Monitoring stress

Run vmstat on your machine with a 1 second delay between updates. Notice the CPU utilisation (info about the output columns here).

In another terminal, use the stress command to start N CPU workers, where N is the number of cores on your system. Do not pass the number directly. In stead, use command substitution.

Note: if you are trying to solve the lab on fep and you don't have stress installed, try cloning and compiling stress-ng.

#### [10p] Task B - How does it work?

Let us look at how vmstat works under the hood. We can assume that all these statistics (memory, swap, etc.) can not be normally gathered in userspace. So how does vmstat get these values from the kernel? Or rather, how does any process interact with the kernel? Most obvious answer: system calls.

$strace vmstat “All well and good. But what am I looking at?” What you should be looking at are the system calls after the two writes that display the output header (hint: it has to do with /proc/ file system). So, what are these files that vmstat opens? $ file /proc/meminfo
$cat /proc/meminfo$ man 5 proc

The manual should contain enough information about what these kernel interfaces can provide. However, if you are interested in how the kernel generates the statistics in /proc/meminfo (for example), a good place to start would be meminfo.c (but first, SO2 wiki).

#### [10p] Task C - USO flashbacks (1)

Write a one-liner that uses vmstat to report complete disk statistics and sort the output in descending order based on total reads column.

You can eliminate the first two header lines from the vmstat output using tail -n +3.

### 02. [30p] Mpstat

Open fact_rcrs.zip and look at the code.

#### [10p] Task A - Python recursion depth

Try to run the script while passing 1000 as a command line argument. Why does it crash?

Luckily, python allows you to both retrieve the current recursion limit and set a new value for it. Increase the recursion limit so that the process will never crash, regardless of input (assume that it still has a reasonable upper bound).

#### [10p] Task B - CPU affinity

Run the script again, this time passing 10000. Use mpstat to monitor the load on each individual CPU at 1s intervals. The one with close to 100% load will be the one running our script. Note that the process might be passed around from one core to another.

Stop the process. Use stress to create N-1 CPU workers, where N is the number of cores on your system. Use taskset to set the CPU affinity of the N-1 workers to CPUs 1-(N-1) and then run the script again. You should notice that the process is scheduled on cpu0.

Note: to get the best performance when running a process, make sure that it stays on the same core for as long as possible. Don't let the scheduler decide this for you, if you can help it. Allowing it to bounce your process between cores can drastically impact the efficient use of the cache and the TLB. This holds especially true when you are working with servers rather than your personal PCs. While the problem may not manifest on a system with only 4 cores, you can't guarantee that it also won't manifest on one with 40 cores. When running several experiments in parallel, aim for something like this:

Figure 1: htop output. Processes are bound to specific cores, increasing performance by not potentially invalidating L1 and L2 caches. This works out well since we have fewer active processes than available cores. Otherwise, setting the affinity to a single core may backfire; the rescheduling of these processes could be delayed until other processes are also allocated a time slice. We notice that CPU usage on these cores is maxed (green:user space, red:kernel space). The ratio tells us that a considerable amount of time is spent in kernel space, leading us to believe that the processes are I/O bound.

#### [10p] Task C - USO flashbacks (2)

Write a bash command that binds CPU stress workers on your odd-numbered cores (i.e.: 1,3,5,…). The list of cores and the number of stress workers must NOT be hardcoded, but constructed based on nproc (or whatever else you fancy).
In your submission, include both the bash command and a mpstat capture to prove that the command is working.

### 03. [15p] Zip with compression levels

The zip command is used for compression and file packaging under Linux/Unix operating system. It provides 10 levels of compression, where:

• level 0 : provides no compression, only packaging
• level 6 : used as default compression level
• level 9 : provides maximum compression
$zip -5 file.zip file.txt #### [10p] Task A - Measurements Write a script to measure the compression rate and the time required for each level. Use the following files: • two largest bitmaps from here • this large text file here #### [5p] Task B - Plot Fill the data you obtained into the python3 script in plot.zip. Make sure you have python3 and python3-matplotlib installed. ### 04. [25p] Hardware counters A significant portion of the system statistics that can be generated involve hardware counters. As the name implies, these are special registers that count the number of occurrences of specific events in the CPU. These counters are implemented through Model Specific Registers (MSR), control registers used by developers for debugging, tracing, monitoring, etc. Since these registers may be subject to changes from one iteration of a microarchitecture to the next, we will need to consult chapters 18 and 19 from Intel 64 and IA-32 Architectures Developer's Manual: Vol. 3B. The instructions that are used to interact with these counters are RDMSR, WRMSR and RDPMC. Normally, these are considered privileged instruction (that can be executed only in ring0, aka. kernel space). As a result, acquiring these information from ring3 (user space) requires a context switch into ring0, which we all know to be a costly operation. The objective of this exercise is to prove that this is not necessarily the case and that it is possible to configure and examine these counters from ring3 in as few as a couple of clock cycles. Before getting started, one thing to note is that there are two types of performance counters: 1. Fixed Function Counters • each can monitor a single, distinct and predetermined event (burned in hardware) • are configured a bit differently than the other type • are not of interest to us in this laboratory 2. General Purpose Counters • can be configured to monitor a specific event from a list of over 200 (see chapters 19.1 and 19.2) Download hw_counter.zip. Here is an overview of the following five tasks: • Task A: check the version ID of your CPU to determine what it's capable of monitoring. • Task B: set a certain bit in CR4 to enable ring3 usage of the RDPMC instruction. • Task C: use some ring3 tools to enable the hardware counters. • Task D: start counting L2 cache misses. • Task E: use RDPMC to measure the cache misses for a familiar program. #### [5p] Task A - Hardware Info First of all, we need to know what we are working with. Namely, the microarchitecture version ID and the number of counters per core. To this end, we will use cpuid (basically a wrapper over the CPUID instruction.) All the information that we need will be contained in the 0AH leaf (might want to get the raw output of cpuid): • CPUID.0AH:EAX[15:8] : number of general purpose counters • CPUID.0AH:EAX[7:0] : version ID • CPUID.0AH:EDX[7:0] : number of fixed function counters Note: the first two columns of the output represent the EAX and ECX registers used when calling CPUID. If the most significant bit in EAX is 1 (i.e.: starts with 0x8) the output is for extended options. ECX is a relatively new addition. So when looking for the 0AH leaf, search for a line starting with 0x0000000a. The register contents following ':' represent the output of the instruction. Point out to your assistant which is which in the cpuid output. #### [5p] Task B - Unlock RDPMC in ring3 This is pretty straightforward. All you need to do is set the Performance-Monitor Counter Enable bit in CR4. Naturally, this can't be done from ring3. As such, we provide a kernel module that does it for you (see hack_cr4.c.) When the module is loaded, it will set the aforementioned bit. Similarly, when the module is unloaded, it will revert the change. Try compiling the module, loading and unloading it and finally, check the kernel message log to verify that it works. $ make
$sudo insmod hack_cr4.ko$ sudo rmmod hack_cr4
$dmesg Note: the module must remain loaded in the kernel in order to keep the bit set. If during Task E you get a segfault, the reason is that you (probably) unloaded the module and you no longer have permission to run the instruction in ring3. This does NOT invalidate your work in Tasks C and D; simply load the module once more. #### [5p] Task C - Configure IA32_PERF_GLOBAL_CTRL Figure 2: Control register for the Fixed Function and General Purpose counters. While setting a bit will enable the associated counter, clearing it will disable it. Note that for a counter to be enabled, both this bit and the EN bit in its configuration register must be set. If either is cleared, the counter is disabled. The purpose of this register is to simultaneously change the active state of multiple counters, with a single write instruction. The IA32_PERF_GLOBAL_CTRL (0x38f) MSR is an addition from version 2 that allows enabling / disabling multiple counters with a single WRMSR instruction. What happens, in layman terms, is that the CPU performs an AND between each EANBLE bit in this register and its counterpart in the counter's original configuration register from version 1 (which we will deal with in the next task.) If the result is 1, the counter begins to register the programmed event every clock cycle. Normally, all these bits should be set by default during the booting process but it never hurts to check. Also, note that this register exists for each logical core. If for CR4 we had to write a kernel module, for MSRs we have user space tools that take care of this for us (rdmsr and wrmsr) by interacting with a driver called msr (install msr-tools if it's missing from your system.) But first, we must load this driver. $ lsmod | grep msr
$sudo modprobe msr$ lsmod | grep msr
msr                    16384  0

Next, let us read the value in the IA32_PERF_GLOBAL_CTRL register. If the result differs from what you see in the snippet below, overwrite the value (the -a flag specifies that we want the command to run on each individual logical core).

$sudo rdmsr -a 0x38f 70000000f$ sudo wrmsr -a 0x38f 0x70000000f

#### [5p] Task D - Configure IA32_PERFEVENTSELx

Figure 3: Configuration register for individual counters. Of interest to us are the EN bit (mentioned in the previous subsection), the event selection fields, and the user mode bit. Note how the USR bit can only distinguish between ring 0 and ring 3. While rings 1 and 2 are still present in the CPU's implementation today, no mainstream operating system has used them in over 30 years. The PMC, being a newer addition, acknowledges this reality in trying to simplify the control interface as much as possible. It is not clear if rings 1 and 2 are blind spots for PMCs or if they are covered under ring 0.

The IA32_PERFEVENTSELx are MSRs from version 1 that are used to configure the monitored event of a certain counter, its enabled state and a few other things. We will not go into detail and instead only mention the fields that interest us right now (you can read about the rest in the Intel manual.) Note that the x in the MSR's name stands for the counter number. If we have 4 counters, it takes values in the 0:3 range. The one that we will configure is IA32_PERFEVENTSEL0 (0x186). If you want to configure more than one counter, note that they have consecutive register number (i.e. 0x187, 0x188, etc.).

As for the register flags, those that are not mentioned in the following list should be left cleared:

• EN (enable flag) = 1 starts the counter
• USR (user mode flag) = 1 monitors only ring3 events
• UMASK (unit mask) = ?? depends on the monitored event (see chapter 19.2)
• EVSEL (event select) = ?? depends on the monitored event (see chapter 19.2)

Before actually writing in this register, we should verify that no one is currently using it. If this is indeed the case, we might also want to clear IA32_PMC0 (0xc1). PMC0 is the actual counter that is associated to PERFEVENTSEL0.

$sudo rdmsr -a 0x186 0$ sudo wrmsr -a 0xc1 0x00

### 05. [10p] Feedback

Please take a minute to fill in the feedback form for this lab.