This is an old revision of the document!

Lab 03 - I/O Monitoring (Linux)

Objectives

Offer an introduction to I/O monitoring.
Get you acquainted with a few linux standard monitoring tools and their outputs, for monitoring the impact of the I/Os on the system.
Gives an intuition to be able to compare two relatively similar systems, but I/O different.

Tasks

Introduction

Disk I/O subsystems are the slowest part of any Linux system. This is mainly due to their distance from the CPU and the fact that disks require physics to work (rotation and seek). If the time taken to access disk as opposed to memory was converted into minutes and seconds, it is the difference between 7 days and 7 minutes. As a result, it is essential that the Linux kernel minimises the amount of I/O operations it generates on a disk.

The following subsections describe the different ways the kernel processes data I/O from disk to memory and back.

01. Reading and Writing Data - Memory Pages

The Linux kernel breaks disk I/O into pages. The default page size on most Linux systems is 4K. It reads and writes disk blocks in and out of memory in 4K page sizes. You can check the page size of your system by using the time command in verbose mode and searching for the page size:

# /usr/bin/time –v date.

02. Major and Minor Page Faults

Linux, like most UNIX systems, uses a virtual memory layer that maps into physical address space. This mapping is “on-demand” in the sense that when a process starts, the kernel only maps what is required. When an application starts, the kernel searches the CPU caches and then physical memory. If the data does not exist in either, the kernel issues a Major Page Fault (MPF). A MPF is a request to the disk subsystem to retrieve pages of the disk and buffer them in RAM.

Once memory pages are mapped into the buffer cache, the kernel will attempt to use these pages resulting in a Minor Page Fault (MnPF). A MnPF saves the kernel time by reusing a page in memory as opposed to placing it back on the disk.

To find out how many MPF and MnPF occurred when an application starts, the time command can be used:

# /usr/bin/time –v evolution.

03. The File Buffer Cache

The file buffer cache is used by the kernel to minimise MPFs and maximise MnPFs. As a system generates I/O over time, this buffer cache will continue to grow as the system will leave these pages in memory until memory gets low and the kernel needs to “free” some of these pages for other uses. The result is that many system administrators see low amounts of free memory and become concerned when in reality, the system is just making good use of its caches

04. Types of Memory Pages

There are 3 types of memory pages in the Linux kernel:

Read Pages – Pages of data read in via disk (MPF) that are read only and backed on disk. These pages exist in the Buffer Cache and include static files, binaries, and libraries that do not change. The Kernel will continue to page these into memory as it needs them. If the system becomes short on memory, the kernel will “steal” these pages and place them back on the free list causing an application to have to MPF to bring them back in.
Dirty Pages – Pages of data that have been modified by the kernel while in memory. These pages need to be synced back to disk at some point by the pdflush daemon. In the event of a memory shortage, kswapd (along with pdflush) will write these pages to disk in order to make room in memory.
Anonymous Pages – Pages of data that do belong to a process, but do not have any file or backing store associated with them. They can't be synchronised back to disk. In the event of a memory shortage, kswapd writes these to the swap device as temporary storage until more RAM is free (“swapping” pages).

05. Writing Data Pages Back to Disk

Applications themselves may choose to write dirty pages back to disk immediately using the fsync() or sync() system calls. These system calls issue a direct request to the I/O scheduler. If an application does not invoke these system calls, the pdflush kernel daemon runs at periodic intervals and writes pages back to disk.

06. Monitoring I/O

Certain conditions occur on a system that may create I/O bottlenecks. These conditions may be identified by using a standard set of system monitoring tools. These tools include top, vmstat, iostat, and sar. There are some similarities between the outputs of these commands, but for the most part, each offers a unique set of output that provides a different aspect on performance. The following subsections describe conditions that cause I/O bottlenecks.

Calculating IOs Per Second

Every I/O request to a disk takes a certain amount of time. This is due primarily to the fact that a disk must spin and a head must seek. The spinning of a disk is often referred to as “rotational delay” (RD) and the moving of the head as a “disk seek” (DS). The time it takes for each I/O request is calculated by adding DS and RD. A disk's RD is fixed based on the RPM of the drive. An RD is considered half a revolution around a disk.

Each time an application issues an I/O, it takes an average of 8MS to service that I/O on a 10K RPM disk. Since this is a fixed time, it is imperative that the disk be as efficient as possible with the time it will spend reading and writing to the disk. The amount of I/O requests is often measured in I/Os Per Second (IOPS). The 10K RPM disk has the ability to push 120 to 150 (burst) IOPS. To measure the effectiveness of IOPS, divide the amount of IOPS by the amount of data read or written for each I/O.

Random vs Sequential I/O

The relevance of KB per I/O depends on the workload of the system. There are two different types of workload categories on a system: sequential and random.

Sequential I/O - The iostat command provides information on IOPS and the amount of data processed during each I/O. Use the –x switch with iostat (iostat –x 1). Sequential workloads require large amounts of data to be read sequentially and at once. These include applications such as enterprise databases executing large queries and streaming media services capturing data. With sequential workloads, the KB per I/O ratio should be high. Sequential workload performance relies on the ability to move large amounts of data as fast as possible. If each I/O costs time, it is imperative to get as much data out of that I/O as possible.

Random I/O - Random access workloads do not depend as much on size of data. They depend primarily on the amount of IOPS a disk can push. Web and mail servers are examples of random access workloads. The I/O requests are rather small. Random access workload relies on how many requests can be processed at once. Therefore, the amount of IOPS the disk can push becomes crucial.

When Virtual Memory Kills I/O

If the system does not have enough RAM to accommodate all requests, it must start to use the SWAP device. As file system I/Os, writes to the SWAP device are just as costly. If the system is extremely deprived of RAM, it is possible that it will create a paging storm to the SWAP disk. If the SWAP device is on the same file system as the data trying to be accessed, the system will enter into contention for the I/O paths. This will cause a complete performance breakdown on the system. If pages can't be read or written to disk, they will stay in RAM longer. If they stay in RAM longer, the kernel will need to free the RAM. The problem is that the I/O channels are so clogged that nothing can be done. This inevitably leads to a kernel panic and crash of the system.

The following vmstat output demonstrates a system under memory distress. It is writing data out to the swap device:

The previous output demonstrates a large amount of read requests into memory (bi). The requests are so many that the system is short on memory (free). This is causing the system to send blocks to the swap device (so) and the size of swap keeps growing (swpd). Also notice a large percentage of WIO time (wa). This indicates that the CPU is starting to slow down because of I/O requests.

To see the effect the swapping to disk is having on the system, check the swap partition on the drive using iostat.

Both the swap device (/dev/sda1) and the file system device (/dev/sda3) are contending for I/O. Both have high amounts of write requests per second (w/s) and high wait time (await) to low service time ratios (svctm). This indicates that there is contention between the two partitions, causing both to underperform.

Good to know:

Takeaways for I/O monitoring:

Any time the CPU is waiting on I/O, the disks are overloaded.
Calculate the amount of IOPS your disks can sustain.
Determine whether your applications require random or sequential disk access.
Monitor slow disks by comparing wait times and service times.
Monitor the swap and file system partitions to make sure that virtual memory is not contending for filesystem I/O.

Tasks

01. [30p] Vmstat

The vmstat utility provides a good low-overhead view of system performance. Since vmstat is such a low-overhead tool, it is practical to have it running even on heavily loaded servers when it is needed to monitor the system’s health.

[10p] Task A - Monitoring stress

Run vmstat on your machine with a 1 second delay between updates. Notice the CPU utilisation (info about the output columns here).

In another terminal, use the stress command to start N CPU workers, where N is the number of cores on your system. Do not pass the number directly. Instead, use command substitution.

Note: if you are trying to solve the lab on fep and you don't have stress installed, try cloning and compiling stress-ng.

[10p] Task B - How does it work?

Let us look at how vmstat works under the hood. We can assume that all these statistics (memory, swap, etc.) can not be normally gathered in userspace. So how does vmstat get these values from the kernel? Or rather, how does any process interact with the kernel? Most obvious answer: system calls.

$ strace vmstat

“All well and good. But what am I looking at?”

What you should be looking at are the system calls after the two writes that display the output header (hint: it has to do with /proc/ file system). So, what are these files that vmstat opens?

$ file /proc/meminfo
$ cat /proc/meminfo
 
$ man 5 proc

The manual should contain enough information about what these kernel interfaces can provide. However, if you are interested in how the kernel generates the statistics in /proc/meminfo (for example), a good place to start would be meminfo.c (but first, SO2 wiki).

[10p] Task C - USO flashbacks (1)

Write a one-liner that uses vmstat to report complete disk statistics and sort the output in descending order based on total reads column.

You can eliminate the first two header lines from the vmstat output using tail -n +3.

02. [30p] Mpstat

[10p] Task A - Python recursion depth

Try to run the script while passing 1000 as a command line argument. Why does it crash?

Luckily, python allows you to both retrieve the current recursion limit and set a new value for it. Increase the recursion limit so that the process will never crash, regardless of input (assume that it still has a reasonable upper bound).

[10p] Task B - CPU affinity

Run the script again, this time passing 10000. Use mpstat to monitor the load on each individual CPU at 1s intervals. The one with close to 100% load will be the one running our script. Note that the process might be passed around from one core to another.

Stop the process. Use stress to create N-1 CPU workers, where N is the number of cores on your system. Use taskset to set the CPU affinity of the N-1 workers to CPUs 1-(N-1) and then run the script again. You should notice that the process is scheduled on cpu0.

Note: to get the best performance when running a process, make sure that it stays on the same core for as long as possible. Don't let the scheduler decide this for you, if you can help it. Allowing it to bounce your process between cores can drastically impact the efficient use of the cache and the TLB. This holds especially true when you are working with servers rather than your personal PCs. While the problem may not manifest on a system with only 4 cores, you can't guarantee that it also won't manifest on one with 40 cores. When running several experiments in parallel, aim for something like this:

Figure 1: htop output. Processes are bound to specific cores, increasing performance by not potentially invalidating L1 and L2 caches. This works out well since we have fewer active processes than available cores. Otherwise, setting the affinity to a single core may backfire; the rescheduling of these processes could be delayed until other processes are also allocated a time slice. We notice that CPU usage on these cores is maxed (green:user space, red:kernel space). The ratio tells us that a considerable amount of time is spent in kernel space, leading us to believe that the processes are I/O bound.

[10p] Task C - USO flashbacks (2)

Write a bash command that binds CPU stress workers on your odd-numbered cores (i.e.: 1,3,5,…). The list of cores and the number of stress workers must NOT be hardcoded, but constructed based on nproc (or whatever else you fancy).
In your submission, include both the bash command and a mpstat capture to prove that the command is working.

03. [20p] Zip with compression levels

The zip command is used for compression and file packaging under Linux/Unix operating system. It provides 10 levels of compression, where:

level 0 : provides no compression, only packaging
level 6 : used as default compression level
level 9 : provides maximum compression

$ zip -5 file.zip file.txt

[10p] Task A - Measurements

Write a script to measure the compression rate and the time required for each level. You have a few large files in the code skeleton but feel free to add more. If you do add new files, make sure that they are not random data!

[5p] Task B - Plot

Generate a plot illustrating the compression rate, size decrease, etc. as a function of zip compression level. Make sure that your plot is understandable (i.e., has labels, a legend, etc.) Make sure to average multiple measurements for each compression level.

04. [20p] Hardware Counters

A significant portion of the system statistics that can be generated involve hardware counters. As the name implies, these are special registers that count the number of occurrences of specific events in the CPU. These counters are implemented through Model Specific Registers (MSR), control registers used by developers for debugging, tracing, monitoring, etc. Since these registers may be subject to changes from one iteration of a microarchitecture to the next, we will need to consult chapters 18 and 19 from Intel 64 and IA-32 Architectures Developer's Manual: Vol. 3B.

The instructions that are used to interact with these counters are RDMSR, WRMSR and RDPMC. Normally, these are considered privileged instructions (that can be executed only in ring0, aka. kernel space). As a result, acquiring this information from ring3 (user space) requires a context switch into ring0, which we all know to be a costly operation. The objective of this exercise is to prove that this is not necessarily the case and that it is possible to configure and examine these counters from ring3 in as few as a couple of clock cycles.

Before getting started, one thing to note is that there are two types of performance counters:

Fixed Function Counters
- each can monitor a single, distinct and predetermined event (burned in hardware)
- are configured a bit differently than the other type
- are not of interest to us in this laboratory
General Purpose Counters
- can be configured to monitor a specific event from a list of over 200 (see chapters 19.1 and 19.2)

Here is an overview of the following five tasks:

Task A: check the version ID of your CPU to determine what it's capable of monitoring.
Task B: set a certain bit in CR4 to enable ring3 usage of the RDPMC instruction.
Task C: use some ring3 tools to enable the hardware counters.
Task D: start counting L2 cache misses.
Task E: use RDPMC to measure the cache misses for a familiar program.

Task A - Hardware info

First of all, we need to know what we are working with. Namely, the microarchitecture version ID and the number of counters per core. To this end, we will use cpuid (basically a wrapper over the CPUID instruction.) All the information that we need will be contained in the 0AH leaf (might want to get the raw output of cpuid):

CPUID.0AH:EAX[15:8] : number of general purpose counters
CPUID.0AH:EAX[7:0] : version ID
CPUID.0AH:EDX[7:0] : number of fixed function counters

Note: the first two columns of the output represent the EAX and ECX registers used when calling CPUID. If the most significant bit in EAX is 1 (i.e.: starts with 0x8) the output is for extended options. ECX is a relatively new addition. So when looking for the 0AH leaf, search for a line starting with 0x0000000a. The register contents following ':' represent the output of the instruction.

Point out to your assistant which is which in the cpuid output.

Task B - Unlock RDPMC in ring3

Due to security considerations, reading the Performance Monitor Counters from userspace is normally not allowed. This is enforced at a hardware level via the Performance-Monitor Counter Enable bit in CR4.

Under normal circumstances, modifying Control Registers from userspace is not possible and you would have to write a kernel module for this. However, the perf_event_open() man page documents a sysfs interface (i.e., /sys/bus/event_source/devices/cpu/rdpmc) that does this for us.

Use the sysfs interface to revert the RDPMC access behavior to the pre-4.0 version.

Task C - Configure IA32_PERF_GLOBAL_CTRL

Figure 2: Control register for the Fixed Function and General Purpose counters. While setting a bit will enable the associated counter, clearing it will disable it. Note that for a counter to be enabled, both this bit and the EN bit in its configuration register must be set. If either is cleared, the counter is disabled. The purpose of this register is to simultaneously change the active state of multiple counters, with a single write instruction.

The IA32_PERF_GLOBAL_CTRL (0x38f) MSR is an addition from version 2 that allows enabling / disabling multiple counters with a single WRMSR instruction. What happens, in layman terms, is that the CPU performs an AND between each EANBLE bit in this register and its counterpart in the counter's original configuration register from version 1 (which we will deal with in the next task.) If the result is 1, the counter begins to register the programmed event every clock cycle. Normally, all these bits should be set by default during the booting process but it never hurts to check. Also, note that this register exists for each logical core.

If for CR4 we had to write a kernel module, for MSRs we have user space tools that take care of this for us (rdmsr and wrmsr) by interacting with a driver called msr (install msr-tools if it's missing from your system.) But first, we must load this driver.

$ lsmod | grep msr
$ sudo modprobe msr
$ lsmod | grep msr
    msr                    16384  0

Next, let us read the value in the IA32_PERF_GLOBAL_CTRL register. If the result differs from what you see in the snippet below, overwrite the value (the -a flag specifies that we want the command to run on each individual logical core).

$ sudo rdmsr -a 0x38f
    70000000f
$ sudo wrmsr -a 0x38f 0x70000000f

Task D - Configure IA32_PERFEVENTSELx

Figure 3: Configuration register for individual counters. Of interest to us are the EN bit (mentioned in the previous subsection), the event selection fields, and the user mode bit. Note how the USR bit can only distinguish between ring 0 and ring 3. While rings 1 and 2 are still present in the CPU's implementation today, no mainstream operating system has used them in over 30 years. The PMC, being a newer addition, acknowledges this reality in trying to simplify the control interface as much as possible. It is not clear if rings 1 and 2 are blind spots for PMCs or if they are covered under ring 0.

The IA32_PERFEVENTSELx are MSRs from version 1 that are used to configure the monitored event of a certain counter, its enabled state and a few other things. We will not go into detail and instead only mention the fields that interest us right now (you can read about the rest in the Intel manual.) Note that the x in the MSR's name stands for the counter number. If we have 4 counters, it takes values in the 0:3 range. The one that we will configure is IA32_PERFEVENTSEL0 (0x186). If you want to configure more than one counter, note that they have consecutive register number (i.e. 0x187, 0x188, etc.).

As for the register flags, those that are not mentioned in the following list should be left cleared:

EN (enable flag) = 1 starts the counter
USR (user mode flag) = 1 monitors only ring3 events
UMASK (unit mask) = ?? depends on the monitored event (see chapter 19.2)
EVSEL (event select) = ?? depends on the monitored event (see chapter 19.2)

Before actually writing in this register, we should verify that no one is currently using it. If this is indeed the case, we might also want to clear IA32_PMC0 (0xc1). PMC0 is the actual counter that is associated to PERFEVENTSEL0.

$ sudo rdmsr -a 0x186
    0
$ sudo wrmsr -a 0xc1 0x00
$ sudo wrmsr -a 0x186 0x41????

For the next (and final task) we are going to monitor the number of L2 cache misses. Look for the L2_RQSTS.MISS event in table 19-3 or 19-11 (depending on CPU version id) in the Intel manual and set the last two bytes (the unit mask and event select) accordingly. If the operation is successful and the counters have started, you should start seeing non-zero values in the PMC0 register, increasing in subsequent reads.

An easier alternative to scouring through the Intel manuals would be to use perfmon-events.intel.com. Get your CPU “model name” from /proc/cpuinfo and identify your microarchitecture based on the table below. Then search for the desired event in the appropriate section of the site.

Generation	Microarchitecture (Core Codename)	Release Year	Typical CPU Numbers
1st	Nehalem / Westmere	2008–2010	i3,5,7 3xx–9xx
2nd	Sandy Bridge	2011	i3,5,7 2xxx
3rd	Ivy Bridge	2012	i3,5,7 3xxx
4th	Haswell	2013	i3,5,7 4xxx
5th	Broadwell	2014–2015	i3,5,7 5xxx
6th	Skylake	2015	i3,5,7 6xxx
7th	Kaby Lake	2016–2017	i3,5,7 7xxx
8th	Coffee Lake / Amber Lake / Whiskey Lake	2017–2018	i3,5,7 8xxx
9th	Coffee Lake Refresh	2018–2019	i3,5,7,9 9xxx
10th	Comet Lake / Ice Lake / Tiger Lake	2019–2020	i3,5,7,9 10xxx
11th	Rocket Lake / Tiger Lake	2021	i3,5,7,9 11xxx
12th	Alder Lake	2021–2022	i3,5,7,9 12xxx
13th	Raptor Lake	2022–2023	i3,5,7,9 13xxx
14th	Raptor Lake Refresh	2023–2024	i3,5,7,9 14xxx
—	Meteor Lake	2023–2024	Core Ultra 5,7,9 1xx
—	Arrow Lake / Lunar Lake	2024–2025	Core Ultra 5,7,9 2xx

Task E - Ring3 cache performance evaluation

As of now, we should be able to modify the CR4 register with the kernel module, enable all counters in the IA32_PERF_GLOBAL_CTRL across all cores and start an L2 cache miss counter again, across all cores. What remains is putting everything into practice.

Take mat_mul.c. This program may be familiar from an ASC laboratory but, in case it isn't, the gist of it is that when using the naive matrix multiplication algorithm (O(n^3)), the frequency with which each iterator varies can wildly affect the performance of the program. The reason behind this is (in)efficient use of the CPU cache. Take a look at the following snippet from the source and keep in mind that each matrix buffer is a continuous area in memory.

for (uint32_t i=0; i<N; ++i)             /* line   */
    for (uint32_t j=0; j<N; ++j)         /* column */
        for (uint32_t k=0; k<N; ++k)
            r[i*N + j] += m1[i*N + k] * m2[k*N + j];

What is the problem here? The problem is that i and k are multiplied with a large number N when updating a certain element. Thus, fast variations in these two indices will cause huge strides in accessed memory areas (larger than a cache line) and will cause unnecessary cache misses. So what are the best and worst configurations for the three fors? The best: i, k j. The worst: j, k, i. As we can see, the configurations that we will monitor in mat_mul.c do not coincide with the aforementioned two (so… not great, not terrible.) Even so, the difference in execution time and number of cache misses will still be significant.

Which brings us to the task at hand: using the RDPMC instruction, calculate the number of L2 cache misses for each of the two multiplications without performing any context switches (hint: look at gcc extended asm and the following macro from mat_mul.c).

#define rdpmc(ecx, eax, edx)    \
    asm volatile (              \
        "rdpmc"                 \
        : "=a"(eax),            \
          "=d"(edx)             \
        : "c"(ecx))

A word of caution: remember that each logical core has its own PMC0 counter, so make sure to use taskset in order to set the CPU affinity of the process. If you don't the process may be passed around different cores and the counter value becomes unreliable.

$ taskset 0x01 ./mat_mul 1024

Depending on your CPU cache size, a matrix size of 1024 may be insufficient and you won't see any significant difference between the two arrangements.

You can check your cache size by printing the contents of these files: /sys/bus/cpu/devices/cpu0/cache/index*/size. Indices 0 and 1 typically correspond to the L1 data and L1 instruction caches respectively (you can double check this by reading the type file instead of size). Indices 2 and 3 correspond to the L2 and L3 caches.

If you run into this problem, either calculate a sufficiently high matrix size value. Or you can just ballpark it.

05. [10p] Feedback

Please take a minute to fill in the feedback form for this lab.

References

These examples are from Darren Hoch’s Linux System and Performance Monitoring.

General Information

Lectures

Labs

Assignments

Archived Labs

Lab 03 - I/O Monitoring (Linux)

ep/labs/03.1596131470.txt.gz · Last modified: 2020/07/30 20:51 by cristian.marin0805

Old revisions

Media Manager Back to top