Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ep:labs:03 [2021/10/11 13:36]
radu.mantu
ep:labs:03 [2025/03/18 00:08] (current)
radu.mantu
Line 1: Line 1:
-====== Lab 03 - I/O Monitoring (Linux) ======+~~NOTOC~~
  
-**"​Every thing is a file"​**,​ is a very famous **Linux** philosophy. There is a reason for this philosophy. +====== Lab 03 CPU Monitoring ======
- +
-It's because, Linux operating system considers and works with most of its devices, by the same way a file is opened or closed. +
-  * Block devices (Hard-disks, Compact Disk, Floppy, Flash Memory) +
-  * Character devices or serial devices (Mouse, keyboard) +
-  * Network Devices+
  
 ===== Objectives ===== ===== Objectives =====
  
-  * Offer an introduction to I/O monitoring. +  * Offer an introduction to Performance Monitoring 
-  * Get you acquainted with a few linux standard monitoring ​tools and their outputs, ​for monitoring the impact ​of the I/Os on the system. +  * Present the main CPU metrics and how to interpret them 
-  * Gives an intuition to be able to compare two relatively similar systems, but I/O different.+  * Get you to use various ​tools for monitoring the performance ​of the CPU 
 +  * Familiarize you with the x86 Hardware Performance Counters
  
 ===== Contents ===== ===== Contents =====
Line 22: Line 18:
 Before you start, create a [[http://​docs.google.com/​|Google Doc]]. Here, you will add screenshots / code snippets / comments for each exercise. Whatever you decide to include, it must prove that you managed to solve the given task (so don't show just the output, but how you obtained it and what conclusion can be drawn from it). If you decide to complete the feedback for bonus points, include a screenshot with the form submission confirmation,​ but not with its contents. Before you start, create a [[http://​docs.google.com/​|Google Doc]]. Here, you will add screenshots / code snippets / comments for each exercise. Whatever you decide to include, it must prove that you managed to solve the given task (so don't show just the output, but how you obtained it and what conclusion can be drawn from it). If you decide to complete the feedback for bonus points, include a screenshot with the form submission confirmation,​ but not with its contents.
  
-When done, export the document as a //pdf// and upload in the appropriate assignment on [[https://​curs.upb.ro/​2021/​course/​view.php?​id=5665#section-4|moodle]]. ​Remember, the cut-off time is 15m after the lab ends. +When done, export the document as a //pdf// and upload in the appropriate assignment on [[https://​curs.upb.ro/​2022/​course/​view.php?​id=5113#section-2|moodle]]. ​The deadline ​is 23:55 on Friday.
 ===== Introduction ===== ===== Introduction =====
  
-<note important>Disk I/O subsystems are the slowest part of any Linux system. This is mainly due to their distance from the CPU and for the old HDD the fact that disk requires physics to work (rotation and seek). If the time taken to access disk as opposed to memory was converted into days and minutes, it is the difference between 7 days and 7 minutes. As a result, it is essential that the Linux kernel minimises the amount of I/O operations it generates on a disk. </​note>​ +<spoiler>
-The following subsections describe the different ways the kernel processes data I/O from disk to memory and back. +
- +
-==== 01. Reading and Writing Data - Memory Pages ==== +
-<​note>​The Linux kernel breaks disk I/O into pages. The default page size on most Linux systems is **4K**. It reads and writes disk blocks in and out of memory in 4K page sizes. You can check the page size of your system by using the time command in verbose mode and searching for the page size:  +
- +
-//# getconf PAGESIZE//​ +
-</​note>​ +
- +
-==== 02. Major and Minor Page Faults ==== +
- +
-<​note>​Linux,​ like most UNIX systems, uses a **virtual memory layer** that maps into physical address space. This mapping is **"​on-demand"​** in the sense that when a process starts, the kernel only maps what is required. When an application starts, the kernel searches the CPU caches and then physical memory. If the data does not exist in either, the kernel issues a **Major Page Fault** (MPF). A MPF is a request to the disk subsystem to retrieve pages of the disk and buffer them in RAM. +
- +
-Once memory pages are mapped into the buffer cache, the kernel will attempt to use these pages resulting in a **Minor Page Fault** (MnPF). A MnPF saves the kernel time by reusing a page in memory as opposed to placing it back on the disk. +
- +
-To find out how many MPF and MnPF occurred when an application starts, the time command can be used: +
- +
-<code bash> +
-# /​usr/​bin/​time –v evolution +
-</​code>​ +
- +
-As an alternative,​ a more elegant solution for a specific pid is: +
- +
-<code bash> +
-# ps -o min_flt,​maj_flt ${pid} +
-</​code>​ +
-</​note>​ +
- +
-==== 03. The File Buffer Cache ==== +
-<​note>​ +
-The **file buffer cache** is used by the kernel to** minimise MPFs and maximise MnPFs**. As a system generates I/O over time, this buffer cache will continue to grow as the system will leave these pages in memory until memory gets low and the kernel needs to "​**free**"​ some of these pages for other uses. The result is that many system administrators see low amounts of free memory and become concerned when in reality, the system is just making good use of its caches ;-) +
- +
-</​note>​ +
-==== 04. Types of Memory Pages ==== +
-<​note>​ +
-There are **3** types of memory pages in the Linux kernel: +
-  * **Read Pages** – Pages of data read in via disk (MPF) that are read only and backed on disk. These pages exist in the Buffer Cache and include **static files**, **binaries**,​ and **libraries** that do not change. The Kernel will continue to page these into memory as it needs them. If the system becomes short on memory, the kernel will "​steal"​ these pages and place them back on the free list causing an application to have to MPF to bring them back in. +
-  * **Dirty Pages** – Pages of data that have been modified by the kernel while in memory. These pages need to be synced back to disk at some point by the pdflush daemon. In the event of a memory shortage, kswapd (along with pdflush) will write these pages to disk in order to make room in memory. +
-  * **Anonymous Pages** – Pages of data that do belong to a process, but do not have any file or backing store associated with them. They can't be synchronised back to disk. In the event of a memory shortage, kswapd writes these to the swap device as temporary storage until more RAM is free ("​swapping"​ pages). +
-</​note>​ +
-==== 05. Writing Data Pages Back to Disk ====+
  
-<​note>​ +Performance Monitoring is the process of checking ​set of metrics in order to ascertain the health of the system. Normally, the information gleaned from these metrics is in turn used to fine tune the system in order to maximize its performanceAs you may imagine, both acquiring and interpreting this data requires at least //some// knowledge of the underlying operating system.
-Applications themselves may choose to write **dirty pages** back to disk immediately using the **fsync()** or **sync()** system calls. These system calls issue direct request ​to the **I/O scheduler**. If an application does not invoke these system ​calls, the pdflush kernel daemon runs at periodic intervals and writes pages back to disk. +
-</note> +
-===== Monitoring I/O =====+
  
-<note important>​Certain conditions occur on a system ​that may create I/O bottlenecks. These conditions may be identified by using standard set of system monitoring tools. These tools include **top****vmstat**, **iostat**, and **sar**There are some similarities between the outputs ​of these commandsbut for the most part, each offers ​unique set of output ​that provides ​different aspect on performanceThe following subsections describe conditions that cause **I/O bottlenecks**.</​note>​+In the following four labs, we'll discuss the four main subsystems that are likely to have an impact either ​on a single process, or on the system ​as whole. These are: CPUmemorydisk I/O and networkingNote that these subsystems ​are not independent ​of one another. For examplea web application may be dependent on the network stack of the kernel. Its implementation determines the amount of packets processed in given amount ​of time. However, protocols ​that require checksum calculation (e.g.: TCP) will want to use highly optimized implementation of this function (which is written directly in assembly)If your architecture does not have such an implementation and falls back to using the one written in C, you may prefer changing your choice of protocol.
  
-=== Calculating IOs Per Second ===+When dealing strictly with the CPU, these are a few things to look out for:
  
-<​note>​Every I/O __request__ to a disk takes a certain amount of time. This is due primarily to the fact that a //disk must spin// and //a head must seek//. The spinning of a disk is often referred to as "**rotational delay**" (RD 8-))  and the moving of the head as a "​**disk seek**"​ (DS). The time it takes for each I/O request is calculated by __adding__ DS and RD. A disk's RD is fixed based on the RPM of the drive. An RD is considered half a revolution around a disk.+**Context Switches**
  
-Each time an application issues an I/O, it takes an average of 8MS to service that I/O on a 10K RPM diskSince this is fixed time, it is imperative that the disk be as efficient as possible with the time it will spend reading and writing to the diskThe amount ​of I/O requests is often measured in I/Os Per Second (IOPS). The 10K RPM disk has the ability to push 120 to 150 (burst) IOPS. To measure the effectiveness of IOPSdivide the amount of IOPS by the amount of data read or written ​for each I/O.</​note>​+A context switch is a transition from one runtime environment ​to anotherOne example would be performing ​privileged call to kernel space via a system callthen returning from it. When this happens, a copy of your register state must be stored, for obvious reasonsThis operation takes some time
  
-=== Random vs Sequential I/O ===+This usually comes in the form of performing a privileged call to kernel space (e.g.: syscall) and returning from it. Whenever this happens, a copy of your register state must be (re)stored, which takes up some time.
  
-<​note>​The relevance of KB per I/O depends on the __workload__ ​of the systemThere are two different types of workload categories on systemsequential and random.+Note, however, how each individual process has its own address space, but in every address space, ​the only constant is the kernel. Why is that? Well, when the time slice of a process runs out and another is scheduled in, the kernel must perform a Translation Lookaside Buffer (TLB) flushOtherwise, memory accesses in the new process might erroneously end up targeting the memory ​of the previous process. Yes, some shared objects (libraries) //could// have been mapped at the same virtual addresses and deleting those entries from the TLB is shame, but there'​s no workaround for that. Now, back to our original questionwhy is the kernel mapped identically in each virtual address space? The reason is that when you perform a context switch into the kernel after calling ''​open()''​ or ''​read()'',​ a TLB flush is not necessary. If you wanted to write your own kernel, you could theoretically isolate the kernel'​s address space (like any other process), but you would see a huge performance drop.
  
-**Sequential I/O** - The **iostat** command provides information on IOPS and the amount ​of data processed during each I/O. Use the **–x** switch ​with **iostat** ​(//iostat –x 1//). **Sequential workloads** require large amounts of data to be read sequentially and at once. These include applications such as enterprise databases executing large queries and streaming media services capturing data. With sequential workloads, the KB per I/O ratio should be high. Sequential workload performance relies on the ability to move large amounts ​of data as fast as possible. If each I/O costs time, it is imperative to get as much data out of that I/O as possible.+The takeaway is that some context switches are more expensive than others. Not being able to schedule a process to a single core 100% of the time comes with a huge cost (flushing the TLB). This being said, context switches from user space to kernel space are still expensive operations. As Terry Davis once demonstrated in his Temple OSrunning everything at the same privilege level can reduce ​the cost of context switches by orders ​of magnitude.
  
-**Random I/O** - Random access workloads do not depend as much on size of data. They depend primarily on the amount of IOPS a disk can push. Web and mail servers are examples of random access workloads. The I/O requests are rather small. Random access workload relies on how many requests can be processed at once. Therefore, the amount of IOPS the disk can push becomes crucial.</​note>​+**CPU Utilization**
  
-=== When Virtual Memory Kills I/O ===+Each process is given a time slice for it to utilize however it sees fit. The way that time is utilized can prove to be a meaningful metric. There are two ways that we can look at this data: system level or process level.
  
-<​note>​If the system ​does not have enough **RAM** to accommodate all requestsit must start to use the **SWAP** device. As file system I/Oswrites to the SWAP device are just as costly. If the system is extremely deprived ​of RAM, it is possible that it will create a __paging storm__ to the SWAP disk. If the SWAP device is on the same file system as the data trying to be accessed, the system will enter into contention for the I/O pathsThis will cause a complete ​**performance breakdown** on the system. If pages can't be read or written to disk, they will stay in RAM longerIf they stay in RAM longer, the kernel will need to free the RAM. The problem ​is that the __I/O channels__ are so __clogged__ that nothing can be doneThis inevitably leads to a __kernel panic and crash of the system__.<​/note>+At system ​level, the data is offered by the kernel in ''/​proc/​stats''​ (details in **man 5 proc**; look for this file). For each corewe get the amount ​of time units (''​USER_HZ''​ configured at compile time in the kernel ~= 10ms) each core has spent on a certain type of taskThe more commonly encountered are of course: 
 +  ​* **user:** Running unprivileged code in ring3. 
 +  * **system:** Running privileged code in ring0. 
 +  * **idle:** Not running anything. In this case, the core voltage & frequency ​is usually reduced. 
 +  * **nice:** Same as **user**, but refers ​to processes with a //nice 0// personality. More details [[https://​www.kernel.org/​doc/​html/​next/​scheduler/​sched-nice-design.html|here]].
  
-The following ​**vmstat** output demonstrates a system under memory distressIt is writing data out to the swap device:+The less reliable / relevant ones are: 
 +  ​* **iowait:** Time waiting for I/O. Not reliable because this is usually done via Direct Memory Access at kernel level and processes that perform blocking I/O operations (e.g.: ''​read()''​ -- with the exception of certain types of files, such as sockets, opened with ''​O_NONBLOCK''​) automatically yield their remaining time for another CPU bound process to be rescheduled. 
 +  * **(soft)irq:​** Time servicing interrupts. This has nothing to do with user space processes. A high number can indicate high peripheral activity. 
 +  * **steal:** If the current ​system ​runs under a Hypervisor (i.e.: you are running in a Virtual Machine), know that the HV has every right to steal clock cycles from any VM in order to satisfy its own goals. Just like the kernel can steal clock cycles from a regular process to service an interrupt from, let's say, the Network Interface Controller, so can the HV steal clock cycles from the VM for exactly the same purpose. 
 +  * **guest:** The opposite of **steal**. If you are running a VM, then the kernel can take the role of a HV in some capacity (see **kvm**). This is the amount of time the CPU was used to run the guest VM.
  
-{{ :​ep:​laboratoare:ep2_poz1.png?550 |}}+At process level, the data can be found in ''/​proc/​[pid]/​stat''​ (see **man 5 proc**). Note that in this case, the amount of information the kernel interface provides is much more varied. While we still have **utime** (user time) and **stime** (system time), note that we also have statistics for child processes that have not been orphaned**cutime**, **cstime**.
  
-<note tip>The previous output demonstrates a __large amount of read requests__ into memory (**bi**). The requests are so many that the system is short on memory (**free**)This is causing ​the __system ​to send blocks ​to the swap device__ (**so**) ​and the size of swap keeps growing (**swpd**). Also notice a large percentage of WIO time (**wa**)This indicates that the __CPU is starting to slow down__ because of I/O requests. Furthermore,​ **id** represents the time spent idle and it is included in **wa** </​note>​+Although you may find many tools that offer similar information,​ remember ​that these files are the originAnother thing to keep in mind is that this data is representative for the entire session, i.e.: from system boot or from process launch. If you want to interpret it in a meaningful manner, you need to get two data points ​and know the time interval between their acquisition.
  
-To see the effect the swapping to disk is having on the system, check the swap partition on the drive using **iostat**.+**Scheduling**
  
-{{ :ep:laboratoare:ep2_poz2.png?650 |}}+When a CPU frees up, the kernel must decide which process gets to run next. To this end, it uses the [[https://​www.kernel.org/​doc/​html/​v5.7/​scheduler/​sched-design-CFS.html|Completely Fair Scheduler (CFS)]]. Normally, we don't question the validity of the scheduler'​s design. That's a few levels above our paygrade. What we can do, is adjust the value of ''/​proc/​sys/​kernel/​sched_min_granularity_ns''​. This virtual file contains the minimum amount of nanoseconds that a task is allocated when scheduled on the CPU. A lower value guarantees that each process will be scheduled sooner rather than later, which is a good trait of a real-time system (e.g.Android -- you don't want unresponsive menus). A greater value, however, is better when you are doing batch processing (e.g.rendering a video). We noted previously that switching active processes on a CPU core is an expensive operation. Thus, allowing each process to run for longer will reduce the CPU dead time in the long run.
  
-<note tip>​Both ​the swap device (///​dev/​sda1//​) ​and the file system device (///​dev/​sda3//​) are contending for I/O. Both have __high amounts ​of write requests per second__ (//​w/​s//​) ​and __high wait time__ (//await//) to __low service ​time ratios__ (//svctm//)This indicates that there is **contention** between ​the two partitionscausing both to **underperform**.</​note>​+Another aspect that's not necessarily as talked about is core scheduling. Given that you have more available cores than active tasks, on what core do you schedule a task? When answering this question, we need to keep in mind a few things: ​the CPU does not operate at a constant frequency. The voltage of each core, and consequently its frequency, varies based on the amount ​of active time. That being said, if a core has been idle for quite some time and suddenly a new task is scheduled, it will take some time to get it from its low-power frequency to its maximumBut now consider this: what if the workload ​is not distributed among all cores and a small subset of cores overheats? The CPU is designed to forcibly reduce ​the frequency is such casesand if the overall temperature exceeds a certain point, shut down entirely.
  
-==== Takeaways ====+At the moment, CFS likes to spread out the tasks to all cores. Of course, each process has the right to choose the cores it's comfortable to run on (more on this in the exercises section). Another reason why this may be preferable that we haven'​t mentioned before is not invalidating the CPU cache. L1 and L2 caches are specific to each physical core. L3 is accessible to all cores. However. L1 and L2 have an access time of 1-10ns, while L3 can go as high as 30ns. If you have some time, read a bit about [[https://​www.phoronix.com/​news/​Nest-Linux-Scheduling-Warm-Core|Nest]],​ a newly proposed scheduler that aims to keep scheduled tasks on "warm cores" until it becomes necessary to power up idle cores as well. Can you come up with situations when Nest may be better or worse than CFS?
  
-<note important>​ +</spoiler>
-  * Any time the **CPU is waiting** on I/O, the **disks are overloaded**. +
-  * Calculate the amount of **IOPS** your disks can sustain. +
-  * Determine whether your applications require **random** or **sequential** disk access. +
-  * Monitor slow disks by comparing **wait times** and **service times**. +
-  * Monitor the swap and file system partitions to make sure that **virtual memory** is not contending for **filesystem I/O**. +
-</note>+
  
 ===== Tasks ===== ===== Tasks =====
  
-<note important>​For every task/​subtask marked with ":!: :!:" you should provide proof of the requirements mentioned ​in that manner.</note>+The skeleton for this lab can be found in this [[https://​github.com/cs-pub-ro/​EP-labs|repository]]. Clone it locally before you start.
  
 {{namespace>:​ep:​labs:​03:​contents:​tasks&​nofooter&​noeditbutton}} {{namespace>:​ep:​labs:​03:​contents:​tasks&​nofooter&​noeditbutton}}
  
-===== References ===== 
  
-  * These examples are from Darren Hoch’s [[http://​ufsdump.org/​papers/​oscon2009-linux-monitoring.pdf|Linux System and Performance Monitoring]]. 
ep/labs/03.1633948578.txt.gz · Last modified: 2021/10/11 13:36 by radu.mantu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0